Useful Link:

https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

xgboost

use 2nd order approximation on the loss function
无放回抽样
add advanced regularization term.

Tree Boosting

The way to learn the supervised model is to define an object function and optimize it.

The objective function is

$obj = \sum^{n}_{i=1}l(y_i, \hat{y_i}^{(t)}) + \sum^{n}_{i=1} \Omega(f_i)$

Using an additive strategy: fix what we have learned, and add one new tree at a time

$\hat{y}$ is the sum estimated residual.

$\begin{aligned}\hat{y}_i^{(0)} &= 0\\ \hat{y}_i^{(1)} &= f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\\ \hat{y}_i^{(2)} &= f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i)\\ &\dots\\ \hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i)\end{aligned}$

At each step we can define a loss function

$y$ is the actual data

$\begin{aligned}\text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\ & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant}\end{aligned}$

for example: we can try MSE as our loss function:

$\begin{aligned}\text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\ & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant}\end{aligned}$

The MSE has a nice form but some other loss function such like logistic loss do not has such nice form. In the general case, we take the Taylor expansion of the loss function up to the second order: The Taylor expansion transport a function to a polynomial expression.

$\text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant}$

where the $g_i$ and $h_i$ are defined as

$\begin{aligned}g_i &= \partial_{\hat{y}_i^{(t-1)}} l(y_i, \hat{y}_i^{(t-1)})\\ h_i &= \partial_{\hat{y}_i^{(t-1)}}^2 l(y_i, \hat{y}_i^{(t-1)})\end{aligned}$

After we remove all the constants, the specific objective at step $t$ becomes

$\sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t)$

This function only include depends on the $g_i$ and $h_i$ . And it is our optimization goal for the new tree.

The Regularization Term

After the training step, we add the regularization term to define the complexity of the tree $\Omega(f)$ . Redfine the $f(x)$

$f_t(x) = w_{q(x)}, w \in R^T, q: R^d \rightarrow {1,2,...,T}$

Here $w$ is the vector of score for leaves, q is the function for each data point to the corresponding leaf, and $T$ is the number of leaves.

In the XGBoost, define the complexity:

$\Omega(f) = \gamma T + \frac{1}{2} \lambda\sum^T_{j=1}w_j^2$

The method to define the complexity is flexible.

The Structure Score

Here is the magical part of the derivation. After re-formulating the tree model, we can write the objective value with the t-th tree as:

$\begin{aligned}\text{obj}^{(t)} &\approx \sum_{i=1}^n [g_i w_{q(x_i)} + \frac{1}{2} h_i w_{q(x_i)}^2] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2\\ &= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T\end{aligned}$

where $I_j={i|q(x_i)=j}I_j={i|q(x_i)=j}$ is the set of indices of data points assigned to the j-th leaf. Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score. We could further compress the expression by defining $𝐺_j=\sum_𝑖 \in I_jg_i$ and $ 𝐻_j=\sum_i\in I_jh_i$ :

$\text{obj}^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T$

In this equation, 𝑤𝑗wj are independent with respect to each other, the form $G_jw_j+\frac{1}{2}(H_j+\lambda)w_j^2$ is quadratic and the best $w_j$ for a given structure $q(x)$ and the best objective reduction we can get is:

$\begin{aligned} w_j^\ast &= -\frac{G_j}{H_j+\lambda}\\ \text{obj}^\ast &= -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T \end{aligned}$

The last equation shows the how good a tree structure $q(x)$ is .

From the tree, we try to split a leaf into two leaves and gain the score:

$Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma$

This formula can be decomposed as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on the original leaf 4) regularization on the additional leaf. We can see an important fact here: if the gain is smaller than $\gamma$ , we would do better not to add that branch. This is exactly the pruning techniques in tree based models! By using the principles of supervised learning, we can naturally come up with the reason these techniques work 😃

For real valued data, we usually want to search for an optimal split. To efficiently do so, we place all the instances in sorted order, like the following picture.

Random Forests in XGBoost

Random Forests use the same model like gradient-boosted decision trees but different training algorithm. Here we can use the random forest as a base model for gradient boosting.

In practical, we can just use the API from the Scikit Learn.

In Scikit Learn, XGBClassifier and XGBRegressor are using Gradient Boosting. XGBRFClassifier and XGBRFRegressor are training the random forest.