xgboost

Useful Link:

https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

xgboost

  1. use 2nd order approximation on the loss function
  2. 无放回抽样
  3. add advanced regularization term.

Tree Boosting

The way to learn the supervised model is to define an object function and optimize it.

The objective function is

obj=i=1nl(yi,yi^(t))+i=1nΩ(fi)obj = \sum^{n}_{i=1}l(y_i, \hat{y_i}^{(t)}) + \sum^{n}_{i=1} \Omega(f_i)

Using an additive strategy: fix what we have learned, and add one new tree at a time

y^\hat{y} is the sum estimated residual.

y^i(0)=0y^i(1)=f1(xi)=y^i(0)+f1(xi)y^i(2)=f1(xi)+f2(xi)=y^i(1)+f2(xi)y^i(t)=k=1tfk(xi)=y^i(t1)+ft(xi)\begin{aligned}\hat{y}_i^{(0)} &= 0\\ \hat{y}_i^{(1)} &= f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i)\\ \hat{y}_i^{(2)} &= f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i)\\ &\dots\\ \hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i)\end{aligned}

At each step we can define a loss function

yy is the actual data

obj(t)=i=1nl(yi,y^i(t))+i=1tΩ(fi)=i=1nl(yi,y^i(t1)+ft(xi))+Ω(ft)+constant\begin{aligned}\text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\ & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant}\end{aligned}

for example: we can try MSE as our loss function:

obj(t)=i=1n(yi(y^i(t1)+ft(xi)))2+i=1tΩ(fi)=i=1n[2(y^i(t1)yi)ft(xi)+ft(xi)2]+Ω(ft)+constant\begin{aligned}\text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\ & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant}\end{aligned}

The MSE has a nice form but some other loss function such like logistic loss do not has such nice form. In the general case, we take the Taylor expansion of the loss function up to the second order: The Taylor expansion transport a function to a polynomial expression.

obj(t)=i=1n[l(yi,y^i(t1))+gift(xi)+12hift2(xi)]+Ω(ft)+constant\text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant}

where the gig_i and hih_i are defined as

gi=y^i(t1)l(yi,y^i(t1))hi=y^i(t1)2l(yi,y^i(t1))\begin{aligned}g_i &= \partial_{\hat{y}_i^{(t-1)}} l(y_i, \hat{y}_i^{(t-1)})\\ h_i &= \partial_{\hat{y}_i^{(t-1)}}^2 l(y_i, \hat{y}_i^{(t-1)})\end{aligned}

After we remove all the constants, the specific objective at step tt becomes

i=1n[gift(xi)+12hift2(xi)]+Ω(ft)\sum_{i=1}^n [g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t)

This function only include depends on the gig_i and hih_i. And it is our optimization goal for the new tree.

The Regularization Term

After the training step, we add the regularization term to define the complexity of the tree Ω(f)\Omega(f). Redfine the f(x)f(x)

ft(x)=wq(x),wRT,q:Rd1,2,...,Tf_t(x) = w_{q(x)}, w \in R^T, q: R^d \rightarrow {1,2,...,T}

Here ww is the vector of score for leaves, q is the function for each data point to the corresponding leaf, and TT is the number of leaves.

In the XGBoost, define the complexity:

Ω(f)=γT+12λj=1Twj2\Omega(f) = \gamma T + \frac{1}{2} \lambda\sum^T_{j=1}w_j^2

The method to define the complexity is flexible.

The Structure Score

Here is the magical part of the derivation. After re-formulating the tree model, we can write the objective value with the t-th tree as:

obj(t)i=1n[giwq(xi)+12hiwq(xi)2]+γT+12λj=1Twj2=j=1T[(iIjgi)wj+12(iIjhi+λ)wj2]+γT\begin{aligned}\text{obj}^{(t)} &\approx \sum_{i=1}^n [g_i w_{q(x_i)} + \frac{1}{2} h_i w_{q(x_i)}^2] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2\\ &= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T\end{aligned}

where Ij=iq(xi)=jIj=iq(xi)=jI_j={i|q(x_i)=j}I_j={i|q(x_i)=j} is the set of indices of data points assigned to the j-th leaf. Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score. We could further compress the expression by defining 𝐺j=𝑖Ijgi𝐺_j=\sum_𝑖 \in I_jg_i and $ 𝐻_j=\sum_i\in I_jh_i$ :

obj(t)=j=1T[Gjwj+12(Hj+λ)wj2]+γT\text{obj}^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T

In this equation, 𝑤𝑗wj are independent with respect to each other, the form Gjwj+12(Hj+λ)wj2G_jw_j+\frac{1}{2}(H_j+\lambda)w_j^2 is quadratic and the best wjw_j for a given structure q(x)q(x) and the best objective reduction we can get is:

wj=GjHj+λobj=12j=1TGj2Hj+λ+γT\begin{aligned} w_j^\ast &= -\frac{G_j}{H_j+\lambda}\\ \text{obj}^\ast &= -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T \end{aligned}

The last equation shows the how good a tree structure q(x)q(x) is .

From the tree, we try to split a leaf into two leaves and gain the score:

Gain=12[GL2HL+λ+GR2HR+λ(GL+GR)2HL+HR+λ]γGain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma

This formula can be decomposed as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on the original leaf 4) regularization on the additional leaf. We can see an important fact here: if the gain is smaller than γ\gamma, we would do better not to add that branch. This is exactly the pruning techniques in tree based models! By using the principles of supervised learning, we can naturally come up with the reason these techniques work 😃

For real valued data, we usually want to search for an optimal split. To efficiently do so, we place all the instances in sorted order, like the following picture.

Random Forests in XGBoost

Random Forests use the same model like gradient-boosted decision trees but different training algorithm. Here we can use the random forest as a base model for gradient boosting.

In practical, we can just use the API from the Scikit Learn.

In Scikit Learn, XGBClassifier and XGBRegressor are using Gradient Boosting. XGBRFClassifier and XGBRFRegressor are training the random forest.

Author: shixuan liu
Link: http://tedlsx.github.io/2020/07/13/xgboost/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Donate
  • Wechat
  • Alipay

Comment