Useful Link:
https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
https://xgboost.readthedocs.io/en/latest/tutorials/model.html
xgboost
- use 2nd order approximation on the loss function
- 无放回抽样
- add advanced regularization term.
Tree Boosting
The way to learn the supervised model is to define an object function and optimize it.
The objective function is
obj=i=1∑nl(yi,yi^(t))+i=1∑nΩ(fi)
Using an additive strategy: fix what we have learned, and add one new tree at a time
y^ is the sum estimated residual.
y^i(0)y^i(1)y^i(2)y^i(t)=0=f1(xi)=y^i(0)+f1(xi)=f1(xi)+f2(xi)=y^i(1)+f2(xi)…=k=1∑tfk(xi)=y^i(t−1)+ft(xi)
At each step we can define a loss function
y is the actual data
obj(t)=i=1∑nl(yi,y^i(t))+i=1∑tΩ(fi)=i=1∑nl(yi,y^i(t−1)+ft(xi))+Ω(ft)+constant
for example: we can try MSE as our loss function:
obj(t)=i=1∑n(yi−(y^i(t−1)+ft(xi)))2+i=1∑tΩ(fi)=i=1∑n[2(y^i(t−1)−yi)ft(xi)+ft(xi)2]+Ω(ft)+constant
The MSE has a nice form but some other loss function such like logistic loss do not has such nice form. In the general case, we take the Taylor expansion of the loss function up to the second order: The Taylor expansion transport a function to a polynomial expression.
obj(t)=i=1∑n[l(yi,y^i(t−1))+gift(xi)+21hift2(xi)]+Ω(ft)+constant
where the gi and hi are defined as
gihi=∂y^i(t−1)l(yi,y^i(t−1))=∂y^i(t−1)2l(yi,y^i(t−1))
After we remove all the constants, the specific objective at step t becomes
i=1∑n[gift(xi)+21hift2(xi)]+Ω(ft)
This function only include depends on the gi and hi. And it is our optimization goal for the new tree.
The Regularization Term
After the training step, we add the regularization term to define the complexity of the tree Ω(f). Redfine the f(x)
ft(x)=wq(x),w∈RT,q:Rd→1,2,...,T
Here w is the vector of score for leaves, q is the function for each data point to the corresponding leaf, and T is the number of leaves.
In the XGBoost, define the complexity:
Ω(f)=γT+21λj=1∑Twj2
The method to define the complexity is flexible.
The Structure Score
Here is the magical part of the derivation. After re-formulating the tree model, we can write the objective value with the t-th tree as:
obj(t)≈i=1∑n[giwq(xi)+21hiwq(xi)2]+γT+21λj=1∑Twj2=j=1∑T[(i∈Ij∑gi)wj+21(i∈Ij∑hi+λ)wj2]+γT
where Ij=i∣q(xi)=jIj=i∣q(xi)=j is the set of indices of data points assigned to the j-th leaf. Notice that in the second line we have changed the index of the summation because all the data points on the same leaf get the same score. We could further compress the expression by defining Gj=∑i∈Ijgi and $ 𝐻_j=\sum_i\in I_jh_i$ :
obj(t)=j=1∑T[Gjwj+21(Hj+λ)wj2]+γT
In this equation, 𝑤𝑗wj are independent with respect to each other, the form Gjwj+21(Hj+λ)wj2 is quadratic and the best wj for a given structure q(x) and the best objective reduction we can get is:
wj∗obj∗=−Hj+λGj=−21j=1∑THj+λGj2+γT
The last equation shows the how good a tree structure q(x) is .
From the tree, we try to split a leaf into two leaves and gain the score:
Gain=21[HL+λGL2+HR+λGR2−HL+HR+λ(GL+GR)2]−γ
This formula can be decomposed as 1) the score on the new left leaf 2) the score on the new right leaf 3) The score on the original leaf 4) regularization on the additional leaf. We can see an important fact here: if the gain is smaller than γ, we would do better not to add that branch. This is exactly the pruning techniques in tree based models! By using the principles of supervised learning, we can naturally come up with the reason these techniques work 😃
For real valued data, we usually want to search for an optimal split. To efficiently do so, we place all the instances in sorted order, like the following picture.
Random Forests in XGBoost
Random Forests use the same model like gradient-boosted decision trees but different training algorithm. Here we can use the random forest as a base model for gradient boosting.
In practical, we can just use the API from the Scikit Learn.
In Scikit Learn, XGBClassifier
and XGBRegressor
are using Gradient Boosting. XGBRFClassifier
and XGBRFRegressor
are training the random forest.