Random Forest

Introduction of Random Forest

RF(random forest) is one of the classification method. And it consists of many random decision trees. The “random” in RF means it used random sample of original data and also used random subset features to build the model.

It has several advantage:

  • RF includes 2 randomness which makes model not too overfitted and good to avoid noise
  • In high dimensional problems, we do not need to do the feature selection and dimension reduction.
  • Can also give the importance to each classification
  • RF can both works on continuous and categorical data
  • It generates an internal unbiased estimate of the generalization error as the forest building progresses

Introduction of Decision Tree

DT(decisiona tree) is a tree-like graph and decision model which produce the events outcome, resource costs and utility. DT is a supervised learning methods and are adaptable at solving any kind of problem (classification or regression).

When split the node into sub-nodes, DT should decide which classification should be top. Here we need use Gini importance: for each split at variable m, the child nodes should has smaller gini impurity criterion than the parent node.

ID 3

在ID3中小型模型优于大型。 通过信息熵计算信息增益(information gain)。取最大增益优先进行分裂(贪婪搜索)。

缺点:

  • 没有剪枝策略, 容易过拟合
  • Information gain 对可取值数目较多的特征有偏好。例如“ID”(数据index)其information gain 趋近于 1.
  • 只能用于处理离散分布特征
  • 没有考虑缺失值

C 4.5

改进:

  • 运用information gain ratio, 弥补ID3对可取值数目较多的特征的偏向
  • 预剪枝:贪心策略,在节点划分前。 这三种情况下进行剪枝:1. 节点内数据样本低于某一阈值。2. 所有节点特征都已分裂。3. 节点划分前准确率比划分后高。 预剪枝降低过拟合,减少训练时间会带来欠拟合。
  • 后剪枝: 在以生成的决策树上进行剪枝。悲观剪枝,用递归从低往高对每一个非叶子节,评估一个最佳叶子节代替是否有益(错误率)。泛化性优于预剪枝,不容易欠拟合,但是训练时间长。
  • 将连续特征离散化
  • 对于缺失值:1. 计算information gain ratio 有缺失值,用没有缺失的数据的占比来折算。 2. 在划分该特征时,把样本划分到所有子节点

缺点:

只用于分类,多叉树,要进行排序找分割点

CART (classification and regression tree)

分裂: 可以是连续性or离散型

剪枝:代价复杂度剪枝:loss function, 要平衡 过拟合与预测误差

Cα(T)=C(T)+αTC_{\alpha}(T) = C(T) + \alpha |T|

其中C(T)C(T) 为拟合度, αT\alpha |T|为树的复杂度。在类别不平衡时,在判断分裂优劣时,对类别进行加权,二分类中,节点node被分类成1当且仅当:

N1(node)N1(root)>N0(node)N0(root)\frac{N_1(node)}{N_1(root)} > \frac{N_0(node)}{N_0(root)}

运用gini information 减少了对数运算log(越小越优先分割)

采用代理测试来预测缺失值:一个特征节点20%的纪录为缺失,则此特征就会减少20%,每个数节点招到代理分裂器,拥有较高分数的分裂器被采用。

假设k个目标类别,确保跟节点每个概率1/k,模式为先验相等。

对于连续值的处理: 将此特征以s点分为D1D_1D2D_2, 使两集合方差最小,同时两集合方差和最小:

min[minc1D1(yic1)2+minc2D2(yic2)2]min[min_{c_1}\sum_{D_1}(y_i-c_1)^2+min_{c_2}\sum_{D_2}(y_i-c_2)^2]

分类树中,概率最大的叶子节点作为预测

回归树中,最终叶子的均值或中位数作为结果

DT algorithm

The DT algorithms refered as CART(classification and regression trees).

In skit-learn

1
2
3
4
5
6
7
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)

tree.plot_tree(clf.fit(iris.data, iris.target))

How Random Forests work

It used out-of-bag(oob) error to get the unbiased estimate and find the importance of variables.

Also use Gini to find the variables when split from top to down.

Bagging and Boosting

Variance reduction for a complex model, or bias reduction for a simple model, which refers to random forest(bagging, complex should reduce var) and boosting.

RF belongs to Ensemble Learning in Machine Learning where Ensemble Learning has 2 major algorithms as Bagging and Boosting

Bagging((Bootstrap Aggregation)

  • Using Bootstrapping randomly take n training samples(random draw with replacment随机取有放回). Repeating k times which we can get k training sets. (all k sets are independent)
  • For k trainign sets, we get k models
  • For classification problem, use voting. For regression problems, use average of the results.

Advantages of using Random Forest technique:

  • Handles higher dimensionality data very well.
  • Handles missing values and maintains accuracy for missing data.

Disadvantages of using Random Forest technique:

  • Since final prediction is based on the mean predictions from subset trees, it won’t give precise values for the regression model.

Boosting

In this technique, learners are learned sequentially with early learners fitting simple models to the data and then analyzing data for errors. In other words, we fit consecutive trees (random sample) and at every step, the goal is to solve for net error from the prior tree.

####Gradient Boosting

Gradient Boosting = Gradient Descent + Boosting.

It uses gradient descent algorithm which can optimize any differentiable loss function. An ensemble of trees are built one by one and individual trees are summed sequentially. Next tree tries to recover the loss (difference between actual and predicted values).

Advantages of using Gradient Boosting technique:

  • Supports different loss function.
  • Works well with interactions.

Disadvantages of using Gradient Boosting technique:

  • Easy to over-fitting.
  • Requires careful tuning of different hyper-parameters
Author: shixuan liu
Link: http://tedlsx.github.io/2019/08/06/random-forest/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless stating additionally.
Donate
  • Wechat
  • Alipay

Comment