LightGBM的分析 LightGBM的分析

lightGBM分类模型使用 【LightGBM的分析】/2019年5月11日/
1.以下我们使用sklearn中自带的数据进行lightGBM的使用示范

import lightgbm as lgb import pandas as pd from sklearn.metrics import mean_squared_error from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.datasets importmake_classificationiris = load_iris() data=https://www.it610.com/article/iris.data print(data) target = iris.target print(target) X_train,X_test,y_train,y_test =train_test_split(data,target,test_size=0.2,random_state=42) #对于整个data和target进行的重新的分割,test_size是指测试集占总的的比例 print(y_test,y_train)# 创建成lgb特征的数据集格式 lgb_train = lgb.Dataset(X_train, y_train)#data (string, numpy array, pandas DataFrame, scipy.sparse or list of numpy arrays) – Data source of Dataset. If string, it represents the path to txt file. 如果是图片的话就是将图片的矩阵形式输入进去 lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)#参考（数据集或无，可选（默认值=无））–如果这是用于验证的数据集，则应将培训数据用作参考。 #print(type(lgb_eval))# 将参数写成字典下形式 params = {'task': 'train', 'boosting_type': 'gbdt',# 设置提升类型 'objective': 'regression', # 目标函数 'metric': {'l2', 'auc'},# 评估函数 'num_leaves': 20,# 叶子节点数 'learning_rate': 0.07,# 学习速率 'feature_fraction': 0.9, # 建树的特征选择比例 'bagging_fraction': 0.8, # 建树的样本采样比例 'bagging_freq': 5,# k 意味着每 k 次迭代执行bagging 'verbose': 1 # <0 显示致命的, =0 显示错误 (警告), >0 显示信息 }print('Start training...') # 训练 cv and train gbm = lgb.train(params,lgb_train,num_boost_round=100,valid_sets=lgb_eval,early_stopping_rounds=5)print('Save model...') # 保存模型到文件 gbm.save_model('model.txt')print('Start predicting...') # 预测数据集 y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration) # 评估模型 print(y_pred,y_test) print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) **0.5)

原理之类的如果只是要使用，其实也没有必要理解，以下推荐一位大神的博客可以了解一下。
https://blog.csdn.net/qq_24519677/article/details/82811215
2.关于lightGBM的调参问题：
a.参数微调

文章图片

以下是某知乎作者的看法：
(1).num_leaves. This is the main parameter to control the complexity of the tree model. Theoretically, we can set num_leaves = 2^(max_depth) to convert from depth-wise tree. However, this simple conversion is not good in practice. The reason is, when number of leaves are the same, the leaf-wise tree is much deeper than depth-wise tree. As a result, it may be over-fitting. Thus, when trying to tune the num_leaves, we should let it be smaller than 2^ (max_depth). For example, when the max_depth=6 the depth-wise tree can get good accuracy, but setting num_leaves to 127 may cause over-fitting, and setting it to 70 or 80 may get better accuracy than depth-wise. Actually, the concept depth can be forgotten in leaf-wise tree, since it doesn’t have a correct mapping from leaves to depth.
(2).min_data_in_leaf This is a very important parameter to deal with over-fitting in leaf-wise tree. Its value depends on the number of training data and num_leaves. Setting it to a large value can avoid growing too deep a tree, but may cause under-fitting. In practice, setting it to hundreds or thousands is enough for a large dataset.
(3).max_depth You also can use max_depth to limit the tree depth explicitly.
b.不同情况下的参数调整

文章图片

1.For Faster Speed Use bagging by setting bagging_fraction and bagging_freq Use feature sub-sampling by setting feature_fraction Use small max_bin Use save_binary to speed up data loading in future learning Use parallel learning, refer to Parallel Learning Guide
2.For Better Accuracy Use large max_bin (may be slower) Use small learning_rate with large num_iterations Use large num_leaves (may cause over-fitting) Use bigger training data Try dart
3.Deal with Over-fitting Use small max_bin Use small num_leaves Use min_data_in_leaf and min_sum_hessian_in_leaf Use bagging by set bagging_fraction and bagging_freq Use feature sub-sampling by set feature_fraction Use bigger training data Try lambda_l1, lambda_l2 and min_gain_to_split for regularization Try max_depth to avoid growing deep tree.
c.几个主要的参数：

文章图片
3.调参过程的自我体会
其实上面说的大都是废话，调参的过程就是一个个的去试，不同的数据集，不同的训练集测试集比例，最优的参数都是不一样的，没有捷径。要么你手动去调，要么你就自动化去测（网格搜索，随机搜索等）。别人给的只能是一定的参考范围，这仅仅是可以用在自动化去测的过程中。所有的调参都是这样的。
当然，这里还存在两个问题：如果分类过程中，准确率怎么也调不上去，维持在一个很低的水平，那么可能和输入的特征数据的结构，分布关系很大，这时考虑对特征进行降维等操作；
也有可能准确率特别高，那么就基本可以肯定是过拟合了，过拟合的原因很多，从调参角度，可以参见以上表格，从数据角度，有可能是数据量太小，或者特征不独立等。具体问题具体分析。