【机器学习基础】你应该知道的LightGBM各种操作!

it2025-06-17  14

LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoost的节点划分进行了改进,内存占用更低训练速度更快。

LightGBM官网:https://lightgbm.readthedocs.io/en/latest/

参数介绍:https://lightgbm.readthedocs.io/en/latest/Parameters.html

本文内容如下,原始代码获取方式见文末。

1 安装方法

2 调用方法

2.1 定义数据集

2.2 模型训练

2.3 模型保存与加载

2.4 查看特征重要性

2.5 继续训练

2.6 动态调整模型超参数

2.7 自定义损失函数

2.8 调参方法

人工调参

网格搜索

贝叶斯优化

1 安装方法

LightGBM的安装非常简单,在Linux下很方便的就可以开启GPU训练。可以优先选用从pip安装,如果失败再从源码安装。

安装方法:从源码安装

git clone --recursive https://github.com/microsoft/LightGBM ;  cd LightGBM mkdir build ; cd build cmake .. # 开启MPI通信机制,训练更快 # cmake -DUSE_MPI=ON .. # GPU版本,训练更快 # cmake -DUSE_GPU=1 .. make -j4

安装方法:pip安装

# 默认版本 pip install lightgbm # MPI版本 pip install lightgbm --install-option=--mpi # GPU版本 pip install lightgbm --install-option=--gpu

2 调用方法

在Python语言中LightGBM提供了两种调用方式,分为为原生的API和Scikit-learn API,两种方式都可以完成训练和验证。当然原生的API更加灵活,看个人习惯来进行选择。

2.1 定义数据集

df_train = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train', header=None, sep='\t') df_test = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test', header=None, sep='\t') W_train = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.train.weight', header=None)[0] W_test = pd.read_csv('https://cdn.coggle.club/LightGBM/examples/binary_classification/binary.test.weight', header=None)[0] y_train = df_train[0] y_test = df_test[0] X_train = df_train.drop(0, axis=1) X_test = df_test.drop(0, axis=1) num_train, num_feature = X_train.shape # create dataset for lightgbm # if you want to re-use data, remember to set free_raw_data=False lgb_train = lgb.Dataset(X_train, y_train,                         weight=W_train, free_raw_data=False) lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train,                        weight=W_test, free_raw_data=False)

2.2 模型训练

params = {     'boosting_type': 'gbdt',     'objective': 'binary',     'metric': 'binary_logloss',     'num_leaves': 31,     'learning_rate': 0.05,     'feature_fraction': 0.9,     'bagging_fraction': 0.8,     'bagging_freq': 5,     'verbose': 0 } # generate feature names feature_name = ['feature_' + str(col) for col in range(num_feature)] gbm = lgb.train(params,                 lgb_train,                 num_boost_round=10,                 valid_sets=lgb_train,  # eval training data                 feature_name=feature_name,                 categorical_feature=[21])

2.3 模型保存与加载

# save model to file gbm.save_model('model.txt') print('Dumping model to JSON...') model_json = gbm.dump_model() with open('model.json', 'w+') as f:     json.dump(model_json, f, indent=4)

2.4 查看特征重要性

# feature names print('Feature names:', gbm.feature_name()) # feature importances print('Feature importances:', list(gbm.feature_importance()))

2.5 继续训练

# continue training # init_model accepts: # 1. model file name # 2. Booster() gbm = lgb.train(params,                 lgb_train,                 num_boost_round=10,                 init_model='model.txt',                 valid_sets=lgb_eval) print('Finished 10 - 20 rounds with model file...')

2.6 动态调整模型超参数

# decay learning rates # learning_rates accepts: # 1. list/tuple with length = num_boost_round # 2. function(curr_iter) gbm = lgb.train(params,                 lgb_train,                 num_boost_round=10,                 init_model=gbm,                 learning_rates=lambda iter: 0.05 * (0.99 ** iter),                 valid_sets=lgb_eval) print('Finished 20 - 30 rounds with decay learning rates...') # change other parameters during training gbm = lgb.train(params,                 lgb_train,                 num_boost_round=10,                 init_model=gbm,                 valid_sets=lgb_eval,                 callbacks=[lgb.reset_parameter(bagging_fraction=[0.7] * 5 + [0.6] * 5)]) print('Finished 30 - 40 rounds with changing bagging_fraction...')

2.7 自定义损失函数

# self-defined objective function # f(preds: array, train_data: Dataset) -> grad: array, hess: array # log likelihood loss def loglikelihood(preds, train_data):     labels = train_data.get_label()     preds = 1. / (1. + np.exp(-preds))     grad = preds - labels     hess = preds * (1. - preds)     return grad, hess # self-defined eval metric # f(preds: array, train_data: Dataset) -> name: string, eval_result: float, is_higher_better: bool # binary error # NOTE: when you do customized loss function, the default prediction value is margin # This may make built-in evalution metric calculate wrong results # For example, we are doing log likelihood loss, the prediction is score before logistic transformation # Keep this in mind when you use the customization def binary_error(preds, train_data):     labels = train_data.get_label()     preds = 1. / (1. + np.exp(-preds))     return 'error', np.mean(labels != (preds > 0.5)), False gbm = lgb.train(params,                 lgb_train,                 num_boost_round=10,                 init_model=gbm,                 fobj=loglikelihood,                 feval=binary_error,                 valid_sets=lgb_eval) print('Finished 40 - 50 rounds with self-defined objective function and eval metric...')

2.8 调参方法

人工调参

For Faster Speed

Use bagging by setting bagging_fraction and bagging_freq

Use feature sub-sampling by setting feature_fraction

Use small max_bin

Use save_binary to speed up data loading in future learning

Use parallel learning, refer to Parallel Learning Guide <./Parallel-Learning-Guide.rst>__

For Better Accuracy

Use large max_bin (may be slower)

Use small learning_rate with large num_iterations

Use large num_leaves (may cause over-fitting)

Use bigger training data

Try dart

Deal with Over-fitting

Use small max_bin

Use small num_leaves

Use min_data_in_leaf and min_sum_hessian_in_leaf

Use bagging by set bagging_fraction and bagging_freq

Use feature sub-sampling by set feature_fraction

Use bigger training data

Try lambda_l1, lambda_l2 and min_gain_to_split for regularization

Try max_depth to avoid growing deep tree

Try extra_trees

Try increasing path_smooth

网格搜索

lg = lgb.LGBMClassifier(silent=False) param_dist = {"max_depth": [4,5, 7],               "learning_rate" : [0.01,0.05,0.1],               "num_leaves": [300,900,1200],               "n_estimators": [50, 100, 150]              } grid_search = GridSearchCV(lg, n_jobs=-1, param_grid=param_dist, cv = 5, scoring="roc_auc", verbose=5) grid_search.fit(train,y_train) grid_search.best_estimator_, grid_search.best_score_

贝叶斯优化

import warnings import time warnings.filterwarnings("ignore") from bayes_opt import BayesianOptimization def lgb_eval(max_depth, learning_rate, num_leaves, n_estimators):     params = {              "metric" : 'auc'         }     params['max_depth'] = int(max(max_depth, 1))     params['learning_rate'] = np.clip(0, 1, learning_rate)     params['num_leaves'] = int(max(num_leaves, 1))     params['n_estimators'] = int(max(n_estimators, 1))     cv_result = lgb.cv(params, d_train, nfold=5, seed=0, verbose_eval =200,stratified=False)     return 1.0 * np.array(cv_result['auc-mean']).max() lgbBO = BayesianOptimization(lgb_eval, {'max_depth': (4, 8),                                             'learning_rate': (0.05, 0.2),                                             'num_leaves' : (20,1500),                                             'n_estimators': (5, 200)}, random_state=0) lgbBO.maximize(init_points=5, n_iter=50,acq='ei') print(lgbBO.max)

获取本文代码,可以在作者公众号“datawhale”后台回复【lgb】,即可获取本文的代码Notebook!

往期精彩回顾 适合初学者入门人工智能的路线及资料下载机器学习及深度学习笔记等资料打印机器学习在线手册深度学习笔记专辑《统计学习方法》的代码复现专辑 AI基础下载机器学习的数学基础专辑

获取一折本站知识星球优惠券,复制链接直接打开:

https://t.zsxq.com/y7uvZF6

本站qq群704220115。

加入微信群请扫码:

最新回复(0)