sklearn 小白抱佛脚笔记3:模型选择和它们的参数

it2023-11-05  77

Model selection: choosing estimators and their parameters

评分, 交叉验证的评分:Score, and cross-validated scores交叉验证生成器:Cross-validation generators网格搜索 和 交叉验证估计:Grid-search and cross-validated estimators网格搜索: grid-search密集交叉验证:Nested cross-validation 交叉验证估计:Cross-validated estimators 参考文献


为了帮朋友写个作业, 由于之前又没学过, 所以干脆过一遍官方的教程, 做个笔记,以便日后回查。


评分, 交叉验证的评分:Score, and cross-validated scores

每个模型都会有个score 方法来表示训练的结果, 这个方法返回的就是模型的评分了, 越高自然越好Bigger is better. from sklearn import datasets, svm X_digits, y_digits = datasets.load_digits(return_X_y=True) svc = svm.SVC(C=1, kernel='linear') svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:]) >>>0.98 有时候你在自己选的训练集和验证集不一定就最能够说明你的模型性能好, 因为很可能刚好是对你选训练集和测试集的效果好, 可能换一种选法效果就不好了,所以我们可以通过把数据集分成很多分来分别得出它在这些数据集上的性能评分 import numpy as np X_folds = np.array_split(X_digits, 3) y_folds = np.array_split(y_digits, 3) scores = list() for k in range(3): # We use 'list' to copy, in order to 'pop' lazer on X_train = list(X_folds) X_test = X_train.pop(k) X_train = np.concatenate(X_train) y_train = list(y_folds) y_test = y_train.pop(k) y_train = np.concatenate(y_train) scores.append(svc.fit(X_train, y_train).score(X_test, y_test)) print(scores) >>>[0.934..., 0.956..., 0.939...]

这就叫 KFold cross-validation k份交叉验证

交叉验证生成器:Cross-validation generators

sklearn 中的数据生成器都有一个split方法, 它可以帮你自动的去生成训练集和测试集样本的下标 from sklearn.model_selection import KFold, cross_val_score X = ["a", "a", "a", "b", "b", "c", "c", "c", "c", "c"] k_fold = KFold(n_splits=5) for train_indices, test_indices in k_fold.split(X): print('Train: %s | test: %s' % (train_indices, test_indices)) Train: [2 3 4 5 6 7 8 9] | test: [0 1] Train: [0 1 4 5 6 7 8 9] | test: [2 3] Train: [0 1 2 3 6 7 8 9] | test: [4 5] Train: [0 1 2 3 4 5 8 9] | test: [6 7] Train: [0 1 2 3 4 5 6 7] | test: [8 9]

然后就很好算交叉验证分了:

[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) ... for train, test in k_fold.split(X_digits)] [0.963..., 0.922..., 0.963..., 0.963..., 0.930...]

当然直接求也是可以的

cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1) >>> array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212])

这里的 n_jobs = -1 意思是计算给用到所有的cpu资源

这里不得不说的是要想更6地去看看还有哪些模型评估工具,那都在 matrics module 里了. 但是其实score 是可以直接通过名字来选的, 人家都给你封装好了。使用参数 scoring 就ok了

cross_val_score(svc, X_digits, y_digits, cv=k_fold, scoring='precision_macro') >>> array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644])

下面的图显示还有很多的数据集交叉验证生成器供人玩 然后有个小练习脚本可以玩玩:

print(__doc__) import numpy as np from sklearn.model_selection import cross_val_score from sklearn import datasets, svm X, y = datasets.load_digits(return_X_y=True) svc = svm.SVC(kernel='linear') C_s = np.logspace(-10, 0, 10) scores = list() scores_std = list() for C in C_s: svc.C = C this_scores = cross_val_score(svc, X, y, n_jobs=1) scores.append(np.mean(this_scores)) scores_std.append(np.std(this_scores)) # Do the plotting import matplotlib.pyplot as plt plt.figure() plt.semilogx(C_s, scores) plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--') plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--') locs, labels = plt.yticks() plt.yticks(locs, list(map(lambda x: "%g" % x, locs))) plt.ylabel('CV score') plt.xlabel('Parameter C') plt.ylim(0, 1.1) plt.show()

图片大体如下:

网格搜索 和 交叉验证估计:Grid-search and cross-validated estimators

网格搜索: grid-search

就是说你在训练的时候grid-search 可以帮你找到交叉验证分最高的模型超参是啥, 很爽, 你只用提供数据和模型的对象就好了 >>> from sklearn.model_selection import GridSearchCV, cross_val_score >>> Cs = np.logspace(-6, -1, 10) >>> clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs), ... n_jobs=-1) >>> clf.fit(X_digits[:1000], y_digits[:1000]) GridSearchCV(cv=None,... >>> clf.best_score_ 0.925... >>> clf.best_estimator_.C 0.0077... >>> # Prediction performance on test set is not as good as on train set >>> clf.score(X_digits[1000:], y_digits[1000:]) 0.943...

GridSearchCV默认是3个fold 的交叉验证分, 取决于版本, 当然如果你放进去的是回归器, 人家就会使用3fold层级交叉验证。

密集交叉验证:Nested cross-validation

cross_val_score(clf, X_digits, y_digits) array([0.938..., 0.963..., 0.944...])

本质上就是两个循环, 一个是循环参数, 第二个是循环遍历交叉验证分, 然后找到最高的分, The resulting scores are unbiased estimates of the prediction score on new data. 这句话就很有意思了, 意思就是训练集上完美训练了呗? Warning You cannot nest objects with parallel computing (n_jobs different than 1).

交叉验证估计:Cross-validated estimators

调参其实很高效, 因为 for certain estimators, scikit-learn exposes Cross-validation: evaluating estimator performance estimators that set their parameter automatically by cross-validation 用这些超参估计器就可以自动设置超参啦 >>> from sklearn import linear_model, datasets >>> lasso = linear_model.LassoCV() >>> X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True) >>> lasso.fit(X_diabetes, y_diabetes) LassoCV() >>> # The estimator chose automatically its lambda: >>> lasso.alpha_ 0.00375...

这些模型所对应的超参交叉验证估计器就是这些模型对应的名字后面加上“CV” 下面是官方的一个练习例子脚本:

print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.linear_model import LassoCV from sklearn.linear_model import Lasso from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV X, y = datasets.load_diabetes(return_X_y=True) X = X[:150] y = y[:150] lasso = Lasso(random_state=0, max_iter=10000) alphas = np.logspace(-4, -0.5, 30) tuned_parameters = [{'alpha': alphas}] n_folds = 5 clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False) clf.fit(X, y) scores = clf.cv_results_['mean_test_score'] scores_std = clf.cv_results_['std_test_score'] plt.figure().set_size_inches(8, 6) plt.semilogx(alphas, scores) # plot error lines showing +/- std. errors of the scores std_error = scores_std / np.sqrt(n_folds) plt.semilogx(alphas, scores + std_error, 'b--') plt.semilogx(alphas, scores - std_error, 'b--') # alpha=0.2 controls the translucency of the fill color plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2) plt.ylabel('CV score +/- std error') plt.xlabel('alpha') plt.axhline(np.max(scores), linestyle='--', color='.5') plt.xlim([alphas[0], alphas[-1]]) # ############################################################################# # Bonus: how much can you trust the selection of alpha? # To answer this question we use the LassoCV object that sets its alpha # parameter automatically from the data by internal cross-validation (i.e. it # performs cross-validation on the training data it receives). # We use external cross-validation to see how much the automatically obtained # alphas differ across different cross-validation folds. lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000) k_fold = KFold(3) print("Answer to the bonus question:", "how much can you trust the selection of alpha?") print() print("Alpha parameters maximising the generalization score on different") print("subsets of the data:") for k, (train, test) in enumerate(k_fold.split(X, y)): lasso_cv.fit(X[train], y[train]) print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}". format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test]))) print() print("Answer: Not very much since we obtained different alphas for different") print("subsets of the data and moreover, the scores for these alphas differ") print("quite substantially.") plt.show()

图是这么个图

参考文献

sklearn官方教程

最新回复(0)