为了帮朋友写个作业, 由于之前又没学过, 所以干脆过一遍官方的教程, 做个笔记,以便日后回查。
这就叫 KFold cross-validation k份交叉验证
然后就很好算交叉验证分了:
[svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test]) ... for train, test in k_fold.split(X_digits)] [0.963..., 0.922..., 0.963..., 0.963..., 0.930...]当然直接求也是可以的
cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1) >>> array([0.96388889, 0.92222222, 0.9637883 , 0.9637883 , 0.93036212])这里的 n_jobs = -1 意思是计算给用到所有的cpu资源
这里不得不说的是要想更6地去看看还有哪些模型评估工具,那都在 matrics module 里了. 但是其实score 是可以直接通过名字来选的, 人家都给你封装好了。使用参数 scoring 就ok了
cross_val_score(svc, X_digits, y_digits, cv=k_fold, scoring='precision_macro') >>> array([0.96578289, 0.92708922, 0.96681476, 0.96362897, 0.93192644])下面的图显示还有很多的数据集交叉验证生成器供人玩 然后有个小练习脚本可以玩玩:
print(__doc__) import numpy as np from sklearn.model_selection import cross_val_score from sklearn import datasets, svm X, y = datasets.load_digits(return_X_y=True) svc = svm.SVC(kernel='linear') C_s = np.logspace(-10, 0, 10) scores = list() scores_std = list() for C in C_s: svc.C = C this_scores = cross_val_score(svc, X, y, n_jobs=1) scores.append(np.mean(this_scores)) scores_std.append(np.std(this_scores)) # Do the plotting import matplotlib.pyplot as plt plt.figure() plt.semilogx(C_s, scores) plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--') plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--') locs, labels = plt.yticks() plt.yticks(locs, list(map(lambda x: "%g" % x, locs))) plt.ylabel('CV score') plt.xlabel('Parameter C') plt.ylim(0, 1.1) plt.show()图片大体如下:
GridSearchCV默认是3个fold 的交叉验证分, 取决于版本, 当然如果你放进去的是回归器, 人家就会使用3fold层级交叉验证。
本质上就是两个循环, 一个是循环参数, 第二个是循环遍历交叉验证分, 然后找到最高的分, The resulting scores are unbiased estimates of the prediction score on new data. 这句话就很有意思了, 意思就是训练集上完美训练了呗? Warning You cannot nest objects with parallel computing (n_jobs different than 1).
这些模型所对应的超参交叉验证估计器就是这些模型对应的名字后面加上“CV” 下面是官方的一个练习例子脚本:
print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.linear_model import LassoCV from sklearn.linear_model import Lasso from sklearn.model_selection import KFold from sklearn.model_selection import GridSearchCV X, y = datasets.load_diabetes(return_X_y=True) X = X[:150] y = y[:150] lasso = Lasso(random_state=0, max_iter=10000) alphas = np.logspace(-4, -0.5, 30) tuned_parameters = [{'alpha': alphas}] n_folds = 5 clf = GridSearchCV(lasso, tuned_parameters, cv=n_folds, refit=False) clf.fit(X, y) scores = clf.cv_results_['mean_test_score'] scores_std = clf.cv_results_['std_test_score'] plt.figure().set_size_inches(8, 6) plt.semilogx(alphas, scores) # plot error lines showing +/- std. errors of the scores std_error = scores_std / np.sqrt(n_folds) plt.semilogx(alphas, scores + std_error, 'b--') plt.semilogx(alphas, scores - std_error, 'b--') # alpha=0.2 controls the translucency of the fill color plt.fill_between(alphas, scores + std_error, scores - std_error, alpha=0.2) plt.ylabel('CV score +/- std error') plt.xlabel('alpha') plt.axhline(np.max(scores), linestyle='--', color='.5') plt.xlim([alphas[0], alphas[-1]]) # ############################################################################# # Bonus: how much can you trust the selection of alpha? # To answer this question we use the LassoCV object that sets its alpha # parameter automatically from the data by internal cross-validation (i.e. it # performs cross-validation on the training data it receives). # We use external cross-validation to see how much the automatically obtained # alphas differ across different cross-validation folds. lasso_cv = LassoCV(alphas=alphas, random_state=0, max_iter=10000) k_fold = KFold(3) print("Answer to the bonus question:", "how much can you trust the selection of alpha?") print() print("Alpha parameters maximising the generalization score on different") print("subsets of the data:") for k, (train, test) in enumerate(k_fold.split(X, y)): lasso_cv.fit(X[train], y[train]) print("[fold {0}] alpha: {1:.5f}, score: {2:.5f}". format(k, lasso_cv.alpha_, lasso_cv.score(X[test], y[test]))) print() print("Answer: Not very much since we obtained different alphas for different") print("subsets of the data and moreover, the scores for these alphas differ") print("quite substantially.") plt.show()图是这么个图
sklearn官方教程