目录
4-5 超参数05-Hyper-Parameters
4-6 网格搜索与k近邻算法中更多超参数
4-5 超参数05-Hyper-Parameters
random_state=666 随机种子,保证每次运行的结果一样
best_score = 0.0best_k = -1for k in range(1, 11):knn_clf = KNeighborsClassifier(n_neighbors=k)knn_clf.fit(X_train, y_train)score = knn_clf.score(X_test, y_test)if score > best_score:best_k = kbest_score = scoreprint("best_k =", best_k)print("best_score =", best_score)
如果最好的值在边界上,则有可能好的值在边界外面,如果是10,则要对10以上的一些数计算
只计了投票数,没有权重,近的则权重大一点,比较合理
权重是距离的倒数
各有一票,则是平票, 解决平票的情况
sklearn.neighbors.KNeighborsClassifier — scikit-learn 1.0 documentation
官方文档的说明
best_score = 0.0best_k = -1best_method = ""for method in ["uniform", "distance"]:for k in range(1, 11):knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)knn_clf.fit(X_train, y_train)score = knn_clf.score(X_test, y_test)if score > best_score:best_k = kbest_score = scorebest_method = methodprint("best_method =", best_method)print("best_k =", best_k)print("best_score =", best_score)
()----》| |
有一定的一致性两者在数学上,对其进行推广
p = 1为莫达顿距离, 2为欧拉距离 又是一个超参数
best_score = 0.0best_k = -1best_p = -1for k in range(1, 11):for p in range(1, 6):knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p=p)knn_clf.fit(X_train, y_train)score = knn_clf.score(X_test, y_test)if score > best_score:best_k = kbest_p = pbest_score = scoreprint("best_k =", best_k)print("best_p =", best_p)print("best_score =", best_score)
distance和p有关,而uniform则和p无关
4-6 网格搜索与k近邻算法中更多超参数
param_grid = [{'weights': ['uniform'], 'n_neighbors': [i for i in range(1, 11)]},{'weights': ['distance'],'n_neighbors': [i for i in range(1, 11)], 'p': [i for i in range(1, 6)]}]
uniform 10
weights 10*5=50
数组,里面是字典,定义探索参数的集合
knn_clf = KNeighborsClassifier()
10+50= 60种不同的结果
两次运行weights可以不同,因为使用的CV交叉验证,这个和算法有关
n_jobs指定使用的计算机核数,并行运算,-1使用所有的核
运行没有什么输出, verbose越大则输出的信息越详细,输出的信息就是使用verbose的意义
鸢尾花的分类案例
import seaborn as snsfrom matplotlib.colors import ListedColormapimport matplotlib.pyplot as pltimport numpy as npfrom sklearn import datasetsfrom sklearn.neighbors import KNeighborsClassifieriris = datasets.load_iris()X = iris.data[:,:2]# X = iris.datay = iris.targetfrom sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.15, random_state = 6)# Create color mapscmap_light = ListedColormap(['orange', 'cyan', 'cornflowerblue'])cmap_bold = ['darkorange', 'c', 'darkblue']h = .02 # step size in the meshdef drawBoundary(knn_clf,n_neighbors,weights):# Plot the decision boundary. For that, we will assign a color to each# point in the mesh [x_min, x_max]x[y_min, y_max].x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, h),np.arange(y_min, y_max, h))Z = knn_clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plotZ = Z.reshape(xx.shape)plt.figure(figsize=(8, 6))plt.contourf(xx, yy, Z, cmap=cmap_light)# plt.contour(xx, yy, Z, cmap=cmap_light)#Plot also the training pointssns.scatterplot(x=X[:, 0], y=X[:, 1], hue=iris.target_names[y],palette=cmap_bold, alpha=1.0, edgecolor="black")plt.xlim(xx.min(), xx.max())plt.ylim(yy.min(), yy.max())plt.title("3-Class classification (k = %i, weights = '%s')"% (n_neighbors, weights))plt.xlabel(iris.feature_names[0])plt.ylabel(iris.feature_names[1])plt.show() # 当有多个图片要显示时只能一张显示后关了才能显示第二张# 自己实现的网格搜索best_score = 0.0best_k = -1best_method = ""for method in ["uniform", "distance"]:for k in range(1, 18):knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)knn_clf.fit(X_train, y_train)score = knn_clf.score(X_test, y_test)if score > best_score:best_k = kbest_score = scorebest_method = method# drawBoundary(knn_clf, best_k, best_method)# 如果这个绘制函数放在drawBoundary函数里当有多个图片要显示时只能一张显示后关了才能显示第二张# 但把这句放在下面这儿就不会# plt.show() # 在drawBoundary后面一定要有这句不然图像绘不出来,单步调试时也只会显示一部分,但程序运行完后就不显示print("best_method =", best_method)print("best_k =", best_k)print("best_score =", best_score)# 采用系统自带的网格搜索param_grid = [{'weights': ['uniform'],'n_neighbors': [i for i in range(1, 18)]},{'weights': ['distance'],'n_neighbors': [i for i in range(1, 18)],'p': [i for i in range(1, 6)]}]from sklearn.model_selection import GridSearchCVclf = KNeighborsClassifier()clf.fit(X_train, y_train)grid_srearch = GridSearchCV(clf, param_grid, n_jobs = -1, verbose = -1)grid_srearch.fit(X_train, y_train)print(10*"-------------")print("best:%f using %s" % (grid_srearch.best_score_,grid_srearch.best_params_))# print(grid_srearch.best_params_['n_neighbors'])# print(grid_srearch.best_params_['weights'])# print(grid_srearch.best_estimator_)# means = grid_srearch.cv_results_['mean_test_score']# params = grid_srearch.cv_results_['params']## for mean, param in zip(means,params):#print("%f with: %r" % (mean,param))drawBoundary(grid_srearch.best_estimator_, grid_srearch.best_params_['n_neighbors'], grid_srearch.best_params_['weights'])
pandas读取数据
pandas在excel中读取的数据类型与numpy的数据类型是不一样
pandas是DataFrame,numpy是array
excel表格数据
其他超参数
sklearn.neighbors.DistanceMetric — scikit-learn 1.0 documentation