1500字范文 > NBA球员总得分预测——K近邻算法

NBA球员总得分预测——K近邻算法

时间：2024-02-25 14:49:04

相关推荐

NBA球员总得分预测——K近邻算法

Dataset

本文的数据集nba_.csv是到赛季的NBA球员信息：

player– name of the playerpos– the position of the playerg– number of games the player was ings– number of games the player startedpts– total points the player scored

import pandaswith open("nba_.csv", 'r') as csvfile:nba = pandas.read_csv(csvfile)# The names of the columns in the data.print(nba.columns.values)'''['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p''x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb''trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']'''

Euclidean Distance

下面这段代码计算的是LeBron James与每个球员的欧氏距离，记住在这里只能用iloc[0]来索引出eBron James，首先该行得到的是一个DataFrame对象，loc是根据行标签索引，但此时eBron James的行标签未知，用iloc表示第0行最合适。

selected_player = nba[nba["player"] == "LeBron James"].iloc[0]distance_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']import mathdef euclidean_distance(row):inner_value = 0for k in distance_columns:inner_value += (row[k] - selected_player[k]) ** 2return math.sqrt(inner_value)lebron_distance = nba.apply(euclidean_distance, axis=1)'''lebron_distance : Series (<class 'pandas.core.series.Series'>)03475.79286813148.3950161.56736131189.55497943216.773098...'''

Normalizing Columns

由于属性的取值范围较大将会对距离度量产生很大的影响，因此，为了保证各个属性的平等性，需要对属性值进行正规化，使其均值为0，方差为1.nba_numeric.mean()函数得到是每一列的均值，nba_numeric.std()得到的是每一列的标准差。

nba_numeric = nba[distance_columns]nba_normalized = (nba_numeric - nba_numeric.mean()) / nba_numeric.std()

Finding The Nearest Neighbor

在前面我们已经计算了eBron James到每个球员的距离，但是在scripy.spatial中有一个distance类，它含有各种距离度量函数，在这里我们使用distance.euclidean计算，可以得到与前面相同的结果。然后我们对其进行排序，第一个是eBron James到eBron James自己，因此距离为0，我们需要找的最近邻是第二个。使用apply函数时，前面是可以迭代的对象，参数是函数的名称，但是此处函数的参数不仅仅是迭代的每一行数据，还有一个eBron James对象，因此采用lambda函数的同时，在里面调用更复杂的距离函数。

from scipy.spatial import distance# Fill in NA values in nba_normalizednba_normalized.fillna(0, inplace=True)# Find the normalized vector for lebron james.lebron_normalized = nba_normalized[nba["player"] == "LeBron James"].iloc[0]# Find the distance between lebron james and everyone else.euclidean_distances = nba_normalized.apply(lambda row: distance.euclidean(row, lebron_normalized), axis=1)distance_frame = pandas.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})distance_frame.sort_values("dist", inplace=True)second_smallest = distance_frame.iloc[1]["idx"]most_similar_to_lebron = nba.loc[int(second_smallest)]["player"]

我们在获得lebron_normalized到每个球员的距离后，创建了一个DataFrame对象，然后按照距离对其进行排序，为了排序的时候保存每个球员的index,t添加了一个新的属性。其实也可以不用添加，因为DataFrame的行标签就是球员的index。其中distance_frame.sort_values(“dist”, inplace=True)中的inplace=True表示就地执行排序，等价于distance_frame=distance_frame.sort_values(“dist”），只是如果inplace=False，表示原有的distance_frame是没有变化的。

Generating Training And Testing Sets

训练集以及测试集目前还没有正规化：

import randomfrom numpy.random import permutation# Randomly shuffle the index of nba.random_indices = permutation(nba.index)# Set a cutoff for how many items we want in the test set (in this case 1/3 of the items)test_cutoff = math.floor(len(nba)/3)# Generate the test set by taking the first 1/3 of the randomly shuffled indices.test = nba.loc[random_indices[1:test_cutoff]]# Generate the train set with the rest of the data.train = nba.loc[random_indices[test_cutoff:]]

Using Sklearn

sklearn中有个专门的计算最近邻的算法KNeighborsRegressor，由于我么需要预测球员的总得分pts，是个连续值，因此是回归问题。可以发现训练集和测试集都是原始数据，我们没有对其进行正规化，因为在KNeighborsRegressor自动对齐进行normalization以及距离的计算都是自动完成的。这些参数可以在KNeighborsRegressor中进行调整。

# The columns that we will be making predictions with.x_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf']# The column that we want to predict：total points the player scoredy_column = ["pts"]from sklearn.neighbors import KNeighborsRegressor# Create the knn model.knn = KNeighborsRegressor(n_neighbors=5)# Fit the model on the training data.knn.fit(train[x_columns], train[y_column])# Make predictions on the test set using the fit model.predictions = knn.predict(test[x_columns])

Computing Error

计算模型的误差MSE，通常对于分类问题，我们计算其AUC值，而对于回归问题，由于roc_auc_score中写道Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format.

actual = test[y_column]mse = (((predictions - actual) ** 2).sum()) / len(predictions)

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。