1500字范文 > python-主成分分析实现

python-主成分分析实现

时间：2021-09-05 19:15:11

相关推荐

python-主成分分析实现

以下内容笔记出自‘跟着迪哥学python数据分析与机器学习实战’，外加个人整理添加，仅供个人复习使用。

在理论的基础上，在python中实现主成分分析。

使用鸢尾花数据作为例子进行。首先导入数据：

import numpy as npimport pandas as pddf=pd.read_csv(r'iris.data')print(df.shape)df.columns=['sepal_len','sepal_wid','petal_len','petal_wid','class']df.head(6)

df['class'].unique()

array([‘Iris-setosa’, ‘Iris-versicolor’, ‘Iris-virginica’], dtype=object)

1. 数据探索

X=df.iloc[:,0:4].valuesy=df.iloc[:,4].valuesX[y=='Iris-setosa',1] #查看第一个特征数据

大概查看每一种类别的每一个特性分布：

import matplotlib.pyplot as plt%matplotlib inlineimport mathlabel_dict={1:'Iris_Setosa',2:'Iris-Versicolor',3:'Iris-Virgnica'}feature_dict={0:'sepal length [cm]',1:'sepal width [cm]',2:'petal length [cm]',3:'petal width [cm]'}plt.figure(figsize=(8,6))for cnt in range(4):plt.subplot(2,2,cnt+1)for lab in ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'):plt.hist(X[y==lab,cnt],label=lab,bins=10,alpha=0.3)plt.xlabel(feature_dict[cnt])plt.legend(loc='upper right',fancybox=True,fontsize=8)plt.tight_layout()plt.show()

2. PCA实现过程（未调用sklearn）

2.1 数据预处理

from sklearn.preprocessing import StandardScalerX_std=StandardScaler().fit_transform(X)print(X_std.shape)

(149, 4)

4个特征，149个样本

2.2 计算协方差矩阵

mean_vec=np.mean(X_std,axis=0)print('4个特征的均值向量:\n',mean_vec) #array类型，向量，竖着的#协方差矩阵cov_mat=(X_std-mean_vec #对应相减，得出结果仍是149*4 ).T.dot((X_std-mean_vec))/(X_std.shape[0]-1) #向量内积print('协方差矩阵:\n',cov_mat) #实对称阵

'''这里是两个相同矩阵求协方差 X^T*X'''#也可以直接调用函数print('协方差矩阵:\n',np.cov(X_std.T)) #为什么要转置，转置才是求得特征数据的协方差阵#不转置就会求成样本的协方差矩阵，错误

2.3 计算特征值和特征向量

cov_mat=np.cov(X_std.T)eig_vals,eig_vecs=np.linalg.eig(cov_mat)print('Eigenvectors \n%s' % eig_vecs)print('\nEigenvalues \n%s' % eig_vals)

#这样表示eig_pairs=[(np.abs(eig_vals[i]),eig_vecs[:,i]) for i in range(len(eig_vals))]print(eig_pairs)print('-------------')eig_pairs.sort(key=lambda x:x[0],reverse=True)print('Eigenvalues in descending order:')for i in eig_pairs:print(i[0])

2.4 计算累计方差贡献率并作图

#计算方差贡献率也就是特征值贡献率tot=sum(eig_vals)var_exp=[(i/tot)*100 for i in sorted(eig_vals,reverse=True)]print(var_exp)#累计方差贡献率cum_var_exp=np.cumsum(var_exp)cum_var_exp

[72.633269203, 23.14740685864414, 3.7115155645845284, 0.5210442498510098]

array([ 72.6333, 95.76744019, 99.47895575, 100. ])

plt.figure(figsize=(6,4))plt.bar(range(4),var_exp,alpha=0.5,align='center',label='individual explained variance')plt.step(range(4),cum_var_exp,where='mid',label='cumulative explained variance')plt.ylabel('explained variance ratio')plt.xlabel('pricipal components')plt.legend(loc='best')plt.tight_layout()plt.show()

3. 数据降维

matrix_w=np.hstack((eig_pairs[0][1].reshape(4,1), #输出前两个特征向量eig_pairs[1][1].reshape(4,1)))print('matrix W:\n',matrix_w)

Y=X_std.dot(matrix_w)Y

4. 降维前后效果比较

#将原数据展示在图上plt.figure(figsize=(6,4))for lab,col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),('blue','red','green')):plt.scatter(X[y==lab,0],X[y==lab,1],label=lab,c=col)plt.xlabel('sepal_len')plt.ylabel('sepal_wid')plt.legend(loc='best')plt.tight_layout()plt.show()

#降维后数据作图plt.figure(figsize=(6,4))for lab,col in zip(('Iris-setosa', 'Iris-versicolor', 'Iris-virginica'),('blue','red','green')):plt.scatter(Y[y==lab,0],Y[y==lab,1],label=lab,c=col)plt.xlabel('principal component 1')plt.ylabel('principal component 2')plt.legend(loc='lower center')plt.show()

明显降维后数据分群更明显，但横纵轴的意义不易解释。一些情况下，如果能够根据主成分上的主要指标定义主成分的含义，也可以用这个含义进行解释。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。