1500字范文 > 主成分分析（PCA）——matlab程序及函数详解

主成分分析（PCA）——matlab程序及函数详解

时间：2022-04-05 00:10:11

参考来源：

/Hand-Head/articles/5156435.html

/thread-11751-1-1.html

matlab帮助文档

程序源码下载链接：/detail/ckzhb/9903051

包含三个m文件：

drtool_pca：函数封装。不包括盒形图。

pca_test:matlab自带的例子——pca()函数

princomp_test:matlab自带的例子——princomp()函数

例子说明：

它使用了衡量美国329个城市生活质量的9个指标：气候、住房、健康、犯罪率、交通、教育、艺术、娱乐和经济。

对于各指标，越高表示越好，如高的犯罪指标表示低的犯罪率。

本文例子程序：

%% test for princomp(Principal Component Analysis)% 0706 BY Hubery_Zhangclear;clc;%% load data setload cities;%% box plot forratings data% To get a quickimpression of the ratings data, make a box plotfigure;boxplot(ratings,'orientation','horizontal','labels',categories);grid on;%% pre-processstdr =std(ratings);sr =ratings./repmat(stdr,329,1);%% use princomp[coeff,score,latent,tsquare]= princomp(sr);%% 如何提取主成分,达到降为的目的% 通过latent,可以知道提取前几个主成分就可以了.% 图中的线表示的累积变量解释程度.% 通过看图可以看出前七个主成分可以表示出原始数据的90%.% 所以在90%的意义下只需提取前七个主成分即可,进而达到主成分提取的目的.figure;percent_explained= 100*latent/sum(latent); %cumsum(latent)./sum(latent)pareto(percent_explained);xlabel('PrincipalComponent');ylabel('VarianceExplained (%)');%% Visualizing theResults% 横坐标和纵坐标分别表示第一主成分和第二主成分% 红色的点代表329个观察量,其坐标就是那个score% 蓝色的向量的方向和长度表示了每个原始变量对新的主成分的贡献,其坐标就是那个coeff.figure;biplot(coeff(:,1:2),'scores',score(:,1:2),...'varlabels',categories);axis([-.26 1 -.51.51]);

程序详解：

1、std()函数

（1）std（A）——函数求解的是最常见的标准差，此时除以的是N-1，按照列求标注差即输出每一列的标准差。

（2）std(A，flag)——flag代表的是用哪一种标准差函数，如果取0，则代表除以N-1，如果是1代表的是除以N。

（3）std(A，flag，dim)——dim代表的是按照列求标准差还是按照行求标准差，std(A,1,1)代表的是按照列求标准差，std(A,1,2)代表的是按照行求标准差。

2、B = repmat(A,m,n) 复制和平铺矩阵

将矩阵 A 复制 m×n 块，即把 A 作为 B 的元素，B 由 m×n 个 A平铺而成。B 的维数是 [size(A,1)*m, size(A,2)*n]

3、“./”

矩阵中，点运算表示元素之间的运算。

小结：

%% pre-process

stdr = std(ratings);

sr = ratings./repmat(stdr,329,1);

使所有特征的方差相等即单位化，保证所有属性在同一个数量级上面。

note：

无论之前是否进行了去均值化（即每个数据减去特征的均值），通过此操作（求特征的标准差，然后每个数据除以标准差）均可以达到特征归一化。

当然，使用PCA必须进行归一化，函数princomp自动对列进行去均值化。

princompcentersXbysubtracting off column means, but does not rescale the columns ofX. To perform principal components analysis with standardizedvariables, that is, based on correlations, useprincomp(zscore(X)). To perform principal components analysis directly on acovariance or correlation matrix, usepcacov.

4、[COEFF,SCORE,latent,tsquare] = princomp(X)

princomp函数未来被废弃，系统推荐使用pca函数。

X是n行P列的。n是数据样本个数，p是特征数。

（1）

COEFFis ap-by-pmatrix, eachcolumncontaining coefficients for one principal component. The columns are in orderof decreasing component variance.

主成分系数:即原始数据线性组合生成主成分数据中每一维数据前面的系数.COEFF的每一列代表一个新生成的主成分的系数.

（2）

SCORE,the principal component scores; that is, the representationofXin the principal component space. Rows ofSCOREcorrespond to observations, columns to components.即n-by-pmatrix

即原始数据在新生成的主成分空间里的坐标值.

（3）

latent, a vector containing the eigenvalues of the covariancematrix ofX.

即 latent = sort(eig(cov(sr)),'descend');

（4）

tsquare, which contains Hotelling's T2statistic foreach data point.

一种多元统计距离,记录的是每一个观察量到中心的距离。

The scores are the data formed by transformingthe original data into the space of the principal components. The values of thevectorlatentare thevariance of the columns ofSCORE. Hotelling'sT2is a measure of the multivariate distance of each observationfrom the center of the data set.

Whenn <= p,SCORE(:,n:p)andlatent(n:p)arenecessarily zero, and the columns ofCOEFF(:,n:p)definedirections that are orthogonal toX.

5、percent_explained = 100*latent/sum(latent);

或者%cumsum(latent)./sum(latent)

b=sum(a)函数：若a是向量，则求和即可；若a是矩阵，默认对列求和，可以改变dim参数进行按行求和。

b=cumsum（a）函数：当a是向量时，返回向量中各元素分别是该元素在a中位置之前所有元素之和如a=1 2 3,b=1 3 6。

当a是矩阵时，默认b中每一列都是对应列前面各行元素之和，当dim为2时，b的各行为对应行前面各列元素。

6、pareto(percent_explained);

Pareto charts display the values in thevectorYas bars drawn in descending order. Values inYmustbe nonnegative and not includeNaNs. Only the first 95% of the cumulativedistribution is displayed. also plots a line displaying the cumulative sum ofY.

7、biplot(coefs,'Name',Value)

biplot(coefs)creates a biplot of the coefficients in the matrixcoefs. The biplot is 2-D ifcoefshas two columns or 3-D if it has three columns.coefsusually contains principal component coefficients createdwithpca,pcacov, or factor loadings estimated withfactoran. The axes in the biplot represent the principal componentsor latent factors (columns ofcoefs), and the observed variables (rows ofcoefs) are represented as vectors.

比如在2-D图中，横坐标和纵坐标分别表示第一主成分和第二主成分；红色的点代表329个观察量,其坐标就是那个score；蓝色的向量的方向和长度表示了每个原始变量对新的主成分的贡献,其坐标就是那个coef.

'Name',Value)

'Scores'：

scoresusually contains principal component scores created withpcaor factor scores estimated withfactoran. Each observation (row of scores) is represented as a pointin the biplot.

'VarLabels'：

Labels each vector (variable) with the text inthe character array or cell arrayvarlabels.

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。