1500字范文,内容丰富有趣,写作好帮手!
1500字范文 > 数据挖掘(一)-探索性数据分析

数据挖掘(一)-探索性数据分析

时间:2021-02-20 11:53:29

相关推荐

数据挖掘(一)-探索性数据分析

目录

探索性数据分析EDA目标项目介绍具体实现1.导入相关包2.载入数据3.数据简要浏览3.1数据描述3.2数据信息查看4.拓展数据分析4.1相关性分析4.1查看几个特征得 偏度和峰值4.2数字特征可视化4.3 数字特征之间的关系相互可视化4.4多变量之间相互回归关系可视化5.类别特征分析5.1unique分布5.2类别特征箱型图可视化5.3 类别特征小提琴图可视化5.4类别特征柱形图可视化5.5类别特征的每个类别频数可视化6.0用pandas_profiling生成数据报告7.经验总结

探索性数据分析

探索性数据分析(Exploratory Data Analysis,EDA)是指对已有数据在尽量少的先验假设下通过作图、制表、方程拟合、计算特征量等手段探索数据的结构和规律的一种数据分析方法。在我们队一个项目制定的以及实施的过程中有什么疑问性的问题,我们都可以做一个探索性数据分析来明晰我们的思路。

EDA目标

对已有的数据(特别是调查或观察得来的原始数据)在尽量少的先验假定下进行探索,通过作图、制表、方程拟合、计算特征量等手段探索数据的结构和规律。了解变量间的相互关系以及变量与预测值之间的存在关系。引导数据科学从业者进行数据处理以及特征工程的步骤,使数据集的结构和特征集让接下来的预测问题更加可靠。

项目介绍

预测二手车交易价格

载入数据科学以及可视化库载入数据数据处理了解预测值分布特征分析生成数据报告

具体实现

根据个部分的内容通过代码实现一下思路

1.导入相关包

# 基础数据科学工具以及可视化等import numpy as npimport pandas as pdimport warningsimport matplotlibimport matplotlib.pyplot as pltimport seaborn as snsfrom scipy.special import jnfrom IPython.display import display, clear_outputimport timewarnings.filterwarnings('ignore')%matplotlib inline#模型预测的from sklearn import linear_modelfrom sklearn import preprocessingfrom sklearn.svm import SVRfrom sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor#数据降维处理的from sklearn.decomposition import PCA,FastICA,FactorAnalysis,SparsePCAimport lightgbm as lgbimport xgboost as xgb#参数搜索和评价的from sklearn.model_selection import GridSearchCV,cross_val_score,StratifiedKFold,train_test_splitfrom sklearn.metrics import mean_squared_error, mean_absolute_error

包的安装一般用pip 安装就可以,多版本的时候用python3就使用pip3用于区分

pip install ***pip3 install ***

2.载入数据

数据下载链接:

链接:/s/1qHPUtXBfWMa_qLoExJ94GA

提取码:idey

## 通过Pandas对于数据进行读取 (pandas是一个很友好的数据读取函数库)f1=open('C:/Users/zxy/Desktop/数据挖掘/data/used_car_train_-1.csv')f2=open('C:/Users/zxy/Desktop/数据挖掘/data/used_car_train_-1.csv')Train_data = pd.read_csv(f1,encoding='gbk')TestA_data = pd.read_csv(f2,encoding='gbk')## 输出数据的大小信息print('Train data shape:',Train_data.shape)print('TestA data shape:',TestA_data.shape)

注:打开数据的时候,路径如果包含中文名称,直接用pd.csv()打开会出现报错。直接用open()打开,就不会有这种情况,或者直接用英文路径。

3.数据简要浏览

## 通过.head() 简要浏览读取数据的形式Train_data.head(10)

out:

3.1数据描述

describe种有每列的统计量,个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、以及最大值 看这个信息主要是瞬间掌握数据的大概的范围以及每个值的异常值的判断,比如有的时候会发现999 9999 -1 等值这些其实都是nan的另外一种表达方式,有的时候需要注意下

Train_data.describe()

out:

TestA_data.describe()

out:

3.2数据信息查看

info 通过info来了解数据每列的type,有助于了解是否存在除了nan以外的特殊符号异常。

## 通过 .info() 简要可以看到对应一些数据列名,以及NAN缺失信息Train_data.info()

out:

<class 'pandas.core.frame.DataFrame'>RangeIndex: 150000 entries, 0 to 149999Data columns (total 31 columns):SaleID150000 non-null int64name 150000 non-null int64regDate 150000 non-null int64model150000 non-null int64brand150000 non-null int64bodyType 150000 non-null int64fuelType 150000 non-null float64gearbox 150000 non-null objectpower150000 non-null objectkilometer 150000 non-null objectnotRepairedDamage 150000 non-null objectregionCode 150000 non-null int64seller150000 non-null int64offerType 150000 non-null float64creatDate 150000 non-null float64price150000 non-null float64v_0 150000 non-null float64v_1 150000 non-null float64v_2 150000 non-null float64v_3 150000 non-null float64v_4 150000 non-null float64v_5 150000 non-null float64v_6 150000 non-null float64v_7 150000 non-null float64v_8 150000 non-null float64v_9 150000 non-null float64v_10 150000 non-null float64v_11 150000 non-null float64v_12 148531 non-null float64v_13 146417 non-null float64v_14 135884 non-null float64dtypes: float64(19), int64(8), object(4)memory usage: 35.5+ MB

Train_data.isnull().sum()

out:

SaleID 0name 0regDate 0model0brand0bodyType 0fuelType 0gearbox 0power0kilometer0notRepairedDamage 0regionCode0seller 0offerType0creatDate0price0v_0 0v_1 0v_2 0v_3 0v_4 0v_5 0v_6 0v_7 0v_8 0v_9 0v_10 0v_11 0v_12 1469v_13 3583v_14 14116dtype: int64

TestA_data.isnull().sum()

out:

SaleID 0name 0regDate 0model0brand0bodyType 0fuelType 0gearbox 0power0kilometer0notRepairedDamage 0regionCode0seller 0offerType0creatDate0price0v_0 0v_1 0v_2 0v_3 0v_4 0v_5 0v_6 0v_7 0v_8 0v_9 0v_10 0v_11 0v_12 1469v_13 3583v_14 14116dtype: int64

nan可视化

missing = Train_data.isnull().sum()missing = missing[missing > 0]missing.sort_values(inplace=True)missing.plot.bar()

out:

通过以上两句可以很直观的了解哪些列存在 “nan”, 并可以把nan的个数打印,主要的目的在于 nan存在的个数是否真的很大,如果很小一般选择填充,如果使用lgb等树模型可以直接空缺,让树自己去优化,但如果nan存在的过多、可以考虑删掉。

缺省值信息查看

msno.matrix(Train_data.sample(250))

out:

<matplotlib.axes._subplots.AxesSubplot at 0x247481e7bc8>

msno.bar(Train_data.sample(1000))

out:

<matplotlib.axes._subplots.AxesSubplot at 0x24747ffbd88>

msno.matrix(TestA_data.sample(250))

out:

<matplotlib.axes._subplots.AxesSubplot at 0x2474ac862c8>

测试集的缺省和训练集的差不多情况, 可视化有四列有缺省,notRepairedDamage缺省得最多。

Train_data['notRepairedDamage'].value_counts()

out:

0 109301- 175581 126117056486 55... 6736 14370 16220 129 16192 1Name: notRepairedDamage, Length: 4272, dtype: int64

TestA_data['notRepairedDamage'].value_counts()

out:

0 109301- 175581 126117056486 55... 6736 14370 16220 129 16192 1Name: notRepairedDamage, Length: 4272, dtype: int64

了解数据分布情况

import scipy.stats as sty = Train_data['price']plt.figure(1); plt.title('Johnson SU')sns.distplot(y, kde=False, fit=st.johnsonsu)plt.figure(2); plt.title('Normal')sns.distplot(y, kde=False, fit=st.norm)plt.figure(3); plt.title('Log Normal')sns.distplot(y, kde=False, fit=st.lognorm)

out:

<matplotlib.axes._subplots.AxesSubplot at 0x2474b502648>

价格不服从正态分布,所以在进行回归之前,它必须进行转换。虽然对数变换做得很好,但最佳拟合是无界约翰逊分布。

查看skewness and kurtosis

sns.distplot(Train_data['price']);print("Skewness: %f" % Train_data['price'].skew())print("Kurtosis: %f" % Train_data['price'].kurt())

out:

Skewness: 3.259240Kurtosis: 17.966895

Train_data.skew(), Train_data.kurt()

out:

(SaleID 0.000000name 0.557606regDate 0.028495model1.484396brand1.150662bodyType72.267715fuelType92.831088regionCode9.955996seller 8.244463offerType3.364031creatDate-2.780338price3.259240v_0 -2.765477v_1 0.877980v_2 0.919860v_3 0.057813v_4 0.320513v_5 5.328188v_6 7.384523v_7 10.738620v_8 7.647080v_9 5.478185v_10 0.460554v_11 0.323848v_12 0.087052v_13 0.20v_14-1.203489dtype: float64, SaleID -1.200000name -1.039945regDate-0.697308model 1.740520brand 1.075831bodyType10312.737535fuelType17797.327090regionCode 97.123257seller 65.972059offerType 9.316830creatDate 5.730356price 17.966895v_06.011055v_10.914529v_23.373838v_3 -0.432035v_4 -0.020651v_5 48.841989v_6 62.786639v_7 115.770854v_8 60.469519v_9 37.329159v_10 1.446217v_11 0.730101v_12 -0.453978v_13 -0.357008v_14 2.350805dtype: float64)

sns.distplot(Train_data.skew(),color='blue',axlabel ='Skewness')

out:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff016585e80>

sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')

out:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff00c5ed978>

查看预测值的具体频数

plt.hist(Train_data['price'], orientation = 'vertical',histtype = 'bar', color ='red')plt.show()

out:

查看频数, 大于20000得值极少,其实这里也可以把这些当作特殊得值(异常值)直接用填充或者删掉,再前面进行。

# log变换 z之后的分布较均匀,可以进行log变换进行预测,这也是预测问题常用的trickplt.hist(np.log(Train_data['price']), orientation = 'vertical',histtype = 'bar', color ='red') plt.show()

out:

特征

特征分为类别特征和数字特征,并对类别特征查看unique分布。

数据类型

分离label即预测值

Y_train = Train_data['price']

numeric_features = ['power', 'kilometer', 'v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14' ]categorical_features = ['name', 'model', 'brand', 'bodyType', 'fuelType', 'gearbox', 'notRepairedDamage', 'regionCode',]

# 特征nunique分布for cat_fea in categorical_features:print(cat_fea + "的特征分布如下:")print("{}特征有个{}不同的值".format(cat_fea, Train_data[cat_fea].nunique()))print(Train_data[cat_fea].value_counts())

out:

name的特征分布如下:name特征有个99662不同的值708 282387 28255 2801541263203 23353 221713 217290 1971186184911 182204417615131601180158631 157893 1532765147473 14111391371108132444 129306 12728661232402116533 1141479113422 1134635110725 110964 1091373104... 89083 195230 11648641173060117920711812561185354125564 119417 118932411627191191373119342211360821140180114427811463271148376115862111404 115319 146022 164463 1976 13025 15074 17123 111221 113270 11744851Name: name, Length: 99662, dtype: int64model的特征分布如下:model特征有个248不同的值0.01176219.095734.0 84451.0 603829.0518648.0505240.0450226.044968.0 439131.0382713.0376217.0312165.0273049.0260846.0245430.0234244.021955.0 206310.021.0187273.0178911.0177523.0169622.0152469.0152263.014697.0 146016.0134988.0130966.01250... 141.0 37133.0 35216.0 30202.0 28151.0 26226.0 26231.0 23234.0 23233.0 8.0 18224.0 18227.0 17237.0 17220.0 16230.0 16239.0 14223.0 13236.0 11241.0 10232.0 10229.0 10235.0 7246.0 7243.0 4244.0 3245.0 2209.0 2240.0 2242.0 2247.0 1Name: model, Length: 248, dtype: int64brand的特征分布如下:brand特征有个40不同的值03148041673714 1608910 1424911379461021797306546651338171129453246172361162223820772520642720532115471514581913883612110922108526966309401791324772286493259229406373332 3213131818316362283422733218231863518038 6539 9Name: brand, dtype: int64bodyType的特征分布如下:bodyType特征有个8不同的值0.0 414201.0 352722.0 303243.0 134914.096095.076076.064827.01289Name: bodyType, dtype: int64fuelType的特征分布如下:fuelType特征有个7不同的值0.0 916561.0 469912.022123.02624.01185.0 456.0 36Name: fuelType, dtype: int64gearbox的特征分布如下:gearbox特征有个2不同的值0.0 1116231.032396Name: gearbox, dtype: int64notRepairedDamage的特征分布如下:notRepairedDamage特征有个2不同的值0.0 1113611.014315Name: notRepairedDamage, dtype: int64regionCode的特征分布如下:regionCode特征有个7905不同的值419369764258125137176136462134428132241301184 13012212982812670125827171181222 1172418 117851162615 1152222 1137591121881111757 1101157 1092401 1071069 1073545 107424107272107451106450105129105... 6324173721750018107124531794215135167601807017204118012159651823 174011810615224181171750717989165051637718042177631778616414170631423915931172671Name: regionCode, Length: 7905, dtype: int641# 特征nunique分布2for cat_fea in categorical_features:3print(cat_fea + "的特征分布如下:")4print("{}特征有个{}不同的值".format(cat_fea, Test_data[cat_fea].nunique()))5print(Test_data[cat_fea].value_counts())name的特征分布如下:name特征有个37453不同的值55 97708 96387 95154188713 7453 72118667203 67631 65911 64204462286660113957893 54118052276550110850290 48151347691 45473 44299 43444 41422 39964 39147938127338306 36725 35463535..4678614883511655721682041171719159080118606211198511471551134869113896711737921114403159098159144140679161161112874615502211430891140661147187111289214659811594811222701898551427521488991118081Name: name, Length: 37453, dtype: int64model的特征分布如下:model特征有个247不同的值0.0389619.032454.030071.0198129.0174248.0168526.0152540.014098.0139731.0129213.0121017.0108765.091549.086646.083130.080310.07095.0 69644.067621.065911.060323.059173.056169.05557.0 52663.049322.044316.041266.041188.0391... 124.0 9193.0 9151.0 8198.0 8181.0 8239.0 7233.0 7216.0 7231.0 6133.0 6236.0 6227.0 6220.0 5230.0 5234.0 4224.0 4241.0 4223.0 4229.0 3189.0 3232.0 3237.0 3235.0 2245.0 2209.0 2242.0 1240.0 1244.0 1243.0 1246.0 1Name: model, Length: 247, dtype: int64brand的特征分布如下:brand特征有个40不同的值0103484576314531410476614532635029242351569131245119197 7953 773167718 70425695276502154415511204501945012389223633032417317263032426828225321932911731115181062 10437 9234 7733 7636 6723 6235 5338 2339 2Name: brand, dtype: int64bodyType的特征分布如下:bodyType特征有个8不同的值0.0 139851.0 118822.099003.044334.033035.025376.021167.0431Name: bodyType, dtype: int64fuelType的特征分布如下:fuelType特征有个7不同的值0.0 306561.0 155442.07743.0 724.0 376.0 145.0 10Name: fuelType, dtype: int64gearbox的特征分布如下:gearbox特征有个2不同的值0.0 373011.0 10789Name: gearbox, dtype: int64notRepairedDamage的特征分布如下:notRepairedDamage特征有个2不同的值0.0 372491.04720Name: notRepairedDamage, dtype: int64regionCode的特征分布如下:regionCode特征有个6971不同的值41914676478188521255175951261550462495424485 4410694345141828407573916883921543919473924 392690382383824183882738118438272382333870 3770337206737509373603717637... 551217465112901371711258174011799251515117527176891811413237160031733513984173671600118021136911490351333315382169691775317463172301826 1112 1Name: regionCode, Length: 6971, dtype: int64

数字特征分析

numeric_features.append('price')numeric_features

out:

['power','kilometer','v_0','v_1','v_2','v_3','v_4','v_5','v_6','v_7','v_8','v_9','v_10','v_11','v_12','v_13','v_14','price']

4.拓展数据分析

4.1相关性分析

Train_data.corr()

out:

price_numeric = Train_data[numeric_features]correlation = price_numeric.corr()print(correlation['price'].sort_values(ascending = False),'\n')

out:

price 1.000000v_12 0.692823v_80.685798v_00.628397power 0.219834v_50.164317v_20.085322v_60.068970v_10.060914v_14 0.035911v_13 -0.013993v_7 -0.053024v_4 -0.147085v_9 -0.206205v_10 -0.246175v_11 -0.275320kilometer -0.440519v_3 -0.730946Name: price, dtype: float64

f , ax = plt.subplots(figsize = (7, 7))plt.title('Correlation of Numeric Features with Price',y=1,size=16)sns.heatmap(correlation,square = True, vmax=0.8)

out:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff01668d940>

4.1查看几个特征得 偏度和峰值

for col in numeric_features:print('{:15}'.format(col), 'Skewness: {:05.2f}'.format(Train_data[col].skew()) , ' ' ,'Kurtosis: {:06.2f}'.format(Train_data[col].kurt()) )

out:

power Skewness: 65.86Kurtosis: 5733.45kilometer Skewness: -1.53Kurtosis: 001.14v_0 Skewness: -1.32Kurtosis: 003.99v_1 Skewness: 00.36Kurtosis: -01.75v_2 Skewness: 04.84Kurtosis: 023.86v_3 Skewness: 00.11Kurtosis: -00.42v_4 Skewness: 00.37Kurtosis: -00.20v_5 Skewness: -4.74Kurtosis: 022.93v_6 Skewness: 00.37Kurtosis: -01.74v_7 Skewness: 05.13Kurtosis: 025.85v_8 Skewness: 00.20Kurtosis: -00.64v_9 Skewness: 00.42Kurtosis: -00.32v_10 Skewness: 00.03Kurtosis: -00.58v_11 Skewness: 03.03Kurtosis: 012.57v_12 Skewness: 00.37Kurtosis: 000.27v_13 Skewness: 00.27Kurtosis: -00.44v_14 Skewness: -1.19Kurtosis: 002.39price Skewness: 03.35Kurtosis: 019.00

4.2数字特征可视化

f = pd.melt(Train_data, value_vars=numeric_features)g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False)g = g.map(sns.distplot, "value")

out:

可以看出匿名特征相对均匀。

4.3 数字特征之间的关系相互可视化

sns.set()columns = ['price', 'v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']sns.pairplot(Train_data[columns],size = 2 ,kind ='scatter',diag_kind='kde')plt.show()

out:

这篇文章是多变量之间的关系可视化,可视化更多学习可参考很不错的文章 /p/6e18d21a4cad¶

4.4多变量之间相互回归关系可视化

fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8), (ax9, ax10)) = plt.subplots(nrows=5, ncols=2, figsize=(24, 20))# ['v_12', 'v_8' , 'v_0', 'power', 'v_5', 'v_2', 'v_6', 'v_1', 'v_14']v_12_scatter_plot = pd.concat([Y_train,Train_data['v_12']],axis = 1)sns.regplot(x='v_12',y = 'price', data = v_12_scatter_plot,scatter= True, fit_reg=True, ax=ax1)v_8_scatter_plot = pd.concat([Y_train,Train_data['v_8']],axis = 1)sns.regplot(x='v_8',y = 'price',data = v_8_scatter_plot,scatter= True, fit_reg=True, ax=ax2)v_0_scatter_plot = pd.concat([Y_train,Train_data['v_0']],axis = 1)sns.regplot(x='v_0',y = 'price',data = v_0_scatter_plot,scatter= True, fit_reg=True, ax=ax3)power_scatter_plot = pd.concat([Y_train,Train_data['power']],axis = 1)sns.regplot(x='power',y = 'price',data = power_scatter_plot,scatter= True, fit_reg=True, ax=ax4)v_5_scatter_plot = pd.concat([Y_train,Train_data['v_5']],axis = 1)sns.regplot(x='v_5',y = 'price',data = v_5_scatter_plot,scatter= True, fit_reg=True, ax=ax5)v_2_scatter_plot = pd.concat([Y_train,Train_data['v_2']],axis = 1)sns.regplot(x='v_2',y = 'price',data = v_2_scatter_plot,scatter= True, fit_reg=True, ax=ax6)v_6_scatter_plot = pd.concat([Y_train,Train_data['v_6']],axis = 1)sns.regplot(x='v_6',y = 'price',data = v_6_scatter_plot,scatter= True, fit_reg=True, ax=ax7)v_1_scatter_plot = pd.concat([Y_train,Train_data['v_1']],axis = 1)sns.regplot(x='v_1',y = 'price',data = v_1_scatter_plot,scatter= True, fit_reg=True, ax=ax8)v_14_scatter_plot = pd.concat([Y_train,Train_data['v_14']],axis = 1)sns.regplot(x='v_14',y = 'price',data = v_14_scatter_plot,scatter= True, fit_reg=True, ax=ax9)v_13_scatter_plot = pd.concat([Y_train,Train_data['v_13']],axis = 1)sns.regplot(x='v_13',y = 'price',data = v_13_scatter_plot,scatter= True, fit_reg=True, ax=ax10)

out:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff00c9242b0>

5.类别特征分析

5.1unique分布

for fea in categorical_features:print(Train_data[fea].nunique())

out:

996622484087227905

categorical_features

out:

['name','model','brand','bodyType','fuelType','gearbox','notRepairedDamage','regionCode']

5.2类别特征箱型图可视化

# 因为 name和 regionCode的类别太稀疏了,这里我们把不稀疏的几类画一下categorical_features = ['model','brand','bodyType','fuelType','gearbox','notRepairedDamage']for c in categorical_features:Train_data[c] = Train_data[c].astype('category')if Train_data[c].isnull().any():Train_data[c] = Train_data[c].cat.add_categories(['MISSING'])Train_data[c] = Train_data[c].fillna('MISSING')def boxplot(x, y, **kwargs):sns.boxplot(x=x, y=y)x=plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)g = g.map(boxplot, "value", "price")

out:

5.3 类别特征小提琴图可视化

catg_list = categorical_featurestarget = 'price'for catg in catg_list :sns.violinplot(x=catg, y=target, data=Train_data)plt.show()

out:

categorical_features = ['model','brand','bodyType','fuelType','gearbox','notRepairedDamage']

5.4类别特征柱形图可视化

def bar_plot(x, y, **kwargs):sns.barplot(x=x, y=y)x=plt.xticks(rotation=90)f = pd.melt(Train_data, id_vars=['price'], value_vars=categorical_features)g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)g = g.map(bar_plot, "value", "price")

5.5类别特征的每个类别频数可视化

def count_plot(x, **kwargs):sns.countplot(x=x)x=plt.xticks(rotation=90)f = pd.melt(Train_data, value_vars=categorical_features)g = sns.FacetGrid(f, col="variable", col_wrap=2, sharex=False, sharey=False, size=5)g = g.map(count_plot, "value")

6.0用pandas_profiling生成数据报告

用pandas_profiling生成一个较为全面的可视化和数据报告(较为简单、方便) 最终打开html文件即可。

import pandas_profiling

pfr = pandas_profiling.ProfileReport(Train_data)pfr.to_file("./example.html")

7.经验总结

所给出的EDA步骤为广为普遍的步骤,在实际的不管是工程还是比赛过程中,这只是最开始的一步,也是最基本的一步。

接下来一般要结合模型的效果以及特征工程等来分析数据的实际建模情况,根据自己的一些理解,查阅文献,对实际问题做出判断和深入的理解。

最后不断进行EDA与数据处理和挖掘,来到达更好的数据结构和分布以及较为强势相关的特征

数据探索在机器学习中我们一般称为EDA(Exploratory Data Analysis):

是指对已有的数据(特别是调查或观察得来的原始数据)在尽量少的先验假定下进行探索,通过作图、制表、方程拟合、计算特征量等手段探索数据的结构和规律的一种数据分析方法。

数据探索有利于我们发现数据的一些特性,数据之间的关联性,对于后续的特征构建是很有帮助的。

对于数据的初步分析(直接查看数据,或.sum(), .mean(),.descirbe()等统计函数)可以从:样本数量,训练集数量,是否有时间特征,是否是时许问题,特征所表示的含义(非匿名特征),特征类型(字符类似,int,float,time),特征的缺失情况(注意缺失的在数据中的表现形式,有些是空的有些是”NAN”符号等),特征的均值方差情况。

分析记录某些特征值缺失占比30%以上样本的缺失处理,有助于后续的模型验证和调节,分析特征应该是填充(填充方式是什么,均值填充,0填充,众数填充等),还是舍去,还是先做样本分类用不同的特征模型去预测。

对于异常值做专门的分析,分析特征异常的label是否为异常值(或者偏离均值较远或者事特殊符号),异常值是否应该剔除,还是用正常值填充,是记录异常,还是机器本身异常等。

对于Label做专门的分析,分析标签的分布情况等。

进步分析可以通过对特征作图,特征和label联合做图(统计图,离散图),直观了解特征的分布情况,通过这一步也可以发现数据之中的一些异常值等,通过箱型图分析一些特征值的偏离情况,对于特征和特征联合作图,对于特征和label联合作图,分析其中的一些关联性。

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。