1500字范文,内容丰富有趣,写作好帮手!
1500字范文 > 【大数据部落】R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

【大数据部落】R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

时间:2020-01-13 04:10:18

相关推荐

【大数据部落】R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

原文链接:/?p=5521

原文出处:拓端数据部落公众号

Data background

A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service.

The data set is Churn . The fields are as follows:

Data Preparation and Exploration

查看数据概览##stateaccount.length area.code phone.number ## WV: 158 Min. : 1.0 Min. :408.0 327-1058: 1 ## MN: 125 1st Qu.: 73.0 1st Qu.:408.0 327-1319: 1 ## AL: 124 Median :100.0 Median :415.0 327-2040: 1 ## ID: 119 Mean :100.3 Mean :436.9 327-2475: 1 ## VA: 118 3rd Qu.:127.0 3rd Qu.:415.0 327-3053: 1 ## OH: 116 Max. :243.0 Max. :510.0 327-3587: 1 ## (Other):4240(Other) :4994 ## international.plan voice.mail.plan number.vmail.messages## no :4527 no :3677 Min. : 0.000 ## yes: 473 yes:1323 1st Qu.: 0.000 ## Median : 0.000 ## Mean : 7.755 ## 3rd Qu.:17.000 ## Max. :52.000 ## ## total.day.minutes total.day.calls total.day.charge total.eve.minutes## Min. : 0.0Min. : 0Min. : 0.00 Min. : 0.0 ## 1st Qu.:143.71st Qu.: 871st Qu.:24.43 1st Qu.:166.4 ## Median :180.1Median :100Median :30.62 Median :201.0 ## Mean :180.3Mean :100Mean :30.65 Mean :200.6 ## 3rd Qu.:216.23rd Qu.:1133rd Qu.:36.75 3rd Qu.:234.1 ## Max. :351.5Max. :165Max. :59.76 Max. :363.7 #### total.eve.calls total.eve.charge total.night.minutes total.night.calls## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00 ## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 ## Median :100.0 Median :17.09 Median :200.4 Median :100.00 ## Mean :100.2 Mean :17.05 Mean :200.4 Mean : 99.92 ## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 ## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00 ## ## total.night.charge total.intl.minutes total.intl.calls total.intl.charge## Min. : 0.000Min. : 0.00Min. : 0.000 Min. :0.000 ## 1st Qu.: 7.5101st Qu.: 8.501st Qu.: 3.000 1st Qu.:2.300 ## Median : 9.020Median :10.30Median : 4.000 Median :2.780 ## Mean : 9.018Mean :10.26Mean : 4.435 Mean :2.771 ## 3rd Qu.:10.5603rd Qu.:12.003rd Qu.: 6.000 3rd Qu.:3.240 ## Max. :17.770Max. :20.00Max. :20.000 Max. :5.400 ## ## number.customer.service.callschurn## Min. :0.00 False.:4293 ## 1st Qu.:1.00 True. : 707 ## Median :1.00 ## Mean :1.57 ## 3rd Qu.:2.00 ## Max. :9.00 ##

从数据概览中我们可以发现没有缺失数据,同时可以发现电话号 地区代码是没有价值的变量,可以删去

Examine the variables graphically

从上面的结果中,我们可以看到churn为no的样本数目要远远大于churn为yes的样本,因此所有样本中churn占多数。

从上面的结果中,我们可以看到除了emailcode和areacode之外,其他数值变量近似符合正态分布。

## account.length area.codenumber.vmail.messages total.day.minutes## Min. : 1.0 Min. :408.0 Min. : 0.000 Min. : 0.0 ## 1st Qu.: 73.0 1st Qu.:408.0 1st Qu.: 0.000 1st Qu.:143.7 ## Median :100.0 Median :415.0 Median : 0.000 Median :180.1 ## Mean :100.3 Mean :436.9 Mean : 7.755 Mean :180.3 ## 3rd Qu.:127.0 3rd Qu.:415.0 3rd Qu.:17.000 3rd Qu.:216.2 ## Max. :243.0 Max. :510.0 Max. :52.000 Max. :351.5 ## total.day.calls total.day.charge total.eve.minutes total.eve.calls## Min. : 0Min. : 0.00 Min. : 0.0Min. : 0.0 ## 1st Qu.: 871st Qu.:24.43 1st Qu.:166.41st Qu.: 87.0 ## Median :100Median :30.62 Median :201.0Median :100.0 ## Mean :100Mean :30.65 Mean :200.6Mean :100.2 ## 3rd Qu.:1133rd Qu.:36.75 3rd Qu.:234.13rd Qu.:114.0 ## Max. :165Max. :59.76 Max. :363.7Max. :170.0 ## total.eve.charge total.night.minutes total.night.calls total.night.charge## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000 ## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510 ## Median :17.09 Median :200.4 Median :100.00 Median : 9.020 ## Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018 ## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560 ## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770 ## total.intl.minutes total.intl.calls total.intl.charge## Min. : 0.00Min. : 0.000 Min. :0.000 ## 1st Qu.: 8.501st Qu.: 3.000 1st Qu.:2.300 ## Median :10.30Median : 4.000 Median :2.780 ## Mean :10.26Mean : 4.435 Mean :2.771 ## 3rd Qu.:12.003rd Qu.: 6.000 3rd Qu.:3.240 ## Max. :20.00Max. :20.000 Max. :5.400 ## number.customer.service.calls## Min. :0.00 ## 1st Qu.:1.00 ## Median :1.00 ## Mean :1.57 ## 3rd Qu.:2.00 ## Max. :9.00

Relationships between variables

从结果中我们可以看到两者之间存在显著的正相关线性关系。

Using the statistics node, report

## account.length area.code## account.length 1.0000000000 -0.018054187## area.code -0.0180541874 1.000000000## number.vmail.messages-0.0145746663 -0.003398983## total.day.minutes -0.0010174908 -0.019118245## total.day.calls 0.0282402279 -0.019313854## total.day.charge-0.0010191980 -0.019119256## total.eve.minutes -0.0095913331 0.007097877## total.eve.calls 0.0091425790 -0.012299947## total.eve.charge-0.0095873958 0.007114130## total.night.minutes 0.0006679112 0.002083626## total.night.calls -0.0078254785 0.014656846## total.night.charge 0.0006558937 0.002070264## total.intl.minutes 0.0012908394 -0.004153729## total.intl.calls0.0142772733 -0.013623309## total.intl.charge0.0012918112 -0.004219099## number.customer.service.calls -0.0014447918 0.020920513## number.vmail.messages total.day.minutes## account.length -0.0145746663-0.001017491## area.code -0.0033989831-0.019118245## number.vmail.messages 1.0000000000 0.005381376## total.day.minutes 0.0053813760 1.000000000## total.day.calls 0.0008831280 0.001935149## total.day.charge 0.0053767959 0.999999951## total.eve.minutes 0.0194901208-0.010750427## total.eve.calls -0.0039543728 0.008128130## total.eve.charge 0.0194959757-0.010760022## total.night.minutes0.0055413838 0.011798660## total.night.calls 0.0026762202 0.004236100## total.night.charge 0.0055349281 0.011782533## total.intl.minutes 0.0024627018-0.019485746## total.intl.calls 0.0001243302-0.001303123## total.intl.charge 0.0025051773-0.019414797## number.customer.service.calls -0.0070856427 0.002732576## total.day.calls total.day.charge## account.length 0.0282402279-0.001019198## area.code -0.0193138545-0.019119256## number.vmail.messages 0.00088312800.005376796## total.day.minutes0.00193514870.999999951## total.day.calls 1.00000000000.001935884## total.day.charge 0.00193588441.000000000## total.eve.minutes-0.0006994115-0.010747297## total.eve.calls 0.00375417870.008129319## total.eve.charge-0.0006952217-0.010756893## total.night.minutes 0.00280446500.011801434## total.night.calls-0.00830834670.004234934## total.night.charge0.00280181690.011785301## total.intl.minutes0.0130972198-0.019489700## total.intl.calls 0.0108928533-0.001306635## total.intl.charge0.0131613976-0.019418755## number.customer.service.calls -0.01073949510.002726370## total.eve.minutes total.eve.calls## account.length-0.00959133310.009142579## area.code0.0070978766 -0.012299947## number.vmail.messages 0.0194901208 -0.003954373## total.day.minutes -0.01075042740.008128130## total.day.calls -0.00069941150.003754179## total.day.charge -0.01074729680.008129319## total.eve.minutes 1.00000000000.002763019## total.eve.calls0.00276301941.000000000## total.eve.charge 0.99999977490.002778097## total.night.minutes-0.01663911600.001781411## total.night.calls 0.013463 -0.013682341## total.night.charge-0.01664204210.001799380## total.intl.minutes 0.0001365487 -0.007458458## total.intl.calls 0.00838815590.005574500## total.intl.charge 0.0001593155 -0.007507151## number.customer.service.calls-0.01382342280.006234831## total.eve.charge total.night.minutes## account.length -0.0095873958 0.0006679112## area.code0.0071141298 0.0020836263## number.vmail.messages 0.0194959757 0.0055413838## total.day.minutes-0.0107600217 0.0117986600## total.day.calls -0.0006952217 0.0028044650## total.day.charge -0.0107568931 0.0118014339## total.eve.minutes 0.9999997749 -0.0166391160## total.eve.calls 0.0027780971 0.0017814106## total.eve.charge 1.0000000000 -0.0166489191## total.night.minutes -0.0166489191 1.0000000000## total.night.calls 0.013424 0.0269718182## total.night.charge-0.0166518367 0.9999992072## total.intl.minutes0.000138 -0.0067209669## total.intl.calls 0.0083930603 -0.0172140162## total.intl.charge 0.0001547783 -0.0066545873## number.customer.service.calls -0.0138363623 -0.0085325365

如果把高相关性的变量保存下来,可能会造成多重共线性问题,因此需要把高相关关系的变量删去。

Data Manipulation

从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。

特别是voicemial为no的变量之间存在负相关关系。

Discretize(make categorical) a relevant numeric variable

对变量进行离散化

construct a distribution of the variable with a churn overlay

construct a histogram of the variable with a churn overlay

Find a pair of numeric variables which are interesting with respect to churn.

从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。

Model Building

特别是churn为no的变量之间存在相关关系。

## Estimate Std. Error t value Pr(>|t|) ## (Intercept)0.3082150 0.0735760 4.189 2.85e-05 ***## stateAL 0.0151188 0.0462343 0.327 0.743680 ## stateAR 0.0894792 0.0490897 1.823 0.068399 . ## stateAZ 0.0329566 0.0494195 0.667 0.504883 ## stateCA 0.1951511 0.0567439 3.439 0.000588 ***## international.plan yes 0.3059341 0.0151677 20.170 < 2e-16 ***## voice.mail.plan yes -0.1375056 0.0337533 -4.074 4.70e-05 ***## number.vmail.messages0.0017068 0.0010988 1.553 0.120402 ## total.day.minutes 0.3796323 0.2629027 1.444 0.148802 ## total.day.calls0.0002191 0.0002235 0.981 0.326781 ## total.day.charge -2.2207671 1.5464583 -1.436 0.151056 ## total.eve.minutes 0.0288233 0.1307496 0.220 0.825533 ## total.eve.calls-0.0001585 0.0002238 -0.708 0.478915 ## total.eve.charge -0.3316041 1.5382391 -0.216 0.829329 ## total.night.minutes 0.0083224 0.0695916 0.120 0.904814 ## total.night.calls -0.0001824 0.0002225 -0.820 0.412290 ## total.night.charge -0.1760782 1.5464674 -0.114 0.909355 ## total.intl.minutes -0.0104679 0.4192270 -0.025 0.980080 ## total.intl.calls -0.0063448 0.0018062 -3.513 0.000447 ***## total.intl.charge 0.0676460 1.5528267 0.044 0.965254 ## number.customer.service.calls 0.0566474 0.0033945 16.688 < 2e-16 ***## total.day.minutes1medium 0.0502681 0.0160228 3.137 0.001715 ** ## total.day.minutes1short 0.2404020 0.0322293 7.459 1.02e-13 ***

从结果中看,我们可以发现 state total.intl.calls、number.customer.service.calls 、 total.day.minutes1medium 、total.day.minutes1short的变量有重要的影响。

Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn

## Direction.## knn.pred 1 2## 1 760 97## 2 100 43[1] 0.803

混淆矩阵(英语:confusion matrix)是可视化工具,特别用于监督学习,在无监督学习一般叫做匹配矩阵。 矩阵的每一列代表一个类的实例预测,而每一行表示一个实际的类的实例。

## Direction.## knn.pred 1 2## 1 827 104## 2 33 36[1] 0.863

从测试集的结果,我们可以看到准确度达到86%。

Findings

我们可以发现 ,total.day.calls和total.day.charge之间存在一定的相关关系。特别是churn为no的变量之间存在相关关系。同时我们可以发现 state total.intl.calls、number.customer.service.calls 、 total.day.minutes1medium、total.day.minutes1short的变量有重要的影响。同时我们可以发现,total.day.calls和total.day.charge之间存在一定的相关关系。最后从knn模型结果中,我们可以发现从训练集的结果中,我们可以看到准确度有80%,从测试集的结果,我们可以看到准确度达到86%。说明模型有很好的预测效果。

最受欢迎的见解

1.DT模型打好用户流失预防针——电信客户流失浅析

2.Python中用PyTorch机器学习分类预测银行客户流失模型

3.银行信用数据SOM神经网络聚类实现

4.基于决策树的银行信贷风险预警模型

5.机器学习助推快时尚精准销售预测

6.在Python中使用LSTM和PyTorch进行时间序列预测

7.python中使用scikit-learn和pandas决策树进行iris鸢尾花数据分类建模和交叉验证

8.r语言预测波动率的实现:ARCH模型与HAR-RV模型

9.用于NLP的Python:使用Keras的多标签文本LSTM神经网络分类

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。