1500字范文 > 机器学习项目实战：基于随机森林进行心脏病分类（含多种模型解释方法）

机器学习项目实战：基于随机森林进行心脏病分类（含多种模型解释方法）

时间：2023-04-25 01:25:59

本项目是Kaggle上面的一个经典竞赛题，心脏病分类问题，题目链接在这里. 主要基于随机森林的bagging集成学习框架，通过13个生理特征数据，实现对心脏病分类的预测。

由于自己想要在这个项目更多的学习到模型解释方面的内容，所以对于模型精度没有过多的在意和调参。模型解释主要用了eli5，shap和部分依赖图。

下面是完整的代码和运行结果。在python3.7环境下可以运行。

文章目录

1 导入各种模块2 导入数据2.1 修改特征名称2.2 特征说明2.3 特征属性说明3 建模3.1 模型选择3.2 随机森林绘图4 模型评价4.1 混淆矩阵4.2 精确率，召回率，准确率4.3 ROC和AUC5 模型解释5.1 基于eli5进行特征重要度排序5.2 部份依赖图5.3 shap值

1 导入各种模块

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns # 画图from sklearn.ensemble import RandomForestClassifier # bagging的随机森林from sklearn.tree import DecisionTreeClassifier # 决策树模型from sklearn.tree import export_graphviz # 绘制决策树from sklearn.metrics import roc_curve,auc # 模型评价之ROC，AUC曲线from sklearn.metrics import classification_report # 决策树分类报告from sklearn.metrics import confusion_matrix # 混淆矩阵from sklearn.model_selection import train_test_split # 训练集划分import eli5 #for purmutation importancefrom eli5.sklearn import PermutationImportanceimport shap #for SHAP valuesfrom sklearn.inspection import plot_partial_dependence

2 导入数据

数据集点这里下载：数据集免费下载

dt = pd.read_csv('./heart.csv')

2.1 修改特征名称

dt.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved','exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']

2.2 特征说明

2.3 特征属性说明

# 展示前十个数据dt.head(10)

# 特征数据类型dt.dtypes

ageint64sexint64chest_pain_type int64resting_blood_pressure int64cholesterol int64fasting_blood_sugarint64rest_ecg int64max_heart_rate_achievedint64exercise_induced_anginaint64st_depression float64st_slope int64num_major_vessels int64thalassemia int64target int64dtype: object

3 建模

3.1 模型选择

采用sklearn中的随机森林模型

# 切分训练集和测试集X_train,X_test,y_train,y_test = train_test_split(dt.drop('target',1),dt['target'],test_size=0.2,random_state = 10)

# train the modelmodel = RandomForestClassifier(max_depth= 5)# model = DecisionTreeClassifier(max_depth= 5)model = model.fit(X_train,y_train)

feature_names = [i for i in X_train.columns]print(feature_names)y_train_str = y_train.astype('str')y_train_str[y_train_str == '0'] = 'no disease'y_train_str[y_train_str == '1'] = 'disease'y_train_str = y_train_str.values

['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved', 'exercise_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia']

3.2 随机森林绘图

由于随机森林是一种集成学习的方法，包含多个决策树，所以采用一棵树的形式展现。

estimator = model.estimators_[1] ## 第二棵树export_graphviz(estimator, out_file='tree.dot', feature_names = feature_names,class_names = y_train_str,rounded = True, proportion = True, label='root',precision = 2, filled = True)from subprocess import callcall(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])from IPython.display import ImageImage(filename = 'tree.png')

# 测试集预测y_predict = model.predict(X_test)y_pred_quant = model.predict_proba(X_test)[:, 1] # 概率形式print(y_pred_quant)print(y_predict)

[0.16269787 0.44748633 0.40041271 0.76694297 0.24910602 0.745488230.49200005 0.75311419 0.89932211 0.1549346 0.96559097 0.186688080.59132509 0.80893958 0.2246534 0.74900495 0.15050619 0.009456740.67502776 0.30255965 0.17225514 0.83713577 0.62405556 0.880696210.36353612 0.26581345 0.02924095 0.07584574 0.83546613 0.024895960.87005523 0.23022382 0.06858459 0.37906588 0.01819351 0.103035440.76307864 0.44975844 0.74887615 0.21951679 0.06297115 0.298104840.74734177 0.67051724 0.94183962 0.45715414 0.59805333 0.889405830.76094355 0.54006929 0.48825986 0.81975331 0.04957972 0.259290680.95305684 0.7351655 0.79543294 0.73187127 0.00626538 0.072733160.71277042][0 0 0 1 0 1 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 10 1 0 0 0 1 1 1 0 1 1 1 1 0 1 0 0 1 1 1 1 0 0 1]

4 模型评价

4.1 混淆矩阵

confusion_matrix = confusion_matrix(y_test, y_predict)confusion_matrix

array([[28, 7],[ 4, 22]], dtype=int64)

4.2 精确率，召回率，准确率

total_num = sum(sum(confusion_matrix))precise = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])recall = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[0,1])acc = (confusion_matrix[0,0]+confusion_matrix[1,1])/total_numprint('precise:',precise)print('recall:',recall)print('acc:',acc)

precise: 0.875recall: 0.8acc: 0.819672131147541

4.3 ROC和AUC

# 绘制ROC曲线fpr,tpr,thresholds = roc_curve(y_test,y_pred_quant)plt.plot(fpr,tpr)plt.plot([0,1],[0,1],ls='--',color='gray')plt.xlim([0.0,1.0])plt.plot([0.0],[1.0])plt.title('ROC curve for diabetes classifier')plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.grid()

# AUCauc(fpr,tpr)

0.9087912087912088

5 模型解释

5.1 基于eli5进行特征重要度排序

特征重要度排序就是对单个特征进行观察其对预测结果的影响。这一块的官方文档我还没有看，昨天思远和我解释了一下大概意思。这个weight是怎么计算的呢，就是对于特征一，随机打乱数据的顺序，观测新的预测结果和基准结果的变化程度，如果变化越大，及说明该特征的重要性越大，也就是最上面的绿色部分。如果是靠近下面的特征，就是不重要的特征。

基于这个特性，还可以采用这个方法进行降维、避免过拟合等等。

perm = PermutationImportance(model,random_state=1).fit(X_test,y_test)eli5.show_weights(perm, feature_names = X_test.columns.tolist())

5.2 部份依赖图

# 单个依赖图features = ['num_major_vessels','age','st_depression']display = plot_partial_dependence(model, X_train, features,kind="both", subsample=30,n_jobs=5, grid_resolution=5, random_state=0) display.figure_.subplots_adjust(wspace=0.4, hspace=0.3)

# 2维(暂时没画出来)features = ['st_slope','st_depression',('st_slope','st_depression')]display = plot_partial_dependence(model, X_train, features,kind="both", subsample=30,n_jobs=5, grid_resolution=5, random_state=0) display.figure_.subplots_adjust(wspace=0.4, hspace=0.3)

5.3 shap值

shap值可以解释每个变量对预测结果的影响

explainer = shap.TreeExplainer(model)shap_values = explainer.shap_values(X_test)shap.summary_plot(shap_values[1],X_test,plot_type = 'bar')

横坐标为预测概率，每一个变量都有一行数据，越红表示数值越高，越蓝色表示数值越低。以血管数量为例，红色在左边，蓝色在右边，说明了数值越大，预测概率越低，即患病的风险越小。

shap.summary_plot(shap_values[1],X_test)

接下来，对于单个病人分析，各个特征是如何影响他的结果的

def heart_disease_risk_factors(model, patient):explainer = shap.TreeExplainer(model)shap_values = explainer.shap_values(patient)shap.initjs()return shap.force_plot(explainer.expected_value[1],shap_values[1],patient)

data_for_prediction = X_test.iloc[0,:].astype(float)heart_disease_risk_factors(model,data_for_prediction)

接下来可以分析两个特征之间的相互影响关系

shap.dependence_plot('num_major_vessels', shap_values[1], X_test, interaction_index="st_depression")

对于一批病人数据，例如50个，可以全面的观察各个特征对结果的影响，各个特征之间的影响，下图是一个可选图，可以自由选择横纵坐标，达到想要的目的。

shap_values = explainer.shap_values(X_train.iloc[:50])shap.force_plot(explainer.expected_value[1], shap_values[1], X_test.iloc[:50])

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。