Python scikit-learnで決定木での分類結果をROC曲線とAUCで評価する

前回モデルの性能検証に関して、クロスバリデーションについて書きました。

www.superi.jp

サンプリングデータにバラツキがある際にクロスバリデーションしてから性能評価を行います。モデル自体の性能評価はROC曲線下面積(AUC)で行います。ROCとかAUCの詳細は他を参照していただきたいですが、モデルの評価にAUCを利用する意図としてはサンプル自体に偏りのあるデータ(ラベルAとラベルBのデータのうちAが9割占める)ようなデータの場合、ラベルAって予測しとけば当たる確率高いやん？、そんなモデルってどうよってのを考慮して評価するためです。

というわけで決定木でモデルを作成して、AUCで評価してみましょう。

#scikit-learn付属のirisデータを読み込み
from sklearn.datasets import load_iris
iris = load_iris()

次に決定木でモデリングです。

from sklearn import tree

clf = tree.DecisionTreeClassifier(max_depth=3)
clf = clf.fit(iris.data, iris.target)
score = clf.score(iris.data, iris.target)

predict関数で予測します。

classification_report関数で精度と再現率、f1スコア、サンプル数が返ってきます。

#classification_report のクラスの読み込み

from sklearn import metrics as mtr

#予測

y_pred = clf.predict(iris.data)

#分類結果の表示
print(mtr.classification_report(iris.target, y_pred, target_names = iris.target_names))
             precision    recall  f1-score   support

     setosa       1.00      1.00      1.00        50
 versicolor       0.98      0.94      0.96        50
  virginica       0.94      0.98      0.96        50

avg / total       0.97      0.97      0.97       150

次にpredict_proba関数で予測確率を計算します。ここではvirginicaかどうかの予測です。

予測確率と実際の正答率からroc_curve関数でROC曲線を計算します。

prob = clf.predict_proba(iris.data)[:,2]
fpr, tpr, thresholds = mtr.roc_curve(iris.target, prob, pos_label=2)

さっき計算したROCをグラフ化。

%matplotlib inline
from matplotlib import pyplot as plt
plt.plot(fpr, tpr)
plt.title("ROC curve")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.show()

AUC関数でAUCを計算します。

mtr.auc(fpr, tpr)
0.9929

高いっすね。そもそも学習データとテストデータが同じなんで当然っすね。

参考

機械学習の学習日記 | Microsoft Azure Machine Learning /Evaluation results / 3Dプリンタの設定をいじる / 機械学習の学習日記まとめ