大众号:尤而小屋
作者:Peter
编辑:Peter

大家好,我是Peter~

今日给大家带来一篇新的kaggle数据建模文章:信贷违约猜测

本文是依据一份Lendingclub上面的信贷数据,依据借款人的历史数据和借款信息,建立一个逻辑回归模型来猜测借款人是否会归还他们的借款。

导图显现本文的首要作业,其间缺失值处理特征工程是重点。

逻辑回归:信贷违规预测!

导入库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
import plotly.offline as pyo
pyo.init_notebook_mode()
sns.set_style('darkgrid')
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score as f1
from sklearn.metrics import confusion_matrix
plt.rc('figure',figsize=(18,9))
import warnings
warnings.filterwarnings("ignore")

数据基本信息

数据来自 LendingClub,它是美国一家个人对个人借款公司,总部位于加利福尼亚州旧金山,号称是全球最大的个人对个人(p2p)假贷渠道。

逻辑回归:信贷违规预测!

检查数据量、字段类型等:

逻辑回归:信贷违规预测!

缺失值处理

缺失值处理是本次数据处理的一项重要作业。由于存在字符型和数值型的变量,而且这两种类型下的数据都存在缺失值,所以在分隔处理。

缺失值可视化

In [6]:

# 4、缺失值状况
df.isnull().sum().sort_values(ascending=False)

Out[6]:

id                     105451
hardship_start_date    105451
member_id              105451
hardship_type          105451
hardship_reason        105451
                        ...  
open_acc_6m                 0
tot_cur_bal                 0
tot_coll_amt                0
acc_now_delinq              0
max_bal_bc                  0
Length: 137, dtype: int64

In [7]:

ms.bar(df,color="blue")
plt.show()

逻辑回归:信贷违规预测!

假如取值是1,阐明是完好的字段,没有缺失值;能够看到仍是存在很多字段有缺失值。

删去悉数缺失字段

1、部分字段是悉数缺失的,将它们进行删去:

In [8]:

# 办法1:运用dropna的how参数

# data.dropna(axis=1, how='all')

In [9]:

# 在这里咱们运用下面的【办法2】
# 1、检查每个字段的缺失值份额
isnull_col_percent = df.isnull().sum(axis=0).sort_values(ascending=False) / float(len(df)) 
isnull_col_percent

Out[9]:

id                     1.0
hardship_start_date    1.0
member_id              1.0
hardship_type          1.0
hardship_reason        1.0
                      ... 
open_acc_6m            0.0
tot_cur_bal            0.0
tot_coll_amt           0.0
acc_now_delinq         0.0
max_bal_bc             0.0
Length: 137, dtype: float64

In [10]:

isnull_col_percent[:20]  # 前20个

Out[10]:

id                                            1.000000
hardship_start_date                           1.000000
member_id                                     1.000000
hardship_type                                 1.000000
hardship_reason                               1.000000
hardship_status                               1.000000
deferral_term                                 1.000000
hardship_amount                               1.000000
hardship_end_date                             1.000000
url                                           1.000000
payment_plan_start_date                       1.000000
hardship_length                               1.000000
hardship_dpd                                  1.000000
hardship_loan_status                          1.000000
orig_projected_additional_accrued_interest    1.000000
hardship_payoff_balance_amount                1.000000
desc                                          1.000000
hardship_last_payment_amount                  1.000000
sec_app_mths_since_last_major_derog           0.977866
sec_app_revol_util                            0.936188
dtype: float64

在上面的成果中,份额为1的字段便是悉数缺失的(前面18个字段),关于建模无用,咱们将其删去:

In [11]:

no_use_col = df.columns[df.isnull().all()].tolist()
no_use_col

Out[11]:

['id', 'member_id', 'url', 'desc', 'hardship_type', 'hardship_reason', 'hardship_status', 'deferral_term', 'hardship_amount', 'hardship_start_date', 'hardship_end_date', 'payment_plan_start_date', 'hardship_length', 'hardship_dpd', 'hardship_loan_status', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount']

In [12]:

df.drop(no_use_col,axis=1,inplace=True)

In [13]:

df.shape  # 137-18 = 119,成功删去18个属性

Out[13]:

(105451, 119)

删去缺失值在30%以上的字段

2、有些字段是部分缺失值;缺失量过大也会对建模造成影响。

在这里咱们设置删去缺失值在30%(能够设定其他值)以上的字段(依照列的方历来计算)

In [14]:

# 最大的缺失值份额高达97.78%
df.isnull().sum(axis=0).sort_values(ascending=False) / float(len(df))

Out[14]:

sec_app_mths_since_last_major_derog    0.977866
sec_app_revol_util                     0.936188
revol_bal_joint                        0.935458
sec_app_earliest_cr_line               0.935458
sec_app_inq_last_6mths                 0.935458
                                         ...   
total_rec_int                          0.000000
total_rec_prncp                        0.000000
total_pymnt_inv                        0.000000
total_pymnt                            0.000000
hardship_flag                          0.000000
Length: 119, dtype: float64

In [15]:

thresh = len(df)*0.3 # 阀值(缺失值数量)
# 某列数据缺失的数量超过阀值就被删去
df.dropna(thresh=thresh, axis=1, inplace=True) 

In [16]:

df.shape

Out[16]:

(105451, 102)

In [17]:

df.isnull().sum(axis=0).sort_values(ascending=False) / float(len(df))

Out[17]:

mths_since_recent_revol_delinq    0.648405
mths_since_last_delinq            0.484765
il_util                           0.126884
mths_since_recent_inq             0.113313
emp_title                         0.064314
                                    ...   
total_pymnt_inv                   0.000000
total_pymnt                       0.000000
out_prncp_inv                     0.000000
out_prncp                         0.000000
hardship_flag                     0.000000
Length: 102, dtype: float64

新数据(删去缺失值)

新数据缺失值可视化

In [18]:

ms.bar(df,color="blue")
plt.show()

逻辑回归:信贷违规预测!

和之前的对比,发现缺失值的字段少了很多

处理后的数据量:只剩下102个字段

In [19]:

df.shape

Out[19]:

(105451, 102)

字段类型计算

In [20]:

df.dtypes.value_counts()

Out[20]:

int64      54
float64    25
object     23
dtype: int64

仅有值字段

3、某些字段的取值是仅有的,比方悉数是0,这样的字段关于建模也是无用的,咱们要将这些字段进行删去。

首要找出这些字段:运用Pandas中的nunique函数来确认取值是否仅有。

In [21]:

nunique_data = df.apply(pd.Series.nunique) != 1   #  不仅有的字段
nunique_data

Out[21]:

loan_amnt                      True
funded_amnt                    True
funded_amnt_inv                True
term                           True
int_rate                       True
                              ...  
tot_hi_cred_lim                True
total_bal_ex_mort              True
total_bc_limit                 True
total_il_high_credit_limit     True
hardship_flag                 False
Length: 102, dtype: bool

In [22]:

# df1:删去仅有值字段后的数据
df1 = df.loc[:, nunique_data]
df1.shape

Out[22]:

(105451, 97)

字段从102变成了97,阐明仍是存在5个取值仅有的字段

In [23]:

# 再次检查字段类型
df1.dtypes.value_counts()

Out[23]:

int64      51
float64    25
object     21
dtype: int64

缺失值处理-分类型变量

针对分类型变量(取值为字符串类型)的缺失值处理,一般是选用类型编码、独热码、因子化等操作

In [24]:

object_df = df1.select_dtypes(include="object")
object_df.columns

Out[24]:

Index(['term', 'int_rate', 'grade', 'sub_grade', 'emp_title', 'emp_length',       'home_ownership', 'verification_status', 'issue_d', 'loan_status',       'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line',       'revol_util', 'initial_list_status', 'last_pymnt_d', 'next_pymnt_d',       'last_credit_pull_d', 'application_type'],
      dtype='object')

比方字段grade的取值是真实的Object类型:

In [25]:

# grade
df1["grade"].value_counts()

Out[25]:

C    36880
B    31301
A    17898
D    12015
E     4847
F     1725
G      785
Name: grade, dtype: int64

下面呈现的字段虽然是object类型,但是实际上行它们的取值仍存在必定的巨细联系

In [26]:

# int_rate
df1["int_rate"].value_counts()

Out[26]:

16.02%    4956
5.32%     4365
15.05%    4195
14.08%    4168
12.74%    4155
          ... 
30.75%     119
30.49%     113
30.74%      98
21.49%      89
6.00%        4
Name: int_rate, Length: 65, dtype: int64

In [27]:

# revol_util
df1["revol_util"].value_counts()

Out[27]:

0%         468
60%        244
64%        230
50%        227
53%        220
          ... 
116.20%      1
110.10%      1
106.10%      1
125%         1
113%         1
Name: revol_util, Length: 1076, dtype: int64

分类型变量转成数值

  1. 将百分比的数据转成浮点型:

In [28]:

# 将右侧的%去掉,一起强制转成float类型
df1["int_rate"] = df1["int_rate"].str.rstrip("%").astype("float")
df1["revol_util"] = df1["revol_util"].str.rstrip("%").astype("float")
  1. 关于年份的处理:转成数值,并且10+直接用10替代,<1的部分直接用0替代:

In [29]:

# emp_length
df1["emp_length"].value_counts()

Out[29]:

10+ years    35438
2 years       9914
< 1 year      9542
3 years       8495
1 year        7034
4 years       6453
5 years       6382
6 years       4526
7 years       3847
9 years       3828
8 years       3295
Name: emp_length, dtype: int64

In [30]:

df1["emp_length"].isnull().sum()  # 存在缺失值

Out[30]:

6697

现将缺失值填充为<1 year

In [31]:

df1["emp_length"] = df1["emp_length"].fillna("< 1 year")

In [32]:

# 1、先将year或years去掉
df1["emp_length"] = df1["emp_length"].apply(lambda x: x.split("y")[0].rstrip(" "))

In [33]:

df1["emp_length"].value_counts()

Out[33]:

10+    35438
< 1    16239
2       9914
3       8495
1       7034
4       6453
5       6382
6       4526
7       3847
9       3828
8       3295
Name: emp_length, dtype: int64

In [34]:

# 2、转成数值型
df1["emp_length"] = df1["emp_length"].apply(lambda x: "10" if x == "10+" else x)
df1["emp_length"] = df1["emp_length"].str.replace("< 1","0")
df1["emp_length"] = df1["emp_length"].astype("float")

In [35]:

df1["emp_length"].value_counts()

Out[35]:

10.0    35438
0.0     16239
2.0      9914
3.0      8495
1.0      7034
4.0      6453
5.0      6382
6.0      4526
7.0      3847
9.0      3828
8.0      3295
Name: emp_length, dtype: int64

分类型变量可视化

In [36]:

object_df = df1.select_dtypes(include="object")
object_df.columns

Out[36]:

Index(['term', 'grade', 'sub_grade', 'emp_title', 'home_ownership',       'verification_status', 'issue_d', 'loan_status', 'purpose', 'title',       'zip_code', 'addr_state', 'earliest_cr_line', 'initial_list_status',       'last_pymnt_d', 'next_pymnt_d', 'last_credit_pull_d',       'application_type'],
      dtype='object')

In [37]:

ms.matrix(object_df)
plt.show()

逻辑回归:信贷违规预测!

首要是两个字段缺失值:emp_title、next_pymnt_d

缺失值填充

In [38]:

object_df.isnull().sum()

Out[38]:

term                      0
grade                     0
sub_grade                 0
emp_title              6782
home_ownership            0
verification_status       0
issue_d                   0
loan_status               0
purpose                   0
title                     0
zip_code                  0
addr_state                0
earliest_cr_line          0
initial_list_status       0
last_pymnt_d            145
next_pymnt_d           3921
last_credit_pull_d        3
application_type          0
dtype: int64

在这里咱们直接填充为“Unknown”:

In [39]:

object_df.fillna("Unknown",inplace=True)

再次检查发现已经没有缺失值:

In [40]:

逻辑回归:信贷违规预测!

缺失值处理——数值型变量

In [41]:

number_df = df1.select_dtypes(exclude="object")
number_df.columns

Out[41]:

Index(['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'int_rate',       'installment', 'emp_length', 'annual_inc', 'dti', 'delinq_2yrs',       'inq_last_6mths', 'mths_since_last_delinq', 'open_acc', 'pub_rec',       'revol_bal', 'revol_util', 'total_acc', 'out_prncp', 'out_prncp_inv',       'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',       'total_rec_late_fee', 'last_pymnt_amnt', 'collections_12_mths_ex_med',       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',       'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il',       'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc',       'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',       'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',       'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct',       'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',       'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_inq',       'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd',       'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl',       'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0',       'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m',       'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75',       'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim',       'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit'],
      dtype='object')

In [42]:

number_df.shape

Out[42]:

(105451, 79)

缺失值可视化

In [43]:

ms.matrix(number_df)
plt.show()

逻辑回归:信贷违规预测!

怎么判断某行是否悉数为空

逻辑回归:信贷违规预测!

检查缺失值状况

In [48]:

number_df.isnull().sum().sort_values(ascending=False)

Out[48]:

mths_since_recent_revol_delinq    68375
mths_since_last_delinq            51119
il_util                           13380
mths_since_recent_inq             11949
num_tl_120dpd_2m                   5278
                                  ...  
tot_cur_bal                           0
tot_coll_amt                          0
acc_now_delinq                        0
collections_12_mths_ex_med            0
total_il_high_credit_limit            0
Length: 79, dtype: int64

找出存在缺失值的字段

In [49]:

# 存在缺失值的为True,否则为False
number_df.isnull().sum() > 0

Out[49]:

loan_amnt                     False
funded_amnt                   False
funded_amnt_inv               False
int_rate                      False
installment                   False
                              ...  
tax_liens                     False
tot_hi_cred_lim               False
total_bal_ex_mort             False
total_bc_limit                False
total_il_high_credit_limit    False
Length: 79, dtype: bool

In [50]:

number_df.columns[number_df.isnull().sum() > 0]

Out[50]:

Index(['dti', 'mths_since_last_delinq', 'revol_util', 'mths_since_rcnt_il',       'il_util', 'all_util', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',       'mo_sin_old_il_acct', 'mths_since_recent_bc', 'mths_since_recent_inq',       'mths_since_recent_revol_delinq', 'num_tl_120dpd_2m',       'percent_bc_gt_75'],
      dtype='object')

缺失值填充

关于数值型变量的缺失值,咱们选用每列的均值来填充

In [51]:

# 办法1
for col in number_df.columns[number_df.isnull().sum() > 0]:
    mean_val = number_df[col].mean()  # 每列的均值
    number_df[col].fillna(mean_val, inplace=True)  # 均值填充

In [52]:

# 办法2
# 老版本报错
# from sklearn.preprocessing import Imputer
# 运用新版本
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

In [53]:

number_df.isnull().sum().sort_values(ascending=False)

Out[53]:

loan_amnt                         0
mo_sin_old_rev_tl_op              0
num_accts_ever_120_pd             0
mths_since_recent_revol_delinq    0
mths_since_recent_inq             0
                                 ..
acc_now_delinq                    0
collections_12_mths_ex_med        0
last_pymnt_amnt                   0
total_rec_late_fee                0
total_il_high_credit_limit        0
Length: 79, dtype: int64

新建模数据new_data

手动过滤字段

有些字段对建模是无用的,即存在冗余的字段信息,咱们需求手动删去。

In [54]:

object_df.isnull().sum()

Out[54]:

term                   0
grade                  0
sub_grade              0
emp_title              0
home_ownership         0
verification_status    0
issue_d                0
loan_status            0
purpose                0
title                  0
zip_code               0
addr_state             0
earliest_cr_line       0
initial_list_status    0
last_pymnt_d           0
next_pymnt_d           0
last_credit_pull_d     0
application_type       0
dtype: int64

In [55]:

object_df["application_type"].value_counts()

Out[55]:

INDIVIDUAL    98619
JOINT          6813
DIRECT_PAY       19
Name: application_type, dtype: int64
  • term:申请人的借款归还期数,通常是36或许60,关于建模没有指导意义,保存(1)
  • grade和sub_grade:存在重复信息,选择删去字段sub_gracde,保存grade(12)
  • emp_title:不能反映借款人的收入或许资产状况,可删去
  • home_ownership:房屋地点状况,保存(3)
  • verification_status:收入是否被LC验证、未验证或收入来历是否被验证,保存(4)
  • issue_d:借款发行时刻,关于是否违规猜测也没有指导意义,可删去
  • loan_status:借款的当时状况,保存(5)
  • purpose:借款意图,保存(6)
  • title:和purpose重复,可删去
  • zip_code:地址邮编,关于建模无效
  • addr_state:申请人的地址地点州,建模无效
  • earliest_cr_line:借款人最早公布的信用额度开通的月份,关于是否违规模型搭建无意义,考虑删去
  • initial_list_status:借款的初始上市状况,可谓W或许F,保存(7)
  • last_pymnt_d、next_pymnt_d、last_credit_pull_d:借款后的信息,关于是否违规猜测为意义,可删去
  • application_type:指示借款是独自申请仍是与两个一起借款人的联合申请,保存(8)

终究留下8个有用的字段信息:

In [56]:

object_df = object_df[["term","grade","home_ownership","verification_status","loan_status",
                       "purpose","initial_list_status","application_type"]]
object_df

逻辑回归:信贷违规预测!

数据兼并

将两个数据兼并起来:

In [57]:

new_data = pd.concat([object_df,number_df],axis=1)
new_data.head()

基本信息

In [58]:

new_data.dtypes

Out[58]:

term                          object
grade                         object
home_ownership                object
verification_status           object
loan_status                   object
                               ...  
tax_liens                      int64
tot_hi_cred_lim                int64
total_bal_ex_mort              int64
total_bc_limit                 int64
total_il_high_credit_limit     int64
Length: 87, dtype: object

In [59]:

new_data.shape

Out[59]:

(105451, 87)

描述计算信息

针对数值型字段的描述计算信息,检查最值,四分位数等:

In [60]:

new_data.describe()

逻辑回归:信贷违规预测!

特征工程

特征衍生

In [61]:

installment / (annual_inc / 12):表明每个月的还款金额占有月收入的份额,数值越大,还款压力越大

# installment:每月的还款金额
# annual_inc:申请人年度总收入
# 新特征
new_data["installment_percent"] = new_data[["installment","annual_inc"]].apply(lambda x: x.installment / (x.annual_inc / 12), axis=1).replace([np.inf,0])
new_data.head()

Out[61]:

逻辑回归:信贷违规预测!

借款状况编码与可视化

编码

In [62]:

new_data["loan_status"].value_counts()

Out[62]:

Current               99850
Fully Paid             3896
In Grace Period         932
Late (31-120 days)      436
Late (16-30 days)       312
Charged Off              25
Name: loan_status, dtype: int64

In [63]:

# 违约=1, 正常=0
new_data["loan_status"] = new_data["loan_status"].apply(lambda x: 0 if (x == "Fully Paid" or x == "Current") else 1)

In [64]:

new_data["loan_status"].value_counts()

Out[64]:

0    103746
1      1705
Name: loan_status, dtype: int64

可视化

In [65]:

# 借款状况柱状图计算
fig, axs = plt.subplots(1,2,figsize=(12,6))
sns.countplot(x='loan_status',
              data=new_data,
              ax=axs[0])
axs[0].set_title("Frequency of each Loan Status(Bar)")
# 饼图:计算每个类型的占比
new_data['loan_status'].value_counts().plot(x=None,
                                            y=None, 
                                            kind='pie', 
                                            ax=axs[1],
                                            autopct='%1.2f%%')
axs[1].set_title("Percentage of each Loan status(Pie)")
plt.show()

逻辑回归:信贷违规预测!

能够看到违规和没有违规的份额差别是很大的,后边会经过采样的办法来解决。

有序变量编码

部分字段的取值是存在必定的顺序联系。比方服装的尺码巨细,”XS”、”S”、”M”、”L”等,它们的取值本身是有巨细联系的。

在这里咱们施行硬编码:

In [66]:

new_data.select_dtypes("object")

逻辑回归:信贷违规预测!

无序特征编码

选用的独热码,经过get_dummies函数来完成

In [68]:

df1 = new_data.select_dtypes("object")  # 字符型
df2 = new_data.select_dtypes(exclude=["object"])   # 非字符型
df1.head()

Out[68]:

逻辑回归:信贷违规预测!

逻辑回归:信贷违规预测!

特征缩放:标准化

针对数值型特征(数据df2)的特征缩放,将数据标准到必定范围内,便于后期加快算法的收敛速度

In [70]:

# 方针变量load_status
Y = df2[["loan_status"]]
Y.head()

Out[70]:

loan_status
0 0
1 0
2 0
3 0
4 0

In [71]:

# df2中剩余的数值型变量构成df3
df3 = df2.drop("loan_status",axis=1)
df3.head()
# 对df3施行标准化处理
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss_data = ss.fit_transform(df3)
df4 = pd.DataFrame(ss_data, columns=df3.columns)
df4.head()

逻辑回归:信贷违规预测!

SMOTE采样(重点)

能够看到违规和没有违规的份额差别是很大的,也便是说样本及其不均衡,常用的处理方式有两种:

  • 过采样(oversampling),增加正样本使得正、负样本数目挨近,然后再进行学习
  • 欠采样(undersampling),去除一些负样本使得正、负样本数目挨近,然后再进行学习。

下面经过欠采样办法来处理:

In [75]:

positive_data = new_df[new_df["loan_status"] == 1]  # 违约
negetive_data = new_df[new_df["loan_status"] == 0]  # 未违约

In [76]:

print("length of positive_data: ",len(positive_data))
print("length of negetive_data: ",len(negetive_data))
length of positive_data:  1705
length of negetive_data:  103746

In [77]:

selected_data = negetive_data.sample(len(positive_data))
selected_data.shape

Out[77]:

(1705, 104)

下面生成的new_data便是终究用于建模的数据:

In [78]:

new_data = pd.concat([positive_data,selected_data],axis=0)
new_data  

建模

特征和方针别离

In [79]:

y = new_data[["loan_status"]]
X = new_data.drop("loan_status", axis=1)

练习集和测验集切分

In [80]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)

逻辑回归分类器

依据逻辑回归的分类器

In [81]:

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression() # 构建逻辑回归分类器
clf.fit(X_train, y_train)

Out[81]:

LogisticRegression()

In [82]:

# 进行猜测
predict = clf.predict(X_test) 
predict[:5]

Out[82]:

array([1, 1, 0, 1, 1])

猜测准确率

检查猜测的准确率:

In [83]:

from sklearn.metrics import accuracy_score
accuracy_score(predict, y_test)

Out[83]:

0.7067448680351907

混杂矩阵的信息:

In [84]:

from sklearn.metrics import confusion_matrix
m = confusion_matrix(y_test, predict)
m

Out[84]:

array([[262,  87],
       [113, 220]])

In [85]:

# 混杂矩阵可视化
plt.figure(figsize=(12,8))
sns.heatmap(m) 
plt.show()

逻辑回归:信贷违规预测!

上面可视化的成果表明:颜色越深代表的人数越多,也便是真阳性的人数是最多的;而假阳性是最少的

roc曲线

In [86]:

from sklearn.metrics import roc_curve, auc
false_positive_rate, true_positive_rate,thresholds = roc_curve(y_test, predict)
roc_auc = auc(false_positive_rate,true_positive_rate)
roc_auc

Out[86]:

0.7056884965194421

In [87]:

import matplotlib.pyplot as plt
plt.figure(figsize=(10,10))
plt.title("ROC")  # Receiver Operating Characteristic
plt.plot(false_positive_rate,
         true_positive_rate,
         color="red",
         label="AUC = %0.2f"%roc_auc
        )
plt.legend(loc="lower right")
plt.plot([0,1],[0,1],linestyle="--")
plt.axis("tight")
# 真阳性:猜测类别为1的positive;猜测正确True
plt.ylabel("True Positive Rate")
# 假阳性:猜测类别为1的positive;猜测错误False
plt.xlabel("False Positive Rate")

Out[87]:

Text(0.5, 0, 'False Positive Rate')

逻辑回归:信贷违规预测!

总结

整体计划的ROC值达到了71%,仍是有必定的提高空间。后续能够优化的点:

  1. 特征属性过多:能够考虑降维或许特征的多重线性检验,找出更具有价值的特征
  2. 离散型变量的编码:现在是一致的标准化处理,没有编码作业;后续能够考虑加入编码作业,比方:因子化、特征分箱等
  3. 建模优化:尝试运用不同的分类模型,以及模型交融办法
  4. 参数调优:能够考虑做一个参数的调优,比方网格查找、随机查找等等