前言
大数据时代的发展背景下,数据分析渗透到众多领域,利用数据分析方法充分挖掘数据的价值并为实际业务的执行与决策提供可靠依据越来越显得重要,本文旨在从金融贷款业务中预测贷款人是否会出现违约情况。
1.1 Jupyter设置、导包及数据集加载
导入相关模块。
import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.exceptions import ConvergenceWarning
from typing import types
import sklearn
import pandas_profiling
复制代码
拦截警告
warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)
复制代码
防止中文乱码,设置seaborn中文字体。
mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')
复制代码
设置jupyter显示行数
pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 30)
pd.set_option('max_columns', 30)
复制代码
加载数据。
df_train =pd.read_csv('train.csv',encoding='utf-8')
df_test=pd.read_csv('testA.csv',encoding='utf-8')
复制代码
1.2 探索性分析
1.2.1 预览数据集
- 预览数据集
- 训练集中共有47个特征
df_train.head(5).append(df_train.tail(5))
————————————————————————————————————————————————————————
id loanAmnt term interestRate installment grade subGrade employmentTitle employmentLength homeOwnership annualIncome verificationStatus issueDate isDefault purpose ... n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14
0 0 35000.0 5 19.52 917.97 E E2 320.0 2 years 2 110000.0 2 2014-07-01 1 1 ... 0.0 2.0 2.0 2.0 4.0 9.0 8.0 4.0 12.0 2.0 7.0 0.0 0.0 0.0 2.0
1 1 18000.0 5 18.49 461.90 D D2 219843.0 5 years 0 46000.0 2 2012-08-01 0 0 ... NaN NaN NaN NaN 10.0 NaN NaN NaN NaN NaN 13.0 NaN NaN NaN NaN
2 2 12000.0 5 16.99 298.17 D D3 31698.0 8 years 0 74000.0 2 2015-10-01 0 0 ... 0.0 0.0 3.0 3.0 0.0 0.0 21.0 4.0 5.0 3.0 11.0 0.0 0.0 0.0 4.0
3 3 11000.0 3 7.26 340.96 A A4 46854.0 10+ years 1 118000.0 1 2015-08-01 0 4 ... 6.0 4.0 6.0 6.0 4.0 16.0 4.0 7.0 21.0 6.0 9.0 0.0 0.0 0.0 1.0
4 4 3000.0 3 12.99 101.07 C C2 54.0 NaN 1 29000.0 2 2016-03-01 0 10 ... 1.0 2.0 7.0 7.0 2.0 4.0 9.0 10.0 15.0 7.0 12.0 0.0 0.0 0.0 4.0
799995 799995 25000.0 3 14.49 860.41 C C4 2659.0 7 years 1 72000.0 0 2016-07-01 0 0 ... 0.0 5.0 10.0 10.0 6.0 6.0 2.0 12.0 13.0 10.0 14.0 0.0 0.0 0.0 3.0
799996 799996 17000.0 3 7.90 531.94 A A4 29205.0 10+ years 0 99000.0 2 2013-04-01 0 4 ... 0.0 2.0 2.0 2.0 2.0 15.0 16.0 2.0 19.0 2.0 7.0 0.0 0.0 0.0 0.0
799997 799997 6000.0 3 13.33 203.12 C C3 2582.0 10+ years 1 65000.0 2 2015-10-01 1 0 ... 2.0 1.0 4.0 4.0 1.0 4.0 26.0 4.0 10.0 4.0 5.0 0.0 0.0 1.0 4.0
799998 799998 19200.0 3 6.92 592.14 A A4 151.0 10+ years 0 96000.0 2 2015-02-01 0 4 ... 0.0 5.0 8.0 8.0 7.0 10.0 6.0 12.0 22.0 8.0 16.0 0.0 0.0 0.0 5.0
799999 799999 9000.0 3 11.06 294.91 B B3 13.0 5 years 0 120000.0 0 2018-08-01 0 4 ... 2.0 2.0 3.0 3.0 2.0 3.0 4.0 4.0 8.0 3.0 7.0 0.0 0.0 0.0 2.0
10 rows × 47 columns
复制代码
- 预览相关统计量
df_train.describe()
————————————————————————————————————————————————————————————————————————
id loanAmnt term interestRate installment employmentTitle homeOwnership annualIncome verificationStatus isDefault purpose postCode regionCode dti delinquency_2years ... n0 n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14
count 800000.000000 800000.000000 800000.000000 800000.000000 800000.000000 799999.000000 800000.000000 8.000000e+05 800000.000000 800000.000000 800000.000000 799999.000000 800000.000000 799761.000000 800000.000000 ... 759730.000000 759730.000000 759730.000000 759730.000000 766761.000000 759730.000000 759730.000000 759730.000000 759729.000000 759730.000000 766761.000000 730248.000000 759730.000000 759730.000000 759730.000000
mean 399999.500000 14416.818875 3.482745 13.238391 437.947723 72005.351714 0.614213 7.613391e+04 1.009683 0.199513 1.745982 258.535648 16.385758 18.284557 0.318239 ... 0.511932 3.642330 5.642648 5.642648 4.735641 8.107937 8.575994 8.282953 14.622488 5.592345 11.643896 0.000815 0.003384 0.089366 2.178606
std 230940.252015 8716.086178 0.855832 4.765757 261.460393 106585.640204 0.675749 6.894751e+04 0.782716 0.399634 2.367453 200.037446 11.036679 11.150155 0.880325 ... 1.333266 2.246825 3.302810 3.302810 2.949969 4.799210 7.400536 4.561689 8.124610 3.216184 5.484104 0.030075 0.062041 0.509069 1.844377
min 0.000000 500.000000 3.000000 5.310000 15.690000 0.000000 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 -1.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 199999.750000 8000.000000 3.000000 9.750000 248.450000 427.000000 0.000000 4.560000e+04 0.000000 0.000000 0.000000 103.000000 8.000000 11.790000 0.000000 ... 0.000000 2.000000 3.000000 3.000000 3.000000 5.000000 4.000000 5.000000 9.000000 3.000000 8.000000 0.000000 0.000000 0.000000 1.000000
50% 399999.500000 12000.000000 3.000000 12.740000 375.135000 7755.000000 1.000000 6.500000e+04 1.000000 0.000000 0.000000 203.000000 14.000000 17.610000 0.000000 ... 0.000000 3.000000 5.000000 5.000000 4.000000 7.000000 7.000000 7.000000 13.000000 5.000000 11.000000 0.000000 0.000000 0.000000 2.000000
75% 599999.250000 20000.000000 3.000000 15.990000 580.710000 117663.500000 1.000000 9.000000e+04 2.000000 0.000000 4.000000 395.000000 22.000000 24.060000 0.000000 ... 0.000000 5.000000 7.000000 7.000000 6.000000 11.000000 11.000000 10.000000 19.000000 7.000000 14.000000 0.000000 0.000000 0.000000 3.000000
max 799999.000000 40000.000000 5.000000 30.990000 1715.420000 378351.000000 5.000000 1.099920e+07 2.000000 1.000000 13.000000 940.000000 50.000000 999.000000 39.000000 ... 51.000000 33.000000 63.000000 63.000000 49.000000 70.000000 132.000000 79.000000 128.000000 45.000000 82.000000 4.000000 4.000000 39.000000 30.000000
8 rows × 42 columns
复制代码
- 预览数据类型
df_train.info()
————————————————————————————————————————————————————————————————————————
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 800000 non-null int64
1 loanAmnt 800000 non-null float64
2 term 800000 non-null int64
3 interestRate 800000 non-null float64
4 installment 800000 non-null float64
5 grade 800000 non-null object
6 subGrade 800000 non-null object
7 employmentTitle 799999 non-null float64
8 employmentLength 753201 non-null object
9 homeOwnership 800000 non-null int64
10 annualIncome 800000 non-null float64
11 verificationStatus 800000 non-null int64
12 issueDate 800000 non-null object
13 isDefault 800000 non-null int64
14 purpose 800000 non-null int64
15 postCode 799999 non-null float64
16 regionCode 800000 non-null int64
17 dti 799761 non-null float64
18 delinquency_2years 800000 non-null float64
19 ficoRangeLow 800000 non-null float64
20 ficoRangeHigh 800000 non-null float64
21 openAcc 800000 non-null float64
22 pubRec 800000 non-null float64
23 pubRecBankruptcies 799595 non-null float64
24 revolBal 800000 non-null float64
25 revolUtil 799469 non-null float64
26 totalAcc 800000 non-null float64
27 initialListStatus 800000 non-null int64
28 applicationType 800000 non-null int64
29 earliesCreditLine 800000 non-null object
30 title 799999 non-null float64
31 policyCode 800000 non-null float64
32 n0 759730 non-null float64
33 n1 759730 non-null float64
34 n2 759730 non-null float64
35 n3 759730 non-null float64
36 n4 766761 non-null float64
37 n5 759730 non-null float64
38 n6 759730 non-null float64
39 n7 759730 non-null float64
40 n8 759729 non-null float64
41 n9 759730 non-null float64
42 n10 766761 non-null float64
43 n11 730248 non-null float64
44 n12 759730 non-null float64
45 n13 759730 non-null float64
46 n14 759730 non-null float64
dtypes: float64(33), int64(9), object(5)
复制代码
- 预览训练集、测试集维度
df_train.shape,df_test.shape
————————————————————————————————————————————————————————————————————————
((800000, 47), (200000, 46))
复制代码
- 缺失值数量与分布
df_train.isnull().sum()
missing_pct = combined.isnull().sum() * 100 / len(combined) #将列中为空的个数统计出来
missing = pd.DataFrame({
'name': combined.columns,
'missing_pct': missing_pct,
})
missing.sort_values(by='missing_pct', ascending=False).head()
————————————————————————————————————————————————————————
name missing_pct
n11 n11 8.7192
employmentLength employmentLength 5.8307
n8 n8 5.0322
n14 n14 5.0321
n3 n3 5.0321
复制代码
1.2.2 贷款违约数量与分布
- 从图中可以看出,样本分布并不均匀,负样本占比80%
fig,ax = plt.subplots(1,2,figsize=(15,8))
sns.countplot('isDefault',data=df_train,ax=ax[0],palette=['m','r'])
df_train['isDefault'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1],colors=['orange','gray'])
ax[0].set_ylabel('')
ax[0].set_xlabel('isDefault')
ax[1].set_ylabel('')
ax[1].set_xlabel('isDefault')
plt.show()
————————————————————————————————————————————————————————
复制代码
1.2.3 贷款金额、期限、利率与违约
- 发生违约情况的贷款金额均值高于未违约情况的贷款金额
- 样本中贷款期限为3年居多,贷款期限为3年比5年发生了更多的违约情况
- 发生违约情况的贷款利率均值显著更高一些
plt.figure()
plt.figure(figsize=(15, 5))
plt.subplot(1, 3,1)
df_train[['loanAmnt','isDefault']].groupby('isDefault')['loanAmnt'].mean().plot.bar(
color=['m','c'])
plt.ylabel('loanAmnt')
plt.subplot(1,3,2)
sns.countplot('term',hue='isDefault',data=df_train,palette=['orange','g'])
plt.subplot(1,3,3)
df_train[['interestRate','isDefault']].groupby('isDefault')['interestRate'].mean().plot.bar(
color=['k','r'])
plt.ylabel('interestRate')
plt.show()
复制代码
- 未违约情况中,贷款3年选择的贷款金额在15000以下,而贷款5年选择的贷款金额多在15000以上
- 违约情况中分布与未违约情况相似
fig = plt.figure(figsize=(25, 10))
sns.violinplot(x='term', y='loanAmnt',
hue='isDefault', data=df_train,
split=True,alpha=0.9,
palette={0: "r", 1: "g"});
plt.title('贷款金额-贷款年限分布图',fontsize=25)
plt.xlabel('term',fontsize=22)
plt.ylabel('loanAmnt',fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
复制代码
1.2.4 贷款金额、等级、次等级与违约
- 贷款金额整体趋势随着贷款等级上升,贷款次等级的表现相似
- 贷款等级为C、D、B、E发生的违约情况更多一些
plt.figure()
plt.figure(figsize=(15, 8))
plt.subplot(2,2,1)
df_train[['loanAmnt','grade',]].groupby('grade')['loanAmnt'].mean().plot.bar(
color=['m','c'])
plt.ylabel('loanAmnt')
plt.subplot(2,2,2)
df_train[['loanAmnt','subGrade',]].groupby('subGrade')['loanAmnt'].mean().plot.bar(color=['m','c'])
plt.ylabel('loanAmnt')
plt.subplot(2,2,3)
sns.countplot('grade',hue='isDefault',data=df_train,palette=['orange','g'])
plt.subplot(2,2,4)
sns.countplot('subGrade',hue='isDefault',data=df_train,palette=['orange','c'])
plt.ylabel('subGrade')
复制代码
- 在不同贷款等级与次等级下,违约与未违约情况的分布情况相近
plt.figure(figsize=(20, 15))
plt.subplot(2,1,1)
sns.violinplot(x='grade', y='loanAmnt',
hue='isDefault', data=df_train,
split=True,alpha=0.9,
palette={0: "orange", 1: "b"});
plt.title('贷款金额-贷款等级分布图',fontsize=25)
plt.xlabel('grade',fontsize=22)
plt.ylabel('loanAmnt',fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
plt.subplot(2,1,2)
sns.violinplot(x='subGrade', y='loanAmnt',
hue='isDefault', data=df_train,
split=True,alpha=0.9,
palette={0: "orange", 1: "b"});
plt.title('贷款金额-贷款次等级分布图',fontsize=25)
plt.xlabel('subGrade',fontsize=22)
plt.ylabel('loanAmnt',fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
复制代码
1.2.5 年收入、就业年限与违约
- 结合年收入和就业时间来看,整体分布均匀
plt.figure(figsize=(15, 8))
ax = plt.subplot()
sns.barplot('employmentLength','annualIncome',hue='isDefault',
data=df_train,palette=['g','c'])
复制代码
1.2.6 贷款金额、债务收入比与违约
- 发生违约的样本中,债务收入比更高一些,这体现了贷款人存在多方借款情况,最终导致发生违约
plt.figure(figsize=(15, 8))
df_train[['dti','isDefault',]].groupby('isDefault')['dti'].mean().plot.bar(
color=['r','b'])
plt.ylabel('dti')
复制代码
- 违约情况多发生在dti处于100~200的范围,
- 存在200以上的高债务收入占比情况
plt.figure(figsize=(20,8))
ax = plt.subplot()
ax.scatter(df_train[df_train['isDefault'] ==1]['dti'],df_train[df_train['isDefault']==1]['loanAmnt'],
color = 'r',s=df_train[df_train['isDefault']==1]['loanAmnt']/100)
ax.scatter(df_train[df_train['isDefault']==0]['dti'],df_train[df_train['isDefault']==0]['loanAmnt'],
color = 'k',s=df_train[df_train['isDefault']==0]['loanAmnt']/100)
复制代码
1.3 数据清洗
1.3.1 缺失值处理
在1.2.1中确认具有缺失值的特征:
-
[’employmentTitle’,’employmentLength’, ‘postCode’, ‘dti’, ‘pubRecBankruptcies’, ‘revolUtil’,’title’, ‘n0’, ‘n1’, ‘n2’, ‘n3’, ‘n4’, ‘n5’, ‘n6’, ‘n7’, ‘n8’, ‘n9′,’n10’, ‘n11’, ‘n12’, ‘n13’, ‘n14’]
-
使用众数填充employmentTitle、employmentLength、postCode、pubRecBankruptcies、title
-
使用均值填充revolUtil、title
combined['employmentLength']=combined['employmentLength'].fillna('10+ years')
combined['employmentTitle']=combined['employmentTitle'].fillna(54)
combined['postCode']=combined['postCode'].fillna(134)
combined['dti']=combined['dti'].fillna(combined['dti'].mean())
combined['pubRecBankruptcies']=combined['pubRecBankruptcies'].fillna(0)
combined['title']=combined['title'].fillna(0)
combined['revolUtil']=combined['revolUtil'].fillna(combined['revolUtil'].mean())
复制代码
- 使用众数填充n系列缺失值
n_features=[i for i in combined.columns if i.startswith('n')]
modes=combined[n_features].mode().values
modes_box=[]
for i in range(15):
modes_box.append(a[0][i])
combined[n_features]=combined[n_features].fillna(dict(zip(n_features,modes_box)))
复制代码
1.3.2 异常值处理
- 图中可以看到employmentTitle、annualIncome、revolBal、title、n系列存在异常值,将会在特征工程中逐一处理
fig,ax=plt.subplots(3,2,figsize=(20,15))
sns.boxplot(data=combined[n[:7]],ax=ax[0][0])
sns.boxplot(data=combined[n[7:14]],ax=ax[0][1])
sns.boxplot(data=combined[n[14:21]],ax=ax[1][0])
sns.boxplot(data=combined[n[21:28]],ax=ax[1][1])
sns.boxplot(data=combined[n[28:35]],ax=ax[2][0])
sns.boxplot(data=combined[n[35:42]],ax=ax[2][1])
复制代码
1.4 特征工程
待处理的特征有:
- id 为贷款清单分配的唯一信用证标识
- loanAmnt 贷款金额
- term 贷款期限(year)
- interestRate 贷款利率
- installment 分期付款金额
- grade 贷款等级
- subGrade 贷款等级之子级
- employmentTitle 就业职称
- employmentLength 就业年限(年)
- homeOwnership 借款人在登记时提供的房屋所有权状况
- annualIncome 年收入
- verificationStatus 验证状态
- issueDate 贷款发放的月份
- purpose 借款人在贷款申请时的贷款用途类别
- postCode 借款人在贷款申请中提供的邮政编码的前3位数字
- regionCode 地区编码
- dti 债务收入比
- delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
- ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
- ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
- openAcc 借款人信用档案中未结信用额度的数量
- pubRec 贬损公共记录的数量
- pubRecBankruptcies 公开记录清除的数量
- revolBal 信贷周转余额合计
- revolUtil 循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额
- totalAcc 借款人信用档案中当前的信用额度总数
- initialListStatus 贷款的初始列表状态
- applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
- earliesCreditLine 借款人最早报告的信用额度开立的月份
- title 借款人提供的贷款名称
- policyCode 公开可用的策略_代码=1新产品不公开可用的策略_代码=2
- n系列匿名特征 匿名特征n0-n14,为一些贷款人行为计数特征的处理
1.4.1 处理id、policyCode、grade
- 删除这些特征,policyCode只有一个值,grade与subgrade重复
combined.drop(['id','policyCode','grade'],axis=1,inplace=True)
复制代码
1.4.2 处理earliesCreditLine、issueDate
- 转换earliesCreditLine数据为时间格式
def transfer_earliesCreditLine():
global combined
m=[1,2,3,4,5,6,7,8,9,10,11,12]
M=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
dict1 =dict(zip(M,m))
dateline=[]
for k in combined['earliesCreditLine']:
k = k+'-'+str(dict1[k[:3]])+'-'+'01' if dict1[k[:3]]>9 else k+'-'+'0'+str(dict1[k[:3]])+'-'+'01'
k=k[4:]
dateline.append(k)
combined['earliesCreditLine']=dateline
return combined
combined = transfer_earliesCreditLine()
combined['earliesCreditLine']
————————————————————————————————————————————————————————
0 2001-08-01
1 2002-05-01
2 2006-05-01
3 1999-05-01
4 1977-08-01
...
199995 1996-11-01
199996 1994-09-01
199997 1988-04-01
199998 2003-05-01
199999 1997-10-01
Name: earliesCreditLine, Length: 1000000, dtype: object
复制代码
- 计算时间构建新特征sur_years
def date_modify(date):
year = int(str(date)[:4])
month = int(str(date)[5:7])
day = int(str(date)[8:10])
if month < 1:
month = 1
date_tr = datetime(year, month, day)
return date_tr
combined['issueDate'] = combined['issueDate'].apply(date_modify)
combined['earliesCreditLine'] = combined['earliesCreditLine'].apply(date_modify)
sur_days=(combined['issueDate']-combined['earliesCreditLine']).dt.days
sur_years =round(sur_days/365,1)
combined['sur_years'] =sur_years
del combined['issueDate']
del combined['earliesCreditLine']
复制代码
1.4.3 处理term、homeOwnership等类别数据
-
利用独热编码处理:term、homeOwnership、verificationStatus、purpose、regionCode、initialListStatus、applicationType、subGrade、employmentLength
-
利用正则表达式预处理employmentLength
employmentLength=[]
for i in combined['employmentLength']:
v=int(re.findall('[0-9]+',i)[0])
employmentLength.append(v)
combined['employmentLength'] =employmentLength
combined['employmentLength']
复制代码
- 执行独热编码,完成后特征数目为158
def dummies_coder():
global combined
for name in ['term','homeOwnership','verificationStatus',
'purpose','regionCode','initialListStatus','applicationType',
'subGrade','employmentLength']:
df_dummies = pd.get_dummies(combined[name],prefix=name)
combined = pd.concat([combined,df_dummies],axis=1)
combined.drop(name,axis=1,inplace=True)
return combined
combined =dummies_coder()
combined.shape
————————————————————————————————————————————————————————
(1000000, 158)
复制代码
1.4.4 处理loanAmnt、interestRate等数值型数据
- 预处理ficoRangeLow、ficoRangeHigh
combined['ficoRange'] = combined['ficoRangeLow'] +combined['ficoRangeHigh']
combined.drop(['ficoRangeLow','ficoRangeHigh'],axis=1,inplace=True)
复制代码
- 利用数据分箱来处理这些特征
#使用数据分箱的特征
modify_features=['loanAmnt','interestRate','installment','annualIncome','postCode','dti',
'delinquency_2years','pubRec','openAcc','pubRecBankruptcies','revolBal',
'revolUtil','totalAcc','title',]
复制代码
- 根据 25% 50% 75% 分布获取阈值
# 内函数:根据 25% 50% 75% 分布获取阈值
def get_threshold(feature):
global combined
threshold_values=[]
a=list(combined[feature].sort_values()[:250000])
b=list(combined[feature].sort_values()[:500000])
c=list(combined[feature].sort_values()[:750000])
threshold_values.append(a[-1])
threshold_values.append(b[-1])
threshold_values.append(c[-1])
return threshold_values
#外函数:执行数据分箱
def modify_df():
global combined,modify_features
for feature in modify_features:
threshold_values =get_threshold(feature)
combined.loc[combined[feature]<threshold_values[0],feature]=0
combined.loc[(combined[feature] >= threshold_values[0]) & (combined[feature] < threshold_values[1]),
feature] = 1
combined.loc[(combined[feature] >= threshold_values[1]) & (combined[feature] < threshold_values[2]),
feature] = 2
combined.loc[(combined[feature] >= threshold_values[2]) , feature] = 3
return combined
combined=modify_df()
复制代码
- 此时已经同时完成所有的数据分箱操作
combined.head()
————————————————————————————————————————————————————————
loanAmnt interestRate installment employmentTitle annualIncome postCode dti delinquency_2years ficoRangeLow ficoRangeHigh openAcc pubRec pubRecBankruptcies revolBal revolUtil ... subGrade_G1 subGrade_G2 subGrade_G3 subGrade_G4 subGrade_G5 employmentLength_1 employmentLength_2 employmentLength_3 employmentLength_4 employmentLength_5 employmentLength_6 employmentLength_7 employmentLength_8 employmentLength_9 employmentLength_10
0 3.0 3.0 3.0 320.0 3.0 1.0 1.0 3.0 730.0 734.0 0.0 3.0 3.0 3.0 1.0 ... 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1 2.0 3.0 2.0 219843.0 1.0 1.0 3.0 3.0 700.0 704.0 2.0 3.0 3.0 2.0 1.0 ... 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
2 2.0 3.0 1.0 31698.0 2.0 2.0 2.0 3.0 675.0 679.0 2.0 3.0 3.0 0.0 1.0 ... 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 1.0 0.0 1.0 46854.0 3.0 1.0 1.0 3.0 685.0 689.0 1.0 3.0 3.0 1.0 2.0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
4 0.0 2.0 0.0 54.0 0.0 2.0 3.0 3.0 690.0 694.0 2.0 3.0 3.0 0.0 0.0 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
5 rows × 158 columns
复制代码
1.4.5 处理n系列
- 图中可以直观看到n2和n3呈完全线性关系,而其他特征需要进一步表征
plt.figure(figsize=(15,8))
sns.pairplot(combined[n_features])
plt.show()
复制代码
- 特征相关性分析,相关系数大于等于0.75是为高度相关
- 与n1相关性较高的是n2,n3,n4,n9
- 与n2相关性较高的是n1,n3,n7,n9
- 与n5相关性较高的是n8
- 与n7相关性较高的是n2,n3,n8,n9,n10
- 与n8相关性较高的是n5,n7
plt.figure(figsize=(15,8))
sns.heatmap(combined[n_features].corr(),annot=True)
plt.show()
复制代码
- 删除n2,n3,n4,n8,n9,n10
combined.drop(['n2','n3','n4','n8','n9','n10'],axis=1,inplace=True)
n_featuresrv=['n0','n1','n5','n6','n7','n11', 'n12', 'n13','n14']
combined.shape
————————————————————————————————————————————————————————
(1000000, 151)
复制代码
1.5 模型训练
- 导入相关模块
import sklearn
from sklearn.model_selection import GridSearchCV,train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler,Normalizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score,roc_auc_score,make_scorer
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization
复制代码
- 划分训练集、验证集、测试集
train = combined[:800000]
test =combined[800000:]
x_train,x_val,y_train,y_val= train_test_split(train,targets,test_size=0.2,random_state=2021)
复制代码
1.5.1 利用基础模型及评分
- 实例化模型分类器
lgrcv =LogisticRegressionCV()
extree =ExtraTreesClassifier()
rf =RandomForestClassifier()
knn=KNeighborsClassifier()
xgb =xgb.XGBClassifier()
lgb=lgb.LGBMClassifier()
models =[extree,lgrcv, rf,knn ,lgb ,xgb]
复制代码
- 使用单一模型进行训练、模型AUC评估。
for model in models:
model=model.fit(x_train,y_train)
predict_train =model.predict(x_train)
predict_val=model.predict(x_val)
print(model)
print('val f1-score :',metrics.f1_score(y_val,predict_val))
a = model.predict_proba(x_val)
fpr, tpr, thresholds = metrics.roc_curve(y_val, y_score=[i[1] for i in a], pos_label=1)
print('auc:',metrics.auc(fpr, tpr))
print('**********************************')
——————————————————————————————————————————————————————
ExtraTreesClassifier()
val f1-score : 0.12774759370087427
auc: 0.7020222394446626
**********************************
LogisticRegressionCV()
val f1-score : 0.12568336494477295
auc: 0.7169715289708525
**********************************
RandomForestClassifier()
val f1-score : 0.11396893291310219
auc: 0.7066270946062161
**********************************
KNeighborsClassifier()
val f1-score : 0.19608019621520667
auc: 0.6092136514038466
**********************************
LGBMClassifier()
val f1-score : 0.14222881961625158
auc: 0.7214696572215835
**********************************
XGBClassifier()
val f1-score : 0.09031511997687193
auc: 0.7141476229518823
**********************************
复制代码
- 从结果来看,KNN的表现较差且时间较长
- Logistic、LGBM、XGB表现较好
1.5.2 Logistic回归调参
- 训练预处理
mms=MinMaxScaler()
norm =Normalizer()
ss = StandardScaler()
x_train = mms.fit_transform(x_train)
复制代码
- 调节参数,Normalizer后auc值有略微提升
lgr = LogisticRegressionCV(fit_intercept=True, Cs=np.logspace(-5, 1, 100),
multi_class='multinomial', penalty='l2', solver='lbfgs')
lgr=lgr.fit(x_train, y_train)
val_predict = lgr.predict(x_val)
accuracy_score = metrics.accuracy_score(y_val,val_predict)
auc_score =roc_auc_score(y_val,val_predict)
print(f'accuracy_score:{accuracy_score}')
print(f'auc_score:{auc_score}')
————————————————————————————————————————————————————————
#无预处理无参
# accuracy_score:0.73523125
# auc_score:0.6116656749736118
# 无预处理有参
# accuracy_score:0.70948125
# auc_score:0.627097088518769
#Normalizer
# accuracy_score:0.70948125
# auc_score:0.627097088518769
#StandardScaler
# accuracy_score:0.8006375
# auc_score:0.5008976649807645
#MinMaxScaler
# accuracy_score:0.44828125
# auc_score:0.6023187278237851
复制代码
1.5.3 XGB调参
- 贝叶斯调参
def BO_xgb(x,y):
t1=time.clock()
def xgb_cv(max_depth,gamma,min_child_weight,max_delta_step,subsample,colsample_bytree):
paramt={'booster': 'gbtree',
'max_depth': int(max_depth),
'gamma': gamma,
'eta': 0.1,
'objective': 'binary:logistic',
'nthread': 4,
'eval_metric': 'auc',
'subsample': max(min(subsample, 1), 0),
'colsample_bytree': max(min(colsample_bytree, 1), 0),
'min_child_weight': min_child_weight,
'max_delta_step': int(max_delta_step),
'seed': 1001}
model=xgb.XGBClassifier(**paramt)
res = cross_val_score(model,x, y, scoring='roc_auc', cv=5).mean()
return res
cv_params ={'max_depth': (5, 12),
'gamma': (0.001, 10.0),
'min_child_weight': (0, 20),
'max_delta_step': (0, 10),
'subsample': (0.4, 1.0),
'colsample_bytree': (0.4, 1.0)}
xgb_op = BayesianOptimization(xgb_cv,cv_params)
xgb_op.maximize(n_iter=20)
print(xgb_op.max)
t2=time.clock()
print('耗时:',(t2-t1))
return xgb_op.max
best_params=BO_xgb(x_train,y_train)
复制代码
- 优化参数为
{'target': 0.7189167233509909,
'params': {'colsample_bytree': 0.4,
'gamma': 5.152572418614645,
'max_delta_step': 1.4062212725181387,
'max_depth': 8.845615277804477,
'min_child_weight': 13.505828422801075,
'subsample': 1.0}}
复制代码
- 利用验证集可视化roc曲线
#roc可视化
def roc(m,x,y,name):
y_pred = m.predict_proba(x)[:,1]
#预测并计算roc的相关指标
fpr, tpr, threshold = metrics.roc_curve(y, y_pred)
roc_auc = metrics.auc(fpr, tpr)
print(name+'AUC:{}'.format(roc_auc))
#绘制roc曲线图
plt.figure(figsize=(8, 8))
plt.title(name)
plt.plot(fpr, tpr, 'b', label = name + 'AUC = %0.4f' % roc_auc)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 画出对角线
plt.plot([0,1],[0,1],'r--')
plt.show()
#执行可视化函数
roc(xgb,x_val,y_val,'验证集')
————————————————————————————————————————————————————————
验证集AUC:0.722190947212997
复制代码
- 构建最终模型
xgb = xgb.XGBClassifier(colsample_bytree=0.4, gamma= 5.15,
max_delta_step=1.41, max_depth= 9,
min_child_weight= 13, subsample=1.0)
xgb_final=xgb.fit(x_train,y_train)
复制代码
1.5.4 LGBM调参
- 贝叶斯调参
def LGB_bayesian(num_leaves, bagging_fraction,feature_fraction, min_child_weight,
min_data_in_leaf, max_depth,reg_alpha,reg_lambda):
param = { 'num_leaves': num_leaves,
'min_data_in_leaf': min_data_in_leaf,
'min_child_weight': min_child_weight,
'bagging_fraction' : bagging_fraction,
'feature_fraction' : feature_fraction,
'max_depth': max_depth,
'reg_alpha': reg_alpha,
'reg_lambda': reg_lambda,
'objective': 'binary',
'save_binary': True,
'data_random_seed': 1337,
'boosting_type': 'gbdt',
'verbose': 1,
'is_unbalance': False,
'boost_from_average': True,
'metric':'auc'}
trn_data= lgb.Dataset(x_train, label=y_train)
val_data= lgb.Dataset(x_val, label=y_val)
lgbm=lgb.train(param, trn_data,num_boost_round=300,
valid_sets = [trn_data, val_data], verbose_eval=50, early_stopping_rounds = 200)
pred_val = lgbm.predict(x_val, num_iteration=lgbm.best_iteration)
score = roc_auc_score(y_val, pred_val)
return score
bounds_LGB = {
'num_leaves': (31, 500),
'min_data_in_leaf': (20, 200),
'bagging_fraction' : (0.1, 0.9),
'feature_fraction' : (0.1, 0.9),
'min_child_weight': (0.00001, 0.01),
'reg_alpha': (1, 2),
'reg_lambda': (1, 2),
'max_depth':(-1,50),}
lgb_op =BayesianOptimization(LGB_bayesian,bounds_LGB)
lgb_op.maximize(init_points=10, n_iter=10, acq='ucb', xi=0.0, alpha=1e-6)
print(lgb_op.max)
————————————————————————————————————————————————————————
#选择最优参数
# iter | target | baggin... | featur... | max_depth |
# 11 | 0.7234 | 0.1685 | 0.6071 | 45.41 |
# min_ch... | min_da... | num_le... | reg_alpha | reg_la... |
# 0.000838 | 199.2 | 36.56 | 1.77 | 1.748
复制代码
1.5.3 模型预测及输出结果文件
- 使用lightgbm的最终模型
lgb_final = lgb.LGBMClassifier(random_state=2021,learning_rate=0.1 ,
n_estimators=300 ,bagging_fraction=0.17,feature_fraction=0.6,
min_child_weight=0.00084, max_depth= 45,min_data_in_leaf=200,
num_leaves=37,reg_alpha=1.77,reg_lambda=1.75)
lgb_final.fit(x_train,y_train)
pre_test=lgb_final.predict(test)
proba_test=lgb_final.predict_proba(test)
复制代码
- 输出结果
df_predictions = pd.DataFrame()
abc = pd.read_csv('sample_submit.csv',encoding='utf-8')
df_predictions['id'] = abc['id']
df_predictions['isDefault'] = proba_test[:,1]
df_predictions[['id','isDefault']].to_csv('03submit.csv', index=False)
复制代码
- 整套跑下来,发现lightgbm比xgboost要好一些,xgboost比logistic要好一些,夜太深了,明早去考科四了,以后再继续完善模型吧。
© 版权声明
文章版权归作者所有,未经允许请勿转载。
THE END