安泰杯—金融科技学习赛

前言

大数据时代的发展背景下,数据分析渗透到众多领域,利用数据分析方法充分挖掘数据的价值并为实际业务的执行与决策提供可靠依据越来越显得重要,本文旨在从金融贷款业务中预测贷款人是否会出现违约情况。

1.1 Jupyter设置、导包及数据集加载

导入相关模块。

import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
from sklearn.exceptions import ConvergenceWarning
from typing import types
import sklearn
import pandas_profiling
复制代码

拦截警告

warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)   
复制代码

防止中文乱码,设置seaborn中文字体。

mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')
复制代码

设置jupyter显示行数

pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 30)
pd.set_option('max_columns', 30)
复制代码

加载数据。

df_train =pd.read_csv('train.csv',encoding='utf-8')
df_test=pd.read_csv('testA.csv',encoding='utf-8')
复制代码

1.2 探索性分析

1.2.1 预览数据集

  • 预览数据集
  • 训练集中共有47个特征
df_train.head(5).append(df_train.tail(5))
————————————————————————————————————————————————————————
	id	loanAmnt	term	interestRate	installment	grade	subGrade	employmentTitle	employmentLength	homeOwnership	annualIncome	verificationStatus	issueDate	isDefault	purpose	...	n0	n1	n2	n3	n4	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
0	0	35000.0	5	19.52	917.97	E	E2	320.0	2 years	2	110000.0	2	2014-07-01	1	1	...	0.0	2.0	2.0	2.0	4.0	9.0	8.0	4.0	12.0	2.0	7.0	0.0	0.0	0.0	2.0
1	1	18000.0	5	18.49	461.90	D	D2	219843.0	5 years	0	46000.0	2	2012-08-01	0	0	...	NaN	NaN	NaN	NaN	10.0	NaN	NaN	NaN	NaN	NaN	13.0	NaN	NaN	NaN	NaN
2	2	12000.0	5	16.99	298.17	D	D3	31698.0	8 years	0	74000.0	2	2015-10-01	0	0	...	0.0	0.0	3.0	3.0	0.0	0.0	21.0	4.0	5.0	3.0	11.0	0.0	0.0	0.0	4.0
3	3	11000.0	3	7.26	340.96	A	A4	46854.0	10+ years	1	118000.0	1	2015-08-01	0	4	...	6.0	4.0	6.0	6.0	4.0	16.0	4.0	7.0	21.0	6.0	9.0	0.0	0.0	0.0	1.0
4	4	3000.0	3	12.99	101.07	C	C2	54.0	NaN	1	29000.0	2	2016-03-01	0	10	...	1.0	2.0	7.0	7.0	2.0	4.0	9.0	10.0	15.0	7.0	12.0	0.0	0.0	0.0	4.0
799995	799995	25000.0	3	14.49	860.41	C	C4	2659.0	7 years	1	72000.0	0	2016-07-01	0	0	...	0.0	5.0	10.0	10.0	6.0	6.0	2.0	12.0	13.0	10.0	14.0	0.0	0.0	0.0	3.0
799996	799996	17000.0	3	7.90	531.94	A	A4	29205.0	10+ years	0	99000.0	2	2013-04-01	0	4	...	0.0	2.0	2.0	2.0	2.0	15.0	16.0	2.0	19.0	2.0	7.0	0.0	0.0	0.0	0.0
799997	799997	6000.0	3	13.33	203.12	C	C3	2582.0	10+ years	1	65000.0	2	2015-10-01	1	0	...	2.0	1.0	4.0	4.0	1.0	4.0	26.0	4.0	10.0	4.0	5.0	0.0	0.0	1.0	4.0
799998	799998	19200.0	3	6.92	592.14	A	A4	151.0	10+ years	0	96000.0	2	2015-02-01	0	4	...	0.0	5.0	8.0	8.0	7.0	10.0	6.0	12.0	22.0	8.0	16.0	0.0	0.0	0.0	5.0
799999	799999	9000.0	3	11.06	294.91	B	B3	13.0	5 years	0	120000.0	0	2018-08-01	0	4	...	2.0	2.0	3.0	3.0	2.0	3.0	4.0	4.0	8.0	3.0	7.0	0.0	0.0	0.0	2.0
10 rows × 47 columns
复制代码
  • 预览相关统计量
df_train.describe()
————————————————————————————————————————————————————————————————————————
	id	loanAmnt	term	interestRate	installment	employmentTitle	homeOwnership	annualIncome	verificationStatus	isDefault	purpose	postCode	regionCode	dti	delinquency_2years	...	n0	n1	n2	n3	n4	n5	n6	n7	n8	n9	n10	n11	n12	n13	n14
count	800000.000000	800000.000000	800000.000000	800000.000000	800000.000000	799999.000000	800000.000000	8.000000e+05	800000.000000	800000.000000	800000.000000	799999.000000	800000.000000	799761.000000	800000.000000	...	759730.000000	759730.000000	759730.000000	759730.000000	766761.000000	759730.000000	759730.000000	759730.000000	759729.000000	759730.000000	766761.000000	730248.000000	759730.000000	759730.000000	759730.000000
mean	399999.500000	14416.818875	3.482745	13.238391	437.947723	72005.351714	0.614213	7.613391e+04	1.009683	0.199513	1.745982	258.535648	16.385758	18.284557	0.318239	...	0.511932	3.642330	5.642648	5.642648	4.735641	8.107937	8.575994	8.282953	14.622488	5.592345	11.643896	0.000815	0.003384	0.089366	2.178606
std	230940.252015	8716.086178	0.855832	4.765757	261.460393	106585.640204	0.675749	6.894751e+04	0.782716	0.399634	2.367453	200.037446	11.036679	11.150155	0.880325	...	1.333266	2.246825	3.302810	3.302810	2.949969	4.799210	7.400536	4.561689	8.124610	3.216184	5.484104	0.030075	0.062041	0.509069	1.844377
min	0.000000	500.000000	3.000000	5.310000	15.690000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000	0.000000	0.000000	-1.000000	0.000000	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	199999.750000	8000.000000	3.000000	9.750000	248.450000	427.000000	0.000000	4.560000e+04	0.000000	0.000000	0.000000	103.000000	8.000000	11.790000	0.000000	...	0.000000	2.000000	3.000000	3.000000	3.000000	5.000000	4.000000	5.000000	9.000000	3.000000	8.000000	0.000000	0.000000	0.000000	1.000000
50%	399999.500000	12000.000000	3.000000	12.740000	375.135000	7755.000000	1.000000	6.500000e+04	1.000000	0.000000	0.000000	203.000000	14.000000	17.610000	0.000000	...	0.000000	3.000000	5.000000	5.000000	4.000000	7.000000	7.000000	7.000000	13.000000	5.000000	11.000000	0.000000	0.000000	0.000000	2.000000
75%	599999.250000	20000.000000	3.000000	15.990000	580.710000	117663.500000	1.000000	9.000000e+04	2.000000	0.000000	4.000000	395.000000	22.000000	24.060000	0.000000	...	0.000000	5.000000	7.000000	7.000000	6.000000	11.000000	11.000000	10.000000	19.000000	7.000000	14.000000	0.000000	0.000000	0.000000	3.000000
max	799999.000000	40000.000000	5.000000	30.990000	1715.420000	378351.000000	5.000000	1.099920e+07	2.000000	1.000000	13.000000	940.000000	50.000000	999.000000	39.000000	...	51.000000	33.000000	63.000000	63.000000	49.000000	70.000000	132.000000	79.000000	128.000000	45.000000	82.000000	4.000000	4.000000	39.000000	30.000000
8 rows × 42 columns
复制代码
  • 预览数据类型
df_train.info()
————————————————————————————————————————————————————————————————————————
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800000 entries, 0 to 799999
Data columns (total 47 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   id                  800000 non-null  int64  
 1   loanAmnt            800000 non-null  float64
 2   term                800000 non-null  int64  
 3   interestRate        800000 non-null  float64
 4   installment         800000 non-null  float64
 5   grade               800000 non-null  object 
 6   subGrade            800000 non-null  object 
 7   employmentTitle     799999 non-null  float64
 8   employmentLength    753201 non-null  object 
 9   homeOwnership       800000 non-null  int64  
 10  annualIncome        800000 non-null  float64
 11  verificationStatus  800000 non-null  int64  
 12  issueDate           800000 non-null  object 
 13  isDefault           800000 non-null  int64  
 14  purpose             800000 non-null  int64  
 15  postCode            799999 non-null  float64
 16  regionCode          800000 non-null  int64  
 17  dti                 799761 non-null  float64
 18  delinquency_2years  800000 non-null  float64
 19  ficoRangeLow        800000 non-null  float64
 20  ficoRangeHigh       800000 non-null  float64
 21  openAcc             800000 non-null  float64
 22  pubRec              800000 non-null  float64
 23  pubRecBankruptcies  799595 non-null  float64
 24  revolBal            800000 non-null  float64
 25  revolUtil           799469 non-null  float64
 26  totalAcc            800000 non-null  float64
 27  initialListStatus   800000 non-null  int64  
 28  applicationType     800000 non-null  int64  
 29  earliesCreditLine   800000 non-null  object 
 30  title               799999 non-null  float64
 31  policyCode          800000 non-null  float64
 32  n0                  759730 non-null  float64
 33  n1                  759730 non-null  float64
 34  n2                  759730 non-null  float64
 35  n3                  759730 non-null  float64
 36  n4                  766761 non-null  float64
 37  n5                  759730 non-null  float64
 38  n6                  759730 non-null  float64
 39  n7                  759730 non-null  float64
 40  n8                  759729 non-null  float64
 41  n9                  759730 non-null  float64
 42  n10                 766761 non-null  float64
 43  n11                 730248 non-null  float64
 44  n12                 759730 non-null  float64
 45  n13                 759730 non-null  float64
 46  n14                 759730 non-null  float64
dtypes: float64(33), int64(9), object(5)
复制代码
  • 预览训练集、测试集维度
df_train.shape,df_test.shape
————————————————————————————————————————————————————————————————————————
((800000, 47), (200000, 46))
复制代码
  • 缺失值数量与分布
df_train.isnull().sum()
missing_pct = combined.isnull().sum() * 100 / len(combined) #将列中为空的个数统计出来
missing = pd.DataFrame({
    'name': combined.columns,
    'missing_pct': missing_pct,
})
missing.sort_values(by='missing_pct', ascending=False).head()
————————————————————————————————————————————————————————
name	missing_pct
n11	n11	8.7192
employmentLength	employmentLength	5.8307
n8	n8	5.0322
n14	n14	5.0321
n3	n3	5.0321
复制代码

1.2.2 贷款违约数量与分布

  • 从图中可以看出,样本分布并不均匀,负样本占比80%
fig,ax = plt.subplots(1,2,figsize=(15,8))
sns.countplot('isDefault',data=df_train,ax=ax[0],palette=['m','r'])
df_train['isDefault'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1],colors=['orange','gray'])
ax[0].set_ylabel('')
ax[0].set_xlabel('isDefault')
ax[1].set_ylabel('')
ax[1].set_xlabel('isDefault')
plt.show()
————————————————————————————————————————————————————————
复制代码

image.png

1.2.3 贷款金额、期限、利率与违约

  • 发生违约情况的贷款金额均值高于未违约情况的贷款金额
  • 样本中贷款期限为3年居多,贷款期限为3年比5年发生了更多的违约情况
  • 发生违约情况的贷款利率均值显著更高一些

image.png

plt.figure()
plt.figure(figsize=(15, 5))
plt.subplot(1, 3,1)
df_train[['loanAmnt','isDefault']].groupby('isDefault')['loanAmnt'].mean().plot.bar(
color=['m','c'])
plt.ylabel('loanAmnt')
plt.subplot(1,3,2)
sns.countplot('term',hue='isDefault',data=df_train,palette=['orange','g'])
plt.subplot(1,3,3)
df_train[['interestRate','isDefault']].groupby('isDefault')['interestRate'].mean().plot.bar(
color=['k','r'])
plt.ylabel('interestRate')
plt.show()
复制代码
  • 未违约情况中,贷款3年选择的贷款金额在15000以下,而贷款5年选择的贷款金额多在15000以上
  • 违约情况中分布与未违约情况相似
fig = plt.figure(figsize=(25, 10))
sns.violinplot(x='term', y='loanAmnt', 
               hue='isDefault', data=df_train, 
               split=True,alpha=0.9,
               palette={0: "r", 1: "g"});
plt.title('贷款金额-贷款年限分布图',fontsize=25)
plt.xlabel('term',fontsize=22)
plt.ylabel('loanAmnt',fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
复制代码

image.png

1.2.4 贷款金额、等级、次等级与违约

  • 贷款金额整体趋势随着贷款等级上升,贷款次等级的表现相似
  • 贷款等级为C、D、B、E发生的违约情况更多一些
plt.figure()
plt.figure(figsize=(15, 8))
plt.subplot(2,2,1)
df_train[['loanAmnt','grade',]].groupby('grade')['loanAmnt'].mean().plot.bar(
color=['m','c'])
plt.ylabel('loanAmnt')
plt.subplot(2,2,2)
df_train[['loanAmnt','subGrade',]].groupby('subGrade')['loanAmnt'].mean().plot.bar(color=['m','c'])
plt.ylabel('loanAmnt')
plt.subplot(2,2,3)
sns.countplot('grade',hue='isDefault',data=df_train,palette=['orange','g'])
plt.subplot(2,2,4)
sns.countplot('subGrade',hue='isDefault',data=df_train,palette=['orange','c'])
plt.ylabel('subGrade')
复制代码

image.png

  • 在不同贷款等级与次等级下,违约与未违约情况的分布情况相近
plt.figure(figsize=(20, 15))
plt.subplot(2,1,1)
sns.violinplot(x='grade', y='loanAmnt', 
               hue='isDefault', data=df_train, 
               split=True,alpha=0.9,
               palette={0: "orange", 1: "b"});
plt.title('贷款金额-贷款等级分布图',fontsize=25)
plt.xlabel('grade',fontsize=22)
plt.ylabel('loanAmnt',fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
plt.subplot(2,1,2)
sns.violinplot(x='subGrade', y='loanAmnt', 
               hue='isDefault', data=df_train, 
               split=True,alpha=0.9,
               palette={0: "orange", 1: "b"});
plt.title('贷款金额-贷款次等级分布图',fontsize=25)
plt.xlabel('subGrade',fontsize=22)
plt.ylabel('loanAmnt',fontsize=22)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
复制代码

image.png

1.2.5 年收入、就业年限与违约

  • 结合年收入和就业时间来看,整体分布均匀
plt.figure(figsize=(15, 8))
ax = plt.subplot()
sns.barplot('employmentLength','annualIncome',hue='isDefault',
            data=df_train,palette=['g','c'])
复制代码

image.png

1.2.6 贷款金额、债务收入比与违约

  • 发生违约的样本中,债务收入比更高一些,这体现了贷款人存在多方借款情况,最终导致发生违约
plt.figure(figsize=(15, 8))
df_train[['dti','isDefault',]].groupby('isDefault')['dti'].mean().plot.bar(
color=['r','b'])
plt.ylabel('dti')
复制代码

image.png

  • 违约情况多发生在dti处于100~200的范围,
  • 存在200以上的高债务收入占比情况
plt.figure(figsize=(20,8))
ax = plt.subplot()
ax.scatter(df_train[df_train['isDefault'] ==1]['dti'],df_train[df_train['isDefault']==1]['loanAmnt'],
          color = 'r',s=df_train[df_train['isDefault']==1]['loanAmnt']/100)
ax.scatter(df_train[df_train['isDefault']==0]['dti'],df_train[df_train['isDefault']==0]['loanAmnt'],
          color = 'k',s=df_train[df_train['isDefault']==0]['loanAmnt']/100)
复制代码

image.png

1.3 数据清洗

1.3.1 缺失值处理

在1.2.1中确认具有缺失值的特征:

  • [’employmentTitle’,’employmentLength’, ‘postCode’, ‘dti’, ‘pubRecBankruptcies’, ‘revolUtil’,’title’, ‘n0’, ‘n1’, ‘n2’, ‘n3’, ‘n4’, ‘n5’, ‘n6’, ‘n7’, ‘n8’, ‘n9′,’n10’, ‘n11’, ‘n12’, ‘n13’, ‘n14’]

  • 使用众数填充employmentTitle、employmentLength、postCode、pubRecBankruptcies、title

  • 使用均值填充revolUtil、title

combined['employmentLength']=combined['employmentLength'].fillna('10+ years')
combined['employmentTitle']=combined['employmentTitle'].fillna(54)
combined['postCode']=combined['postCode'].fillna(134)
combined['dti']=combined['dti'].fillna(combined['dti'].mean())
combined['pubRecBankruptcies']=combined['pubRecBankruptcies'].fillna(0)
combined['title']=combined['title'].fillna(0)
combined['revolUtil']=combined['revolUtil'].fillna(combined['revolUtil'].mean())
复制代码
  • 使用众数填充n系列缺失值
n_features=[i for i in combined.columns if i.startswith('n')]
modes=combined[n_features].mode().values
modes_box=[]
for  i in range(15):
    modes_box.append(a[0][i])
combined[n_features]=combined[n_features].fillna(dict(zip(n_features,modes_box)))
复制代码

1.3.2 异常值处理

  • 图中可以看到employmentTitle、annualIncome、revolBal、title、n系列存在异常值,将会在特征工程中逐一处理
fig,ax=plt.subplots(3,2,figsize=(20,15))
sns.boxplot(data=combined[n[:7]],ax=ax[0][0])
sns.boxplot(data=combined[n[7:14]],ax=ax[0][1])
sns.boxplot(data=combined[n[14:21]],ax=ax[1][0])
sns.boxplot(data=combined[n[21:28]],ax=ax[1][1])
sns.boxplot(data=combined[n[28:35]],ax=ax[2][0])
sns.boxplot(data=combined[n[35:42]],ax=ax[2][1])
复制代码

image.png

1.4 特征工程

待处理的特征有:

  • id 为贷款清单分配的唯一信用证标识
  • loanAmnt 贷款金额
  • term 贷款期限(year)
  • interestRate 贷款利率
  • installment 分期付款金额
  • grade 贷款等级
  • subGrade 贷款等级之子级
  • employmentTitle 就业职称
  • employmentLength 就业年限(年)
  • homeOwnership 借款人在登记时提供的房屋所有权状况
  • annualIncome 年收入
  • verificationStatus 验证状态
  • issueDate 贷款发放的月份
  • purpose 借款人在贷款申请时的贷款用途类别
  • postCode 借款人在贷款申请中提供的邮政编码的前3位数字
  • regionCode 地区编码
  • dti 债务收入比
  • delinquency_2years 借款人过去2年信用档案中逾期30天以上的违约事件数
  • ficoRangeLow 借款人在贷款发放时的fico所属的下限范围
  • ficoRangeHigh 借款人在贷款发放时的fico所属的上限范围
  • openAcc 借款人信用档案中未结信用额度的数量
  • pubRec 贬损公共记录的数量
  • pubRecBankruptcies 公开记录清除的数量
  • revolBal 信贷周转余额合计
  • revolUtil 循环额度利用率,或借款人使用的相对于所有可用循环信贷的信贷金额
  • totalAcc 借款人信用档案中当前的信用额度总数
  • initialListStatus 贷款的初始列表状态
  • applicationType 表明贷款是个人申请还是与两个共同借款人的联合申请
  • earliesCreditLine 借款人最早报告的信用额度开立的月份
  • title 借款人提供的贷款名称
  • policyCode 公开可用的策略_代码=1新产品不公开可用的策略_代码=2
  • n系列匿名特征 匿名特征n0-n14,为一些贷款人行为计数特征的处理

1.4.1 处理id、policyCode、grade

  • 删除这些特征,policyCode只有一个值,grade与subgrade重复
combined.drop(['id','policyCode','grade'],axis=1,inplace=True)
复制代码

1.4.2 处理earliesCreditLine、issueDate

  • 转换earliesCreditLine数据为时间格式
def transfer_earliesCreditLine():
    global combined
    m=[1,2,3,4,5,6,7,8,9,10,11,12]
    M=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
    dict1 =dict(zip(M,m))
    dateline=[]
    for  k in combined['earliesCreditLine']:
        k = k+'-'+str(dict1[k[:3]])+'-'+'01' if dict1[k[:3]]>9 else k+'-'+'0'+str(dict1[k[:3]])+'-'+'01'
        k=k[4:]
        dateline.append(k)
    combined['earliesCreditLine']=dateline
    return combined
combined = transfer_earliesCreditLine()
combined['earliesCreditLine']
————————————————————————————————————————————————————————
0         2001-08-01
1         2002-05-01
2         2006-05-01
3         1999-05-01
4         1977-08-01
             ...    
199995    1996-11-01
199996    1994-09-01
199997    1988-04-01
199998    2003-05-01
199999    1997-10-01
Name: earliesCreditLine, Length: 1000000, dtype: object
复制代码
  • 计算时间构建新特征sur_years
def date_modify(date):
    year = int(str(date)[:4])
    month = int(str(date)[5:7])
    day = int(str(date)[8:10])
    if month < 1:
        month = 1
    date_tr = datetime(year, month, day)
    return date_tr
combined['issueDate'] = combined['issueDate'].apply(date_modify)
combined['earliesCreditLine'] = combined['earliesCreditLine'].apply(date_modify)
sur_days=(combined['issueDate']-combined['earliesCreditLine']).dt.days
sur_years =round(sur_days/365,1)
combined['sur_years'] =sur_years
del combined['issueDate']
del combined['earliesCreditLine']
复制代码

1.4.3 处理term、homeOwnership等类别数据

  • 利用独热编码处理:term、homeOwnership、verificationStatus、purpose、regionCode、initialListStatus、applicationType、subGrade、employmentLength

  • 利用正则表达式预处理employmentLength

employmentLength=[]
for  i in combined['employmentLength']:
    v=int(re.findall('[0-9]+',i)[0])
    employmentLength.append(v)
combined['employmentLength'] =employmentLength
combined['employmentLength'] 
复制代码
  • 执行独热编码,完成后特征数目为158
def dummies_coder():  
    global combined
    for name in ['term','homeOwnership','verificationStatus',
                 'purpose','regionCode','initialListStatus','applicationType',
                 'subGrade','employmentLength']:
        df_dummies = pd.get_dummies(combined[name],prefix=name)
        combined = pd.concat([combined,df_dummies],axis=1)
        combined.drop(name,axis=1,inplace=True)
    return combined
combined =dummies_coder()
combined.shape
————————————————————————————————————————————————————————
(1000000, 158)
复制代码

1.4.4 处理loanAmnt、interestRate等数值型数据

  • 预处理ficoRangeLow、ficoRangeHigh
combined['ficoRange'] = combined['ficoRangeLow'] +combined['ficoRangeHigh']
combined.drop(['ficoRangeLow','ficoRangeHigh'],axis=1,inplace=True)
复制代码
  • 利用数据分箱来处理这些特征
#使用数据分箱的特征
modify_features=['loanAmnt','interestRate','installment','annualIncome','postCode','dti',
 'delinquency_2years','pubRec','openAcc','pubRecBankruptcies','revolBal',
 'revolUtil','totalAcc','title',]
复制代码
  • 根据 25% 50% 75% 分布获取阈值
# 内函数:根据 25% 50% 75% 分布获取阈值
def get_threshold(feature):
    global combined
    threshold_values=[]
    a=list(combined[feature].sort_values()[:250000])
    b=list(combined[feature].sort_values()[:500000])
    c=list(combined[feature].sort_values()[:750000])
    threshold_values.append(a[-1])
    threshold_values.append(b[-1])
    threshold_values.append(c[-1])
    return threshold_values
#外函数:执行数据分箱
def modify_df():
    global combined,modify_features
    for feature in  modify_features:
        threshold_values =get_threshold(feature)
        combined.loc[combined[feature]<threshold_values[0],feature]=0
        combined.loc[(combined[feature] >= threshold_values[0]) & (combined[feature] < threshold_values[1]),
                    feature] = 1
        combined.loc[(combined[feature] >= threshold_values[1]) & (combined[feature] < threshold_values[2]), 
                    feature] = 2
        combined.loc[(combined[feature] >= threshold_values[2]) , feature] = 3
    return combined
combined=modify_df()
复制代码
  • 此时已经同时完成所有的数据分箱操作
combined.head()
————————————————————————————————————————————————————————
loanAmnt	interestRate	installment	employmentTitle	annualIncome	postCode	dti	delinquency_2years	ficoRangeLow	ficoRangeHigh	openAcc	pubRec	pubRecBankruptcies	revolBal	revolUtil	...	subGrade_G1	subGrade_G2	subGrade_G3	subGrade_G4	subGrade_G5	employmentLength_1	employmentLength_2	employmentLength_3	employmentLength_4	employmentLength_5	employmentLength_6	employmentLength_7	employmentLength_8	employmentLength_9	employmentLength_10
0	3.0	3.0	3.0	320.0	3.0	1.0	1.0	3.0	730.0	734.0	0.0	3.0	3.0	3.0	1.0	...	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0
1	2.0	3.0	2.0	219843.0	1.0	1.0	3.0	3.0	700.0	704.0	2.0	3.0	3.0	2.0	1.0	...	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
2	2.0	3.0	1.0	31698.0	2.0	2.0	2.0	3.0	675.0	679.0	2.0	3.0	3.0	0.0	1.0	...	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
3	1.0	0.0	1.0	46854.0	3.0	1.0	1.0	3.0	685.0	689.0	1.0	3.0	3.0	1.0	2.0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
4	0.0	2.0	0.0	54.0	0.0	2.0	3.0	3.0	690.0	694.0	2.0	3.0	3.0	0.0	0.0	...	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1
5 rows × 158 columns
复制代码

1.4.5 处理n系列

  • 图中可以直观看到n2和n3呈完全线性关系,而其他特征需要进一步表征
plt.figure(figsize=(15,8))
sns.pairplot(combined[n_features])
plt.show()
复制代码

image.png

  • 特征相关性分析,相关系数大于等于0.75是为高度相关
  • 与n1相关性较高的是n2,n3,n4,n9
  • 与n2相关性较高的是n1,n3,n7,n9
  • 与n5相关性较高的是n8
  • 与n7相关性较高的是n2,n3,n8,n9,n10
  • 与n8相关性较高的是n5,n7
plt.figure(figsize=(15,8))
sns.heatmap(combined[n_features].corr(),annot=True)
plt.show()
复制代码

image.png

  • 删除n2,n3,n4,n8,n9,n10
combined.drop(['n2','n3','n4','n8','n9','n10'],axis=1,inplace=True)
n_featuresrv=['n0','n1','n5','n6','n7','n11', 'n12', 'n13','n14']
combined.shape
————————————————————————————————————————————————————————
(1000000, 151)
复制代码

1.5 模型训练

  • 导入相关模块
import sklearn
from sklearn.model_selection import GridSearchCV,train_test_split
from sklearn.preprocessing import MinMaxScaler,StandardScaler,Normalizer
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score,roc_auc_score,make_scorer
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization
复制代码
  • 划分训练集、验证集、测试集
train = combined[:800000]
test =combined[800000:]
x_train,x_val,y_train,y_val= train_test_split(train,targets,test_size=0.2,random_state=2021)
复制代码

1.5.1 利用基础模型及评分

  • 实例化模型分类器
lgrcv =LogisticRegressionCV()
extree =ExtraTreesClassifier()
rf =RandomForestClassifier()
knn=KNeighborsClassifier()
xgb =xgb.XGBClassifier()
lgb=lgb.LGBMClassifier()
models =[extree,lgrcv, rf,knn ,lgb ,xgb]
复制代码
  • 使用单一模型进行训练、模型AUC评估。
for model in models:
    model=model.fit(x_train,y_train)
    predict_train =model.predict(x_train)
    predict_val=model.predict(x_val)
    print(model)
    print('val f1-score :',metrics.f1_score(y_val,predict_val))
    a = model.predict_proba(x_val)
    fpr, tpr, thresholds = metrics.roc_curve(y_val, y_score=[i[1] for i in a], pos_label=1)
    print('auc:',metrics.auc(fpr, tpr))
    print('**********************************')
  ——————————————————————————————————————————————————————
ExtraTreesClassifier()
val f1-score : 0.12774759370087427
auc: 0.7020222394446626
**********************************
LogisticRegressionCV()
val f1-score : 0.12568336494477295
auc: 0.7169715289708525
**********************************
RandomForestClassifier()
val f1-score : 0.11396893291310219
auc: 0.7066270946062161
**********************************
KNeighborsClassifier()
val f1-score : 0.19608019621520667
auc: 0.6092136514038466
**********************************
LGBMClassifier()
val f1-score : 0.14222881961625158
auc: 0.7214696572215835
**********************************
XGBClassifier()
val f1-score : 0.09031511997687193
auc: 0.7141476229518823
**********************************
复制代码
  • 从结果来看,KNN的表现较差且时间较长
  • Logistic、LGBM、XGB表现较好

1.5.2 Logistic回归调参

  • 训练预处理
mms=MinMaxScaler()
norm =Normalizer()
ss = StandardScaler()
x_train = mms.fit_transform(x_train)
复制代码
  • 调节参数,Normalizer后auc值有略微提升
lgr = LogisticRegressionCV(fit_intercept=True, Cs=np.logspace(-5, 1, 100), 
                          multi_class='multinomial', penalty='l2', solver='lbfgs')
lgr=lgr.fit(x_train, y_train)
val_predict = lgr.predict(x_val)
accuracy_score = metrics.accuracy_score(y_val,val_predict)
auc_score =roc_auc_score(y_val,val_predict)
print(f'accuracy_score:{accuracy_score}')
print(f'auc_score:{auc_score}')
————————————————————————————————————————————————————————
#无预处理无参
# accuracy_score:0.73523125
# auc_score:0.6116656749736118
# 无预处理有参
# accuracy_score:0.70948125
# auc_score:0.627097088518769
#Normalizer
# accuracy_score:0.70948125
# auc_score:0.627097088518769
#StandardScaler
# accuracy_score:0.8006375
# auc_score:0.5008976649807645
#MinMaxScaler
# accuracy_score:0.44828125
# auc_score:0.6023187278237851
复制代码

1.5.3 XGB调参

  • 贝叶斯调参
def BO_xgb(x,y):
    t1=time.clock()

    def xgb_cv(max_depth,gamma,min_child_weight,max_delta_step,subsample,colsample_bytree):
        paramt={'booster': 'gbtree',
                'max_depth': int(max_depth),
                'gamma': gamma,
                'eta': 0.1,
                'objective': 'binary:logistic',
                'nthread': 4,
                'eval_metric': 'auc',
                'subsample': max(min(subsample, 1), 0),
                'colsample_bytree': max(min(colsample_bytree, 1), 0),
                'min_child_weight': min_child_weight,
                'max_delta_step': int(max_delta_step),
                'seed': 1001}
        model=xgb.XGBClassifier(**paramt)
        res = cross_val_score(model,x, y, scoring='roc_auc', cv=5).mean()
        return res
    cv_params ={'max_depth': (5, 12),
                'gamma': (0.001, 10.0),
                'min_child_weight': (0, 20),
                'max_delta_step': (0, 10),
                'subsample': (0.4, 1.0),
                'colsample_bytree': (0.4, 1.0)}
    xgb_op = BayesianOptimization(xgb_cv,cv_params)
    xgb_op.maximize(n_iter=20)
    print(xgb_op.max)

    t2=time.clock()
    print('耗时:',(t2-t1))
    return xgb_op.max
best_params=BO_xgb(x_train,y_train)
复制代码

image.png

  • 优化参数为
{'target': 0.7189167233509909,
'params': {'colsample_bytree': 0.4,
'gamma': 5.152572418614645,
 'max_delta_step': 1.4062212725181387,
'max_depth': 8.845615277804477,
 'min_child_weight': 13.505828422801075,
 'subsample': 1.0}}
复制代码
  • 利用验证集可视化roc曲线
#roc可视化
def roc(m,x,y,name):
    y_pred = m.predict_proba(x)[:,1]
    #预测并计算roc的相关指标
    fpr, tpr, threshold = metrics.roc_curve(y, y_pred)
    roc_auc = metrics.auc(fpr, tpr)
    print(name+'AUC:{}'.format(roc_auc))
    #绘制roc曲线图
    plt.figure(figsize=(8, 8))
    plt.title(name)
    plt.plot(fpr, tpr, 'b', label = name + 'AUC = %0.4f' % roc_auc)
    plt.ylim(0,1)
    plt.xlim(0,1)
    plt.legend(loc='best')
    plt.title('ROC')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    # 画出对角线
    plt.plot([0,1],[0,1],'r--')
    plt.show()
#执行可视化函数
roc(xgb,x_val,y_val,'验证集')
————————————————————————————————————————————————————————
验证集AUC:0.722190947212997
复制代码

image.png

  • 构建最终模型
xgb = xgb.XGBClassifier(colsample_bytree=0.4, gamma= 5.15,
    max_delta_step=1.41, max_depth= 9,
  min_child_weight= 13, subsample=1.0)
xgb_final=xgb.fit(x_train,y_train)
复制代码

1.5.4 LGBM调参

  • 贝叶斯调参
def LGB_bayesian(num_leaves, bagging_fraction,feature_fraction, min_child_weight, 
    min_data_in_leaf, max_depth,reg_alpha,reg_lambda):
    
    param = { 'num_leaves': num_leaves, 
              'min_data_in_leaf': min_data_in_leaf,
              'min_child_weight': min_child_weight,
              'bagging_fraction' : bagging_fraction,
              'feature_fraction' : feature_fraction,
              'max_depth': max_depth,
              'reg_alpha': reg_alpha,
              'reg_lambda': reg_lambda,
              'objective': 'binary',
              'save_binary': True,
              'data_random_seed': 1337,
              'boosting_type': 'gbdt',
              'verbose': 1,
              'is_unbalance': False,
              'boost_from_average': True,
              'metric':'auc'}    
    
    trn_data= lgb.Dataset(x_train, label=y_train)
    val_data= lgb.Dataset(x_val, label=y_val)

    lgbm=lgb.train(param, trn_data,num_boost_round=300, 
valid_sets = [trn_data, val_data], verbose_eval=50, early_stopping_rounds = 200)
    
    pred_val  = lgbm.predict(x_val, num_iteration=lgbm.best_iteration) 
    score = roc_auc_score(y_val, pred_val)
    return score
bounds_LGB = {
    'num_leaves': (31, 500), 
    'min_data_in_leaf': (20, 200),
    'bagging_fraction' : (0.1, 0.9),
    'feature_fraction' : (0.1, 0.9),
    'min_child_weight': (0.00001, 0.01),   
    'reg_alpha': (1, 2), 
    'reg_lambda': (1, 2),
    'max_depth':(-1,50),}
lgb_op =BayesianOptimization(LGB_bayesian,bounds_LGB)
lgb_op.maximize(init_points=10, n_iter=10, acq='ucb', xi=0.0, alpha=1e-6)
print(lgb_op.max)
————————————————————————————————————————————————————————
#选择最优参数
# iter    |  target   | baggin... | featur... | max_depth | 
# 11       |  0.7234   |  0.1685   |  0.6071   |  45.41    |  
# min_ch... | min_da... | num_le... | reg_alpha | reg_la... |
# 0.000838 |  199.2    |  36.56    |  1.77     |  1.748 
复制代码

1.5.3 模型预测及输出结果文件

  • 使用lightgbm的最终模型
lgb_final =  lgb.LGBMClassifier(random_state=2021,learning_rate=0.1 ,
             n_estimators=300 ,bagging_fraction=0.17,feature_fraction=0.6,
             min_child_weight=0.00084,  max_depth= 45,min_data_in_leaf=200,
             num_leaves=37,reg_alpha=1.77,reg_lambda=1.75)
lgb_final.fit(x_train,y_train)
pre_test=lgb_final.predict(test)
proba_test=lgb_final.predict_proba(test)
复制代码
  • 输出结果
df_predictions = pd.DataFrame()
abc = pd.read_csv('sample_submit.csv',encoding='utf-8')
df_predictions['id'] = abc['id']
df_predictions['isDefault'] = proba_test[:,1]
df_predictions[['id','isDefault']].to_csv('03submit.csv', index=False)
复制代码
  • 整套跑下来,发现lightgbm比xgboost要好一些,xgboost比logistic要好一些,夜太深了,明早去考科四了,以后再继续完善模型吧。

image.png

© 版权声明
THE END
喜欢就支持一下吧
点赞0 分享