前言
Titanic是十分经典的数据分析项目,因其数据量不大,数据集数据结构较完善,适合进行建模、调参试验。Titanic作为笔者入门数据挖掘的首个项目,时隔数月,又回到这个起点,一路走来十分感慨,前路漫漫,自当继续前行。本文旨在使用更加完善、更系统的思路来修改之前的项目论述。
1.1 Jupyter设置、导包及数据集加载
导入相关模块。
import pandas as pd,numpy as np,matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.exceptions import ConvergenceWarning
from typing import types
import sklearn
import pandas_profiling
复制代码
拦截警告
warnings.filterwarnings('ignore')
warnings.filterwarnings(action ='ignore',category=ConvergenceWarning)
复制代码
防止中文乱码,设置seaborn中文字体。
mpl.rcParams['font.sans-serif'] =[u'simHei']
mpl.rcParams['axes.unicode_minus'] =False
sns.set(font='SimHei')
复制代码
设置jupyter显示行数
pd.options.display.min_rows = None
pd.set_option('display.expand_frame_repr', False)
pd.set_option('expand_frame_repr', False)
pd.set_option('max_rows', 30)
pd.set_option('max_columns', 30)
复制代码
加载数据。
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
df_train.shape,df_test.shape
————————————————————————————————————————————————————————
((891, 12), (418, 11))
复制代码
1.2 探索性分析
1.2.1 预览数据集
- 将预测值Survived提取出来作为目标预测值
- 将测试集添加到训练集,为数据清洗和特征工程做准备
targets =df_train.Survived
combined = df_train.drop('Survived',axis=1).append(df_test)
复制代码
- 预览数据集
combined.head(5).append(combined.tail(5))
————————————————————————————————————————————————————————
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
413 1305 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S
414 1306 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C
415 1307 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
416 1308 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S
417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C
复制代码
- 预览相关统计量与缺失数据情况
combined.describe()
————————————————————————————————————————————————————————
PassengerId Pclass Age SibSp Parch Fare
count 1309.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000
mean 655.000000 2.294882 29.881138 0.498854 0.385027 33.295479
std 378.020061 0.837836 14.413493 1.041658 0.865560 51.758668
min 1.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 328.000000 2.000000 21.000000 0.000000 0.000000 7.895800
50% 655.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 982.000000 3.000000 39.000000 1.000000 0.000000 31.275000
max 1309.000000 3.000000 80.000000 8.000000 9.000000 512.329200
复制代码
- 预览数据类型
combined.info()
————————————————————————————————————————————————————————
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 1309 non-null int64
1 Pclass 1309 non-null int64
2 Name 1309 non-null object
3 Sex 1309 non-null object
4 Age 1046 non-null float64
5 SibSp 1309 non-null int64
6 Parch 1309 non-null int64
7 Ticket 1309 non-null object
8 Fare 1308 non-null float64
9 Cabin 295 non-null object
10 Embarked 1307 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 122.7+ KB
复制代码
- 缺失值数量与分布
df_train.isnull().sum()
missing_pct = df_train.isnull().sum() * 100 / len(df_train) #将列中为空的个数统计出来
missing = pd.DataFrame({
'name': df_train.columns,
'missing_pct': missing_pct,
})
missing.sort_values(by='missing_pct', ascending=False).head()
————————————————————————————————————————————————————————
name missing_pct
Cabin 77.104377
Age 19.865320
Embarked 0.224467
PassengerId 0.000000
Survived 0.000000
复制代码
1.2.2 特征的实际意义
- PassengerId:船上每位旅客的身份证明
- Pclass:乘客等级。 它有三个可能的值:1,2,3(第一,第二和第三类)
- Name:Passeger的名字
- Sex:性别
- Age:年龄
- SibSp:与乘客一起旅行的兄弟姐妹和配偶的数量
- Parch:与乘客一起旅行的父母和孩子的数量
- Ticket:船票号码
- Fare:票价
- Cabin:船舱号码
- Embarked:这描述了人们登上的泰坦尼克号的三个可能区域。 三个可能的值 S,C,Q
1.2.3 预测值数量与分布
- 从预测值分布来看,船上乘客的生存率为38.4%,低于61.6%的死亡率,整体数据样本分布较为均匀
fig,ax = plt.subplots(1,2,figsize=(15,8))
sns.countplot('Survived',data=df_train,ax=ax[0],palette=['g','r']
)
df_train['Survived'].value_counts().plot.pie(autopct='%1.1f%%',ax=ax[1],colors=['g','r'])
ax[0].set_ylabel('')
ax[0].set_xlabel('Survived')
ax[1].set_ylabel('')
ax[1].set_xlabel('Survived')
plt.show()
————————————————————————————————————————————————————————
复制代码
1.2.4 性性别、年龄与生存
- 男性的生存率远低于女性,这可能与妇女、儿童先行的观念有关
- 从年龄来看,20~30的男性死亡人数最多,10岁以下的幼儿被优先施救
df_train['Died'] = 1-df_train['Survived']
fig,ax=plt.subplots(1,2,figsize=(15,8))
df_train.groupby('Sex')[['Survived','Died']].sum().plot.bar(ax=ax[0],
color=['m','c'],stacked=True)
plt.ylabel('count')
df_train.groupby('Survived')['Age'].plot.hist(ax=ax[1])
plt.xlabel('Age')
plt.ylabel('count')
plt.legend()
plt.show()
复制代码
1.2.5 票价、年龄与生存
- 年龄在30~40间存在最高票价,票价在100以上存活几率较高
plt.figure(figsize=(30,10))
ax = plt.subplot()
ax.scatter(df_train[df_train['Survived'] ==1]['Age'],df_train[df_train['Survived']==1]['Fare'],
color = 'purple',s=df_train[df_train['Survived']==1]['Fare'])
ax.scatter(df_train[df_train['Survived']==0]['Age'],df_train[df_train['Survived']==0]['Fare'],
color = 'k',s=df_train[df_train['Survived']==0]['Fare'])
plt.show()
复制代码
1.2.6 等级、乘船点、票价与生存
- Pclass为1的人群平均票价较高
- Pclass为2平均票价略高于Pclass为3的人群
plt.figure()
plt.ylabel('Average fare')
df_train.groupby('Pclass').mean()['Fare'].plot(kind='bar',color=['orange','g','c'],figsize=(30,10))
复制代码
- 在不同等登船点中,Pclass=3的人群死亡率总是最高的,Pclass=1的人群生存率总是高于死亡率
- Pclass为2、3的人群票价在100以下,生存率与死亡率几乎对半分,Pclass为1的人群在低票价区,生存与死亡相近,在高票价区生存率接近100%,这表明票价越高可以获得优先逃生
- 在Q登船的人群票价均在100以下,生存率与死亡率几乎对半分;在S登船、票价高于100的人群更易活下来;在C处登船的人群存活下来的总数更多,票价高于200的都是幸存的。
fig,ax = plt.subplots(3,1,figsize =(20,15))
sns.violinplot(ax=ax[0],x='Embarked',y='Pclass',hue='Survived',data=df_train,split = True,
palette ={0:'r',1:'g'})
sns.violinplot(ax=ax[1],x='Pclass',y='Fare',hue='Survived',data=df_train,split = True,
palette ={0:'r',1:'g'})
sns.violinplot(ax=ax[2],x='Embarked',y='Fare',hue='Survived',data=df_train,split = True,
palette ={0:'r',1:'g'})
复制代码
1.3 数据清洗
1.3.1 缺失值处理
- 存在缺失值的特征有:
name missing_pct
Cabin 77.463713
Age 20.091673
Embarked 0.152788
Fare 0.076394
PassengerId 0.000000
复制代码
- 定义一个查看缺失值情况的函数
#预览缺失值
def get_missing():
global combined
missing_pct = combined.isnull().sum() * 100 / len(combined) #将列中为空的个数统计出来
missing = pd.DataFrame({
'name': combined.columns,
'missing_pct': missing_pct })
return missing.sort_values(by='missing_pct', ascending=False).head()
#完成后显示 ok
def status(feature):
print(f'{feature} is ok')
复制代码
- 根据性别和等级的不同,利用年龄的中位数来填充空值
def get_age(row):
global df_train
train_age_median = df_train.groupby(['Sex','Pclass','Age']).median().reset_index()[['Sex','Pclass','Age']]
condition= (
(train_age_median['Sex'] == row['Sex']) &
(train_age_median['Pclass'] == row['Pclass'])
)
return train_age_median[condition]['Age'].values[0]
def fill_age():
global combined
combined['Age'] = combined.apply(lambda row : get_age(row)
if np.isnan(row['Age']) else row['Age'], axis=1)
status('Age')
return combined
combined =fill_age()
————————————————————————————————————————————————————————
Age is ok
复制代码
- 对于Cabin中的空值,使用’U’填充,表示未知
combined['Cabin']=combined['Cabin'].fillna('U')
复制代码
- 使用众数填充Embarked的缺失值
combined['Embarked']=combined['Embarked'].fillna('S')
复制代码
- 使用均值填充Fare
combined['Fare']=combined['Fare'].fillna(df_train['Fare'].mean())
复制代码
1.3.2 异常值处理
- 删除PassengerId
combined.drop('PassengerId',axis=1,inplace=True)
复制代码
- Fare中存在异常值,在EDA中发现Fare越高生存几率越大,因此这些值与预测目标相关,不能轻易删除
num_type=combined.select_dtypes(exclude=['object'])
#['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
num=['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']
plt.figure(figsize=(16,8))
sns.boxplot(data=combined[num])
复制代码
1.4 特征工程
1.4.1 特征相关性筛查
- 特征之间的相关性没有出现很高的值,不需要特别处理
plt.figure(figsize=(15,8))
sns.pairplot(df_train.drop('PassengerId',axis=1))
plt.show()
plt.figure(figsize=(15,8))
sns.heatmap(df_train.drop('PassengerId',axis=1).corr(),annot=True)
plt.show()
复制代码
1.4.2 处理Name
- 将Name中的称呼词提取出来,如Miss,Mrs等
Title_Dictionary = {'Capt':'Officer',
'Col':'Officer',
'Don':'Royalty',
'Dr':'Officer',
'Jonkheer':'Royalty',
'Lady':'Royalty',
'Major':'Officer',
'Master':'Master',
'Miss':'Miss',
'Mlle':'Miss',
'Mme':'Mrs',
'Mr':'Mr',
'Mrs':'Mrs',
'Ms':'Mrs',
'Rev':'Officer',
'Sir':'Royalty',
'the Countess':'Royalty'}
# 外函数:对Title进行独热编码
def modify_names():
#内函数:构建新特征Title
def get_titles():
global combined,Title_Dictionary
combined['Title'] = combined['Name'].map(lambda name :name.split(',')[1].split('.')[0].strip())
combined['Title'] = combined.Title.map(Title_Dictionary)
combined.drop('Name',axis=1,inplace = True)
status('Title')
return combined
combined=get_titles()
titles_dummies =pd.get_dummies( combined['Title'],prefix ='Title')
combined = pd.concat([combined,titles_dummies],axis=1)
combined.drop('Title',axis=1,inplace=True)
status('names')
return combined
#执行处理Name函数
combined = modify_names()
复制代码
1.4.3 处理Sex
- 用1表示male,0表示female
def modify_Sex():
global combined
combined['Sex']=combined.Sex.map({'male':1,'female':0})
status('sex')
return combined
combined = modify_Sex()
复制代码
1.4.4 处理Pclass、Embarked、Cabin
- Cabin预处理
combined['Cabin'] = combined['Cabin'].map(lambda e : e[0])
复制代码
- 使用独热编码处理
# 独热编码
def dummies_coder():
global combined
for name in ['Embarked','Cabin','Pclass']:
df_dummies = pd.get_dummies(combined[name],prefix=name)
combined = pd.concat([combined,df_dummies],axis=1)
combined.drop(name,axis=1,inplace=True)
status(name)
return combined
combined =dummies_coder()
————————————————————————————————————————————————————————
Embarked is ok
Cabin is ok
Pclass is ok
复制代码
1.4.5 处理SibSp、Parch
-按照家庭人数来进行分类
- Singleone表示单人,SmallFamily表示四人以下小家庭,BigFamily表示4人以上大家庭
def modify_Family():
global combined
combined['Family_size'] = combined['Parch'] + combined['SibSp'] +1
combined['Singleone']=combined['Family_size'].map(lambda s : 1 if s==1 else 0)
combined['BigFamily']= combined['Family_size'].map(lambda s:1 if s>4 else 0)
combined['SmallFamily']=combined['Family_size'].map(lambda s:1 if 1<s<=4 else 0)
combined.drop(['Parch','SibSp','Family_size'],axis=1,inplace=True)
status('family')
return combined
combined = modify_Family()
————————————————————————————————————————————————————————
family is ok
复制代码
1.4.6 处理Ticket
- 提取Ticket开头的字母序列
- 利用独热编码处理
def preperform_Ticket(ticket):
ticket = ticket.replace('.','')
ticket = ticket.replace('/','')
ticket = ticket.split()
ticket = map(lambda t : t.strip(),ticket)
ticket = list(filter(lambda t : not t.isdigit(),ticket))
if len(ticket) >0:
return ticket[0]
else:
return 'xxx'
def modify_Ticket():
global combined
combined['Ticket'] = combined['Ticket'].map(preperform_Ticket)
tickets_dummies = pd.get_dummies(combined['Ticket'],prefix = 'Ticket')
combined = pd.concat([combined,tickets_dummies],axis=1)
combined.drop('Ticket',axis=1,inplace = True)
status('ticket')
return combined
combined = modify_Ticket()
复制代码
1.4.7 处理Fare、Age
- 利用数据分箱处理
def modify_df(feature,threshold_values):
global combined
combined.loc[combined[feature]<threshold_values[0],feature]=0
combined.loc[(combined[feature] >= threshold_values[0]) & (combined[feature] < threshold_values[1]),
feature] = 1
combined.loc[(combined[feature] >= threshold_values[1]) & (combined[feature] < threshold_values[2]),
feature] = 2
combined.loc[(combined[feature] >= threshold_values[2]) , feature] = 3
return combined
age_threshold_values =[15,30,45]
fare_threshold_values =[15,30,100]
combined =modify_df('Age',age_threshold_values)
combined =modify_df('Fare',fare_threshold_values)
复制代码
1.5 建模调参
- 导入相关模块
import sklearn
from sklearn.model_selection import GridSearchCV,train_test_split
from sklearn.linear_model import LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from catboost import CatBoostClassifier,Pool,cv
from sklearn import metrics
from sklearn.metrics import accuracy_score,roc_auc_score,f1_score,make_scorer
from sklearn.model_selection import cross_val_score
from bayes_opt import BayesianOptimization
复制代码
- 划分数据集
train = combined[:891]
test =combined[891:]
targets =np.array(targets)
x_train,x_val,y_train,y_val= train_test_split(train,targets,test_size=0.2,random_state=2021)
复制代码
1.5.1 利用基础模型及评分
- 使用基础模型训练与评估
lg_cv =LogisticRegressionCV()
rf = RandomForestClassifier()
extree = ExtraTreesClassifier()
gbdt =GradientBoostingClassifier()
knn =KNeighborsClassifier()
models = [lg_cv,rf,extree, knn, gbdt]
for model in models:
model=model.fit(x_train,y_train)
predict_val=model.predict(x_val)
print(model)
print('accuracy-score :',metrics.accuracy_score(y_val,predict_val))
val_proba = model.predict_proba(x_val)
print(f'auc:{roc_auc_score(y_val,val_proba[:,1])}')
print('*'*50)
————————————————————————————————————————————————————————
LogisticRegressionCV()
accuracy-score : 0.7597765363128491
auc:0.7911450182576942
**************************************************
RandomForestClassifier()
accuracy-score : 0.7094972067039106
auc:0.7766692749087115
**************************************************
ExtraTreesClassifier()
accuracy-score : 0.7039106145251397
auc:0.744131455399061
**************************************************
KNeighborsClassifier()
accuracy-score : 0.7206703910614525
auc:0.7631716223265518
**************************************************
GradientBoostingClassifier()
accuracy-score : 0.7430167597765364
auc:0.8000782472613458
**************************************************
复制代码
1.5.2 RandomForest调参
- 使用gridsearchcv调参,
#Randomforest调参
model_rf=RandomForestClassifier()
parameter_grid ={
'max_depth':[4,6,8,10],
'n_estimators': [10,30,50,100],
'max_features' : ['sqrt','auto','log2'],
'min_samples_split':[2,3,10],
'min_samples_leaf' :[1,3,10],
'bootstrap' : [True,False], }
cross_validation =StratifiedKFold(n_splits=5)
auc_score=make_scorer(roc_auc_score,average='micro')
grid_search = GridSearchCV(model_rf, cv=cross_validation,param_grid=parameter_grid,
scoring=auc_score,verbose=1)
grid_search.fit(x_train,y_train)
print(f'最好的参数是:{grid_search.best_params_}')
————————————————————————————————————————————————————————
# 最好的参数是:{'bootstrap': False, 'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 30}
复制代码
- 使用优化后的参数建立模型,用验证集验证,auc值在调参后有所提升
- 绘制roc可视化曲线
model_rf = RandomForestClassifier(**grid_search.best_params_)
model_rf.fit(x_train,y_train)
def roc(model,x,y,name):
y_proba = model.predict_proba(x)[:,1]
#预测并计算roc的相关指标
fpr, tpr, threshold = metrics.roc_curve(y, y_proba)
roc_auc = metrics.auc(fpr, tpr)
print(f'{name}AUC:{roc_auc}')
#画出roc曲线图
plt.figure(figsize=(15, 7))
plt.title(name)
plt.plot(fpr, tpr, 'b', label = name + 'AUC = %0.4f' % roc_auc)
plt.xlim(0,1)
plt.ylim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# 画出对角线
plt.plot([0,1],[0,1],'r--')
plt.show()
roc(model_rf,x_val,y_val,'验证集')
————————————————————————————————————————————————————————
验证集AUC:0.7916666666666666
复制代码
1.5.3 XGB调参
- 使用xgb贝叶斯调参
def BO_xgb(x,y):
t1=time.clock()
def xgb_cv(max_depth,gamma,min_child_weight,max_delta_step,subsample,colsample_bytree):
paramt={'booster': 'gbtree',
'max_depth': int(max_depth),
'gamma': gamma,
'eta': 0.1,
'objective': 'binary:logistic',
'nthread': 4,
'eval_metric': 'auc',
'subsample': max(min(subsample, 1), 0),
'colsample_bytree': max(min(colsample_bytree, 1), 0),
'min_child_weight': min_child_weight,
'max_delta_step': int(max_delta_step),
'seed': 1001}
model=XGBClassifier(**paramt)
res = cross_val_score(model,x, y, scoring='roc_auc', cv=5).mean()
return res
cv_params ={'max_depth': (-1, 30),
'gamma': (0.001, 10.0),
'min_child_weight': (0, 20),
'max_delta_step': (0, 10),
'subsample': (0.1, 1.0),
'colsample_bytree': (0.1, 1.0)}
xgb_op = BayesianOptimization(xgb_cv,cv_params)
xgb_op.maximize(n_iter=20)
print(xgb_op.max)
t2=time.clock()
print('耗时:',(t2-t1))
return xgb_op.max
result =BO_xgb(x_train,y_train)
————————————————————————————————————————————————————————
{'target': 0.9039100745280519, 'params': {'colsample_bytree': 1.0, 'gamma': 0.001, 'max_delta_step': 1.5409323299196414, 'max_depth': 3.216932691991951, 'min_child_weight': 0.0, 'subsample': 1.0}}
耗时: 18.490272999999434
复制代码
- roc曲线可视化,训练集与验证集之间的差距仍在,模型还可以更加完善的
best_params=result['params']
best_params['max_depth']=best_params['max_depth'].astype(int)
model_xgb = XGBClassifier(**best_params)
model_xgb.fit(x_train,y_train)
def roc(m,xt,yt,namet,xv,yv,namev):
yt_pred = m.predict_proba(xt)[:,1]
yv_pred = m.predict_proba(xv)[:,1]
""""预测并计算roc的相关指标"""
fprt, tprt, thresholdt = metrics.roc_curve(yt, yt_pred)
roc_auct = metrics.auc(fprt, tprt)
print(namet+'AUC:{}'.format(roc_auct))
fprv, tprv, thresholdv = metrics.roc_curve(yv, yv_pred)
roc_aucv = metrics.auc(fprv, tprv)
print(namev+'AUC:{}'.format(roc_aucv))
"""画出roc曲线图"""
ax=plt.subplot()
ax.plot(fprt, tprt, 'r', label = namet + 'AUC = %0.4f' % roc_auct)
ax.plot(fprv, tprv, 'g', label = namev+ 'AUC = %0.4f' % roc_aucv)
plt.ylim(0,1)
plt.xlim(0,1)
plt.legend(loc='best')
plt.title('ROC')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
# 画出对角线
plt.plot([0,1],[0,1],'r--')
plt.show()
plt.subplot(1,1,1)
roc(model_xgb,x_val,y_val,'验证集',x_train,y_train,'训练集')
————————————————————————————————————————————————————————
验证集AUC:0.799621804903495
训练集AUC:0.9331065759637188
复制代码
1.5.4 模型融合
- 使用randomforest和xgboost模型进行预测取两种模型的平均值
trained_models = [model_xgb,model_rf]
predictions = []
for model in trained_models:
predictions.append(model.predict_proba(test)[:, 1])
predictions_df = pd.DataFrame(predictions).T
predictions_df['out'] = predictions_df.mean(axis=1)
复制代码
1.5.5 模型预测及输出结果文件
- 输出结果文件
abc =pd.read_csv('gender_submission.csv')
df_out =pd.DataFrame()
df_out['PassengerId'] = abc['PassengerId']
df_out['Survived'] =predictions_df['out'].map(lambda x : 1 if x >= 0.5 else 0)
df_out.to_csv('422titanic.csv',index=False)
复制代码
- 汗颜,提升的并不多。
© 版权声明
文章版权归作者所有,未经允许请勿转载。
THE END