零基础入门机器学习 Kaggle_Crime_Prediction

注：本文写自2019.10，是博主接触的第一个机器学习项目。很多地方考虑不够周全，文章中解释不到位之处多多担待。

题目回顾

题目名称：犯罪预测

题目简介：你需要根据相关数据来预测可能发生的犯罪类型。训练数据位于train.csv，测试数据位于test.csv，最终你需要提交你对测试集的预测结果。

数据介绍：此数据集包含来⾃犯罪事件报告系统事件。数据范围为1/1/2003⾄2015年5⽉13⽇。每条数据仅包含⼀种犯罪。

数据字段：Dates – 犯罪事件的时间戳 Category – 犯罪事件的类别 (仅在train.csv出现)，这是需要预测的⽬标变量。 Descript – 犯罪事件的详细描述 (仅在train.csv出现) DayOfWeek – 星期⼏ PdDistrict – 警察局区的名称 Resolution – 如何解决犯罪事件 (仅在train.csv出现) Address – 犯罪事件的⼤致街道地址 X – 经度 Y – 纬度

TIPS

相关python包推荐：scikit-learn， numpy， csv

相关经验的话我⽐较懒，可以搜索以下内容获取更多信息。

google： “python 预测分类问题”

google： “kaggle 泰坦尼克号实战”

前言

准备工作：换镜像源并用cmd命令行安装需要的库和包（萌新费力的步骤）

0.1 在开始处调用库（首先针对绘图）

在使用python库之前必须引入python库 使用import代码

import numpy as np    # 科学计算
import matplotlib.pyplot as plt    # python绘图库
import pandas as pd    # 数据分析
import seaborn as sns  # 作图
import string   
import matplotlib.colors as colors  # 添加颜色

复制代码

0.2 用pandas载入csv训练和测试数据，进而可以进行数据的分析

train = pd.read_csv("D:\Computer\kaggle machine learning/task/train.csv")
test = pd.read_csv("D:\Computer\kaggle machine learning/task/test.csv")
train.info()    
test.info()      #获取文件中的数据类型及数目

复制代码

输出结果如下：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600000 entries, 0 to 599999
Data columns (total 9 columns):
Dates         600000 non-null object
Category      600000 non-null object
Descript      600000 non-null object
DayOfWeek     600000 non-null object
PdDistrict    600000 non-null object
Resolution    600000 non-null object
Address       600000 non-null object
X             600000 non-null float64
Y             600000 non-null float64
dtypes: float64(2), object(7)
memory usage: 41.2+ MB

复制代码

Data columns (total 6 columns):
Dates         250000 non-null object
DayOfWeek     250000 non-null object
PdDistrict    250000 non-null object
Address       250000 non-null object
X             250000 non-null float64
Y             250000 non-null float64
dtypes: float64(2), object(4)
memory usage: 11.4+ MB

复制代码

print(train.isnull().sum()) # 查看数据中有没有缺失项
复制代码

Dates         0
Category      0
Descript      0
DayOfWeek     0
PdDistrict    0
Resolution    0
Address       0
X             0
Y             0
dtype: int64
复制代码

print(test.isnull().sum())
复制代码

Dates         0
DayOfWeek     0
PdDistrict    0
Address       0
X             0
Y             0
dtype: int64
复制代码

发现train训练集数据60w条，test测试集数据25w条，数据没有缺失的数据。

一、进行数据的初步分析及特征分析

1.1 Category(要预测的目标变量)

print(train['Category'].describe())
复制代码

count            600000
unique               39
top       LARCENY/THEFT
freq             119238
Name: Category, dtype: object
复制代码

cate_group = train.groupby(by='Category').size()
cate_group
cate_num = len(cate_group.index)
cate_num
cate_group.index = cate_group.index.map(string.capwords)
cate_group.sort_values(ascending=False,inplace=True)    
# ascending=False按降序排序,指定inplace=True,将同时修改原数据
cate_group.plot(kind='bar',color=sns.color_palette('coolwarm',cate_num))  
# 补充logy=True 绘制y轴的对数图形(把图形放大，看得更细腻，大佬操作)
plt.title('Number of Crime types')

plt.show()
复制代码

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZSy773ZW-1575890164350)(C:\Users\Zeng Wenxuan\AppData\Roaming\Typora\typora-user-images\1570758419411.png)]$
如果我们使用的是指数形式呢（使用logy=True启用对数）

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BvxNIutr-1575885169368)(C:\Users\Zeng Wenxuan\AppData\Roaming\Typora\typora-user-images\1570759930943.png)]$

我们通过观察易得6种数量最多的犯罪类型如下：

Category
ASSAULT
DRUG/NARCOTIC
LARCENY/THEFT
NON-CRIMINAL
OTHER OFFENSES
VEHICLE THEFT
复制代码

1.2 DayOfWeek

print(train['DayOfWeek'].describe())
复制代码

count     600000
unique         7
top       Friday
freq       91634
Name: DayOfWeek, dtype: object
复制代码

Day_group = train.groupby(by='DayOfWeek').size()
Day_group
Day_num = len(Day_group.index)
Day_num
Day_group.index = Day_group.index.map(string.capwords)
Day_group.sort_values(ascending=False,inplace=True)    
Day_group.plot(kind='bar',color=sns.color_palette('coolwarm',Day_num))  
plt.title('The Number of Crime types')

plt.show()
复制代码

在这里插入图片描述

分析：在一周内，星期五犯罪数量最多，星期天犯罪数量最少，总的来说一周中五天的犯罪数目差距不是很大，基本在80000+

1.3 PdDistrict

print(train['PdDistrict'].describe())
复制代码

count       600000
unique          10
top       SOUTHERN
freq        107457
Name: PdDistrict, dtype: object
复制代码

dis_group = train.groupby(by='PdDistrict').size()
dis_group
dis_group = dis_group/sum(dis_group)
dis_group.index = dis_group.index.map(string.capwords)
dis_group.sort_values(ascending=True,inplace=True)
dis_group.plot(kind='barh',figsize=(15,10),fontsize=10,color=sns.color_palette('coolwarm',10))
plt.title('The Frequncy of crimes by district',fontsize=15)

plt.show()
复制代码

在这里插入图片描述

分析：犯罪的高频发生区域在Southern，其次是Mission Northern，Richmond是治安最好，犯罪情况与 PdDistrict有关联

1.4 year/month/day

由于我们需要对年月日分开分析，所以我们需要将year/month/day这样形式的字符分开
所以需要以下代码进行分析
fig=plt.figure()
fig.set(alpha=0.2)

#转化dates数据字段字符串类型为date数据类型 把date的三个量分离开
train['Dates']=pd.to_datetime(train['Dates'])
train['year'] = train.Dates.dt.year
train['month'] = train.Dates.dt.month
train['day'] = train.Dates.dt.day
train['hour'] = train.Dates.dt.hour

plt.subplot2grid((3,1),(0,0))
year_group = data_train.groupby(by='year').size()
plt.plot(year_group.index[:-1],year_group[:-1],'ks-')
plt.title('statistics of criminal cases  measured by year')
plt.xlabel('year')
plt.ylabel('number of people')

plt.subplot2grid((3,1),(1,0))
month_group = data_train.groupby(by='month').size()
plt.plot(month_group,'ks-')
plt.title('statistics of criminal cases measured by month')
plt.xlabel('month')
plt.ylabel('number of people')

plt.subplot2grid((3,1),(2,0))
day_group = data_train.groupby(by='day').size()
plt.plot(day_group,'ks-')
plt.title('statistics of criminal cases measured by day')
plt.xlabel('day')
plt.ylabel('number of people')
复制代码

输出将时间字段字符串分离后的部分数据情况如下(数据说明已经成功分解)

year          600000 non-null int64
month         600000 non-null int64
day           600000 non-null int64
hour          600000 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(5), object(4)
复制代码

在这里插入图片描述

分析：2011年犯罪数量最少，2011年之后的数量激增，到2013年犯罪数量最多；8月份犯罪数量最少，犯罪数量最多的月份为10月份 5月份；除了月初和月末又明显的波动外，一个月内犯罪数目并没有显著的变化，基本稳定在171500-20000人。对于day的数据，更趋近于线性关系。

1.5 在星期几的某一个时刻的犯罪数量

week_group = train.groupby(['DayOfWeek','hour']).size()     #多重分组
week_group = week_group.unstack()     #对分组后的多重索引转为xy索引
week_group.T.plot(figsize=(12,8))     #行列互换后画图
plt.xlabel('hour of day',size=15)
plt.ylabel('Number of crimes',size=15)

plt.show()
复制代码

在这里插入图片描述

数据分析：12点和18点的犯罪数量最多，在凌晨数量开始显著下降,不同的时段犯罪数差别很大

PS.一定需要小心一点，groupby([‘DayOfWeek’,’hour’])中的hour是没有定义的，所以我必须出现前面的一行代码（将年月日中的三个参数分离开）如下：

train['hour'] = train.Dates.dt.hour

复制代码

1.6 Address

print(train['Address'].describe())
复制代码

count                     600000
unique                     22262
top       800 Block of BRYANT ST
freq                       18064
Name: Address, dtype: object
复制代码

由上表可知：Address的数据种类很多，分布很零散，维度很大，选取特征困难

1.7 Resolution

（数据只在训练集上）

print(train['Resolution'].describe())
复制代码

count     600000
unique        17
top         NONE
freq      359779
Name: Resolution, dtype: object
复制代码

在这里插入图片描述

分析：发现仅仅只有三种Resolution的数据比较多，其他的数据相对很少

1.8 Descript

(数据只在训练集上)

print(train['Descript'].describe())

复制代码

count                           600000
unique                             862
top       GRAND THEFT FROM LOCKED AUTO
freq                             41046
复制代码

分析：发现Descript的数据有862种，分布散乱，情况仅次于Address

1.9 位置坐标 X Y

print(train['X'].describe())
print(train['Y'].describe())
复制代码

count    600000.000000
mean       -122.422635
std           0.030175
min        -122.513642
25%        -122.432956
50%        -122.416442
75%        -122.406959
max        -120.500000
Name: X, dtype: float64
        
count    600000.000000
mean         37.770856
std           0.447932
min          37.707920
25%          37.752441
50%          37.775421
75%          37.784353
max          90.000000
Name: Y, dtype: float64
复制代码

分析：没有画出坐标图像，但是X和Y的坐标的std（标准差）分别为0.030175、0.447932，显然都非常小，所以说明地理位置坐标的变化幅度并不大，该参数并不会对结果产生太明显的影响。

1.10 数量前6的犯罪类别与hour的关系

top6 = list(cate_group.index[:6])
tmp = train[train['Category'].map(string.capwords).isin(top6)]
tmp_group = tmp.groupby(['Category','hour']).size()
tmp_group = tmp_group.unstack()
tmp_group.T.plot(figsize=(12,6),style='o-')

plt.show()
复制代码

在这里插入图片描述

分析：5点以后数量开始上升，12和18时出现数量峰值，不同的时段对于不同的犯罪情况均有影响。

1.11 数量前6的犯罪类别与month的关系

mon_g = tmp.groupby(['Category','month']).size()
mon_g = mon_g.unstack()
for i in range(6):
    mon_g.iloc[i] = mon_g.iloc[i]/mon_g.sum(axis=1)[i]
mon_g.T.plot(figsize=(12,6),style='o-')

plt.show()
复制代码

在这里插入图片描述

分析：不同的月份对犯罪有很大的影响，5和10月是犯罪的峰值，ASSAULT对峰值做出贡献

1.12 数量前6的犯罪类别与year的关系

mon_g = tmp.groupby(['Category','year']).size()
mon_g = mon_g.unstack()
for i in range(6):
    mon_g.iloc[i] = mon_g.iloc[i]/mon_g.sum(axis=1)[i]
mon_g.T.plot(figsize=(12,6),style='o-')

plt.show()
复制代码

在这里插入图片描述

分析：非常明显，2003~2006年VEHICLE THEFT的犯罪率极高，而后来偏低，各种犯罪类型频率在2014年（可能是治安变好了）开始显著下降

二、进行数据处理

#引入python中的sklearn库

from sklearn import preprocessing
#from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
#from sklearn.feature_selection import SelectKBest
#from sklearn.feature_selection import chi2
复制代码

2.1 对分类目标做标签化处理

#对分类目标做标签化处理

from sklearn.preprocessing import LabelEncoder      #用于Label编码
from sklearn.preprocessing import OneHotEncoder     #用于one-hot编码
label = preprocessing.LabelEncoder()
target = label.fit_transform(train.Category)
target
crime = label.fit_transform(train.Category)
复制代码

#对test数据中的Dates进行同样的分离处理

test['date'] = pd.to_datetime(test['Dates'])
test['year'] = test.date.dt.year
test['month'] = test.date.dt.month
test['day'] = test.date.dt.day
test['hour'] = test.date.dt.hour

test.info()
复制代码

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 10 columns):
Dates         250000 non-null datetime64[ns]
DayOfWeek     250000 non-null object
PdDistrict    250000 non-null object
Address       250000 non-null object
X             250000 non-null float64
Y             250000 non-null float64
year          250000 non-null int64
month         250000 non-null int64
day           250000 non-null int64
hour          250000 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(4), object(3)
memory usage: 19.1+ MB
复制代码

2.2 对train数据进行二值化处理并组合特征

#将train数据因子化
days = pd.get_dummies(train.DayOfWeek)
district = pd.get_dummies(train.PdDistrict)
hour = pd.get_dummies(train.Dates.dt.hour)
month = pd.get_dummies(train.Dates.dt.month)
year = pd.get_dummies(train.Dates.dt.year)

#组合train特征
train= pd.concat([days,district,hour，month,year], axis=1)
train['crime'] = crime
复制代码

2.3 对test数据进行同样二值化处理并组合特征

#将test数据因子化
days = pd.get_dummies(test.DayOfWeek)
district = pd.get_dummies(test.PdDistrict)
hour = pd.get_dummies(test.Dates.dt.hour)
test= pd.concat([days, district, hour], axis=1)
month = pd.get_dummies(test.Dates.dt.month)
year = pd.get_dummies(test.Dates.dt.year)

#组合test特征
test = pd.concat([days, district, hour,month,year],axis=1)
复制代码

上述二值化处理过程是因为我们需要将数据放进0-1的范围内进行处理

需要构造并运用的 $sigF(x)$ 如图所示：

$\delta(z)=\frac{1}{1+e^{-z}}.$

在这里插入图片描述

输出处理后的数据情况如下：

print(train.head(10))
print(test.head(10))
复制代码

 Friday  Monday  Saturday  Sunday  Thursday  ...  2012  2013  2014  2015  crime
0       0       1         0       0         0  ...     0     0     0     0     16
1       0       1         0       0         0  ...     0     0     0     0     25
2       0       1         0       0         0  ...     0     0     0     0     21
3       0       1         0       0         0  ...     0     0     0     0     21
4       0       1         0       0         0  ...     0     0     0     0     16
5       0       1         0       0         0  ...     0     0     0     0     35
6       0       1         0       0         0  ...     0     0     0     0     34
7       0       1         0       0         0  ...     0     0     0     0     21
8       0       1         0       0         0  ...     0     0     0     0      1
9       0       1         0       0         0  ...     0     0     0     0     21

[10 rows x 67 columns]
   Friday  Monday  Saturday  Sunday  Thursday  ...  2011  2012  2013  2014  2015
0       0       1         0       0         0  ...     0     0     0     0     0
1       0       1         0       0         0  ...     0     0     0     0     0
2       0       1         0       0         0  ...     0     0     0     0     0
3       0       1         0       0         0  ...     0     0     0     0     0
4       0       1         0       0         0  ...     0     0     0     0     0
5       0       1         0       0         0  ...     0     0     0     0     0
6       0       1         0       0         0  ...     0     0     0     0     0
7       0       1         0       0         0  ...     0     0     0     0     0
8       0       1         0       0         0  ...     0     0     0     0     0
9       0       1         0       0         0  ...     0     0     0     0     0

[10 rows x 66 columns]
复制代码

三、建立模型并验证评估模型效果

3.1 引入sklearn中的log_loss

from sklearn.metrics import log_loss
import time
复制代码

3.2 选取特征

然后既然是建模，我们就需要选取特征进行模型拟合，这里首先选取days和PdDistrict，代码如下：

features = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'BAYVIEW', 'CENTRAL', 'INGLESIDE', 'MISSION',
 'NORTHERN', 'PARK', 'RICHMOND', 'SOUTHERN', 'TARAVAL', 'TENDERLOIN']

#然后添加三种时间数据参数到特征中
hourFea = [x for x in range(0,24)]
monFea = [x for x in range(1,12)]
yearFea = [x for x in range(2003,2015)]

features = features + hourFea + monFea + yearFea
复制代码

3.3 交叉验证的方法

交叉验证的基本思想是把在某种意义下将原始数据(dataset)进行分组,一部分做为训练集(train set),另一部分做为验证集(validation set or test set),首先用训练集对分类器进行训练,再利用验证集来测试训练得到的模型(model),以此来做为评价分类器的性能指标。于是我们在3.4步骤中划分该数据集。

3.4 划分数据集

#我们按照训练集6/10 测试集4/10 来划分数据集
X_train,X_test,y_train,y_test = train_test_split(train[features], train['crime'] , train_size = 0.6)
#代码解释
#X_是划分的数据集即train[features]选取特征的训练集划分为X_train,X_test两个部分，比例为3：2
#y_是划分的标签集即train['crime']打上的标签集划分为y_train,y_test两个部分
#然后让我们看看划分的数据情况吧
print("The length of original data is:", train[features].shape[0])
print("The length of train Data is:", X_train.shape[0])
print("The length of test Data is:", X_test.shape[0])
复制代码

The length of original data X is: 600000
The length of train Data is: 360000
The length of test Data is: 240000
复制代码

3.5 从sklearn库中导入模型进行训练

（1）朴素贝叶斯

模型公式：

$P(Y_k|X)=\frac{P(XY_k)}{P(X)}=\frac{P(Y_k)P(X|Y)}{\sum_j P(Y_j)P(X|Y_j)}.$

公式的右边是总结历史
公式的左边是预知未来
如果把Y看成类别， $X$ 看成特征， $P(Y_k|X)$ 就是在已知特征 $X$ 的情况下求 $Y_k$ 类别的概率，而对 $P(Y_k|X)$ 的计算又全部转化到类别 $Y_k$ 的特征分布上来。

喜欢就支持一下吧

SuSE Linux 漏洞

苏州直销银行。。10立减。。多号多撸

DreamBox DM800 ‘file’参数本地文件泄露漏洞

mysql单表亿级数据分页怎么优化？

Element-UI Form Verification ( 表单校验 )

FishNet FishCart多个跨站脚本攻击和SQL注入漏洞

（新手向）零基础探究机器学习Crime_Prediction

零基础入门机器学习 Kaggle_Crime_Prediction

题目回顾

前言

0.1 在开始处调用库（首先针对绘图）

0.2 用pandas载入csv训练和测试数据，进而可以进行数据的分析

一、进行数据的初步分析及特征分析

1.1 Category(要预测的目标变量)

1.2 DayOfWeek

1.3 PdDistrict

1.4 year/month/day

1.5 在星期几的某一个时刻的犯罪数量

1.6 Address

1.7 Resolution

1.8 Descript

1.9 位置坐标 X Y

1.10 数量前6的犯罪类别与hour的关系

1.11 数量前6的犯罪类别与month的关系

1.12 数量前6的犯罪类别与year的关系

二、进行数据处理

2.1 对分类目标做标签化处理

2.2 对train数据进行二值化处理并组合特征

2.3 对test数据进行同样二值化处理并组合特征

三、建立模型并验证评估模型效果

3.1 引入sklearn中的log_loss

3.2 选取特征

3.3 交叉验证的方法

3.4 划分数据集

3.5 从sklearn库中导入模型进行训练

（1）朴素贝叶斯

SuSE Linux 漏洞

苏州直销银行。。10立减。。多号多撸

DreamBox DM800 ‘file’参数本地文件泄露漏洞

mysql单表亿级数据分页怎么优化？

Element-UI Form Verification ( 表单校验 )

FishNet FishCart多个跨站脚本攻击和SQL注入漏洞

（新手向）零基础探究机器学习Crime_Prediction

零基础入门机器学习 Kaggle_Crime_Prediction

题目回顾

前言

0.1 在开始处调用库（首先针对绘图）

0.2 用pandas载入csv训练和测试数据，进而可以进行数据的分析

一、 进行数据的初步分析及特征分析

1.1 Category(要预测的目标变量)

1.2 DayOfWeek

1.3 PdDistrict

1.4 year/month/day

1.5 在星期几的某一个时刻的犯罪数量

1.6 Address

1.7 Resolution

1.8 Descript

1.9 位置坐标 X Y

1.10 数量前6的犯罪类别与hour的关系

1.11 数量前6的犯罪类别与month的关系

1.12 数量前6的犯罪类别与year的关系

二、进行数据处理

2.1 对分类目标做标签化处理

2.2 对train数据进行二值化处理并组合特征

2.3 对test数据进行同样二值化处理并组合特征

三、建立模型并验证评估模型效果

3.1 引入sklearn中的log_loss

3.2 选取特征

3.3 交叉验证的方法

3.4 划分数据集

3.5 从sklearn库中 导入模型进行训练

（1）朴素贝叶斯

一、进行数据的初步分析及特征分析

3.5 从sklearn库中导入模型进行训练