随着 Facebook、Twitter等社交网络平台的流行，越来越多的青少年用户会在这些平台发布消息。本案例使用Pandas包和sklearn的预处理模块中的一些类，对青少年社交网络信息数据集进行预处理。

目录¶

数据读取与查看
缺失值处理
异常值处理
标准化
4.1 Z-Score标准化
4.2 Min-Max标准化
特征编码
5.1 数字编码
5.2 One-Hot编码
5.3 二值化编码
离散化
6.1 等距离散化
6.2 等频离散化

1 数据读取与查看¶

我们使用一份包含30000个样本的美国高中生社交网络信息数据集。数据均匀采样于2006年到2009年，每个样本包含40个变量，其中gradyear、gender、age和friends四个变量代表高中生的毕业年份、性别、年龄和好友数等基本信息，剩余36个关键词代表了高中生的5大兴趣类：课外活动、时尚、宗教、浪漫和反社会行为，具体描述如下：

变量	描述
`gradyear`（毕业年份）	2006-2009分别对应高中一年级到四年级
`gender`（性别）	M（男）、F（女）
`age`（年龄）	按出生日期计算的年龄，保留小数点后三位
`friends`（朋友）	社交网络好友数
36个关键词	取值表示该词语在高中生社交网络服务平台发布信息中出现的频次

我们首先导入必要的库：

import numpy as np
import pandas as pd  # 读取数据、离散化
import matplotlib.pyplot as plt  # 可视化
from sklearn.preprocessing import Imputer  # 缺失值填补
from sklearn.preprocessing import StandardScaler  # Z-Score标准化
from sklearn.preprocessing import MinMaxScaler  # Min-Max标准化
from sklearn.preprocessing import LabelEncoder  # 数字编码
from sklearn.preprocessing import OneHotEncoder  # One-Hot编码
from sklearn.preprocessing import Binarizer  # 二值化

再从本地读取数据，并查看数据的前5行：

teenager_sns = pd.read_csv("./input/teenager_sns.csv")
teenager_sns.head()

利用info()函数可以从宏观上查看数据集整体情况：

teenager_sns.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 40 columns):
gradyear        30000 non-null int64
gender          27276 non-null object
age             24914 non-null float64
friends         30000 non-null int64
basketball      30000 non-null int64
football        30000 non-null int64
soccer          30000 non-null int64
softball        30000 non-null int64
volleyball      30000 non-null int64
swimming        30000 non-null int64
cheerleading    30000 non-null int64
baseball        30000 non-null int64
tennis          30000 non-null int64
sports          30000 non-null int64
cute            30000 non-null int64
sex             30000 non-null int64
sexy            30000 non-null int64
hot             30000 non-null int64
kissed          30000 non-null int64
dance           30000 non-null int64
band            30000 non-null int64
marching        30000 non-null int64
music           30000 non-null int64
rock            30000 non-null int64
god             30000 non-null int64
church          30000 non-null int64
jesus           30000 non-null int64
bible           30000 non-null int64
hair            30000 non-null int64
dress           30000 non-null int64
blonde          30000 non-null int64
mall            30000 non-null int64
shopping        30000 non-null int64
clothes         30000 non-null int64
hollister       30000 non-null int64
abercrombie     30000 non-null int64
die             30000 non-null int64
death           30000 non-null int64
drunk           30000 non-null int64
drugs           30000 non-null int64
dtypes: float64(1), int64(38), object(1)
memory usage: 9.2+ MB

我们发现，gender和age的记录不足30000条，意味着gender和age两个变量存在缺失值。相比于gender二值变量，age变量的情况更加复杂，因此我们调用describe()函数进一步查看age变量整体情况：

print('age变量缺失值数目：', len(teenager_sns["age"]) - teenager_sns["age"].count())
teenager_sns["age"].describe()

age变量缺失值数目： 5086

count    24914.000000
mean        17.993949
std          7.858054
min          3.086000
25%         16.312000
50%         17.287000
75%         18.259000
max        106.927000
Name: age, dtype: float64

根据上面输出的结果，我们注意到最大值106.927岁和最小值3.086岁是不合常理的。青少年的年龄限定在13-20岁，我们将不在此范围内的数据记为缺失值，重新统计缺失值数目：

def tag_nan(value):
    if (value >= 13) & (value < 20):
        return value
    else:
        return np.NaN

teenager_sns["age"]  = teenager_sns["age"].map(tag_nan)
print('age变量缺失值数目：', len(teenager_sns["age"]) - teenager_sns["age"].count())
teenager_sns["age"].describe()

age变量缺失值数目： 5523

count    24477.000000
mean        17.252429
std          1.157465
min         13.027000
25%         16.304000
50%         17.265000
75%         18.220000
max         19.995000
Name: age, dtype: float64

用这种方法可以初步规范数据定义，但也为age变量引入更多缺失值。缺失值比例较大，后续我们将考虑使用填补法处理。

2 缺失值处理¶

在上一步，我们使用info()方法发现age和gender变量都有缺失值。接下来，我们将使用sklearn中的Imputer方法，将数据集teenager_sns中age列利用均值进行填充：

# 缺失值填补
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(teenager_sns[["age"]])
teenager_sns["age_imputed"]=imp.transform(teenager_sns[["age"]])

# 显示年龄缺失的行，和插补缺失值之后的列"age_imputed"
teenager_sns[teenager_sns['age'].isnull()].head()

查看性别age一列的缺失值数量：

# 对性别（离散变量）进行处理
print('gender变量缺失值数目：', len(teenager_sns["gender"]) - teenager_sns["gender"].count())

gender变量缺失值数目： 2724

由于性别缺失数据较少，故可以考虑直接删除缺失值：

teenager_sns = teenager_sns.dropna(subset=['gender'])

检查缺失数据:

teenager_sns.isnull().sum().sort_values(ascending=False)  # age填补后在age_imputed变量，已不存在缺失值

age             3674
age_imputed        0
cheerleading       0
hot                0
sexy               0
sex                0
cute               0
sports             0
tennis             0
baseball           0
swimming           0
dance              0
volleyball         0
softball           0
soccer             0
football           0
basketball         0
friends            0
gender             0
kissed             0
band               0
drugs              0
marching           0
drunk              0
death              0
die                0
abercrombie        0
hollister          0
clothes            0
shopping           0
mall               0
blonde             0
dress              0
hair               0
bible              0
jesus              0
church             0
god                0
rock               0
music              0
gradyear           0
dtype: int64

3 异常值处理¶

接下来，我们将对friends列数据进行异常值检测。

# 画箱线图
fig = plt.figure(figsize = (20,8))
plt.subplot(1,2,1)
plt.boxplot(x = teenager_sns.friends)
plt.xlabel('friends',fontsize = 20)
plt.ylabel("Count",fontsize = 20) 

# 查看分布情况
plt.subplot(1,2,2)
plt.hist(teenager_sns.friends,bins = 15)
plt.xlabel('friends',fontsize = 20)
plt.ylabel("Count",fontsize = 20) 

plt.show()

由图可知，friends变量整体呈右偏，大于100左右为异常值。下面我们将剔除异常值：

# 剔除异常值
'''
规定：超过上四分位+1.5倍IQR距离，或者下四分位-1.5倍IQR距离的点为异常值
# 四分位距(IQR)就是上四分位与下四分位的差值，我们以IQR的1.5倍为标准
'''
quantile = np.percentile(teenager_sns.friends,[0,25,50,75,100])
friendsIQR = quantile[3] - quantile[1]
UpLimit = quantile[3]+friendsIQR*1.5
DownLimit = quantile[1]-friendsIQR*1.5 
teenager_sns = teenager_sns[(teenager_sns['friends'] > DownLimit) & (teenager_sns['friends'] < UpLimit)]

查看异常值剔除后的数据分布情况:

# 画箱线图
fig = plt.figure(figsize = (20,8))
plt.subplot(1,2,1)
plt.boxplot(x = teenager_sns.friends)
plt.xlabel('friends',fontsize = 20)
plt.ylabel("Count",fontsize = 20) 

# 查看分布情况
plt.subplot(1,2,2)
plt.hist(teenager_sns.friends,bins = 15)
plt.xlabel('friends',fontsize = 20)
plt.ylabel("Count",fontsize = 20)

plt.show()

# 删去数据后需要重置索引
teenager_sns = teenager_sns.reset_index()

4 标准化¶

4.1 Z-Score标准化¶

下面我们将使用sklearn中的StandardScaler方法，对数据集teenager_sns中的friends列做Z-Score标准化，使得处理后的数据具有固定均值和标准差。

# Z-Score标准化
scaler = StandardScaler(copy=True)
teenager_sns_zscore = pd.DataFrame(scaler.fit_transform(teenager_sns[["friends"]]),columns =["friends_StandardScaled"] )
teenager_sns_zscore["friends"] = teenager_sns["friends"]

print("均值：",teenager_sns_zscore["friends_StandardScaled"].mean(axis=0))
print("方差：",teenager_sns_zscore["friends_StandardScaled"].std(axis=0))
teenager_sns_zscore.head()

均值： 3.4817192472740534e-17
方差： 1.0000191415035669

4.2 Min-Max标准化¶

接下来，我们使用sklearn中的MinMaxScaler方法，对数据集teenager_sns中的friends列做Min-Max标准化，使得处理后的数据取值分布在[0，1]区间上。

# Min-Max标准化
filtered_columns = ["friends"]
scaler = MinMaxScaler(copy=False)
teenager_sns_minmaxscore = pd.DataFrame(scaler.fit_transform(teenager_sns[["friends"]]),
columns = ["friends_MinMaxScaled"])
teenager_sns_minmaxscore["friends"] = teenager_sns["friends"]
teenager_sns_minmaxscore.head()

5 特征编码¶

5.1 数字编码¶

我们使用sklearn中的LabelEncoder方法，对数据集teenager_sns中的gender进行特征编码：

# 数字编码
le = LabelEncoder()

# 打印前4个人的性别
print(teenager_sns["gender"][:4])

# 数字编码
print(le.fit_transform(teenager_sns["gender"][:4]))

0    M
1    F
2    M
3    F
Name: gender, dtype: object
[1 0 1 0]

结果显示 M 编码为1， F 编码为0。

5.2 One-Hot编码¶

接下来，我们尝试对gender一列进行One-Hot编码。在进行One-Hot编码前，需要先进行数字编码， M 编为1， F 编为2，随后用One-Hot编码将1转换为(1，0)，2转换为(0，1)。

# 对性别进行数字编码
teenager_sns['gender']=teenager_sns['gender'].map({'M':1,'F':2,np.NaN:3})

enc = OneHotEncoder()

# 对性别用OneHotEncoder进行拟合
enc.fit(teenager_sns[['gender']])

# active_features_表示训练集中实际出现的值
print(enc.active_features_)

# 使用One-Hot编码转换
print(enc.transform([[2]]).toarray())

[1 2]
[[0. 1.]]

5.3 二值化编码¶

我们使用sklearn中的Binarizer方法，对数据集teenager_sns中的friends列进行二值特征离散化：

# 二值化
# 阈值设置为3，大于3的映射为1，小于等于3的映射为0
scaler = Binarizer(threshold=3)
teenager_sns_binarizer = pd.DataFrame(scaler.fit_transform(teenager_sns[["friends"]]),columns = ["friends_Binarized"])
teenager_sns_binarizer["friends"] = teenager_sns["friends"]
teenager_sns_binarizer.head()

6 离散化¶

6.1 等距离散化¶

使用Pandas中的cut方法，实现等距离散化：

# 等距离散化
data = teenager_sns[['friends']].copy()
k = 4

# 等距离散化，各个类别依次命名为0,1,2,3
d = pd.cut(data['friends'], k, labels = range(k))
data['f_cut'] = d

# 查看各组频数
print(data.f_cut.value_counts())

# 检验分组是否正确
print('最大好友数', max(teenager_sns.friends), '；最小好友数', min(teenager_sns.friends))
data.head()

0    15502
1     6352
2     3005
3     1263
Name: f_cut, dtype: int64
最大好友数 103 ；最小好友数 0

等距离散化各组频数不一定相等，根据前面的右偏分布，各组频数逐渐减小。另外我们可以验证，0-103分为4个区间，69应分在第三个区间，经验证分组正确。

6.2 等频离散化¶

使用Pandas中的qcut方法，实现等频离散化：

# 等频离散化
data = teenager_sns[['friends']].copy()
k = 4
# 等频离散化，各个类比依次命名为'A','B','C','D'
d = pd.qcut(data['friends'], k, labels = ['A','B','C','D'])
data['f_qcut'] = d
# 查看各组频数
print(data.f_qcut.value_counts())
data.head()

A    7003
D    6440
B    6354
C    6325
Name: f_qcut, dtype: int64

由各组频数可知，等频离散化各组频数大致相等。

	friends_StandardScaled	friends
0	-0.737834	7
1	-1.018388	0
2	1.747072	69
3	-1.018388	0
4	1.867309	72

	friends_MinMaxScaled	friends
0	0.067961	7
1	0.000000	0
2	0.669903	69
3	0.000000	0
4	0.699029	72

	gradyear	gender	age	friends	football	...	mall	shopping	death	drunk	drugs
0	2006	M	18.980	7	0	...	0	0	0	0	0
1	2006	F	18.801	0	1	...	1	0	0	0	0
2	2006	M	18.335	69	1	...	0	0	1	0	0
3	2006	F	18.875	0	0	...	0	0	0	0	0
4	2006	NaN	18.995	10	0	...	0	2	0	1	1

	gradyear	gender	age	friends	...	shopping	drunk	age_imputed
5	2006	F	NaN	142	...	1	1	17.252429
13	2006	NaN	NaN	0	...	0	0	17.252429
15	2006	NaN	NaN	0	...	0	0	17.252429
16	2006	NaN	NaN	135	...	0	0	17.252429
26	2006	F	NaN	0	...	0	0	17.252429

	friends_Binarized	friends
0	1	7
1	0	0
2	1	69
3	0	0
4	1	72

	friends	f_cut
0	7	0
1	0	0
2	69	2
3	0	0
4	72	2

	friends	f_qcut
0	7	B
1	0	A
2	69	D
3	0	A
4	72	D