玩命加载中...
  
  
    
# 利用朴素贝叶斯对名字进行性别预测
3个小节,预计用时**30分钟**。
请打开您的电脑,按照步骤一步步完成哦!
本教程基于**Python 3.5**。
原创者:**[s3040608090](http://sofasofa.io/user_competition.php?id=1001216)**   |   修改校对:SofaSofa TeamC   |
----
### 1. 条件概率与贝叶斯定理
对于事件$A$和$B$,当$B$发生的情况下,$A$发生的条件概率为
$$P(A|B) = \frac{P(AB)}{P(B)}.$$
如果把$P(AB)$表示为$P(B|A)P(A)$,那么
$$P(A|B) = \frac{P(B|A)P(A)}{P(B)}.$$
### 2. 朴素贝叶斯
朴素贝叶斯是一个基于贝叶斯定理的分类算法,其基本假设是所有特征是相互独立的。
举个例子来说,有一个二元分类问题,每个样本只有两个二元特征$X\_1$和$X\_2$。若已知一个样本$(X\_1=1, X\_2=0)$,我们要预测它的标签为1的概率,就是等价于去计算
$$P(Y=1|X\_1=1,X\_2=0)$$
根据贝叶斯定理,我们可得$$P(Y=1|X\_1=1,X\_2=0)=\frac{P(Y=1)P(X\_1=1,X\_2=0|Y=1)}{P(X\_1=1, X\_2=0)}$$
其中$P(Y=1)$被称为先验(prior),$P(X\_1=1,X\_2=0|Y=1)$被称为似然(likelyhood),$P(X\_1=1, X\_2=0)$被成为证据(evidence)。
因为我们假设所有特征独立,所以我们可以把$P(Y=1|X_1=1,X_2=0)$写成
$$P(Y=1|X\_1=1,X\_2=0)=\frac{P(Y=1)P(X\_1=1|Y=1)P(X\_2=0|Y=1)}{P(X\_1=1)P(X\_2=0)}$$
推广到更普遍的情况下,假设数据有$k$个特征,
$$P(Y|X\_1,X\_2,\cdots, X\_n)=\frac{1}{Z}P(Y)\prod\_{i=1}^nP(X\_i|Y)$$
其中$Z$是缩放因子,使得概率和为1。
对于一个分类问题,如果我们只需要得到其标签,我们只需要求解
$$y\_{pred} = \arg\max\_y P(Y=y)\prod\_{i=1}^nP(X\_i|Y=y)$$
### 3. 实战练习
下面我们利用朴素贝叶斯对“**[机器读中文:根据名字判断性别](http://sofasofa.io/competition.php?id=3)**”中的数据进行预测。首先下载,并读取数据。
```python
# -*- coding: utf-8 -*-
import pandas as pd
from collections import defaultdict
import math
# 读取train.txt
train = pd.read_csv('train.txt')
test = pd.read_csv('test.txt')
submit = pd.read_csv('sample_submit.csv')
```
看看训练集中的数据长什么样
```python
train.head(10)
```
  
    
       | 
      id | 
      name | 
      gender | 
    
  
  
    
      | 0 | 
      1 | 
      闳家 | 
      1 | 
    
    
      | 1 | 
      2 | 
      玉璎 | 
      0 | 
    
    
      | 2 | 
      3 | 
      于邺 | 
      1 | 
    
    
      | 3 | 
      4 | 
      越英 | 
      0 | 
    
    
      | 4 | 
      5 | 
      蕴萱 | 
      0 | 
    
    
      | 5 | 
      6 | 
      子颀 | 
      0 | 
    
    
      | 6 | 
      7 | 
      靖曦 | 
      0 | 
    
    
      | 7 | 
      8 | 
      鲁莱 | 
      1 | 
    
    
      | 8 | 
      9 | 
      永远 | 
      1 | 
    
    
      | 9 | 
      10 | 
      红孙 | 
      1 | 
    
  
 
```python
# 把数据分为男女两部分
names_female = train[train['gender'] == 0]
names_male = train[train['gender'] == 1]
# totals用来存放训练集中女生、男生的总数
totals = {'f': len(names_female),
          'm': len(names_male)}
```
分别计算在所有女生(男生)的名字当中,某个字出现的频率。这一步相当于是计算
$P(X_i|女生)$和$P(X_i|男生)$
```python
frequency_list_f = defaultdict(int)
for name in names_female['name']:
    for char in name:
        frequency_list_f[char] += 1. / totals['f']
        
frequency_list_m = defaultdict(int)
for name in names_male['name']:
    for char in name:
        frequency_list_m[char] += 1. / totals['m']
```
```python
print(frequency_list_f['娟'])
```
    0.004144009000562539
```python
print(frequency_list_m['钢'])
```
    0.0006299685015749209
上面两个例子说明$P(名字中含有娟|女生)=0.004144$,$P(名字中含有钢|男生)=0.0006299$
考虑到预测集中可能会有汉字并没有出现在训练集中,所以我们需要对频率进行Laplace平滑([什么是Laplace平滑](http://sofasofa.io/forum_main_post.php?postid=1001239))。
```python
def LaplaceSmooth(char, frequency_list, total, alpha=1.0):
    count = frequency_list[char] * total
    distinct_chars = len(frequency_list)
    freq_smooth = (count + alpha ) / (total + distinct_chars * alpha)
    return freq_smooth
```
回顾第2节中的式子$$P(Y)\prod\_{i=1}^n P(X\_i|Y),$$
在性别预测中,每个样本中大量的特征都是0。比如说只有$X\_2=1$,其他都为0,那么
$$y\_{pred}=\arg\max\_y P(Y=y)P(X\_2=1|Y=y)\frac{\prod\_{i=1}^n P(X\_i=0|Y=y)}{P(X\_2=0|Y=y)}$$
由于$P(X\_i)$的数值通常较小,我们对整体取对数(防止浮点误差),可得
$$\log P(Y=y)+\sum\_{i=1}^n\log P(X\_i=0|Y=y) +\log P(X\_2=1|Y=y) - \log P(X\_2=0|Y=y)$$
如果一个人的名字中有两个字,假设$X\_5=1$,$X\_{10}=1$,其余为$0$,那么该名字的对数概率表达式为
$$\log P(Y=y)+\sum\_{i=1}^n\log P(X\_i=0|Y=y) $$
$$+\log P(X\_5=1|Y=y) - \log P(X\_5=0|Y=y)+\log P(X\_{10}=1|Y=y) - \log P(X\_{10}=0|Y=y)$$
对于一种性别,$\log P(Y=y)+\sum\_{i=1}^n\log P(X\_i=0|Y=y)$只需要计算一次。为了方面,我们将其数值存放在bases当中
```python
base_f = math.log(1 - train['gender'].mean())
base_f += sum([math.log(1 - frequency_list_f[char]) for char in frequency_list_f])
base_m = math.log(train['gender'].mean())
base_m += sum([math.log(1 - frequency_list_m[char]) for char in frequency_list_m])
bases = {'f': base_f, 'm': base_m}
```
对于$\log P(X\_i=1|Y) - \log P(X\_i=0|Y)$部分,我们利用如下函数计算
```python
def GetLogProb(char, frequency_list, total):
    freq_smooth = LaplaceSmooth(char, frequency_list, total)
    return math.log(freq_smooth) - math.log(1 - freq_smooth)
```
最后我们只需要组合以上函数,实现$$y\_{pred}=\arg\max\_y P(Y=y)P(X\_2=1|Y=y)\frac{\prod\_{i=1}^n P(X\_i=0|Y=y)}{P(X\_2=0|Y=y)}$$
```python
def ComputeLogProb(name, bases, totals, frequency_list_m, frequency_list_f):
    logprob_m = bases['m']
    logprob_f = bases['f']
    for char in name:
        logprob_m += GetLogProb(char, frequency_list_m, totals['m'])
        logprob_f += GetLogProb(char, frequency_list_f, totals['f'])
    return {'male': logprob_m, 'female': logprob_f}
def GetGender(LogProbs):
    return LogProbs['male'] > LogProbs['female']
result = []
for name in test['name']:
    LogProbs = ComputeLogProb(name, bases, totals, frequency_list_m, frequency_list_f)
    gender = GetGender(LogProbs)
    result.append(int(gender))
    
submit['gender'] = result
submit.to_csv('my_NB_prediction.csv', index=False)
```
最后结果输出在`'my_NB_prediction.csv'`中。不如上传到[比赛页面](http://sofasofa.io/competition.php?id=3)看看结果哦。
我们可以看看预测结果如何。
```python
test['pred'] = result
test.head(20)
```
  
    
       | 
      id | 
      name | 
      pred | 
    
  
  
    
      | 0 | 
      0 | 
      辰君 | 
      0 | 
    
    
      | 1 | 
      1 | 
      佳遥 | 
      0 | 
    
    
      | 2 | 
      2 | 
      淼剑 | 
      1 | 
    
    
      | 3 | 
      3 | 
      浩苳 | 
      1 | 
    
    
      | 4 | 
      4 | 
      俪妍 | 
      0 | 
    
    
      | 5 | 
      5 | 
      秉毅 | 
      1 | 
    
    
      | 6 | 
      6 | 
      妍艺 | 
      0 | 
    
    
      | 7 | 
      7 | 
      海防 | 
      1 | 
    
    
      | 8 | 
      8 | 
      壬尧 | 
      1 | 
    
    
      | 9 | 
      9 | 
      珞千 | 
      0 | 
    
    
      | 10 | 
      10 | 
      义元 | 
      1 | 
    
    
      | 11 | 
      11 | 
      才君 | 
      1 | 
    
    
      | 12 | 
      12 | 
      吉喆 | 
      1 | 
    
    
      | 13 | 
      13 | 
      少竣 | 
      1 | 
    
    
      | 14 | 
      14 | 
      创海 | 
      1 | 
    
    
      | 15 | 
      15 | 
      熙兰 | 
      0 | 
    
    
      | 16 | 
      16 | 
      家冬 | 
      1 | 
    
    
      | 17 | 
      17 | 
      方荧 | 
      1 | 
    
    
      | 18 | 
      18 | 
      介二 | 
      1 | 
    
    
      | 19 | 
      19 | 
      钰泷 | 
      1 | 
    
  
 
----
**完整代码如下**:
```python
# -*- coding: utf-8 -*-
import pandas as pd
from collections import defaultdict
import math
# 读取train.txt
train = pd.read_csv('train.txt')
test = pd.read_csv('test.txt')
submit = pd.read_csv('sample_submit.csv')
#把数据分为男女两部分
names_female = train[train['gender'] == 0]
names_male = train[train['gender'] == 1]
totals = {'f': len(names_female),
          'm': len(names_male)}
frequency_list_f = defaultdict(int)
for name in names_female['name']:
    for char in name:
        frequency_list_f[char] += 1. / totals['f']
        
frequency_list_m = defaultdict(int)
for name in names_male['name']:
    for char in name:
        frequency_list_m[char] += 1. / totals['m']
def LaplaceSmooth(char, frequency_list, total, alpha=1.0):
    count = frequency_list[char] * total
    distinct_chars = len(frequency_list)
    freq_smooth = (count + alpha ) / (total + distinct_chars * alpha)
    return freq_smooth
def GetLogProb(char, frequency_list, total):
    freq_smooth = LaplaceSmooth(char, frequency_list, total)
    return math.log(freq_smooth) - math.log(1 - freq_smooth)
def ComputeLogProb(name, bases, totals, frequency_list_m, frequency_list_f):
    logprob_m = bases['m']
    logprob_f = bases['f']
    for char in name:
        logprob_m += GetLogProb(char, frequency_list_m, totals['m'])
        logprob_f += GetLogProb(char, frequency_list_f, totals['f'])
    return {'male': logprob_m, 'female': logprob_f}
def GetGender(LogProbs):
    return LogProbs['male'] > LogProbs['female']
base_f = math.log(1 - train['gender'].mean())
base_f += sum([math.log(1 - frequency_list_f[char]) for char in frequency_list_f])
base_m = math.log(train['gender'].mean())
base_m += sum([math.log(1 - frequency_list_m[char]) for char in frequency_list_m])
bases = {'f': base_f, 'm': base_m}
result = []
for name in test['name']:
    LogProbs = ComputeLogProb(name, bases, totals, frequency_list_m, frequency_list_f)
    gender = GetGender(LogProbs)
    result.append(int(gender))
    
submit['gender'] = result
submit.to_csv('my_NB_prediction12.csv', index=False)
```
