机器学习笔记

机器学习 Machine Learning¶

简介¶

属人工智能(AI). 赋予电脑新能力:

反垃圾邮件
数据挖掘: 网络, 医疗, 生物
无法手工编程:
- 自动驾驶 (Autonomous)
- 手写识别
- 自然语言处理 (NLP)
- 机器视觉 (Computer Vision)
自助客服: 各种网络服务的推荐系统
帮助理解人类学习和真正智能

定义¶

无需明确编程, 使电脑能自己学习
随着经验的积累,完成任务的成绩越来越好
- 任务: 计算结果 Hypothesis function
- 成绩: 误差反馈 Cost function
- 经验: 训练样本 Training set, $X \to y$

通过反馈误差, 不断拟合训练样本, 改进内部模型, 提高任务成绩. 学习已知, 预测未知. 博古而知今

分类¶

监督学习 (Supervised learning): 读书考试, 有正确答案来纠正训练
- 回归 (Regression): 连续值, 房价预估
- 分类 (Classification): 离散值, 肿瘤疹断
非监督学习 (Unsupervised learning): 探索发现, 只有实事没有对错
- 聚类分析 (Clustering)
  - 新闻分组
  - 基因测序
  - 机器集群部署
  - 社交网络分析
  - 市场分段
  - 天文分析
- 鸡尾酒会问题: 杂音消除, 声音分离 (如何做?)

步骤¶

关键是如何得到好模型 (类比人类心智模型)

计算主义: 宇宙就是一场计算, 它的模型就是自身. 所有理论只是宇宙原模型某角度某层面的简化.

所谓神者, 就是某种拥有自指结构的模型. 通神, 就是通过自指达到内部的无限, 以贴近外部宇宙的无限. 自指可再分等级和结构.

先选模型类别骨架
- 线性回归 (Linear regression): 用于简单模型, 预测连续值
- 逻辑回归 (Logistic Regression): 用于分类, 得到离散值
- 多项式回归 (Polynomial regression): 用于复杂模型, 可归约到线性回归
- 非线性神经网络 (Neural network)
再定模型架构肌肉
- 回归与分类
  - 特性数量
  - 多项式特性
  - 结果通过 Sigmoid 函数逻辑化
  - 非线性特性?
- 非线形神经网络
  - 网络层数
  - 每层单元数
用训练数据确定模型权重纹理
1. 定义假设函数 (Hypothesis): $h_\theta(X)$
2. 找出代价函数 (Cost) 及其偏导: $J(\theta)$, $\frac{\partial }{\partial \theta} J(\theta)$
3. 计算模型参数 $\theta$ 值, 使得训练结果总代价最小
  - 叠代逼近最值:
    - 梯度下降 (Gradient descent) 等算法
      - Conjugate gradient
      - BFGS
      - L-BFGS
    - 反向传播 (Backpropagation)
  - 分析计算最值
    - 正规方程 (Normal equation): 利用偏导函数求极值
    - 只适用小数据量简单回归模型
    - X 不可逆:
      - 特性太多, 大于样本数
      - 特性不正交独立, 互有影响
    - 从天文观测资料中机器学习得到牛顿定律

骨架是基础, 纹理是皮毛. 骨架错则余皆不问. 非线性网络通用但复杂

过拟合¶

模型与学习样本高度拟合, 但面对新样本性能明显下降. 高分低能, 知识不能指导实践就不是真知.

原因: 样本有错误, 误差, 特例和噪音. 过度贴近样本会使模型变形, 连错误和噪音也一并学习. 事物发展有偶然性与必然性, 模型要体现决定事物发展的必然性, 排除偶然性.

如何减少:

减少特性
- 人工筛选特性
- 模型选择算法
正规化 (Regularization)
- 保留所有特性, 但通过对模型参数过大进行惩罚来减少参数过大, 避免过度匹配
- 如果众多特性平均影响结果, 则不能减少特性. 这种情况正规化效果很好

过拟合时图形曲折, 通常是 $|\theta|$ 过大的结果. 通过把 $\theta^2$ 加入代价函数,惩罚过大的 $\theta$ 来避免.

$$ J_{R}(\theta) = J(\theta) + \lambda \sum \theta^2$$

凡事过犹不及, 如果惩罚过重, 则会产生欠拟合, 低分低能. 这里的火候全在 $\lambda$

神经网络¶

脑神经回路特性:

通用算法: 听说读写... 轻松学习新技能
大量神经元互联而成: 容错性, 冗余性
多个输入, 多个输出
阈值限制激活态/非激活态: Sigmoid 函数

人工神经网络算法:

通过矩阵进行输入输出关联
中间多个隐藏层是特性层, 从原始数据中抽取抽象特征
正向传播 (Forward propagation): $h(\theta)$
反向传播 (Backpropagation): $\frac{\partial }{\partial \theta} J(\theta)$
参数展开: 将二维矩阵参数展开为一维向量参数
梯度验证: 数值近似计算偏分值, 验证反向传播
随机初始化 $\theta$:
- 局域最值
- 反对称

技巧¶

过犹不及, 不偏不倚, 术也.

学习速率 $\alpha$
- 太大易发散
- 太小难收敛
- 画图直观确认, 从小到大试验 $\alpha$ 的值: ..., 0.01, 0.1, 1, ...
拟合程度
- 过拟合 (Overfitting) 不好, 学习样本有噪音和错误, 好的坏的一起学
- 欠拟合 (Underfitting) 不好, 连已知都解释不好, 怎么拿来预测
特性归一化处理:
- Feature scaling: $ -1 \leq X \leq 1$
- Mean normalization: $ -0.5 \leq X \leq 0.5$, $\bar{X} = 0$
少用循环. 批量化, 向量化 (Vectorization) 提高性能. 使用 NumPy 等向量矩阵库
只有凸函数 (Convex function) 才能确保找到全局最优. 用 $\log$ 函数改造非凸函数
非线性模型一般只能找到局部最优
用 Sigmoid 函数将线性回归扩展到逻辑回归, 用于分类
将 $X^n$ 加入 X 中, 使线性回归扩展为多项式回归
常数项要恰好, 以减少复杂度. 例如 $\frac{1}{2}m\sum{(h(X)-y)^2}$ 这里 $\frac{1}{2}$ 求导后恰好消除
为何要加入 $X_0 = 1$ 偏见?
通过一对多 (One-vs-all) 策略将二分法扩展到处理多分类: $\underset{i}{max}\ h_\theta^{(i)}(x)$
数值验证

练习¶

使用 NumPy, matplotlib, SymPy 等 Python 科学库实现

In [1]:

from __future__ import unicode_literals
from __future__ import division
%matplotlib inline
from IPython.display import *
import numpy as np
import scipy.optimize as op
import matplotlib as mpl
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

import sympy as sp

sp.init_printing(use_latex=True, use_unicode=True)
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS']
plt.rcParams['font.size'] = 14
plt.rcParams['axes.unicode_minus'] = False
plt.rcParams['figure.figsize'] = 8, 6

线性回归¶

用线性函数拟合训练集 $X \to y$,
得到线性模型 $h(\theta^T x)$ 以预测新样本.
代价函数 $J(\theta_n) = \frac{1}{2m}\sum_m (h_m(\theta) - y_m)^2$ 做为性能指标, 最低值来确定权重 $\theta_n$
- 用梯度下降等迭代法得到最优值, 可作图直观调试收敛情况
- 用正规函数直接计算得到最优值

In [2]:

def ex1(data_file_path):
    data = np.loadtxt(data_file_path, delimiter=',')
    x, y = data[:,0:-1], data[:, -1:]
    bias = np.ones((x.shape[0], 1))
    x_ = np.hstack((bias, x))
    theta = np.zeros(x_.shape[1])

    # 算法
    def h(x, theta):
        '''hypothesis'''
        return np.dot(x, theta).reshape(-1, 1)


    def J(x, y, theta):
        '''cost'''
        return np.power(h(x, theta) - y, 2).mean()/2


    def gradient_descent(x, y, theta, alpha, iterations):
        thetas = [theta]
        costs = []
        for i in range(iterations):
            delta = alpha * np.multiply(h(x, theta)-y, x).mean(0)
            theta = theta - delta
            thetas.append(theta)
            cost = J(x, y, theta)
            costs.append(cost)
        return thetas, costs


    def normal_equation(x, y):
        x = np.matrix(x)
        y = np.matrix(y)
        return np.linalg.pinv(x.T * x) * x.T * y

    # 数据结果图
    thetas, costs = gradient_descent(x_, y, theta, 0.01, 1500)
    theta_gradient_descent = thetas[-1]
    theta_normal_equation = normal_equation(x_, y)
    theta_minimize = op.minimize(lambda theta: J(x_, y, theta), [0, 0])['x']

    plt.plot(costs)
    plt.title('代价迭代收敛')
    plt.xlabel('迭代次数')
    plt.ylabel('代价')
    plt.show()


    display(Math(r'$\theta_{梯度下降} = \begin{bmatrix}%.2f\\%.2f\end{bmatrix}\
                 ,\ \theta_{正规函数} = \begin{bmatrix}%.2f\\%.2f\end{bmatrix}\
                 ,\ \theta_{最小化BFGS} = \begin{bmatrix}%.2f\\%.2f\end{bmatrix}$' % 
                 (theta_gradient_descent[0], theta_gradient_descent[1], 
                  theta_normal_equation[0], theta_normal_equation[1],
                  theta_minimize[0], theta_minimize[1])))

    y_35 = h([1, 3.5], theta_gradient_descent)[0, 0]
    y_7 = h([1, 7], theta_gradient_descent)[0, 0]

    plt.plot(x, y, 'x', label='样本')
    px = range(2, 30, 2)
    py = [h(np.array([1, i]), theta_gradient_descent)[0, 0] for i in px]
    plt.plot(px, py, label='线性回归模型')
    plt.plot([3.5, 7], [y_35, y_7], 'o', label='预测')
    plt.title('线性回归')
    plt.xlabel('食品车利润(万)')
    plt.ylabel('城市人口(万)')
    plt.legend(loc=4)
    plt.show()
    display(Math(r'h(%s) = %.2f' % (3.5, y_35)))
    display(Math(r'h(%s) = %.2f' % (7, y_7)))

    # 代价函数图
    theta0s = np.linspace(-10, 10, 100)
    theta1s = np.linspace(-1, 4, 100)
    theta0s, theta1s = np.meshgrid(theta0s, theta1s)
    m, n = theta0s.shape
    hs = np.array([J(x_, y, [theta0s[i, j], theta1s[i, j]]) 
                   for i in range(m) for j in range(n)]).reshape(m, n)

    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.plot_surface(theta0s, theta1s, hs,
                    rstride=3, cstride=3, cmap=mpl.cm.jet, linewidth=0.3, antialiased=True)
    plt.xlabel(r'$\theta_0$')
    plt.ylabel(r'$\theta_1$')
    plt.title(r'代价函数 $J(\theta)$')
    plt.show()

    plt.contour(theta0s, theta1s, hs, np.logspace(-2, 3, 30))
    plt.xlabel(r'$\theta_0$')
    plt.ylabel(r'$\theta_1$')
    plt.title(r'代价函数等高线')
    plt.plot([i[0] for i in thetas], [i[1] for i in thetas], 'r-')
    plt.plot(theta_gradient_descent[0], theta_gradient_descent[1], 'rx')
    plt.plot(theta_normal_equation[0, 0], theta_normal_equation[1, 0], 'k+')
    plt.plot(theta_minimize[0], theta_minimize[1], 'yx')
    plt.show()


ex1('ex1data1.txt')

$$\theta_{梯度下降} = \begin{bmatrix}-3.63\\1.17\end{bmatrix} ,\ \theta_{正规函数} = \begin{bmatrix}-3.90\\1.19\end{bmatrix} ,\ \theta_{最小化BFGS} = \begin{bmatrix}-3.90\\1.19\end{bmatrix}$$

$$h(3.5) = 0.45$$

$$h(7) = 4.53$$

逻辑回归¶

用 Sigmoid 函数扩展线性回归, 得到简单两分法分类模型.
使用一对多策略扩展到多于两个类别的分类模型

In [3]:

z = sp.symbols('z')
sigmoid = 1/(1+sp.exp(-z))
sp.plotting.plot(sigmoid, (z, -8, 8))
display(sigmoid)

$$\frac{1}{1 + e^{- z}}$$

这个函数值在 0 与 1 之间, 用来表示预测的分类概率. 该函数需要用 $log$ 函数来使代价函数为凸起

In [4]:

def ex2(data_file_path):
    data = np.loadtxt(data_file_path, delimiter=',')
    x, y = data[:,0:-1], data[:, -1:]
    bias = np.ones((x.shape[0], 1))
    x_ = np.hstack((bias, x))
    theta = np.zeros(x_.shape[1])


    # 算法
    def sigmoid(z):
        return 1 / (1+np.exp(-z))

    def h(x, theta):
        return sigmoid(np.dot(x, theta).reshape(-1, 1))

    def J(x, y, theta):
        h_val = h(x, theta)
        return (-y * np.log(h_val) - (1-y)*np.log(1-h_val)).mean();

    def gradient(x, y, theta):
        return np.multiply(h(x, theta)-y, x).mean(0)

    print('初始代价 %.3f' % J(x_, y, theta))

    cost = lambda t: J(x_, y, t)
    gradient_ = lambda t: gradient(x_, y, t)

    # TODO: make it work with other minimize algorithm
    #op.minimize(lambda t: J(x_, y, t), 
    #            np.zeros(x_.shape[1]), method = 'TNC', jac=lambda t: gradient(x_, y, t))
    #print gradient_(init)
    #print 'fmin_cg()\n', op.fmin_cg(cost, init, fprime=gradient_)
    #op.check_grad(cost, gradient_, init)

    theta_fmin = op.fmin(cost, theta, disp=False)
    display(Math(r'\theta_{fmin} = \begin{bmatrix}%.3f\\%.3f\\%.3f\end{bmatrix}' % tuple(theta_fmin)))

    pos = np.where(y == 0)[0]
    neg = np.where(y != 0)[0]
    plt.plot(x[pos, 0], x[pos, 1], 'yo', label='拒绝')
    plt.plot(x[neg, 0], x[neg, 1], 'k+', label='录取')
    plt.xlabel('甲科考分')
    plt.ylabel('乙科考分')
    plot_x = np.array([x[:,0].min()-2, x[:,0].max()+2])
    plot_y =  (theta_fmin[0] + theta_fmin[1]*plot_x) / -theta_fmin[2]
    plt.plot(plot_x, plot_y, label='决策边界') # TODO how to plot decision bondary generally?
    plt.legend(loc=3)
    plt.show()

    print('两门成绩 45 分和 85 分的考生录取率为 {:.1%}'.format(h([1, 45, 85], theta_fmin)[0][0]))
    print('总拟合准确率 {:.1%}'.format(((h(x_, theta_fmin) >= 0.5) == y).mean()))


ex2('ex2data1.txt')

初始代价 0.693

$$\theta_{fmin} = \begin{bmatrix}-25.161\\0.206\\0.201\end{bmatrix}$$

两门成绩 45 分和 85 分的考生录取率为 77.6%
总拟合准确率 89.0%

道可叨

Free Will