机器学习-支持向量机推导-Support Vector Machine(SVM)

it2022-12-27  76

1. 支持向量机模型的数学推导

1.1.超平面

首先展示二维和三维空间的两个简单的例子 二 维 : w 1 x 1 + w 2 x 2 + b = 0 三 维 : w 1 x 1 + w 2 x 2 + w 3 x 3 + b = 0 二维:w_1x_1+w_2x_2+b=0 \\ 三维:w_1x_1+w_2x_2+w_3x_3+b=0 w1x1+w2x2+b=0w1x1+w2x2+w3x3+b=0 那么,我们一般化,可以得到 w T x + b = 0 w^Tx+b=0 wTx+b=0 w = ( w 1 w 2 ⋮ w d ) d × 1 x = ( x 1 x 2 ⋮ x d ) d × 1 w= \left( \begin{array}{c} w_1 \\ w_2 \\ \vdots \\ w_d \end{array} \right)_{d\times1} x=\left( \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_d \end{array} \right)_{d\times1} w=w1w2wdd×1x=x1x2xdd×1 超平面有以下三个性质:

原点到超平面的距离 d i s t ( 0 , H ) = ∣ b ∣ ∣ ∣ w ∣ ∣ dist(0,H)=\frac{|b|}{||w||} dist(0,H)=wb判断任意一点 X 0 X_0 X0和超平面的相对位置 w T x 0 + b > 0 ⟹ a b o v e   H w T x 0 + b < 0 ⟹ b e l o w   H w T x 0 + b = 0 ⟹ o n   H w^Tx_0+b>0\Longrightarrow above\ H \\ w^Tx_0+b<0\Longrightarrow below\ H \\ w^Tx_0+b=0\Longrightarrow on\ H wTx0+b>0above HwTx0+b<0below HwTx0+b=0on H任意一点 X 0 X_0 X0到超平面的距离 d i s t ( x 0 , H ) = w T x 0 + b ∣ ∣ w ∣ ∣ dist(x_0,H)=\frac{w^Tx_0+b}{||w||} dist(x0,H)=wwTx0+b

1.2.二分类问题

现在我们面临的是一个二分类的问题,数据集为 D = { ( x i , y i ) ,   x i ∈ R d ,   y i = ± 1 } D=\{(x_i,y_i),\ x_i\in R^d,\ y_i=\pm1\} D={(xi,yi), xiRd, yi=±1} 如果存在一个超平面 W T X + b = 0 W^TX+b=0 WTX+b=0能够将它们正确分类的话,那么 f ∗ ( x ) = s g n ( w T x + b ) = { + 1 ( w T x + b > 0 ) − 1 ( w T x + b < 0 ) f^*(x)=sgn(w^Tx+b)=\left\{ \begin{aligned} +1\qquad(w^Tx+b>0)\\ -1\qquad(w^Tx+b<0) \end{aligned} \right. f(x)=sgn(wTx+b)={+1(wTx+b>0)1(wTx+b<0) 需要优化的参数是 W T W^T WT b b b。定义 m a r g i n margin margin是离超平面最近的点到超平面的距离,我们需要最大化这个距离,且需要一个约束条件保证了这个超平面能够正确将点分类。由于一旦满足约束条件,均方误差就为0,所以采用最小化均方误差是无意义的。因此,我们的目标优化函数是 { max ⁡ w , b min ⁡ i = 1 , 2 , ⋯   , N ∣ w T x i + b ∣ ∣ ∣ w ∣ ∣ s . t . y i ( w T x i + b ) > 0 ∀ i ( 1 ) \begin{cases} \max_{w,b}\min_{i=1,2,\cdots,N}\frac{|w^Tx_i+b|}{||w||} \\ s.t.\quad y_i(w^Tx_i+b)>0\quad \forall i \end{cases} \qquad (1) {maxw,bmini=1,2,,NwwTxi+bs.t.yi(wTxi+b)>0i(1)

1.3.支持向量机模型的数学推导1

首先转化一下约束条件: y i ( w T x i + b ) ≥ min ⁡ y i ( w T x i + b ) = min ⁡ 1 , 2 , ⋯   , N ∣ w T x i + b ∣ \begin{aligned} y_i(w^Tx_i+b)&\geq\min y_i(w^Tx_i+b) \\ &= \min_{1,2,\cdots,N}|w^Tx_i+b| \end{aligned} yi(wTxi+b)minyi(wTxi+b)=1,2,,NminwTxi+b λ = 1 min ⁡ 1 , 2 , ⋯   , N ∣ w T x i + b ∣ \lambda=\frac{1}{\min_{1,2,\cdots,N}|w^Tx_i+b|} λ=min1,2,,NwTxi+b1 那么目标函数 max ⁡ w , b min ⁡ i = 1 , 2 , ⋯   , N ∣ w T x i + b ∣ ∣ ∣ w ∣ ∣ = max ⁡ w , b 1 λ ∣ ∣ w ∣ ∣ λ min ⁡ i = 1 , 2 , ⋯   , N ∣ w T x i + b ∣ = max ⁡ w , b 1 λ ∣ ∣ w ∣ ∣ \begin{aligned} \max_{w,b}\min_{i=1,2,\cdots,N}\frac{|w^Tx_i+b|}{||w||} &= \max_{w,b}\frac{1}{\lambda||w||}\lambda\min_{i=1,2,\cdots,N}|w^Tx_i+b| \\ &= \max_{w,b}\frac{1}{\lambda||w||} \end{aligned} w,bmaxi=1,2,,NminwwTxi+b=w,bmaxλw1λi=1,2,,NminwTxi+b=w,bmaxλw1 因此,化简到这一阶段,我们的目标优化函数变为

{ max ⁡ w , b 1 λ ∣ ∣ w ∣ ∣ s . t . y i ( w T x i + b ) ≥ 1 λ ∀ i ( 2 ) \begin{cases} \max_{w,b}\frac{1}{\lambda||w||} \\ s.t.\quad y_i(w^Tx_i+b)\geq\frac{1}{\lambda}\quad \forall i \end{cases} \qquad (2) {maxw,bλw1s.t.yi(wTxi+b)λ1i(2) 接下来我们定义 w ~ = λ w b ~ = λ b \tilde{w}=\lambda w \\ \tilde{b}=\lambda b w~=λwb~=λb 所以 { max ⁡ w ~ , b ~ 1 ∣ ∣ w ~ ∣ ∣ s . t . y i ( w ~ T x i + b ~ ) ≥   1 ∀ i ( 3 ) \begin{cases} \max_{\tilde{w},\tilde{b}}\frac{1}{||\tilde{w}||} \\ s.t.\quad y_i(\tilde{w}^Tx_i+\tilde{b})\geq\ 1\quad \forall i \end{cases} \qquad (3) {maxw~,b~w~1s.t.yi(w~Txi+b~) 1i(3) 然后,我们去掉 ∼ \sim ,目标优化函数更新为 { max ⁡ w , b 1 ∣ ∣ w ∣ ∣ s . t . y i ( w T x i + b ) ≥   1 ∀ i ( 4 )   { min ⁡ w , b 1 2 ∣ ∣ w ∣ ∣ 2 s . t . 1 − y i ( w T x i + b ) ≤   0 ∀ i ( 5 ) \begin{cases} \max_{w,b}\frac{1}{||w||} \\ s.t.\quad y_i(w^Tx_i+b)\geq\ 1\quad \forall i \end{cases} \qquad (4) \\ \ \\ \begin{cases} \min_{w,b}\frac{1}{2}||w||^2 \\ s.t.\quad 1-y_i(w^Tx_i+b)\leq\ 0\quad \forall i \end{cases} \qquad (5) {maxw,bw1s.t.yi(wTxi+b) 1i(4) {minw,b21w2s.t.1yi(wTxi+b) 0i(5) ( 5 ) (5) (5)是我们的带约束的原问题,也就是 p r i m a l   p r o b l e m primal\ problem primal problem。然后,我们运用拉格朗日乘数法 { z = ( w , b ) F ( z ) = 1 2 ∣ ∣ w ∣ ∣ 2 G i ( z ) = 1 − y i ( w T x i + b )   ⟹ L ( w , b , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + ∑ i = 1 N μ i ( 1 − y i ( w T x i + b ) )   ⟹ { min ⁡ w , b max ⁡ μ L ( w , b , μ ) s . t . μ i ≥   0 ∀ i ( 6 ) \begin{cases} z=(w,b) \\ F(z)=\frac{1}{2}||w||^2 \\ G_i(z)=1-y_i(w^Tx_i+b) \end{cases} \\ \ \\ \Longrightarrow L(w,b,\mu) = \frac{1}{2}||w||^2+\sum_{i=1}^{N}\mu_i\big(1-y_i(w^Tx_i+b)\big) \\ \ \\ \Longrightarrow \begin{cases} \min_{w,b}\max_{\mu}L(w,b,\mu) \\ s.t.\quad \mu_i\geq\ 0\quad \forall i \end{cases} \qquad (6) z=(w,b)F(z)=21w2Gi(z)=1yi(wTxi+b) L(w,b,μ)=21w2+i=1Nμi(1yi(wTxi+b)) {minw,bmaxμL(w,b,μ)s.t.μi 0i(6) 这样 ( 6 ) (6) (6)就是不带约束条件的原问题。由于目标优化函数是凸二次函数,并且约束条件是线性的,所以满足强对偶关的条件。其对偶问题是 { m a x μ min ⁡ w , b L ( w , b , μ ) s . t . μ i ≥   0 ∀ i ( 7 ) \begin{cases} max_{\mu}\min_{w,b}L(w,b,\mu) \\ s.t.\quad \mu_i\geq\ 0\quad \forall i \end{cases} \qquad (7) {maxμminw,bL(w,b,μ)s.t.μi 0i(7) 所以 ∂ L ( w , b , μ ) ∂ b = ∂ ( ∑ i = 1 N μ i ( 1 − y i ( w T x i + b ) ) ) ∂ b = 0 ( 8 ) ∂ L ( w , b , μ ) ∂ w = ∂ ( 1 2 w T w + ∑ i = 1 N μ i − ∑ i = 1 N μ i y i w T x i − ∑ i = 1 N μ i y i b ) ∂ w = 0 ( 9 ) \begin{aligned} \frac{\partial{L(w,b,\mu)}}{\partial b} &= \frac{\partial\Big(\sum_{i=1}^{N}\mu_i\big(1-y_i(w^Tx_i+b)\big) \Big)}{\partial b} = 0 \qquad (8)\\ \frac{\partial{L(w,b,\mu)}}{\partial w} &= \frac{\partial{(\frac{1}{2}w^Tw+\sum_{i=1}^{N}\mu_i-\sum_{i=1}^{N}\mu_iy_iw^Tx_i-\sum_{i=1}^{N}\mu_iy_ib)}}{\partial w} = 0 \qquad (9) \end{aligned} bL(w,b,μ)wL(w,b,μ)=b(i=1Nμi(1yi(wTxi+b)))=0(8)=w(21wTw+i=1Nμii=1NμiyiwTxii=1Nμiyib)=0(9) ( 8 ) (8) (8)得到 ∑ i = 1 N μ i y i = 0 ( 10 ) \sum_{i=1}^{N}\mu_iy_i = 0 \qquad (10) i=1Nμiyi=0(10) ( 10 ) (10) (10)带入 ( 9 ) (9) (9)中,得到 ∂ L ∂ w = w − ∑ i = 1 N μ i y i x i = 0   ⟹ w = ∑ i = 1 N μ i y i x i ( 11 ) \frac{\partial L}{\partial w} = w-\sum_{i=1}^{N}\mu_iy_ix_i = 0 \\ \ \\ \Longrightarrow w = \sum_{i=1}^{N}\mu_iy_ix_i \qquad (11) wL=wi=1Nμiyixi=0 w=i=1Nμiyixi(11) 然后将 ( 10 ) (10) (10) ( 11 ) (11) (11)带入 L ( w , b , μ ) L(w,b,\mu) L(w,b,μ)中,得到 L ( w , b , μ ) = 1 2 ( ∑ i = 1 N μ i y i x i ) T ( ∑ i = 1 N μ i y i x i ) − ∑ i = 1 N μ i y i ( ∑ i = 1 N μ i y i x i ) T x i + ∑ i = 1 N μ i = 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j − ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x j T x i + ∑ i = 1 N μ i = 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j − ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i T x j + ∑ i = 1 N μ i = − 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j + ∑ i = 1 N μ i \begin{aligned} L(w,b,\mu) &= \frac{1}{2}(\sum_{i=1}^{N}\mu_iy_ix_i)^T(\sum_{i=1}^{N}\mu_iy_ix_i) - \sum_{i=1}^{N}\mu_iy_i(\sum_{i=1}^{N}\mu_iy_ix_i)^Tx_i + \sum_{i=1}^{N}\mu_i \\ &= \frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j - \sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_j^Tx_i + \sum_{i=1}^{N}\mu_i \\ &= \frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j - \sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_i^Tx_j + \sum_{i=1}^{N}\mu_i \\ &= -\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{N}\mu_i \end{aligned} L(w,b,μ)=21(i=1Nμiyixi)T(i=1Nμiyixi)i=1Nμiyi(i=1Nμiyixi)Txi+i=1Nμi=21i=1Nj=1Nμiμjyiyjxixji=1Nj=1NμiμjyiyjxjTxi+i=1Nμi=21i=1Nj=1Nμiμjyiyjxixji=1Nj=1NμiμjyiyjxiTxj+i=1Nμi=21i=1Nj=1Nμiμjyiyjxixj+i=1Nμi 现在我们的优化目标变为 { min ⁡ μ − 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j + ∑ i = 1 N μ i s . t . μ i ≥   0 ∀ i s . t . ∑ i = 1 N μ i y i = 0 ( 12 )   ⟹ ∂ ( − 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j + ∑ i = 1 N μ i ) ∂ μ = 0 ( 13 ) \begin{cases} \min_{\mu}-\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{N}\mu_i \\ s.t. \quad \mu_i\geq\ 0\quad \forall i \\ s.t. \quad \sum_{i=1}^{N}\mu_iy_i = 0 \end{cases} \qquad (12) \\ \ \\ \Longrightarrow \frac{\partial{(-\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{N}\mu_i) }}{\partial \mu} = 0 \qquad (13) minμ21i=1Nj=1Nμiμjyiyjxixj+i=1Nμis.t.μi 0is.t.i=1Nμiyi=0(12) μ(21i=1Nj=1Nμiμjyiyjxixj+i=1Nμi)=0(13) ( 13 ) (13) (13)得到 μ ^ \hat{\mu} μ^,然后带入 ( 11 ) (11) (11)中得到 w ^ \hat{w} w^,又由于 ∃ ( x k , y k ) s . t . 1 − y k ( w T x k + b ) = 0 \exists(x_k,y_k)\qquad s.t.\quad1-y_k(w^Tx_k+b) = 0 \\ (xk,yk)s.t.1yk(wTxk+b)=0 所以 y k 2 ( w T x k + b ) = y k   ⟹ ( w T x k + b ) = y k   ⟹ b = y k − ( ∑ i = 1 N μ i y i x i ) T x k ( 14 ) y_k^2(w^Tx_k+b) = y_k \\ \ \\ \Longrightarrow (w^Tx_k+b) = y_k \\ \ \\ \Longrightarrow b = y_k - (\sum_{i=1}^{N}\mu_iy_ix_i)^Tx_k \qquad (14) yk2(wTxk+b)=yk (wTxk+b)=yk b=yk(i=1Nμiyixi)Txk(14) μ ^ \hat{\mu} μ^带入 ( 14 ) (14) (14)中得到 b ^ \hat{b} b^。此时 w ^ \hat{w} w^ b ^ \hat{b} b^都已经求出,于是 f ^ ( x ) = s g n ( w ^ x + b ^ ) \hat{f}(x) = sgn(\hat{w}x+\hat{b}) f^(x)=sgn(w^x+b^)

1.4. 简单支持向量机模型的例子

数据点 x 1 x_1 x1 x 2 x_2 x2 y y yA1.01.5+1B1.52.4+1C1.51.0-1

下面我们用python绘制出这些点在二维平面的位置。

import matplotlib.pyplot as plt import numpy as np x = np.linspace(0, 3, 100) y = x plt.title("Data Points") plt.xlim(xmax=3, xmin=0) plt.ylim(ymax=3, ymin=0) plt.xlabel("x1") plt.ylabel("x2") plt.plot([1.0, 1.5],[1.5, 2.4],'ro') plt.plot([1.5],[1.0],'bo') plt.plot(x, y, c='green') plt.annotate(text='A', xy=(1.0, 1.5)) plt.annotate(text='B', xy=(1.5, 2.4)) plt.annotate(text='C', xy=(1.5, 1.0)) plt.show()

如果仅仅用肉眼观察,选择一条直线来区分红色的点和蓝色的点,那么这条直线应该与x=y相接近,也就是绿色的线。

我们将数据代入式(8), ∂ ( − 1 2 ∑ i = 1 3 ∑ j = 1 3 μ i μ j y i y j x i x j + ∑ i = 1 3 μ i ) ∂ μ = 0 \begin{aligned} \frac{\partial{(-\frac{1}{2}\sum_{i=1}^{3}\sum_{j=1}^{3}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{3}\mu_i) }}{\partial \mu} = 0 \end{aligned} μ(21i=13j=13μiμjyiyjxixj+i=13μi)=0

− 1 2 ∑ i = 1 3 ∑ j = 1 3 μ i μ j y i y j x i x j + ∑ i = 1 3 μ i = − 1 2 ∑ i = 1 3 ( μ i μ 1 y i y 1 x i x 1 + μ i μ 2 y i y 2 x i x 2 + μ i μ 3 y i y 3 x i x 3 ) + ∑ i = 1 3 μ i = − 1 2 ( μ 1 μ 1 y 1 y 1 x 1 x 1 + μ 1 μ 2 y 1 y 2 x 1 x 2 + μ 1 μ 3 y 1 y 3 x 1 x 3 + μ 2 μ 1 y 2 y 1 x 2 x 1 + μ 2 μ 2 y 2 y 2 x 2 x 2 + μ 2 μ 3 y 2 y 3 x 2 x 3 + μ 3 μ 1 y 3 y 1 x 3 x 1 + μ 3 μ 2 y 3 y 2 x 3 x 2 + μ 3 μ 3 y 3 y 3 x 3 x 3 ) + μ 1 + μ 2 + μ 3 \begin{aligned} -\frac{1}{2}\sum_{i=1}^{3}\sum_{j=1}^{3}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{3}\mu_i &= -\frac{1}{2}\sum_{i=1}^{3}(\mu_i\mu_1y_iy_1x_ix_1+\mu_i\mu_2y_iy_2x_ix_2+\mu_i\mu_3y_iy_3x_ix_3) + \sum_{i=1}^{3}\mu_i \\ &= -\frac{1}{2}(\mu_1\mu_1y_1y_1x_1x_1+\mu_1\mu_2y_1y_2x_1x_2+\mu_1\mu_3y_1y_3x_1x_3 + \mu_2\mu_1y_2y_1x_2x_1+\mu_2\mu_2y_2y_2x_2x_2+\mu_2\mu_3y_2y_3x_2x_3 + \mu_3\mu_1y_3y_1x_3x_1+\mu_3\mu_2y_3y_2x_3x_2+\mu_3\mu_3y_3y_3x_3x_3) + \mu_1+\mu_2+\mu_3 \end{aligned} 21i=13j=13μiμjyiyjxixj+i=13μi=21i=13(μiμ1yiy1xix1+μiμ2yiy2xix2+μiμ3yiy3xix3)+i=13μi=21(μ1μ1y1y1x1x1+μ1μ2y1y2x1x2+μ1μ3y1y3x1x3+μ2μ1y2y1x2x1+μ2μ2y2y2x2x2+μ2μ3y2y3x2x3+μ3μ1y3y1x3x1+μ3μ2y3y2x3x2+μ3μ3y3y3x3x3)+μ1+μ2+μ3 μ 1 \mu_1 μ1 μ 2 \mu_2 μ2 μ 3 \mu_3 μ3求导得到 − 1 2 ( 2 y 1 y 1 x 1 x 1 μ 1 + μ 2 y 1 y 2 x 1 x 2 + μ 3 y 1 y 3 x 1 x 3 + μ 2 y 2 y 1 x 2 x 1 + μ 3 y 3 y 1 x 3 x 1 ) + 1 = 0 ( 15 )   − 1 2 ( μ 1 y 1 y 2 x 1 x 2 + μ 1 y 2 y 1 x 2 x 1 + 2 μ 2 y 2 y 2 x 2 x 2 + μ 3 y 2 y 3 x 2 x 3 + μ 3 y 3 y 2 x 3 x 2 ) + 1 = 0 ( 16 )   − 1 2 ( μ 1 y 1 y 3 x 1 x 3 + μ 2 y 2 y 3 x 2 x 3 + μ 1 y 3 y 1 x 3 x 1 + μ 2 y 3 y 2 x 3 x 2 + 2 μ 3 y 3 y 3 x 3 x 3 ) + 1 = 0 ( 17 ) -\frac{1}{2}(2y_1y_1x_1x_1\mu_1+\mu_2y_1y_2x_1x_2+\mu_3y_1y_3x_1x_3+\mu_2y_2y_1x_2x_1+\mu_3y_3y_1x_3x_1)+1=0 \qquad (15) \ \\ -\frac{1}{2}(\mu_1y_1y_2x_1x_2+\mu_1y_2y_1x_2x_1+2\mu_2y_2y_2x_2x_2+\mu_3y_2y_3x_2x_3+\mu_3y_3y_2x_3x_2)+1=0 \qquad (16) \ \\ -\frac{1}{2}(\mu_1y_1y_3x_1x_3+\mu_2y_2y_3x_2x_3+\mu_1y_3y_1x_3x_1+\mu_2y_3y_2x_3x_2+2\mu_3y_3y_3x_3x_3)+1=0 \qquad (17) 21(2y1y1x1x1μ1+μ2y1y2x1x2+μ3y1y3x1x3+μ2y2y1x2x1+μ3y3y1x3x1)+1=0(15) 21(μ1y1y2x1x2+μ1y2y1x2x1+2μ2y2y2x2x2+μ3y2y3x2x3+μ3y3y2x3x2)+1=0(16) 21(μ1y1y3x1x3+μ2y2y3x2x3+μ1y3y1x3x1+μ2y3y2x3x2+2μ3y3y3x3x3)+1=0(17)


该推导过程参考新加坡国立大学“Foundations of Machine Learning”课程和bilibili的up主“shuhuai008”的视频。 ↩︎

最新回复(0)