首先展示二维和三维空间的两个简单的例子 二 维 : w 1 x 1 + w 2 x 2 + b = 0 三 维 : w 1 x 1 + w 2 x 2 + w 3 x 3 + b = 0 二维:w_1x_1+w_2x_2+b=0 \\ 三维:w_1x_1+w_2x_2+w_3x_3+b=0 二维:w1x1+w2x2+b=0三维:w1x1+w2x2+w3x3+b=0 那么,我们一般化,可以得到 w T x + b = 0 w^Tx+b=0 wTx+b=0 w = ( w 1 w 2 ⋮ w d ) d × 1 x = ( x 1 x 2 ⋮ x d ) d × 1 w= \left( \begin{array}{c} w_1 \\ w_2 \\ \vdots \\ w_d \end{array} \right)_{d\times1} x=\left( \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_d \end{array} \right)_{d\times1} w=⎝⎜⎜⎜⎛w1w2⋮wd⎠⎟⎟⎟⎞d×1x=⎝⎜⎜⎜⎛x1x2⋮xd⎠⎟⎟⎟⎞d×1 超平面有以下三个性质:
原点到超平面的距离 d i s t ( 0 , H ) = ∣ b ∣ ∣ ∣ w ∣ ∣ dist(0,H)=\frac{|b|}{||w||} dist(0,H)=∣∣w∣∣∣b∣判断任意一点 X 0 X_0 X0和超平面的相对位置 w T x 0 + b > 0 ⟹ a b o v e H w T x 0 + b < 0 ⟹ b e l o w H w T x 0 + b = 0 ⟹ o n H w^Tx_0+b>0\Longrightarrow above\ H \\ w^Tx_0+b<0\Longrightarrow below\ H \\ w^Tx_0+b=0\Longrightarrow on\ H wTx0+b>0⟹above HwTx0+b<0⟹below HwTx0+b=0⟹on H任意一点 X 0 X_0 X0到超平面的距离 d i s t ( x 0 , H ) = w T x 0 + b ∣ ∣ w ∣ ∣ dist(x_0,H)=\frac{w^Tx_0+b}{||w||} dist(x0,H)=∣∣w∣∣wTx0+b现在我们面临的是一个二分类的问题,数据集为 D = { ( x i , y i ) , x i ∈ R d , y i = ± 1 } D=\{(x_i,y_i),\ x_i\in R^d,\ y_i=\pm1\} D={(xi,yi), xi∈Rd, yi=±1} 如果存在一个超平面 W T X + b = 0 W^TX+b=0 WTX+b=0能够将它们正确分类的话,那么 f ∗ ( x ) = s g n ( w T x + b ) = { + 1 ( w T x + b > 0 ) − 1 ( w T x + b < 0 ) f^*(x)=sgn(w^Tx+b)=\left\{ \begin{aligned} +1\qquad(w^Tx+b>0)\\ -1\qquad(w^Tx+b<0) \end{aligned} \right. f∗(x)=sgn(wTx+b)={+1(wTx+b>0)−1(wTx+b<0) 需要优化的参数是 W T W^T WT和 b b b。定义 m a r g i n margin margin是离超平面最近的点到超平面的距离,我们需要最大化这个距离,且需要一个约束条件保证了这个超平面能够正确将点分类。由于一旦满足约束条件,均方误差就为0,所以采用最小化均方误差是无意义的。因此,我们的目标优化函数是 { max w , b min i = 1 , 2 , ⋯ , N ∣ w T x i + b ∣ ∣ ∣ w ∣ ∣ s . t . y i ( w T x i + b ) > 0 ∀ i ( 1 ) \begin{cases} \max_{w,b}\min_{i=1,2,\cdots,N}\frac{|w^Tx_i+b|}{||w||} \\ s.t.\quad y_i(w^Tx_i+b)>0\quad \forall i \end{cases} \qquad (1) {maxw,bmini=1,2,⋯,N∣∣w∣∣∣wTxi+b∣s.t.yi(wTxi+b)>0∀i(1)
首先转化一下约束条件: y i ( w T x i + b ) ≥ min y i ( w T x i + b ) = min 1 , 2 , ⋯ , N ∣ w T x i + b ∣ \begin{aligned} y_i(w^Tx_i+b)&\geq\min y_i(w^Tx_i+b) \\ &= \min_{1,2,\cdots,N}|w^Tx_i+b| \end{aligned} yi(wTxi+b)≥minyi(wTxi+b)=1,2,⋯,Nmin∣wTxi+b∣ 令 λ = 1 min 1 , 2 , ⋯ , N ∣ w T x i + b ∣ \lambda=\frac{1}{\min_{1,2,\cdots,N}|w^Tx_i+b|} λ=min1,2,⋯,N∣wTxi+b∣1 那么目标函数 max w , b min i = 1 , 2 , ⋯ , N ∣ w T x i + b ∣ ∣ ∣ w ∣ ∣ = max w , b 1 λ ∣ ∣ w ∣ ∣ λ min i = 1 , 2 , ⋯ , N ∣ w T x i + b ∣ = max w , b 1 λ ∣ ∣ w ∣ ∣ \begin{aligned} \max_{w,b}\min_{i=1,2,\cdots,N}\frac{|w^Tx_i+b|}{||w||} &= \max_{w,b}\frac{1}{\lambda||w||}\lambda\min_{i=1,2,\cdots,N}|w^Tx_i+b| \\ &= \max_{w,b}\frac{1}{\lambda||w||} \end{aligned} w,bmaxi=1,2,⋯,Nmin∣∣w∣∣∣wTxi+b∣=w,bmaxλ∣∣w∣∣1λi=1,2,⋯,Nmin∣wTxi+b∣=w,bmaxλ∣∣w∣∣1 因此,化简到这一阶段,我们的目标优化函数变为
{ max w , b 1 λ ∣ ∣ w ∣ ∣ s . t . y i ( w T x i + b ) ≥ 1 λ ∀ i ( 2 ) \begin{cases} \max_{w,b}\frac{1}{\lambda||w||} \\ s.t.\quad y_i(w^Tx_i+b)\geq\frac{1}{\lambda}\quad \forall i \end{cases} \qquad (2) {maxw,bλ∣∣w∣∣1s.t.yi(wTxi+b)≥λ1∀i(2) 接下来我们定义 w ~ = λ w b ~ = λ b \tilde{w}=\lambda w \\ \tilde{b}=\lambda b w~=λwb~=λb 所以 { max w ~ , b ~ 1 ∣ ∣ w ~ ∣ ∣ s . t . y i ( w ~ T x i + b ~ ) ≥ 1 ∀ i ( 3 ) \begin{cases} \max_{\tilde{w},\tilde{b}}\frac{1}{||\tilde{w}||} \\ s.t.\quad y_i(\tilde{w}^Tx_i+\tilde{b})\geq\ 1\quad \forall i \end{cases} \qquad (3) {maxw~,b~∣∣w~∣∣1s.t.yi(w~Txi+b~)≥ 1∀i(3) 然后,我们去掉 ∼ \sim ∼,目标优化函数更新为 { max w , b 1 ∣ ∣ w ∣ ∣ s . t . y i ( w T x i + b ) ≥ 1 ∀ i ( 4 ) { min w , b 1 2 ∣ ∣ w ∣ ∣ 2 s . t . 1 − y i ( w T x i + b ) ≤ 0 ∀ i ( 5 ) \begin{cases} \max_{w,b}\frac{1}{||w||} \\ s.t.\quad y_i(w^Tx_i+b)\geq\ 1\quad \forall i \end{cases} \qquad (4) \\ \ \\ \begin{cases} \min_{w,b}\frac{1}{2}||w||^2 \\ s.t.\quad 1-y_i(w^Tx_i+b)\leq\ 0\quad \forall i \end{cases} \qquad (5) {maxw,b∣∣w∣∣1s.t.yi(wTxi+b)≥ 1∀i(4) {minw,b21∣∣w∣∣2s.t.1−yi(wTxi+b)≤ 0∀i(5) ( 5 ) (5) (5)是我们的带约束的原问题,也就是 p r i m a l p r o b l e m primal\ problem primal problem。然后,我们运用拉格朗日乘数法 { z = ( w , b ) F ( z ) = 1 2 ∣ ∣ w ∣ ∣ 2 G i ( z ) = 1 − y i ( w T x i + b ) ⟹ L ( w , b , μ ) = 1 2 ∣ ∣ w ∣ ∣ 2 + ∑ i = 1 N μ i ( 1 − y i ( w T x i + b ) ) ⟹ { min w , b max μ L ( w , b , μ ) s . t . μ i ≥ 0 ∀ i ( 6 ) \begin{cases} z=(w,b) \\ F(z)=\frac{1}{2}||w||^2 \\ G_i(z)=1-y_i(w^Tx_i+b) \end{cases} \\ \ \\ \Longrightarrow L(w,b,\mu) = \frac{1}{2}||w||^2+\sum_{i=1}^{N}\mu_i\big(1-y_i(w^Tx_i+b)\big) \\ \ \\ \Longrightarrow \begin{cases} \min_{w,b}\max_{\mu}L(w,b,\mu) \\ s.t.\quad \mu_i\geq\ 0\quad \forall i \end{cases} \qquad (6) ⎩⎪⎨⎪⎧z=(w,b)F(z)=21∣∣w∣∣2Gi(z)=1−yi(wTxi+b) ⟹L(w,b,μ)=21∣∣w∣∣2+i=1∑Nμi(1−yi(wTxi+b)) ⟹{minw,bmaxμL(w,b,μ)s.t.μi≥ 0∀i(6) 这样 ( 6 ) (6) (6)就是不带约束条件的原问题。由于目标优化函数是凸二次函数,并且约束条件是线性的,所以满足强对偶关的条件。其对偶问题是 { m a x μ min w , b L ( w , b , μ ) s . t . μ i ≥ 0 ∀ i ( 7 ) \begin{cases} max_{\mu}\min_{w,b}L(w,b,\mu) \\ s.t.\quad \mu_i\geq\ 0\quad \forall i \end{cases} \qquad (7) {maxμminw,bL(w,b,μ)s.t.μi≥ 0∀i(7) 所以 ∂ L ( w , b , μ ) ∂ b = ∂ ( ∑ i = 1 N μ i ( 1 − y i ( w T x i + b ) ) ) ∂ b = 0 ( 8 ) ∂ L ( w , b , μ ) ∂ w = ∂ ( 1 2 w T w + ∑ i = 1 N μ i − ∑ i = 1 N μ i y i w T x i − ∑ i = 1 N μ i y i b ) ∂ w = 0 ( 9 ) \begin{aligned} \frac{\partial{L(w,b,\mu)}}{\partial b} &= \frac{\partial\Big(\sum_{i=1}^{N}\mu_i\big(1-y_i(w^Tx_i+b)\big) \Big)}{\partial b} = 0 \qquad (8)\\ \frac{\partial{L(w,b,\mu)}}{\partial w} &= \frac{\partial{(\frac{1}{2}w^Tw+\sum_{i=1}^{N}\mu_i-\sum_{i=1}^{N}\mu_iy_iw^Tx_i-\sum_{i=1}^{N}\mu_iy_ib)}}{\partial w} = 0 \qquad (9) \end{aligned} ∂b∂L(w,b,μ)∂w∂L(w,b,μ)=∂b∂(∑i=1Nμi(1−yi(wTxi+b)))=0(8)=∂w∂(21wTw+∑i=1Nμi−∑i=1NμiyiwTxi−∑i=1Nμiyib)=0(9) 由 ( 8 ) (8) (8)得到 ∑ i = 1 N μ i y i = 0 ( 10 ) \sum_{i=1}^{N}\mu_iy_i = 0 \qquad (10) i=1∑Nμiyi=0(10) 将 ( 10 ) (10) (10)带入 ( 9 ) (9) (9)中,得到 ∂ L ∂ w = w − ∑ i = 1 N μ i y i x i = 0 ⟹ w = ∑ i = 1 N μ i y i x i ( 11 ) \frac{\partial L}{\partial w} = w-\sum_{i=1}^{N}\mu_iy_ix_i = 0 \\ \ \\ \Longrightarrow w = \sum_{i=1}^{N}\mu_iy_ix_i \qquad (11) ∂w∂L=w−i=1∑Nμiyixi=0 ⟹w=i=1∑Nμiyixi(11) 然后将 ( 10 ) (10) (10)和 ( 11 ) (11) (11)带入 L ( w , b , μ ) L(w,b,\mu) L(w,b,μ)中,得到 L ( w , b , μ ) = 1 2 ( ∑ i = 1 N μ i y i x i ) T ( ∑ i = 1 N μ i y i x i ) − ∑ i = 1 N μ i y i ( ∑ i = 1 N μ i y i x i ) T x i + ∑ i = 1 N μ i = 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j − ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x j T x i + ∑ i = 1 N μ i = 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j − ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i T x j + ∑ i = 1 N μ i = − 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j + ∑ i = 1 N μ i \begin{aligned} L(w,b,\mu) &= \frac{1}{2}(\sum_{i=1}^{N}\mu_iy_ix_i)^T(\sum_{i=1}^{N}\mu_iy_ix_i) - \sum_{i=1}^{N}\mu_iy_i(\sum_{i=1}^{N}\mu_iy_ix_i)^Tx_i + \sum_{i=1}^{N}\mu_i \\ &= \frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j - \sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_j^Tx_i + \sum_{i=1}^{N}\mu_i \\ &= \frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j - \sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_i^Tx_j + \sum_{i=1}^{N}\mu_i \\ &= -\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{N}\mu_i \end{aligned} L(w,b,μ)=21(i=1∑Nμiyixi)T(i=1∑Nμiyixi)−i=1∑Nμiyi(i=1∑Nμiyixi)Txi+i=1∑Nμi=21i=1∑Nj=1∑Nμiμjyiyjxixj−i=1∑Nj=1∑NμiμjyiyjxjTxi+i=1∑Nμi=21i=1∑Nj=1∑Nμiμjyiyjxixj−i=1∑Nj=1∑NμiμjyiyjxiTxj+i=1∑Nμi=−21i=1∑Nj=1∑Nμiμjyiyjxixj+i=1∑Nμi 现在我们的优化目标变为 { min μ − 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j + ∑ i = 1 N μ i s . t . μ i ≥ 0 ∀ i s . t . ∑ i = 1 N μ i y i = 0 ( 12 ) ⟹ ∂ ( − 1 2 ∑ i = 1 N ∑ j = 1 N μ i μ j y i y j x i x j + ∑ i = 1 N μ i ) ∂ μ = 0 ( 13 ) \begin{cases} \min_{\mu}-\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{N}\mu_i \\ s.t. \quad \mu_i\geq\ 0\quad \forall i \\ s.t. \quad \sum_{i=1}^{N}\mu_iy_i = 0 \end{cases} \qquad (12) \\ \ \\ \Longrightarrow \frac{\partial{(-\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{N}\mu_i) }}{\partial \mu} = 0 \qquad (13) ⎩⎪⎨⎪⎧minμ−21∑i=1N∑j=1Nμiμjyiyjxixj+∑i=1Nμis.t.μi≥ 0∀is.t.∑i=1Nμiyi=0(12) ⟹∂μ∂(−21∑i=1N∑j=1Nμiμjyiyjxixj+∑i=1Nμi)=0(13) 由 ( 13 ) (13) (13)得到 μ ^ \hat{\mu} μ^,然后带入 ( 11 ) (11) (11)中得到 w ^ \hat{w} w^,又由于 ∃ ( x k , y k ) s . t . 1 − y k ( w T x k + b ) = 0 \exists(x_k,y_k)\qquad s.t.\quad1-y_k(w^Tx_k+b) = 0 \\ ∃(xk,yk)s.t.1−yk(wTxk+b)=0 所以 y k 2 ( w T x k + b ) = y k ⟹ ( w T x k + b ) = y k ⟹ b = y k − ( ∑ i = 1 N μ i y i x i ) T x k ( 14 ) y_k^2(w^Tx_k+b) = y_k \\ \ \\ \Longrightarrow (w^Tx_k+b) = y_k \\ \ \\ \Longrightarrow b = y_k - (\sum_{i=1}^{N}\mu_iy_ix_i)^Tx_k \qquad (14) yk2(wTxk+b)=yk ⟹(wTxk+b)=yk ⟹b=yk−(i=1∑Nμiyixi)Txk(14) 将 μ ^ \hat{\mu} μ^带入 ( 14 ) (14) (14)中得到 b ^ \hat{b} b^。此时 w ^ \hat{w} w^和 b ^ \hat{b} b^都已经求出,于是 f ^ ( x ) = s g n ( w ^ x + b ^ ) \hat{f}(x) = sgn(\hat{w}x+\hat{b}) f^(x)=sgn(w^x+b^)
下面我们用python绘制出这些点在二维平面的位置。
import matplotlib.pyplot as plt import numpy as np x = np.linspace(0, 3, 100) y = x plt.title("Data Points") plt.xlim(xmax=3, xmin=0) plt.ylim(ymax=3, ymin=0) plt.xlabel("x1") plt.ylabel("x2") plt.plot([1.0, 1.5],[1.5, 2.4],'ro') plt.plot([1.5],[1.0],'bo') plt.plot(x, y, c='green') plt.annotate(text='A', xy=(1.0, 1.5)) plt.annotate(text='B', xy=(1.5, 2.4)) plt.annotate(text='C', xy=(1.5, 1.0)) plt.show()如果仅仅用肉眼观察,选择一条直线来区分红色的点和蓝色的点,那么这条直线应该与x=y相接近,也就是绿色的线。
我们将数据代入式(8), ∂ ( − 1 2 ∑ i = 1 3 ∑ j = 1 3 μ i μ j y i y j x i x j + ∑ i = 1 3 μ i ) ∂ μ = 0 \begin{aligned} \frac{\partial{(-\frac{1}{2}\sum_{i=1}^{3}\sum_{j=1}^{3}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{3}\mu_i) }}{\partial \mu} = 0 \end{aligned} ∂μ∂(−21∑i=13∑j=13μiμjyiyjxixj+∑i=13μi)=0
− 1 2 ∑ i = 1 3 ∑ j = 1 3 μ i μ j y i y j x i x j + ∑ i = 1 3 μ i = − 1 2 ∑ i = 1 3 ( μ i μ 1 y i y 1 x i x 1 + μ i μ 2 y i y 2 x i x 2 + μ i μ 3 y i y 3 x i x 3 ) + ∑ i = 1 3 μ i = − 1 2 ( μ 1 μ 1 y 1 y 1 x 1 x 1 + μ 1 μ 2 y 1 y 2 x 1 x 2 + μ 1 μ 3 y 1 y 3 x 1 x 3 + μ 2 μ 1 y 2 y 1 x 2 x 1 + μ 2 μ 2 y 2 y 2 x 2 x 2 + μ 2 μ 3 y 2 y 3 x 2 x 3 + μ 3 μ 1 y 3 y 1 x 3 x 1 + μ 3 μ 2 y 3 y 2 x 3 x 2 + μ 3 μ 3 y 3 y 3 x 3 x 3 ) + μ 1 + μ 2 + μ 3 \begin{aligned} -\frac{1}{2}\sum_{i=1}^{3}\sum_{j=1}^{3}\mu_i\mu_jy_iy_jx_ix_j + \sum_{i=1}^{3}\mu_i &= -\frac{1}{2}\sum_{i=1}^{3}(\mu_i\mu_1y_iy_1x_ix_1+\mu_i\mu_2y_iy_2x_ix_2+\mu_i\mu_3y_iy_3x_ix_3) + \sum_{i=1}^{3}\mu_i \\ &= -\frac{1}{2}(\mu_1\mu_1y_1y_1x_1x_1+\mu_1\mu_2y_1y_2x_1x_2+\mu_1\mu_3y_1y_3x_1x_3 + \mu_2\mu_1y_2y_1x_2x_1+\mu_2\mu_2y_2y_2x_2x_2+\mu_2\mu_3y_2y_3x_2x_3 + \mu_3\mu_1y_3y_1x_3x_1+\mu_3\mu_2y_3y_2x_3x_2+\mu_3\mu_3y_3y_3x_3x_3) + \mu_1+\mu_2+\mu_3 \end{aligned} −21i=1∑3j=1∑3μiμjyiyjxixj+i=1∑3μi=−21i=1∑3(μiμ1yiy1xix1+μiμ2yiy2xix2+μiμ3yiy3xix3)+i=1∑3μi=−21(μ1μ1y1y1x1x1+μ1μ2y1y2x1x2+μ1μ3y1y3x1x3+μ2μ1y2y1x2x1+μ2μ2y2y2x2x2+μ2μ3y2y3x2x3+μ3μ1y3y1x3x1+μ3μ2y3y2x3x2+μ3μ3y3y3x3x3)+μ1+μ2+μ3 对 μ 1 \mu_1 μ1, μ 2 \mu_2 μ2和 μ 3 \mu_3 μ3求导得到 − 1 2 ( 2 y 1 y 1 x 1 x 1 μ 1 + μ 2 y 1 y 2 x 1 x 2 + μ 3 y 1 y 3 x 1 x 3 + μ 2 y 2 y 1 x 2 x 1 + μ 3 y 3 y 1 x 3 x 1 ) + 1 = 0 ( 15 ) − 1 2 ( μ 1 y 1 y 2 x 1 x 2 + μ 1 y 2 y 1 x 2 x 1 + 2 μ 2 y 2 y 2 x 2 x 2 + μ 3 y 2 y 3 x 2 x 3 + μ 3 y 3 y 2 x 3 x 2 ) + 1 = 0 ( 16 ) − 1 2 ( μ 1 y 1 y 3 x 1 x 3 + μ 2 y 2 y 3 x 2 x 3 + μ 1 y 3 y 1 x 3 x 1 + μ 2 y 3 y 2 x 3 x 2 + 2 μ 3 y 3 y 3 x 3 x 3 ) + 1 = 0 ( 17 ) -\frac{1}{2}(2y_1y_1x_1x_1\mu_1+\mu_2y_1y_2x_1x_2+\mu_3y_1y_3x_1x_3+\mu_2y_2y_1x_2x_1+\mu_3y_3y_1x_3x_1)+1=0 \qquad (15) \ \\ -\frac{1}{2}(\mu_1y_1y_2x_1x_2+\mu_1y_2y_1x_2x_1+2\mu_2y_2y_2x_2x_2+\mu_3y_2y_3x_2x_3+\mu_3y_3y_2x_3x_2)+1=0 \qquad (16) \ \\ -\frac{1}{2}(\mu_1y_1y_3x_1x_3+\mu_2y_2y_3x_2x_3+\mu_1y_3y_1x_3x_1+\mu_2y_3y_2x_3x_2+2\mu_3y_3y_3x_3x_3)+1=0 \qquad (17) −21(2y1y1x1x1μ1+μ2y1y2x1x2+μ3y1y3x1x3+μ2y2y1x2x1+μ3y3y1x3x1)+1=0(15) −21(μ1y1y2x1x2+μ1y2y1x2x1+2μ2y2y2x2x2+μ3y2y3x2x3+μ3y3y2x3x2)+1=0(16) −21(μ1y1y3x1x3+μ2y2y3x2x3+μ1y3y1x3x1+μ2y3y2x3x2+2μ3y3y3x3x3)+1=0(17)
该推导过程参考新加坡国立大学“Foundations of Machine Learning”课程和bilibili的up主“shuhuai008”的视频。 ↩︎