推导Bias² + Variance + σ²_ε
问题的背景
我们有一个真实函数 f ( x ) f(x) f(x) 和基于训练数据 D D D 训练得到的模型 f ^ ( x ; D ) \hat{f}(x;D) f^(x;D)。对于任意输入 x x x:
-
y y y 是真实的观测值,定义为 y = f ( x ) + ϵ y = f(x) + \epsilon y=f(x)+ϵ,其中 ϵ \epsilon ϵ 是随机噪声,满足 E [ ϵ ] = 0 E[\epsilon] = 0 E[ϵ]=0 且 Var ( ϵ ) = σ ϵ 2 \text{Var}(\epsilon) = \sigma^2_\epsilon Var(ϵ)=σϵ2;
-
f ^ ( x ; D ) \hat{f}(x;D) f^(x;D) 是模型的预测值,依赖于训练数据 D D D;
-
期望平方误差 E [ ( y − f ^ ( x ; D ) ) 2 ] E[(y - \hat{f}(x;D))^2] E[(y−f^(x;D))2] 是衡量模型预测误差的指标,期望 E [ ⋅ ] E[\cdot] E[⋅] 同时针对训练数据 D D D 和噪声 ϵ \epsilon ϵ 取值。
目标是证明:
E [ ( y − f ^ ( x ; D ) ) 2 ] = Bias 2 + Variance + σ ϵ 2 E[(y - \hat{f}(x;D))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2_\epsilon E[(y−f^(x;D))2]=Bias2+Variance+σϵ2
其中:
-
Bias(偏差): Bias = E [ f ^ ( x ; D ) ] − f ( x ) \text{Bias} = E[\hat{f}(x;D)] - f(x) Bias=E[f^(x;D)]−f(x);
-
Variance(方差): Variance = E [ ( f ^ ( x ; D ) − E [ f ^ ( x ; D ) ] ) 2 ] \text{Variance} = E[(\hat{f}(x;D) - E[\hat{f}(x;D)])^2] Variance=E[(f^(x;D)−E[f^(x;D)])2];
-
σ ϵ 2 \sigma^2_\epsilon σϵ2:不可约误差,即噪声的方差。
推导步骤
步骤 1:定义期望平方误差
我们从期望平方误差开始:
E [ ( y − f ^ ( x ; D ) ) 2 ] E[(y - \hat{f}(x;D))^2] E[(y−f^(x;D))2]
由于 y = f ( x ) + ϵ y = f(x) + \epsilon y=f(x)+ϵ,代入后得到:
y − f ^ ( x ; D ) = f ( x ) + ϵ − f ^ ( x ; D ) y - \hat{f}(x;D) = f(x) + \epsilon - \hat{f}(x;D) y−f^(x;D)=f(x)+ϵ−f^(x;D)
因此,期望平方误差为:
E [ ( y − f ^ ( x ; D ) ) 2 ] = E [ ( f ( x ) + ϵ − f ^ ( x ; D ) ) 2 ] E[(y - \hat{f}(x;D))^2] = E[(f(x) + \epsilon - \hat{f}(x;D))^2] E[(y−f^(x;D))2]=E[(f(x)+ϵ−f^(x;D))2]
这里的期望 E [ ⋅ ] E[\cdot] E[⋅] 是对训练数据 D D D 和噪声 ϵ \epsilon ϵ 的联合期望,即 E D , ϵ [ ⋅ ] E_{D,\epsilon}[\cdot] ED,ϵ[⋅]。
步骤 2:展开平方项
将表达式展开:
( f ( x ) + ϵ − f ^ ( x ; D ) ) 2 = [ f ( x ) − f ^ ( x ; D ) ] 2 + 2 [ f ( x ) − f ^ ( x ; D ) ] ϵ + ϵ 2 (f(x) + \epsilon - \hat{f}(x;D))^2 = [f(x) - \hat{f}(x;D)]^2 + 2[f(x) - \hat{f}(x;D)]\epsilon + \epsilon^2 (f(x)+ϵ−f^(x;D))2=[f(x)−f^(x;D)]2+2[f(x)−f^(x;D)]ϵ+ϵ2
对整个表达式取期望:
E [ ( f ( x ) + ϵ − f ^ ( x ; D ) ) 2 ] = E [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] + 2 E [ [ f ( x ) − f ^ ( x ; D ) ] ϵ ] + E [ ϵ 2 ] E[(f(x) + \epsilon - \hat{f}(x;D))^2] = E[[f(x) - \hat{f}(x;D)]^2] + 2E[[f(x) - \hat{f}(x;D)]\epsilon] + E[\epsilon^2] E[(f(x)+ϵ−f^(x;D))2]=E[[f(x)−f^(x;D)]2]+2E[[f(x)−f^(x;D)]ϵ]+E[ϵ2]
步骤 3:分别计算每一项的期望
由于期望是对 D D D 和 ϵ \epsilon ϵ 取值,我们需要利用 ϵ \epsilon ϵ 和 f ^ ( x ; D ) \hat{f}(x;D) f^(x;D) 的独立性( ϵ \epsilon ϵ 是数据固有的噪声,不依赖于训练数据 D D D)以及 ϵ \epsilon ϵ 的性质( E [ ϵ ] = 0 E[\epsilon] = 0 E[ϵ]=0)。
-
第一项: E [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] E[[f(x) - \hat{f}(x;D)]^2] E[[f(x)−f^(x;D)]2]
-
f ( x ) f(x) f(x) 是固定的真实值,不依赖于 D D D 或 ϵ \epsilon ϵ;
-
f ^ ( x ; D ) \hat{f}(x;D) f^(x;D) 依赖于 D D D,但不依赖于 ϵ \epsilon ϵ;
-
因此, E D , ϵ [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] = E D [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] E_{D,\epsilon}[[f(x) - \hat{f}(x;D)]^2] = E_D[[f(x) - \hat{f}(x;D)]^2] ED,ϵ[[f(x)−f^(x;D)]2]=ED[[f(x)−f^(x;D)]2](因为对 ϵ \epsilon ϵ 取期望不影响这一项)。
-
-
第二项: 2 E [ [ f ( x ) − f ^ ( x ; D ) ] ϵ ] 2E[[f(x) - \hat{f}(x;D)]\epsilon] 2E[[f(x)−f^(x;D)]ϵ]
-
由于 ϵ \epsilon ϵ 与 D D D(从而与 f ^ ( x ; D ) \hat{f}(x;D) f^(x;D))独立,且 E [ ϵ ] = 0 E[\epsilon] = 0 E[ϵ]=0:
E D , ϵ [ [ f ( x ) − f ^ ( x ; D ) ] ϵ ] = E D [ f ( x ) − f ^ ( x ; D ) ] ⋅ E [ ϵ ] = E D [ f ( x ) − f ^ ( x ; D ) ] ⋅ 0 = 0 E_{D,\epsilon}[[f(x) - \hat{f}(x;D)]\epsilon] = E_D[f(x) - \hat{f}(x;D)] \cdot E[\epsilon] = E_D[f(x) - \hat{f}(x;D)] \cdot 0 = 0 ED,ϵ[[f(x)−f^(x;D)]ϵ]=ED[f(x)−f^(x;D)]⋅E[ϵ]=ED[f(x)−f^(x;D)]⋅0=0
-
所以这一项为零。
-
-
第三项: E [ ϵ 2 ] E[\epsilon^2] E[ϵ2]
-
ϵ \epsilon ϵ 的方差定义为 Var ( ϵ ) = E [ ϵ 2 ] − ( E [ ϵ ] ) 2 \text{Var}(\epsilon) = E[\epsilon^2] - (E[\epsilon])^2 Var(ϵ)=E[ϵ2]−(E[ϵ])2;
-
已知 E [ ϵ ] = 0 E[\epsilon] = 0 E[ϵ]=0,所以:
E [ ϵ 2 ] = Var ( ϵ ) = σ ϵ 2 E[\epsilon^2] = \text{Var}(\epsilon) = \sigma^2_\epsilon E[ϵ2]=Var(ϵ)=σϵ2
-
由于 ϵ \epsilon ϵ 不依赖于 D D D,这一项直接为 σ ϵ 2 \sigma^2_\epsilon σϵ2。
-
因此,期望平方误差简化为:
E [ ( y − f ^ ( x ; D ) ) 2 ] = E D [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] + σ ϵ 2 E[(y - \hat{f}(x;D))^2] = E_D[[f(x) - \hat{f}(x;D)]^2] + \sigma^2_\epsilon E[(y−f^(x;D))2]=ED[[f(x)−f^(x;D)]2]+σϵ2
步骤 4:分解 E D [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] E_D[[f(x) - \hat{f}(x;D)]^2] ED[[f(x)−f^(x;D)]2]
现在需要将 E D [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] E_D[[f(x) - \hat{f}(x;D)]^2] ED[[f(x)−f^(x;D)]2] 分解为偏差和方差两部分。定义 f ˉ ( x ) = E D [ f ^ ( x ; D ) ] \bar{f}(x) = E_D[\hat{f}(x;D)] fˉ(x)=ED[f^(x;D)] 为模型预测的期望值(对所有可能的训练集 D D D 取平均)。
在表达式中加入和减去 f ˉ ( x ) \bar{f}(x) fˉ(x):
f ( x ) − f ^ ( x ; D ) = [ f ( x ) − f ˉ ( x ) ] + [ f ˉ ( x ) − f ^ ( x ; D ) ] f(x) - \hat{f}(x;D) = [f(x) - \bar{f}(x)] + [\bar{f}(x) - \hat{f}(x;D)] f(x)−f^(x;D)=[f(x)−fˉ(x)]+[fˉ(x)−f^(x;D)]
平方后:
[ f ( x ) − f ^ ( x ; D ) ] 2 = [ f ( x ) − f ˉ ( x ) ] 2 + 2 [ f ( x ) − f ˉ ( x ) ] [ f ˉ ( x ) − f ^ ( x ; D ) ] + [ f ˉ ( x ) − f ^ ( x ; D ) ] 2 [f(x) - \hat{f}(x;D)]^2 = [f(x) - \bar{f}(x)]^2 + 2[f(x) - \bar{f}(x)][\bar{f}(x) - \hat{f}(x;D)] + [\bar{f}(x) - \hat{f}(x;D)]^2 [f(x)−f^(x;D)]2=[f(x)−fˉ(x)]2+2[f(x)−fˉ(x)][fˉ(x)−f^(x;D)]+[fˉ(x)−f^(x;D)]2
对 D D D 取期望:
E D [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] = E D [ [ f ( x ) − f ˉ ( x ) ] 2 ] + 2 E D [ [ f ( x ) − f ˉ ( x ) ] [ f ˉ ( x ) − f ^ ( x ; D ) ] ] + E D [ [ f ˉ ( x ) − f ^ ( x ; D ) ] 2 ] E_D[[f(x) - \hat{f}(x;D)]^2] = E_D[[f(x) - \bar{f}(x)]^2] + 2E_D[[f(x) - \bar{f}(x)][\bar{f}(x) - \hat{f}(x;D)]] + E_D[[\bar{f}(x) - \hat{f}(x;D)]^2] ED[[f(x)−f^(x;D)]2]=ED[[f(x)−fˉ(x)]2]+2ED[[f(x)−fˉ(x)][fˉ(x)−f^(x;D)]]+ED[[fˉ(x)−f^(x;D)]2]
逐项计算:
-
第一项: E D [ [ f ( x ) − f ˉ ( x ) ] 2 ] E_D[[f(x) - \bar{f}(x)]^2] ED[[f(x)−fˉ(x)]2]
-
f ( x ) f(x) f(x) 和 f ˉ ( x ) = E D [ f ^ ( x ; D ) ] \bar{f}(x) = E_D[\hat{f}(x;D)] fˉ(x)=ED[f^(x;D)] 都是固定的(不随具体的 D D D 变化),所以:
E D [ [ f ( x ) − f ˉ ( x ) ] 2 ] = [ f ( x ) − f ˉ ( x ) ] 2 E_D[[f(x) - \bar{f}(x)]^2] = [f(x) - \bar{f}(x)]^2 ED[[f(x)−fˉ(x)]2]=[f(x)−fˉ(x)]2
-
根据定义, Bias = E D [ f ^ ( x ; D ) ] − f ( x ) = f ˉ ( x ) − f ( x ) \text{Bias} = E_D[\hat{f}(x;D)] - f(x) = \bar{f}(x) - f(x) Bias=ED[f^(x;D)]−f(x)=fˉ(x)−f(x),所以:
[ f ( x ) − f ˉ ( x ) ] 2 = [ f ˉ ( x ) − f ( x ) ] 2 = ( Bias ) 2 [f(x) - \bar{f}(x)]^2 = [\bar{f}(x) - f(x)]^2 = (\text{Bias})^2 [f(x)−fˉ(x)]2=[fˉ(x)−f(x)]2=(Bias)2
-
-
第二项: 2 E D [ [ f ( x ) − f ˉ ( x ) ] [ f ˉ ( x ) − f ^ ( x ; D ) ] ] 2E_D[[f(x) - \bar{f}(x)][\bar{f}(x) - \hat{f}(x;D)]] 2ED[[f(x)−fˉ(x)][fˉ(x)−f^(x;D)]]
-
f ( x ) − f ˉ ( x ) f(x) - \bar{f}(x) f(x)−fˉ(x) 是固定的,可提出期望:
E D [ [ f ( x ) − f ˉ ( x ) ] [ f ˉ ( x ) − f ^ ( x ; D ) ] ] = [ f ( x ) − f ˉ ( x ) ] E D [ f ˉ ( x ) − f ^ ( x ; D ) ] E_D[[f(x) - \bar{f}(x)][\bar{f}(x) - \hat{f}(x;D)]] = [f(x) - \bar{f}(x)] E_D[\bar{f}(x) - \hat{f}(x;D)] ED[[f(x)−fˉ(x)][fˉ(x)−f^(x;D)]]=[f(x)−fˉ(x)]ED[fˉ(x)−f^(x;D)]
-
因为 f ˉ ( x ) = E D [ f ^ ( x ; D ) ] \bar{f}(x) = E_D[\hat{f}(x;D)] fˉ(x)=ED[f^(x;D)],所以:
E D [ f ˉ ( x ) − f ^ ( x ; D ) ] = f ˉ ( x ) − E D [ f ^ ( x ; D ) ] = f ˉ ( x ) − f ˉ ( x ) = 0 E_D[\bar{f}(x) - \hat{f}(x;D)] = \bar{f}(x) - E_D[\hat{f}(x;D)] = \bar{f}(x) - \bar{f}(x) = 0 ED[fˉ(x)−f^(x;D)]=fˉ(x)−ED[f^(x;D)]=fˉ(x)−fˉ(x)=0
-
因此这一项为零。
-
-
第三项: E D [ [ f ˉ ( x ) − f ^ ( x ; D ) ] 2 ] E_D[[\bar{f}(x) - \hat{f}(x;D)]^2] ED[[fˉ(x)−f^(x;D)]2]
-
这一项正是模型预测的方差:
E D [ [ f ˉ ( x ) − f ^ ( x ; D ) ] 2 ] = E D [ ( f ^ ( x ; D ) − E D [ f ^ ( x ; D ) ] ) 2 ] = Variance E_D[[\bar{f}(x) - \hat{f}(x;D)]^2] = E_D[(\hat{f}(x;D) - E_D[\hat{f}(x;D)])^2] = \text{Variance} ED[[fˉ(x)−f^(x;D)]2]=ED[(f^(x;D)−ED[f^(x;D)])2]=Variance
-
于是:
E D [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] = ( Bias ) 2 + Variance E_D[[f(x) - \hat{f}(x;D)]^2] = (\text{Bias})^2 + \text{Variance} ED[[f(x)−f^(x;D)]2]=(Bias)2+Variance
步骤 5:合并结果
将分解结果代回:
E [ ( y − f ^ ( x ; D ) ) 2 ] = E D [ [ f ( x ) − f ^ ( x ; D ) ] 2 ] + σ ϵ 2 = ( Bias ) 2 + Variance + σ ϵ 2 E[(y - \hat{f}(x;D))^2] = E_D[[f(x) - \hat{f}(x;D)]^2] + \sigma^2_\epsilon = (\text{Bias})^2 + \text{Variance} + \sigma^2_\epsilon E[(y−f^(x;D))2]=ED[[f(x)−f^(x;D)]2]+σϵ2=(Bias)2+Variance+σϵ2
推导完成。
直观解释
-
Bias²(偏差平方):衡量模型平均预测 f ˉ ( x ) \bar{f}(x) fˉ(x) 与真实值 f ( x ) f(x) f(x) 的差距,反映模型的系统性误差(例如模型是否过于简单)。
-
Variance(方差):衡量模型预测 f ^ ( x ; D ) \hat{f}(x;D) f^(x;D) 在不同训练集 D D D 上的波动性,反映模型对训练数据的敏感度(例如模型是否过于复杂)。
-
σ ϵ 2 \sigma^2_\epsilon σϵ2(不可约误差):数据中固有的噪声,无法通过任何模型消除。
总结
通过将 y − f ^ ( x ; D ) y - \hat{f}(x;D) y−f^(x;D) 展开为真实值、模型预测和噪声的组合,展开平方项并取期望,利用 ϵ \epsilon ϵ 的独立性和零均值性质,最后分解模型误差项,我们证明了:
E [ ( y − f ^ ( x ; D ) ) 2 ] = Bias 2 + Variance + σ ϵ 2 E[(y - \hat{f}(x;D))^2] = \text{Bias}^2 + \text{Variance} + \sigma^2_\epsilon E[(y−f^(x;D))2]=Bias2+Variance+σϵ2