最大似然估计:求解指数族分布的参数 ( η) 具有封闭解 (中英双语)
中文版
最大似然估计:求解参数 ( η \eta η ) 具有封闭解
在统计学和机器学习中,最大似然估计(Maximum Likelihood Estimation, MLE) 是一种广泛使用的参数估计方法,其目标是通过最大化观测数据的可能性来估计模型参数。对于指数族分布而言,由于其特殊的数学形式,参数 ( η \eta η ) 的最大似然估计通常具有封闭解,这显著简化了计算过程。
最大似然估计的基本原理
给定观测数据集 ( D = { x 1 , x 2 , … , x N } D = \{x_1, x_2, \dots, x_N\} D={x1,x2,…,xN} ),其联合概率密度为:
p ( D ∣ η ) = ∏ i = 1 N p ( x i ∣ η ) . p(D|\eta) = \prod_{i=1}^N p(x_i|\eta). p(D∣η)=i=1∏Np(xi∣η).
最大似然估计的目标是找到参数 ( η \eta η ) 使得 ( p ( D ∣ η ) p(D|\eta) p(D∣η) ) 最大化。为便于求解,通常取对数,转化为对数似然函数(Log-Likelihood Function)的最大化问题:
ℓ ( η ) = ln p ( D ∣ η ) = ∑ i = 1 N ln p ( x i ∣ η ) . \ell(\eta) = \ln p(D|\eta) = \sum_{i=1}^N \ln p(x_i|\eta). ℓ(η)=lnp(D∣η)=i=1∑Nlnp(xi∣η).
对于指数族分布,其概率密度函数为:
p ( x ∣ η ) = h ( x ) g ( η ) exp ( η T t ( x ) ) . p(x|\eta) = h(x) g(\eta) \exp(\eta^T t(x)). p(x∣η)=h(x)g(η)exp(ηTt(x)).
将其代入对数似然函数,可得:
ℓ ( η ) = ∑ i = 1 N ln h ( x i ) + ∑ i = 1 N ln g ( η ) + ∑ i = 1 N η T t ( x i ) . \ell(\eta) = \sum_{i=1}^N \ln h(x_i) + \sum_{i=1}^N \ln g(\eta) + \sum_{i=1}^N \eta^T t(x_i). ℓ(η)=i=1∑Nlnh(xi)+i=1∑Nlng(η)+i=1∑NηTt(xi).
由于 ( ln h ( x i ) \ln h(x_i) lnh(xi) ) 与参数 ( η \eta η ) 无关,可以忽略,简化为:
ℓ ( η ) = N ln g ( η ) + η T ∑ i = 1 N t ( x i ) . \ell(\eta) = N \ln g(\eta) + \eta^T \sum_{i=1}^N t(x_i). ℓ(η)=Nlng(η)+ηTi=1∑Nt(xi).
求解封闭解
在指数族分布中,归一化常数 ( g ( η ) g(\eta) g(η) ) 通常通过 ( exp ( − A ( η ) ) \exp(-A(\eta)) exp(−A(η)) ) 表示,其中 ( A ( η ) A(\eta) A(η) ) 是对数分配函数(Log-Partition Function),满足:
A ( η ) = − ln g ( η ) . A(\eta) = -\ln g(\eta). A(η)=−lng(η).
因此,对数似然函数可以进一步表示为:
ℓ ( η ) = − N A ( η ) + η T ∑ i = 1 N t ( x i ) . \ell(\eta) = -N A(\eta) + \eta^T \sum_{i=1}^N t(x_i). ℓ(η)=−NA(η)+ηTi=1∑Nt(xi).
对 ( η \eta η ) 求导并设导数为零,得到最大化对数似然函数的必要条件:
∂ ℓ ( η ) ∂ η = − N ∂ A ( η ) ∂ η + ∑ i = 1 N t ( x i ) = 0. \frac{\partial \ell(\eta)}{\partial \eta} = -N \frac{\partial A(\eta)}{\partial \eta} + \sum_{i=1}^N t(x_i) = 0. ∂η∂ℓ(η)=−N∂η∂A(η)+i=1∑Nt(xi)=0.
根据指数族分布的性质,( ∂ A ( η ) ∂ η \frac{\partial A(\eta)}{\partial \eta} ∂η∂A(η)) 恰好是充分统计量的期望 ( E [ t ( x ) ] \mathbb{E}[t(x)] E[t(x)] ),即:
E [ t ( x ) ] = ∂ A ( η ) ∂ η . \mathbb{E}[t(x)] = \frac{\partial A(\eta)}{\partial \eta}. E[t(x)]=∂η∂A(η).
代入后可得:
N E [ t ( x ) ] = ∑ i = 1 N t ( x i ) . N \mathbb{E}[t(x)] = \sum_{i=1}^N t(x_i). NE[t(x)]=i=1∑Nt(xi).
进一步化简为:
E [ t ( x ) ] = 1 N ∑ i = 1 N t ( x i ) . \mathbb{E}[t(x)] = \frac{1}{N} \sum_{i=1}^N t(x_i). E[t(x)]=N1i=1∑Nt(xi).
这表明,参数 ( η \eta η ) 的最大似然估计等价于使得充分统计量的期望等于样本平均值的值。由于这一关系的线性形式,许多指数族分布的参数估计具有封闭解。 对这句话有疑问的,请移步本文末尾。
示例:一维高斯分布的最大似然估计
对于一维高斯分布:
p ( x ∣ μ , σ 2 ) = 1 2 π σ 2 exp ( − ( x − μ ) 2 2 σ 2 ) , p(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), p(x∣μ,σ2)=2πσ21exp(−2σ2(x−μ)2),
其充分统计量为 ( t ( x ) = [ x , x 2 ] T t(x) = [x, x^2]^T t(x)=[x,x2]T ),自然参数为 ( η = ( μ σ 2 , − 1 2 σ 2 ) \eta = \left(\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}\right) η=(σ2μ,−2σ21))。假设我们有样本数据 ( { x 1 , x 2 , … , x N } \{x_1, x_2, \dots, x_N\} {x1,x2,…,xN} ),则对参数 ( μ \mu μ ) 和 ( σ 2 \sigma^2 σ2 ) 的最大似然估计为:
-
估计均值 ( μ \mu μ ):
μ ^ = 1 N ∑ i = 1 N x i . \hat{\mu} = \frac{1}{N} \sum_{i=1}^N x_i. μ^=N1i=1∑Nxi. -
估计方差 ( σ 2 \sigma^2 σ2 ):
σ ^ 2 = 1 N ∑ i = 1 N ( x i − μ ^ ) 2 . \hat{\sigma}^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \hat{\mu})^2. σ^2=N1i=1∑N(xi−μ^)2.
这些估计具有封闭解,无需复杂的数值优化。
总结
在指数族分布中,由于对数似然函数与充分统计量的线性关系,最大似然估计通常具有封闭解。这种性质不仅使得参数估计更高效,也为统计推断和模型优化提供了重要便利。例如,在高斯分布中,参数的最大似然估计直接等于样本均值和样本方差,从而避免了迭代计算的复杂性。
英文版
Maximum Likelihood Estimation: Closed-form Solution for Estimating Parameters ( η \eta η )
In statistics and machine learning, Maximum Likelihood Estimation (MLE) is a widely used method for estimating model parameters by maximizing the likelihood of the observed data. For exponential family distributions, due to their special mathematical form, the MLE of the parameter ( η \eta η ) often has a closed-form solution, which significantly simplifies the computation process.
Basic Principle of Maximum Likelihood Estimation
Given an observed dataset ( D = { x 1 , x 2 , … , x N } D = \{x_1, x_2, \dots, x_N\} D={x1,x2,…,xN} ), the joint probability density is:
p ( D ∣ η ) = ∏ i = 1 N p ( x i ∣ η ) . p(D|\eta) = \prod_{i=1}^N p(x_i|\eta). p(D∣η)=i=1∏Np(xi∣η).
The goal of maximum likelihood estimation is to find the parameter ( η \eta η ) that maximizes ( p ( D ∣ η ) p(D|\eta) p(D∣η) ). For convenience, we take the logarithm of the likelihood function, turning it into a log-likelihood function maximization problem:
ℓ ( η ) = ln p ( D ∣ η ) = ∑ i = 1 N ln p ( x i ∣ η ) . \ell(\eta) = \ln p(D|\eta) = \sum_{i=1}^N \ln p(x_i|\eta).\ ℓ(η)=lnp(D∣η)=i=1∑Nlnp(xi∣η).
For an exponential family distribution, its probability density function is:
p ( x ∣ η ) = h ( x ) g ( η ) exp ( η T t ( x ) ) . p(x|\eta) = h(x) g(\eta) \exp(\eta^T t(x)). p(x∣η)=h(x)g(η)exp(ηTt(x)).
Substituting this into the log-likelihood function, we obtain:
ℓ ( η ) = ∑ i = 1 N ln h ( x i ) + ∑ i = 1 N ln g ( η ) + ∑ i = 1 N η T t ( x i ) . \ell(\eta) = \sum_{i=1}^N \ln h(x_i) + \sum_{i=1}^N \ln g(\eta) + \sum_{i=1}^N \eta^T t(x_i). ℓ(η)=i=1∑Nlnh(xi)+i=1∑Nlng(η)+i=1∑NηTt(xi).
Since ( ln h ( x i ) \ln h(x_i) lnh(xi) ) does not depend on ( η \eta η ), it can be ignored, simplifying to:
ℓ ( η ) = N ln g ( η ) + η T ∑ i = 1 N t ( x i ) . \ell(\eta) = N \ln g(\eta) + \eta^T \sum_{i=1}^N t(x_i). ℓ(η)=Nlng(η)+ηTi=1∑Nt(xi).
Solving for the Closed-form Solution
In exponential family distributions, the normalization constant ( g ( η ) g(\eta) g(η) ) is often represented as ( exp ( − A ( η ) ) \exp(-A(\eta)) exp(−A(η)) ), where ( A ( η ) A(\eta) A(η) ) is the log-partition function, satisfying:
A ( η ) = − ln g ( η ) . A(\eta) = -\ln g(\eta). A(η)=−lng(η).
Thus, the log-likelihood function becomes:
ℓ ( η ) = − N A ( η ) + η T ∑ i = 1 N t ( x i ) . \ell(\eta) = -N A(\eta) + \eta^T \sum_{i=1}^N t(x_i). ℓ(η)=−NA(η)+ηTi=1∑Nt(xi).
Taking the derivative of ( ℓ ( η ) \ell(\eta) ℓ(η) ) with respect to ( η \eta η ) and setting it to zero, we obtain the necessary condition for maximizing the log-likelihood function:
∂ ℓ ( η ) ∂ η = − N ∂ A ( η ) ∂ η + ∑ i = 1 N t ( x i ) = 0. \frac{\partial \ell(\eta)}{\partial \eta} = -N \frac{\partial A(\eta)}{\partial \eta} + \sum_{i=1}^N t(x_i) = 0. ∂η∂ℓ(η)=−N∂η∂A(η)+i=1∑Nt(xi)=0.
Using the property of exponential family distributions, ( ∂ A ( η ) ∂ η \frac{\partial A(\eta)}{\partial \eta} ∂η∂A(η) ) is exactly the expected value of the sufficient statistic ( E [ t ( x ) ] \mathbb{E}[t(x)] E[t(x)] ), i.e.,
E [ t ( x ) ] = ∂ A ( η ) ∂ η . \mathbb{E}[t(x)] = \frac{\partial A(\eta)}{\partial \eta}. E[t(x)]=∂η∂A(η).
Substituting this into the equation, we get:
N E [ t ( x ) ] = ∑ i = 1 N t ( x i ) . N \mathbb{E}[t(x)] = \sum_{i=1}^N t(x_i). NE[t(x)]=i=1∑Nt(xi).
This simplifies to:
E [ t ( x ) ] = 1 N ∑ i = 1 N t ( x i ) . \mathbb{E}[t(x)] = \frac{1}{N} \sum_{i=1}^N t(x_i). E[t(x)]=N1i=1∑Nt(xi).
This shows that the MLE of ( η \eta η ) is equivalent to setting the expected value of the sufficient statistic equal to the sample mean. Due to the linear relationship, many exponential family distributions have closed-form solutions for their parameter estimation.
Example: Maximum Likelihood Estimation for the Univariate Gaussian Distribution
For a univariate Gaussian distribution:
p ( x ∣ μ , σ 2 ) = 1 2 π σ 2 exp ( − ( x − μ ) 2 2 σ 2 ) , p(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), p(x∣μ,σ2)=2πσ21exp(−2σ2(x−μ)2),
the sufficient statistic is ( t ( x ) = [ x , x 2 ] T t(x) = [x, x^2]^T t(x)=[x,x2]T ), and the natural parameter is ( η = ( μ σ 2 , − 1 2 σ 2 ) \eta = \left(\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}\right) η=(σ2μ,−2σ21) ). Given sample data ( { x 1 , x 2 , … , x N } \{x_1, x_2, \dots, x_N\} {x1,x2,…,xN} ), the maximum likelihood estimates of ( μ \mu μ ) and ( σ 2 \sigma^2 σ2 ) are:
-
Estimate of the mean ( μ \mu μ ):
μ ^ = 1 N ∑ i = 1 N x i . \hat{\mu} = \frac{1}{N} \sum_{i=1}^N x_i. μ^=N1i=1∑Nxi.
-
Estimate of the variance ( σ 2 \sigma^2 σ2 ):
σ ^ 2 = 1 N ∑ i = 1 N ( x i − μ ^ ) 2 . \hat{\sigma}^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \hat{\mu})^2. σ^2=N1i=1∑N(xi−μ^)2.
These estimates have closed-form solutions, avoiding the need for complex numerical optimization.
Summary
In exponential family distributions, due to the linear relationship between the log-likelihood function and the sufficient statistic, Maximum Likelihood Estimation (MLE) often has a closed-form solution. This property not only makes parameter estimation more efficient but also provides significant advantages for statistical inference and model optimization. For example, in the case of the Gaussian distribution, the MLE for the parameters directly corresponds to the sample mean and sample variance, which avoids the complexity of iterative numerical calculations.
对“使充分统计量的期望等于样本平均值”这句话的解释
这段话的意思是,在指数族分布中,充分统计量的期望和样本均值具有直接的数学关系。也就是说,最大似然估计的核心是找到一个参数 ( η \eta η ),使得分布中充分统计量的理论期望值(根据参数 ( η \eta η ) 计算)恰好等于实际数据的样本平均值。
为了更好地理解这点,我们通过一个具体例子来说明:
示例:正态分布的最大似然估计
假设我们有 ( N N N ) 个样本 ( x 1 , x 2 , … , x N x_1, x_2, \dots, x_N x1,x2,…,xN ),这些样本来自于一个一维正态分布,其概率密度函数为:
p ( x ∣ μ , σ 2 ) = 1 2 π σ 2 exp ( − ( x − μ ) 2 2 σ 2 ) . p(x|\mu, \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). p(x∣μ,σ2)=2πσ21exp(−2σ2(x−μ)2).
在正态分布中,充分统计量是 ( t ( x ) = [ x , x 2 ] t(x) = [x, x^2] t(x)=[x,x2] ),也就是每个样本的值和平方值。最大似然估计要求通过调整参数 ( μ \mu μ ) 和 ( σ 2 \sigma^2 σ2 ),使得以下条件满足:
- 数据的样本均值(实际数据的平均值)与正态分布理论期望的均值一致;
- 数据的样本平方均值(实际数据平方值的平均值)与正态分布理论期望的平方均值一致。
具体计算如下:
求解均值 ( μ \mu μ )
正态分布中,均值 ( μ \mu μ ) 的理论期望是分布的参数 ( μ \mu μ ) 本身。
样本均值为:
x ˉ = 1 N ∑ i = 1 N x i . \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i. xˉ=N1i=1∑Nxi.
最大似然估计会让理论均值等于样本均值,因此直接得到:
μ ^ = x ˉ = 1 N ∑ i = 1 N x i . \hat{\mu} = \bar{x} = \frac{1}{N} \sum_{i=1}^N x_i. μ^=xˉ=N1i=1∑Nxi.
求解方差 ( σ 2 \sigma^2 σ2 )
正态分布中,平方均值的理论期望可以用方差和均值表达为:
E [ x 2 ] = σ 2 + μ 2 . \mathbb{E}[x^2] = \sigma^2 + \mu^2. E[x2]=σ2+μ2.
样本平方均值为:
1 N ∑ i = 1 N x i 2 . \frac{1}{N} \sum_{i=1}^N x_i^2. N1i=1∑Nxi2.
最大似然估计让理论平方均值等于样本平方均值,因此有:
σ 2 + μ 2 = 1 N ∑ i = 1 N x i 2 . \sigma^2 + \mu^2 = \frac{1}{N} \sum_{i=1}^N x_i^2. σ2+μ2=N1i=1∑Nxi2.
代入 ( μ = μ ^ \mu = \hat{\mu} μ=μ^ ),可以解出方差 ( σ 2 \sigma^2 σ2 ) 的估计值为:
σ ^ 2 = 1 N ∑ i = 1 N ( x i − μ ^ ) 2 . \hat{\sigma}^2 = \frac{1}{N} \sum_{i=1}^N (x_i - \hat{\mu})^2. σ^2=N1i=1∑N(xi−μ^)2.
这里的过程,可以参考笔者的另一篇文章 一维高斯分布的方差估计推导,可能对理解这里有帮助,当然,也可能没有。
解释“充分统计量的期望等于样本平均值”
-
理论期望:
充分统计量的期望是根据分布的参数 ( η \eta η ) 计算出的值。比如在正态分布中,均值 ( μ \mu μ) 和平方均值 ( σ 2 + μ 2 \sigma^2 + \mu^2 σ2+μ2 ) 就是理论期望。 -
样本均值:
样本均值是实际数据计算出的平均值,比如 ( 1 N ∑ i = 1 N x i \frac{1}{N} \sum_{i=1}^N x_i N1∑i=1Nxi )。 -
最大似然的核心思想:
最大似然估计通过调整参数 ( η \eta η ),让分布的理论期望与样本均值一致。例如,正态分布的均值 ( μ \mu μ ) 就直接等于样本均值 ( 1 N ∑ i = 1 N x i \frac{1}{N} \sum_{i=1}^N x_i N1∑i=1Nxi )。
总结
“使充分统计量的期望等于样本平均值”就是指,最大似然估计会调整分布的参数 ( η \eta η ),让分布中的数学期望(理论值)恰好等于实际观测数据的统计特性(样本值)。在指数族分布中,由于这种关系是线性的,求解常常有封闭解,像正态分布的 ( μ \mu μ ) 和 ( σ 2 \sigma^2 σ2 ) 就可以直接通过简单公式计算得出。
后记
2024年12月1日22点25分于上海,在GPT4o大模型辅助下完成。