ROFORMER: ENHANCED TRANSFORMER WITH ROTARY POSITION EMBEDDING
主要贡献:提出了 RoPE 旋转位置编码,并将该编码方式应用在Bert、Performer等模型上,取得了更好的效果
参考:作者blog
论文:https://arxiv.org/pdf/2104.09864
研究背景
- 在 NLP 中,单词的位置顺序对整体语句的语义信息至关重要,例如:“你爱我” 和 “我爱你”表达的语义信息相差甚选;
- Transformer 中 self-attention 的计算方式至与 token 相关,而与位置无关,相同 token 在不同位置与同一 token 的计算结果完全相同,因此需要位置编码提供位置信息;
- 现有的位置编码方式(绝对位置编码/相对位置编码)都是直接加在 token 上,不太适合线性 self-attention 的计算方式【作者 kindly argue,我认为可能是线性运算可能无法有效捕获通过 add 操作的位置编码信息,因为位置编码信息和 token 应该具有不同的语义表征,不应该使用相同的计算范式,并且一般提取更有效的信息,往往需要添加非线性操作】
RoPE (Rotary Position Embedding)
RoPE 计算逻辑
self-attention 的计算逻辑如下:
- 设 S N = { w i } i = 1 N {\mathbb{S}}_{N} = \{w_{i}\}_{i = 1}^{N} SN={wi}i=1N为包含 N N N个输入词元的序列,其中 w i w_i wi表示第 i i i个元素, S N \mathbb{S}_N SN 对应的词嵌入表示为 E N = { x i } i = 1 N \mathbb{E}_{N} = \{x_{i}\}_{i = 1}^{N} EN={xi}i=1N, x i ∈ R d x_{i} \in \mathbb{R}^{d} xi∈Rd 表示 token w i w_i wi 对应的 d d d 纬度词向量,根据 E N \mathbb{E}_{N} EN 的书写规范,这里设定 w i w_i wi 为列向量;
- 将位置信息加到词向量上,即 x i + p i x_i+p_i xi+pi,其中 p i p_i pi 对应位置 i i i 的位置编码信息;
- 然后构建三个映射关系 f q f_q fq、 f k f_k fk、 f v f_v fv 用于融合位置信息;
q m = f q ( x m , m ) k n = f k ( x n , n ) v n = f v ( x n , n ) \begin{array}{rlrl} q_{m} &= f_{q}(x_{m}, m) & \\ k_{n} &= f_{k}(x_{n}, n) & & \\ v_{n} &= f_{v}(x_{n}, n) & \\ \end{array} qmknvn=fq(xm,m)=fk(xn,n)=fv(xn,n)
三个映射逻辑如下
f t : t ∈ { q , k , v } ( x i , i ) = W t : t ∈ q , k , v ( x i + p i ) f_{t:t\in\{q, k, v\}}(x_i, i)=W_{t:t\in{q,k,v}}(x_i+p_i) ft:t∈{q,k,v}(xi,i)=Wt:t∈q,k,v(xi+pi)其中 q m q_m qm、 k n k_n kn、 v n v_n vn 分别通过 f q f_q fq、 f k f_k fk、 f v f_v fv 融入了第 m m m 和第 n n n 个位置的信息 - 然后 query 和 key 用于计算 attention weight,并归一化输出
a m , n = exp ( q m ⊤ k n d ) ∑ j = 1 N exp ( q m ⊤ k j d ) o m = ∑ n = 1 N a m , n v n \begin{aligned} a_{m,n} = \frac{\exp\left(\frac{q_{m}^{\top}k_{n}}{\sqrt{d}}\right)}{\sum_{j = 1}^{N}\exp\left(\frac{q_{m}^{\top}k_{j}}{\sqrt{d}}\right)} & \\ o_{m} = \sum_{n = 1}^{N}a_{m,n}v_{n} & \\ \end{aligned} am,n=∑j=1Nexp(dqm⊤kj)exp(dqm⊤kn)om=n=1∑Nam,nvn
RoPE 逻辑推导
query 和 key 通过点积运算 < ・ > \left<・\right> ⟨・⟩计算两个向量的相似度,构建一个映射 g ( ・ ) g(・) g(・) 满足如下关系,其输入为位置 m m m、 n n n 对应的 token embedding 及其相对位置 m − n m-n m−n,因此我们只需要求解映射 g ( ・ ) g(・) g(・) 即可
< f q ( x m , m ) , f k ( x n , n ) > = g ( x m , x n , m − n ) . . . . . . . . . . ( 11 ) \left< f_{q}\left(x_{m}, m\right), f_{k}\left(x_{n}, n\right)\right>=g\left(x_{m}, x_{n}, m-n\right) ..........(11) ⟨fq(xm,m),fk(xn,n)⟩=g(xm,xn,m−n)..........(11)
可以找到如下映射 f q ( ・ ) f_q(・) fq(・)、 f k ( ・ ) f_k(・) fk(・)、 g ( ・ ) g(・) g(・) 满足公式 (11) ,以二维为例,即 x m x_m xm、 x n x_n xn 可以在二维直角坐标系中通过坐标对进行描述,其中横轴为实轴,纵轴为虚轴, θ ∈ R \theta \in \mathbb {R} θ∈R 是一个预设的非零常数:
f q ( x m , m ) = ( W q x m ) e i m θ f k ( x n , n ) = ( W k x n ) e i n θ g ( x m , x n , m − n ) = R e [ ( W q x m ) ( W k x n ) ∗ e i ( m − n ) θ ] \begin{aligned} f_{q}\left(x_{m}, m\right) & =\left(W_{q} x_{m}\right) e^{i m \theta}\\ f_{k}\left(x_{n}, n\right) & =\left(W_{k} x_{n}\right) e^{i n \theta}\\ g\left(x_{m}, x_{n}, m - n\right) & =Re\left[\left(W_{q} x_{m}\right)\left(W_{k} x_{n}\right)^{*} e^{i(m - n) \theta}\right] \end{aligned} fq(xm,m)fk(xn,n)g(xm,xn,m−n)=(Wqxm)eimθ=(Wkxn)einθ=Re[(Wqxm)(Wkxn)∗ei(m−n)θ]
这里在二维条件进行简易证明
假设 z 1 = a + b ∗ i = ( a , b ) z_1=a+b*i=(a,b) z1=a+b∗i=(a,b)、 z 2 = c + d ∗ i = ( c , d ) z_2=c+d*i=(c,d) z2=c+d∗i=(c,d),那么 < z 1 , z 2 > = a c + b d \left<z_1, z_2\right>=ac+bd ⟨z1,z2⟩=ac+bd,又 z 1 ∗ z 2 ∗ = ( a + b ∗ i ) ∗ ( c + d ∗ i ) ∗ = ( a + b ∗ i ) ∗ ( c − d ∗ i ) = a c + ( b c − a d ) ∗ i + b d z_1 * z_2^{*}=(a+b*i)*(c+d*i)^*=(a+b*i)*(c-d*i)=ac+(bc-ad)*i + bd z1∗z2∗=(a+b∗i)∗(c+d∗i)∗=(a+b∗i)∗(c−d∗i)=ac+(bc−ad)∗i+bd,因此 R e [ z 1 ∗ z 2 ∗ ] = a c + b d = < z 1 , z 2 > Re[z_1 * z_2^*]=ac + bd=\left<z_1, z_2\right> Re[z1∗z2∗]=ac+bd=⟨z1,z2⟩,其中 z 2 ∗ z_2^* z2∗ 称为 z 2 z_2 z2 的共轭复数
根据欧拉公式 { e i ∗ m θ } ∗ = { c o s ( m θ ) + i ∗ s i n ( m θ ) } ∗ = c o s ( m θ ) − i ∗ s i n ( m θ ) = c o s ( − m θ ) + i ∗ s i n ( − m θ ) = e i ∗ ( − m θ ) = e − i ∗ m θ \{{e^{i*m\theta}}\}^{*}=\{cos(m\theta)+i*sin(m\theta)\}^*=cos(m\theta)-i*sin(m\theta)=cos(-m\theta) + i*sin(-m\theta)=e^{i*(-m\theta)}=e^{-i*m\theta} {ei∗mθ}∗={cos(mθ)+i∗sin(mθ)}∗=cos(mθ)−i∗sin(mθ)=cos(−mθ)+i∗sin(−mθ)=ei∗(−mθ)=e−i∗mθ
根据上述证明条件,并且复数运算满足交换律,我们可以对 g ( x m , x n , m − n ) g\left(x_{m}, x_{n}, m - n\right) g(xm,xn,m−n) 的正确性进行验证
< f q ( x m , m ) , f k ( x n , n ) > = R e [ ( W q x m ) e i m θ ⋅ { ( W k x n ) e i n θ } ∗ ] = R e [ ( W q x m ) e i m θ ⋅ { e i n θ } ∗ ⋅ ( W k x n ) ∗ ] = R e [ ( W q x m ) e i ( m − n ) θ ⋅ ( W k x n ) ∗ ] = R e [ ( W q x m ) ( W k x n ) ∗ e i ( m − n ) θ ] = g ( x m , x n , m − n ) \begin{aligned} \left< f_{q}\left(x_{m}, m\right), f_{k}\left(x_{n}, n\right)\right> &= \mathrm{Re}\left[\left(W_{q} x_{m}\right) e^{i m \theta} \cdot \left\{\left(W_{k} x_{n}\right) e^{i n \theta}\right\}^*\right] \\ &= \mathrm{Re}\left[\left(W_{q} x_{m}\right) e^{i m \theta} \cdot \left\{e^{i n \theta}\right\}^* \cdot \left(W_{k} x_{n}\right)^*\right] \\ &= \mathrm{Re}\left[\left(W_{q} x_{m}\right) e^{i (m - n) \theta} \cdot \left(W_{k} x_{n}\right)^*\right] \\ &= \mathrm{Re}\left[\left(W_{q} x_{m}\right)\left(W_{k} x_{n}\right)^* e^{i(m - n) \theta}\right] \\ &= g\left(x_{m}, x_{n}, m - n\right) \end{aligned} ⟨fq(xm,m),fk(xn,n)⟩=Re[(Wqxm)eimθ⋅{(Wkxn)einθ}∗]=Re[(Wqxm)eimθ⋅{einθ}∗⋅(Wkxn)∗]=Re[(Wqxm)ei(m−n)θ⋅(Wkxn)∗]=Re[(Wqxm)(Wkxn)∗ei(m−n)θ]=g(xm,xn,m−n)
在二维空间,其计算方式如下:
f { q , k } ( x m , m ) = ( c o s m θ − s i n m θ s i n m θ c o s m θ ) ( W { q , k } ( 11 ) W { q , k } ( 12 ) W { q , k } ( 21 ) W { q , k } ( 12 ) ) ( x m ( 1 ) x m ( 2 ) ) f_{\{q, k\}}\left(x_{m}, m\right)=\left(\begin{array}{cc} cos m \theta & -sin m \theta \\ sin m \theta & cos m \theta\end{array}\right)\left(\begin{array}{cc} W_{\{q, k\}}^{(11)} & W_{\{q, k\}}^{(12)} \\ W_{\{q, k\}}^{(21)} & W_{\{q, k\}}^{(12)} \end{array}\right)\left(\begin{array}{c} x_{m}^{(1)} \\ x_{m}^{(2)} \end{array}\right) f{q,k}(xm,m)=(cosmθsinmθ−sinmθcosmθ)(W{q,k}(11)W{q,k}(21)W{q,k}(12)W{q,k}(12))(xm(1)xm(2))
设有二维向量 v ⃗ = ( x , y ) \vec{v}=(x,y) v=(x,y),现在需要将其旋转 θ \theta θ 角,旋转矩阵 R = ( c o s θ − s i n θ s i n θ c o s θ ) R=\left(\begin{array}{cc} cos \theta & -sin \theta \\ sin \theta & cos \theta\end{array}\right) R=(cosθsinθ−sinθcosθ),旋转后的计算公式则为 v ′ ⃗ = R v ⃗ = ( c o s θ − s i n θ s i n θ c o s θ ) ( x y ) \vec{v'}=R\vec{v}=\left(\begin{array}{cc} cos \theta & -sin \theta \\ sin \theta & cos \theta\end{array}\right)\left(\begin{array}{cc} x \\ y \end{array}\right) v′=Rv=(cosθsinθ−sinθcosθ)(xy)
详细证明过程
这里主要推导如何一步步求解到 f q ( ・ ) f_q(・) fq(・)、 f k ( ・ ) f_k(・) fk(・)、 g ( ・ ) g(・) g(・),参考论文证明过程,这里以二维复平面为例,根据 q m = f q ( x m , m ) , k n = f k ( x n , n ) q_{m} = f_{q}(x_{m}, m),k_{n} = f_{k}(x_{n}, n) qm=fq(xm,m),kn=fk(xn,n) ,可以得到:
q m ⊤ k n = < f q ( x m , m ) , f k ( x n , n ) > = g ( x m , x n , n − m ) . . . . . . . . . . ( 12 ) q_{m}^{\top} k_{n}=\left< f_{q}\left(x_{m}, m\right), f_{k}\left(x_{n}, n\right)\right>=g\left(x_{m}, x_{n}, n-m\right)..........(12) qm⊤kn=⟨fq(xm,m),fk(xn,n)⟩=g(xm,xn,n−m)..........(12)我们选取 query 和 key 相关的任意两个 embedding,记为 x q x_q xq、 x k x_k xk,则有:
q m = f q ( x q , m ) k n = f k ( x k , n ) \begin{array}{ll} q_m=f_{q}\left(x_{q}, m\right) & \\ k_n=f_{k}\left(x_{k}, n\right) & \end{array} qm=fq(xq,m)kn=fk(xk,n)我们设定初始条件,当 m m m 和 n n n 都等于 0 0 0 时:
q = f q ( x q , 0 ) , k = f k ( x k , 0 ) , \begin{array}{ll} q=f_{q}\left(x_{q}, 0\right), & \\ k=f_{k}\left(x_{k}, 0\right), & \end{array} q=fq(xq,0),k=fk(xk,0),在二维复平面,任何一点都可以通过模与弧角
进行表示,因此可以得到如下表示:
f q ( x q , m ) = R q ( x q , m ) e i Θ q ( x q , m ) , f k ( x k , n ) = R k ( x k , n ) e i Θ k ( x k , n ) , g ( x q , x k , n − m ) = R g ( x q , x k , n − m ) e i Θ g ( x q , x k , n − m ) , . . . . . . . . . . ( ※ ) \begin{aligned} f_{q}\left(x_{q}, m\right) & =R_{q}\left(x_{q}, m\right) e^{i \Theta_{q}\left(x_{q}, m\right)}, \\ f_{k}\left(x_{k}, n\right) & =R_{k}\left(x_{k}, n\right) e^{i \Theta_{k}\left(x_{k}, n\right)}, \\ g\left(x_{q}, x_{k}, n-m\right) & =R_{g}\left(x_{q}, x_{k}, n-m\right) e^{i \Theta_{g}\left(x_{q}, x_{k}, n-m\right)},\end{aligned}..........(※) fq(xq,m)fk(xk,n)g(xq,xk,n−m)=Rq(xq,m)eiΘq(xq,m),=Rk(xk,n)eiΘk(xk,n),=Rg(xq,xk,n−m)eiΘg(xq,xk,n−m),..........(※)其中 R q ( ・ ) R_q(・) Rq(・)、 R k ( ・ ) R_k(・) Rk(・)、 R g ( ・ ) R_g(・) Rg(・) 分别代表模长, Θ q ( ・ ) \Theta_q(・) Θq(・)、 Θ k ( ・ ) \Theta_k(・) Θk(・)、 Θ g ( ・ ) \Theta_g(・) Θg(・)分别代表弧角,根据公式 (12),采用对应分量相等(模对应相等
、弧角对应相等
)的方式,可以得到:
R q ( x q , m ) R k ( x k , n ) = R g ( x q , x k , n − m ) . . . . . . . . . . ( 13 a ) Θ k ( x k , n ) − Θ q ( x q , m ) = Θ g ( x q , x k , n − m ) . . . . . . . . . . ( 13 b ) \begin{aligned} R_{q}\left(x_{q}, m\right) R_{k}\left(x_{k}, n\right) & =R_{g}\left(x_{q}, x_{k}, n-m\right)..........(13a) \\ \Theta_{k}\left(x_{k}, n\right)-\Theta_{q}\left(x_{q}, m\right) & =\Theta_{g}\left(x_{q}, x_{k}, n-m\right)..........(13 b)\end{aligned} Rq(xq,m)Rk(xk,n)Θk(xk,n)−Θq(xq,m)=Rg(xq,xk,n−m)..........(13a)=Θg(xq,xk,n−m)..........(13b)同理我们的初始值采用模与弧角
可以表示为:
q = ∥ q ∥ e i θ q = R q ( x q , 0 ) e i Θ q ( x q , 0 ) k = ∥ k ∥ e i θ k = R k ( x k , 0 ) e i Θ k ( x k , 0 ) \begin{aligned} & q=\| q\| e^{i \theta_{q}}=R_{q}\left(x_{q}, 0\right) e^{i \Theta_{q}\left(x_{q}, 0\right)} \\ & k=\| k\| e^{i \theta_{k}}=R_{k}\left(x_{k}, 0\right) e^{i \Theta_{k}\left(x_{k}, 0\right)}\end{aligned} q=∥q∥eiθq=Rq(xq,0)eiΘq(xq,0)k=∥k∥eiθk=Rk(xk,0)eiΘk(xk,0)其中 ∥ q ∥ \| q\| ∥q∥、 ∥ k ∥ \| k\| ∥k∥、 θ q \theta_q θq、 θ k \theta_k θk 分别表示对应的模和弧角,然后我们设置 m = n m=n m=n,公式 (13) 可以演变为:
R q ( x q , m ) R k ( x k , m ) = R g ( x q , x k , 0 ) = R q ( x q , 0 ) R k ( x k , 0 ) = ∥ q ∥ ∥ k ∥ . . . . . . . . . . ( 14 a ) R_{q}\left(x_{q}, m\right) R_{k}\left(x_{k}, m\right)=R_{g}\left(x_{q}, x_{k}, 0\right)=R_{q}\left(x_{q}, 0\right) R_{k}\left(x_{k}, 0\right)=\| q\| \| k\|..........(14 a) Rq(xq,m)Rk(xk,m)=Rg(xq,xk,0)=Rq(xq,0)Rk(xk,0)=∥q∥∥k∥..........(14a)
Θ k ( x k , m ) − Θ q ( x q , m ) = Θ g ( x q , x k , 0 ) = Θ k ( x k , 0 ) − Θ q ( x q , 0 ) = θ k − θ q . . . . . . . . . . ( 14 b ) \Theta_{k}\left(x_{k}, m\right)-\Theta_{q}\left(x_{q}, m\right)=\Theta_{g}\left(x_{q}, x_{k}, 0\right)=\Theta_{k}\left(x_{k}, 0\right)-\Theta_{q}\left(x_{q}, 0\right)=\theta_{k}-\theta_{q}..........(14 b) Θk(xk,m)−Θq(xq,m)=Θg(xq,xk,0)=Θk(xk,0)−Θq(xq,0)=θk−θq..........(14b)为了简单起见,从(14 a)可以很直观的看到一种解,即:
R q ( x q , m ) = R q ( x q , 0 ) = ∥ q ∥ R k ( x k , n ) = R k ( x k , 0 ) = ∥ k ∥ g ( x q , x k , n − m ) = R g ( x q , x k , 0 ) = ∥ q ∥ ∥ k ∥ \begin{aligned} R_{q}\left(x_{q}, m\right) & =R_{q}\left(x_{q}, 0\right)=\| q\| \\ R_{k}\left(x_{k}, n\right) & =R_{k}\left(x_{k}, 0\right)=\| k\| \\ _{g}\left(x_{q}, x_{k}, n-m\right) & =R_{g}\left(x_{q}, x_{k}, 0\right)=\| q\| \| k\| \end{aligned} Rq(xq,m)Rk(xk,n)g(xq,xk,n−m)=Rq(xq,0)=∥q∥=Rk(xk,0)=∥k∥=Rg(xq,xk,0)=∥q∥∥k∥仔细观察这组解可以发现,模量映射 R q ( ・ ) R_q(・) Rq(・)、 R k ( ・ ) R_k(・) Rk(・)、 R g ( ・ ) R_g(・) Rg(・)与弧角无关,即与位置信息无关,仅与初始值有关,此外对公式 (14 b) 进行移项可以得到 Θ q ( x q , m ) − θ q = Θ k ( x k , m ) − θ k \Theta_{q}(x_{q}, m)-\theta_{q}=\Theta_{k}(x_{k}, m)-\theta_{k} Θq(xq,m)−θq=Θk(xk,m)−θk,即 query 和 key 的弧角映射与 query 和 key 无关,仅仅与其位置 m m m 和 embedding x { q , k } x_{\{q,k\}} x{q,k} 相关,并且能够得到 Θ q ( ・ ) = Θ k ( ・ ) \Theta_q(・)=\Theta_k(・) Θq(・)=Θk(・),我们将这两种映射统一定义为 Θ f ( ・ ) \Theta_f(・) Θf(・),即 Θ f ( ・ ) = Θ q ( ・ ) = Θ k ( ・ ) \Theta_f(・)=\Theta_q(・)=\Theta_k(・) Θf(・)=Θq(・)=Θk(・),我们可以得到 Θ f ( x { q , k } , m ) − θ { q , k } \Theta_{f}(x_{\{q, k\}}, m)-\theta_{\{q, k\}} Θf(x{q,k},m)−θ{q,k} 是仅仅关于位置 m m m 的函数
这里可以详述一下,由于 x q x_q xq 和 x k x_k xk 是在 query 和 key 中任意选取的,因此 x q x_q xq 和 x k x_k xk 的值并不影响 Θ f ( x { q , k } , m ) − θ { q , k } \Theta_{f}(x_{\{q, k\}}, m)-\theta_{\{q, k\}} Θf(x{q,k},m)−θ{q,k} 的输出,因此其仅仅是关于位置 m m m 的函数
因此可推导如下:
Θ f ( x { q , k } , m ) − θ { q , k } = ϕ ( m ) → Θ f ( x { q , k } , m ) = ϕ ( m ) + θ { q , k } . . . . . . . . . . ( 15 ) \Theta_{f}(x_{\{q, k\}}, m)-\theta_{\{q, k\}}=\phi(m)→\Theta_{f}\left(x_{\{q, k\}}, m\right)=\phi(m)+\theta_{\{q, k\}}..........(15) Θf(x{q,k},m)−θ{q,k}=ϕ(m)→Θf(x{q,k},m)=ϕ(m)+θ{q,k}..........(15)当设定 n = m + 1 n=m+1 n=m+1 时,根据公式 (13) 可做如下推导:
Θ k ( x k , m + 1 ) − Θ q ( x q , m ) = Θ g ( x q , x k , 1 ) . . . . . . . . . . ( 16 a ) Θ k ( x k , m + 1 ) = ϕ ( m + 1 ) + θ k . . . . . . . . . . ( 16 b ) Θ q ( x q , m ) = ϕ ( m ) + θ q . . . . . . . . . . ( 16 c ) ϕ ( m + 1 ) − ϕ ( m ) = Θ g ( x q , x k , 1 ) + θ q − θ k . . . . . . . . . . ( 16 d ) \begin{aligned}\Theta_{k}\left(x_{k}, m+1\right)-\Theta_{q}\left(x_{q}, m\right) &=\Theta_{g}\left(x_{q}, x_{k}, 1\right)..........(16a) \\ \Theta_{k}\left(x_{k}, m+1\right) &=\phi(m+1)+\theta_{k}..........(16b) \\ \Theta_{q}\left(x_{q}, m\right) &=\phi(m)+\theta_{q}..........(16c) \\ \phi(m+1)-\phi(m) & =\Theta_{g}\left(x_{q}, x_{k}, 1\right)+\theta_{q}-\theta_{k}..........(16d) \end{aligned} Θk(xk,m+1)−Θq(xq,m)Θk(xk,m+1)Θq(xq,m)ϕ(m+1)−ϕ(m)=Θg(xq,xk,1)..........(16a)=ϕ(m+1)+θk..........(16b)=ϕ(m)+θq..........(16c)=Θg(xq,xk,1)+θq−θk..........(16d)
通过 ( 13 b ) (13b) (13b) 将 n = m + 1 n=m+1 n=m+1 代入得到 ( 16 a ) (16a) (16a),通过 ( 15 ) (15) (15) 取 query 映射,并将 m = m + 1 m=m+1 m=m+1 代入得到 ( 16 b ) (16b) (16b),通过 ( 15 ) (15) (15) 取 key 映射,得到 ( 16 c ) (16c) (16c),通过 ( 16 b ) − ( 16 c ) (16b)-(16c) (16b)−(16c) 得到 ϕ ( m + 1 ) + θ k − ϕ ( m ) − θ q = Θ k ( x k , m + 1 ) − Θ q ( x q , m ) \phi(m+1)+\theta_{k}-\phi(m)-\theta_{q}=\Theta_{k}\left(x_{k}, m+1\right)-\Theta_{q}(x_q,m) ϕ(m+1)+θk−ϕ(m)−θq=Θk(xk,m+1)−Θq(xq,m),根据 ( 16 a ) (16a) (16a) 得到 ϕ ( m + 1 ) + θ k − ϕ ( m ) − θ q = Θ g ( x q , x k , 1 ) \phi(m+1)+\theta_{k}-\phi(m)-\theta_{q}=\Theta_{g}\left(x_{q}, x_{k}, 1\right) ϕ(m+1)+θk−ϕ(m)−θq=Θg(xq,xk,1) 再移项得到 ϕ ( m + 1 ) − ϕ ( m ) = Θ g ( x q , x k , 1 ) + θ q − θ k \phi(m+1)-\phi(m)=\Theta_{g}\left(x_{q}, x_{k}, 1\right)+\theta_{q}-\theta_{k} ϕ(m+1)−ϕ(m)=Θg(xq,xk,1)+θq−θk 即 ( 16 d ) (16d) (16d)
从 ( 16 d ) (16d) (16d) 中可以看到,等式的右边与 m m m 无关,可将其整体当作一个常数 θ \theta θ,因此 ϕ ( m + 1 ) − ϕ ( m ) \phi(m+1)-\phi(m) ϕ(m+1)−ϕ(m) 是公差为 θ \theta θ 的常数,因此可以构造等比数列如下:
ϕ ( m ) = m ∗ θ + γ \phi(m)=m* \theta+\gamma ϕ(m)=m∗θ+γ其中 θ \theta θ 是非零常数公差, γ \gamma γ 是常数初始值,因此公式 ( ※ ) (※) (※) 可做如下变换(作者 arxiv 论文中忘记加括号):
f q ( x q , m ) = R q ( x q , m ) e i Θ q ( x q , m ) = ∥ q ∥ e i ( θ q + m θ + γ ) = q e i ( m θ + γ ) f k ( x k , n ) = R k ( x k , n ) e i Θ k ( x k , n ) = ∥ k ∥ e i ( θ k + n θ + γ ) = k e i ( n θ + γ ) \begin{aligned} f_{q}\left(x_{q}, m\right)= R_{q}\left(x_{q}, m\right) e^{i \Theta_{q}\left(x_{q}, m\right)}& =\| q\| e^{i( \theta_{q}+m \theta+\gamma)}=q e^{i(m \theta+\gamma)} \\ f_{k}\left(x_{k}, n\right)=R_{k}\left(x_{k}, n\right) e^{i \Theta_{k}\left(x_{k}, n\right)} & =\| k\| e^{i( \theta_{k}+n \theta+\gamma)}=k e^{i(n \theta+\gamma)} \end{aligned} fq(xq,m)=Rq(xq,m)eiΘq(xq,m)fk(xk,n)=Rk(xk,n)eiΘk(xk,n)=∥q∥ei(θq+mθ+γ)=qei(mθ+γ)=∥k∥ei(θk+nθ+γ)=kei(nθ+γ)
前面已经证明了模量映射 R ( ・ ) R(・) R(・) 与弧角无关,弧角映射 Θ f ( x { q , k } , m ) = ϕ ( m ) + θ { q , k } = m ∗ θ + θ { q , k } \Theta_{f}(x_{\{q,k\}}, m)=\phi(m) + \theta_{\{q,k\}}=m*\theta+ \theta_{\{q,k\}} Θf(x{q,k},m)=ϕ(m)+θ{q,k}=m∗θ+θ{q,k} 代替,又 q = ∥ q ∥ e i θ q q=\| q\| e^{i \theta_{q}} q=∥q∥eiθq,即可得到上述推导结论
最后,由于 q q q、 k k k 的选取是任意的,为了让最终结果看起来更加简洁规范,我们做如下定义:
q = f q ( x m , 0 ) = W q x m k = f k ( x n , 0 ) = W k x n \begin{aligned} & q=f_{q}\left(x_{m}, 0\right)=W_{q} x_{m} \\ & k=f_{k}\left(x_{n}, 0\right)=W_{k} x_{n} \end{aligned} q=fq(xm,0)=Wqxmk=fk(xn,0)=Wkxn并设定初始值 γ = 0 \gamma=0 γ=0,因此得到:
f q ( x m , m ) = ( W q x m ) e i m θ f k ( x n , n ) = ( W k x n ) e i n θ \begin{aligned} f_{q}\left(x_{m}, m\right) & =\left(W_{q} x_{m}\right) e^{i m \theta} \\ f_{k}\left(x_{n}, n\right) & =\left(W_{k} x_{n}\right) e^{i n \theta} \end{aligned} fq(xm,m)fk(xn,n)=(Wqxm)eimθ=(Wkxn)einθ
高效计算方式
将二维情况推广到多维情况,则有
f { q , k } ( x m , m ) = R Θ , m d W { q , k } x m f_{\{q, k\}}\left(x_{m}, m\right)=R_{\Theta, m}^{d} W_{\{q, k\}} x_{m} f{q,k}(xm,m)=RΘ,mdW{q,k}xm
其旋转矩阵:
R Θ , m d = ( c o s m θ 1 − s i n m θ 1 0 0 ⋯ 0 0 s i n m θ 1 c o s m θ 1 0 0 ⋯ 0 0 0 0 c o s m θ 2 − s i n m θ 2 ⋯ 0 0 0 0 s i n m θ 2 c o s m θ 2 ⋯ 0 0 ⋮ ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ ⋱ ⋮ ⋮ 0 0 0 0 ⋯ c o s m θ d / 2 − s i n m θ d / 2 0 0 0 0 ⋯ s i n m θ d / 2 c o s m θ d / 2 ) R_{\Theta, m}^{d}=\left(\begin{array}{ccccccc} cos m \theta_{1} & -sin m \theta_{1} & 0 & 0 & \cdots & 0 & 0 \\ sin m \theta_{1} & cos m \theta_{1} & 0 & 0 & \cdots & 0 & 0 \\ 0 & 0 & cos m \theta_{2} & -sin m \theta_{2} & \cdots & 0 & 0 \\ 0 & 0 & sin m \theta_{2} & cos m \theta_{2} & \cdots & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & 0 & \cdots & cos m \theta_{d / 2} & -sin m \theta_{d / 2} \\ 0 & 0 & 0 & 0 & \cdots & sin m \theta_{d / 2} & cos m \theta_{d / 2} \end{array}\right) RΘ,md= cosmθ1sinmθ100⋮00−sinmθ1cosmθ100⋮0000cosmθ2sinmθ2⋮0000−sinmθ2cosmθ2⋮00⋯⋯⋯⋯⋱⋯⋯0000⋮cosmθd/2sinmθd/20000⋮−sinmθd/2cosmθd/2⋱⋮⋮ 这里 d d d 为偶数,我们定义 Θ = { θ i = 1000 0 − 2 ( i − 1 ) / d , i ∈ [ 1 , 2 , . . . , d / 2 ] } \Theta=\{{\theta_{i}=10000^{-2(i-1) / d}},i\in[1,2,...,d/2]\} Θ={θi=10000−2(i−1)/d,i∈[1,2,...,d/2]}, q m ⊤ k n q_{m}^{\top} k_{n} qm⊤kn 计算如下:
q m ⊤ k n = ( R Θ , m d W q x m ) ⊤ ( R Θ , n d W k x n ) = x ⊤ W q R Θ , n − m d W k x n q_{m}^{\top} k_{n}=\left(R_{\Theta, m}^{d} W_{q} x_{m}\right)^{\top}\left(R_{\Theta, n}^{d} W_{k} x_{n}\right)=x^{\top} W_{q} R_{\Theta, n-m}^{d} W_{k} x_{n} qm⊤kn=(RΘ,mdWqxm)⊤(RΘ,ndWkxn)=x⊤WqRΘ,n−mdWkxn其中 R Θ , n − m d = ( R Θ , m d ) ⊤ R Θ , n d R_{\Theta, n-m}^{d}=(R_{\Theta, m}^{d})^{\top} R_{\Theta, n}^{d} RΘ,n−md=(RΘ,md)⊤RΘ,nd, R Θ d R_{\Theta}^{d} RΘd 为正交矩阵,并且由于 R Θ d R_{\Theta}^{d} RΘd 的稀疏性,直接采用上述矩阵运算的方式效率低下,因此可做如下转换:
R Θ , m d x = ( x 1 x 2 x 3 x 4 ⋮ x d − 1 x d ) ⊗ ( c o s m θ 1 c o s m θ 1 c o s m θ 2 c o s m θ 2 ⋮ c o s m θ d / 2 c o s m θ d / 2 ) + ( − x 2 x 1 − x 4 x 3 ⋮ − x d x d − 1 ) ⊗ ( s i n m θ 1 s i n m θ 1 s i n m θ 2 s i n m θ 2 ⋮ s i n m θ d / 2 s i n m θ d / 2 ) R_{\Theta, m}^{d} x=\left(\begin{array}{c} x_{1} \\ x_{2} \\ x_{3} \\ x_{4} \\ \vdots \\ x_{d-1} \\ x_{d} \end{array}\right) \otimes\left(\begin{array}{c} cos m \theta_{1} \\ cos m \theta_{1} \\ cos m \theta_{2} \\ cos m \theta_{2} \\ \vdots \\ cos m \theta_{d / 2} \\ cos m \theta_{d / 2} \end{array}\right)+\left(\begin{array}{c} -x_{2} \\ x_{1} \\ -x_{4} \\ x_{3} \\ \vdots \\ -x_{d} \\ x_{d-1} \end{array}\right) \otimes\left(\begin{array}{c} sin m \theta_{1} \\ sin m \theta_{1} \\ sin m \theta_{2} \\ sin m \theta_{2} \\ \vdots \\ sin m \theta_{d / 2} \\ sin m \theta_{d / 2} \end{array}\right) RΘ,mdx= x1x2x3x4⋮xd−1xd ⊗ cosmθ1cosmθ1cosmθ2cosmθ2⋮cosmθd/2cosmθd/2 + −x2x1−x4x3⋮−xdxd−1 ⊗ sinmθ1sinmθ1sinmθ2sinmθ2⋮sinmθd/2sinmθd/2
图解过程

长程衰减 Long-term decay of RoPE
以在二维情况为例, q m T k n = R e [ ( W q x m ) ( W k x n ) ∗ e i ( m − n ) θ ] q_m^Tk_n =Re\left[\left(W_{q} x_{m}\right)\left(W_{k} x_{n}\right)^{*} e^{i(m - n) \theta}\right] qmTkn=Re[(Wqxm)(Wkxn)∗ei(m−n)θ],其中 W q x m W_{q} x_{m} Wqxm 和 W k x n W_{k} x_{n} Wkxn 分别是二维向量,前者是行向量,后者是列向量。对于多为情况,假设维度为 d d d(偶数),我们可以将其两两分块,这样每一块都是一个二维向量,因此可得到如下公式,为了避免混淆,使用 j j j 表示虚数:
( R Θ , m d W q x m ) ⊤ ( R Θ , n d W k x n ) = R e [ ∑ i = 0 d / 2 − 1 q [ 2 i : 2 i + 1 ] k [ 2 i : 2 i + 1 ] ∗ e j ∗ ( m − n ) θ i ] \left(R_{\Theta, m}^{d} W_{q} x_{m}\right)^{\top}\left(R_{\Theta, n}^{d} W_{k} x_{n}\right)=Re\left[\sum_{i=0}^{d / 2-1} q_{[2 i: 2 i+1]} k_{[2 i: 2 i+1]}^{*} e^{j*(m-n) \theta_{i}}\right] (RΘ,mdWqxm)⊤(RΘ,ndWkxn)=Re i=0∑d/2−1q[2i:2i+1]k[2i:2i+1]∗ej∗(m−n)θi 其中 q [ 2 i : 2 i + 1 ] q_{[2 i: 2 i+1]} q[2i:2i+1] 表示 q q q 中 2 i t h 2 i^{th } 2ith 到 ( 2 i + 1 ) t h (2 i+1)^{t h} (2i+1)th 的值(包含两个值)。设定 h i = q [ 2 i : 2 i + 1 ] k [ 2 i : 2 i + 1 ] ∗ h_{i}=q_{[2 i: 2 i+1]} k_{[2 i: 2 i+1]}^{*} hi=q[2i:2i+1]k[2i:2i+1]∗, S j = S_{j}= Sj= ∑ i = 0 j − 1 e i ( m − n ) θ i \sum_{i=0}^{j-1} e^{i(m-n) \theta_{i}} ∑i=0j−1ei(m−n)θi,并使 h d / 2 = 0 , S 0 = 0 h_{d / 2}=0,S_{0}=0 hd/2=0,S0=0,通过 Abel 变换可做转换如下:
∑ i = 0 d / 2 − 1 q [ 2 i : 2 i + 1 ] k [ 2 i : 2 i + 1 ] ∗ e i ( m − n ) θ i = ∑ i = 0 d / 2 − 1 h i ( S i + 1 − S i ) = − ∑ i = 0 d / 2 − 1 S i + 1 ( h i + 1 − h i ) \sum_{i=0}^{d / 2-1} q_{[2 i: 2 i+1]} k_{[2 i: 2 i+1]}^{*} e^{i(m-n) \theta_{i}}=\sum_{i=0}^{d / 2-1} h_{i}\left(S_{i+1}-S_{i}\right)=-\sum_{i=0}^{d / 2-1} S_{i+1}\left(h_{i+1}-h_{i}\right) i=0∑d/2−1q[2i:2i+1]k[2i:2i+1]∗ei(m−n)θi=i=0∑d/2−1hi(Si+1−Si)=−i=0∑d/2−1Si+1(hi+1−hi)
∣ ∑ i = 0 d / 2 − 1 q [ 2 i : 2 i + 1 ] k [ 2 i : 2 i + 1 ] ∗ e i ( m − n ) θ i ∣ = ∣ ∑ i = 0 d / 2 − 1 S i + 1 ( h i + 1 − h i ) ∣ ≤ ∑ i = 0 d / 2 − 1 ∣ S i + 1 ∣ ∣ ( h i + 1 − h i ) ∣ ≤ ( m a x i ∣ h i + 1 − h i ∣ ) ∑ i = 0 d / 2 − 1 ∣ S i + 1 ∣ \begin{aligned} \left|\sum_{i=0}^{d / 2-1} q_{[2 i: 2 i+1]} k_{[2 i: 2 i+1]}^{*} e^{i(m-n) \theta_{i}}\right| & =\left|\sum_{i=0}^{d / 2-1} S_{i+1}\left(h_{i+1}-h_{i}\right)\right| \\ & \leq \sum_{i=0}^{d / 2-1}\left|S_{i+1}\right|\left|\left(h_{i+1}-h_{i}\right)\right| \\ & \leq\left(max _{i}\left|h_{i+1}-h_{i}\right|\right) \sum_{i=0}^{d / 2-1}\left|S_{i+1}\right| \end{aligned} i=0∑d/2−1q[2i:2i+1]k[2i:2i+1]∗ei(m−n)θi = i=0∑d/2−1Si+1(hi+1−hi) ≤i=0∑d/2−1∣Si+1∣∣(hi+1−hi)∣≤(maxi∣hi+1−hi∣)i=0∑d/2−1∣Si+1∣
Abel 变换,这里简要证明一下 Abel 变换和上式的推导过程
- 定义与公式:设 ] { a n } ]\{a_n\} ]{an}、 { b n } \{b_n\} {bn}和是两个数列,记 B k = ∑ i = 1 k b i ( B 0 = 0 ) B_k=\sum_{i=1}^{k}b_i (B_0=0) Bk=∑i=1kbi(B0=0),那么 ∑ k = 1 n a k b k = ∑ k = 1 n a k ( B k − B k − 1 ) = a n B n − ∑ k = 1 n − 1 ( a k + 1 − a k ) B k \sum_{k=1}^{n}a_kb_k=\sum_{k=1}^{n}a_k(B_k-B_{k-1})=a_nB_n-\sum_{k=1}^{n-1}(a_{k+1}-a_k)B_k ∑k=1nakbk=∑k=1nak(Bk−Bk−1)=anBn−∑k=1n−1(ak+1−ak)Bk
- 证明过程:
- 由 B k − B k − 1 = b k B_k-B_{k-1}=b_k Bk−Bk−1=bk 可得 ∑ k = 1 n a k b k = ∑ k = 1 n a k ( B k − B k − 1 ) \sum_{k=1}^{n}a_kb_k=\sum_{k=1}^{n}a_k(B_k-B_{k-1}) ∑k=1nakbk=∑k=1nak(Bk−Bk−1)
- 展开得: a 1 ( B 1 − B 0 ) + a 2 ( B 2 − B 1 ) + a 3 ( B 3 − B 2 ) + . . . + a n ( B n − B n − 1 ) a_1(B_1-B_0)+a_2(B_2-B_1)+a_3(B_3-B_2)+...+a_n(B_n-B_{n-1}) a1(B1−B0)+a2(B2−B1)+a3(B3−B2)+...+an(Bn−Bn−1)
- 上式等价于 a n B n − [ ( a 2 B 1 − a 1 B 1 ) + ( a 3 B 2 − a 2 B 2 ) + a 1 B 0 ] a_nB_n-[(a_2B_1-a_1B_1)+(a_3B_2-a_2B_2)+a_1B_0] anBn−[(a2B1−a1B1)+(a3B2−a2B2)+a1B0],由于 B 0 = 0 B_0=0 B0=0,所以 a n B n − [ ( a 2 B 1 − a 1 B 1 ) + ( a 3 B 2 − a 2 B 2 ) + a 1 B 0 ] = a n B n − ∑ k = 1 n − 1 ( a k + 1 − a k ) B k a_nB_n-[(a_2B_1-a_1B_1)+(a_3B_2-a_2B_2)+a_1B_0]=a_nB_n-\sum_{k=1}^{n-1}(a_{k+1}-a_k)B_k anBn−[(a2B1−a1B1)+(a3B2−a2B2)+a1B0]=anBn−∑k=1n−1(ak+1−ak)Bk
- 论文公式推导,由 S 0 = 0 S_0=0 S0=0 时,上式满足 Abel 变换,再者,作者设置 h d / 2 = 0 h_{d/2}=0 hd/2=0,这里的目的是方便化简,因此可以如下变换 ∑ i = 0 d / 2 − 1 h i ( S i + 1 − S i ) = ∑ i = 0 d / 2 − 1 h i ( S i + 1 − S i ) + h d / 2 ( S d / 2 + 1 − S d / 2 ) = ∑ i = 0 d / 2 h i ( S i + 1 − S i ) \sum_{i=0}^{d / 2-1} h_{i}\left(S_{i+1}-S_{i}\right)=\sum_{i=0}^{d / 2-1} h_{i}\left(S_{i+1}-S_{i}\right)+h_{d/2}(S_{d/2+1}-S_{d/2})=\sum_{i=0}^{d / 2} h_{i}\left(S_{i+1}-S_{i}\right) ∑i=0d/2−1hi(Si+1−Si)=∑i=0d/2−1hi(Si+1−Si)+hd/2(Sd/2+1−Sd/2)=∑i=0d/2hi(Si+1−Si),根据 Abel 公式(注意别代换错误了) 可得 ∑ i = 0 d / 2 h i ( S i + 1 − S i ) = h d / 2 S d / 2 + 1 − ∑ i = 0 d / 2 − 1 S i + 1 ( h i + 1 − h i ) \sum_{i=0}^{d / 2} h_{i}\left(S_{i+1}-S_{i}\right)=h_{d/2}S_{d/2+1}-\sum_{i=0}^{d / 2-1} S_{i+1}\left(h_{i+1}-h_{i}\right) ∑i=0d/2hi(Si+1−Si)=hd/2Sd/2+1−∑i=0d/2−1Si+1(hi+1−hi),由于 h d / 2 = 0 h_{d/2}=0 hd/2=0,可得 ∑ i = 0 d / 2 h i ( S i + 1 − S i ) = ∑ i = 0 d / 2 − 1 S i + 1 ( h i + 1 − h i ) \sum_{i=0}^{d / 2} h_{i}\left(S_{i+1}-S_{i}\right)=\sum_{i=0}^{d / 2-1} S_{i+1}\left(h_{i+1}-h_{i}\right) ∑i=0d/2hi(Si+1−Si)=∑i=0d/2−1Si+1(hi+1−hi)
通过设置 θ i = 1000 0 − 2 i / d \theta_{i}=10000^{-2 i / d} θi=10000−2i/d,随着 m − n m-n m−n 的值增大,即两个 token 之间的间隔越大 1 d / 2 ∑ i = 1 d / 2 ∣ S i ∣ \frac{1}{d / 2} \sum_{i=1}^{d / 2}|S_{i}| d/21∑i=1d/2∣Si∣ 会出现衰减

根据上述公式可实现如下 Python 代码
import numpy as np
import matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['Hiragino Sans GB'] # 修改字体
plt.rcParams['axes.unicode_minus'] = False # 正常显示负号# 定义向量维度 d
d = 128# 定义 Theta 函数
def theta(t):return 10000 ** (-2 * t / d)# 定义目标函数 f(m)
def f(m):result = 0for j in range(int(d / 2)):inner_sum = np.sum(np.exp(1j * m * theta(np.arange(0, j + 1))))result += np.abs(inner_sum)return result / (d / 2)# 生成相对距离 m 的取值范围
m_values = np.linspace(0, 256, 500)
# 计算每个 m 对应的函数值
f_values = [f(m) for m in m_values]# 绘制图像
plt.plot(m_values, f_values)
plt.xlabel('相对距离')
plt.ylabel('相对大小')
plt.title('相对大小随相对距离的变化')
plt.grid(True)
plt.show()
其走势与原始论文基本相似
