这里是对self-Attention具体的矩阵操作,包括怎么separate head,如何进行的QK乘法等细节做了记录,以便自己以后查看。
Dot-Product Self-Attention
Note:
其中的
X
n
,
d
m
o
d
e
l
X^{n,d_{model}}
Xn,dmodel一般是seq序列,n为序列的长度,
d
m
o
d
e
l
d_{model}
dmodel为序列的emedding维度。在self-attention中
d
k
d_k
dk和
d
v
d_v
dv是相等的。
Q
K
T
d
k
\frac{QK^T}{\sqrt[]{d_k}}
dk
QKT中除以
d
k
d_k
dk是防止
d
k
d_k
dk过大引起点乘之后的值太大,做一个归一化。softmax(A, axis=-1)是对最后一个维度鬼一会呀,也就是对于第0行,对第0行的所有列都进行了归一化,目的是找到Q中每一个词对K的权重。其中对于这个axis=1参数的理解在这里,基本的套路就是,axis=1就是列动行不动,相当于对每一列做了归一化。
Multi-Head Self-Attention
其中的
X
b
s
,
l
e
n
g
t
h
,
e
m
b
X^{bs,length,emb}
Xbs,length,emb一般是输入的序列,维度的意义如名字所示。
首先用三个矩阵
W
Q
W
K
W
V
W_QW_KW_V
WQWKWV分别对QKV嵌入一个新的维度,emb2也就是projection_dim,当然也可以保持原有的维度不变。以
Q
b
s
,
l
e
n
g
t
h
,
e
m
b
2
Q^{bs,length,emb2}
Qbs,length,emb2举例,需要将head分离出来,做法也就是对Q的最后一个维度reshape对QK做矩阵乘法,这里注意,Q和K的维度是四个维度(
Q
b
s
,
h
e
a
d
,
l
e
n
g
h
t
,
e
m
b
2
/
/
h
e
a
d
Q^{bs,head,lenght,emb2//head}
Qbs,head,lenght,emb2//head),这里的乘法是保持bs和head不变只在最后两个维度做乘法,所以得到的Attention矩阵
A
b
s
,
h
e
a
d
,
l
e
n
g
h
t
,
l
e
n
g
t
h
A^{bs,head,lenght,length}
Abs,head,lenght,length,这里的意义就是用户序列的每一个词都对其余的词有一个attention值。
A
V
T
AV^T
AVT得到
Y
b
s
,
h
e
a
d
,
l
e
n
g
t
h
,
e
m
b
2
/
/
h
e
a
d
Y^{bs,head,length,emb2//head}
Ybs,head,length,emb2//head,对Y进行reshape一下将head去掉恢复原来QKV的形状
Y
b
s
,
l
e
n
g
t
h
,
e
m
b
2
Y^{bs,length,emb2}
Ybs,length,emb2
Encoder Mask & Decoder Mask
参考
问题
为什么attention能学到一个词和另一个与他最相关的词的联系? csdn解答为什么BERT的三个embedding可以相加? 知乎解答
self-Attention 实现
class MultiHeadSelfAttention(layers
.Layer
):
def __init__(self
, embed_dim
, num_heads
=8):
super(MultiHeadSelfAttention
, self
).__init__
()
self
.embed_dim
= embed_dim
self
.num_heads
= num_heads
if embed_dim
% num_heads
!= 0:
raise ValueError
(
f
"embedding dimension = {embed_dim} should be divisible by number of heads = {num_heads}"
)
self
.projection_dim
= embed_dim
// num_heads
self
.query_dense
= layers
.Dense
(embed_dim
)
self
.key_dense
= layers
.Dense
(embed_dim
)
self
.value_dense
= layers
.Dense
(embed_dim
)
self
.combine_heads
= layers
.Dense
(embed_dim
)
def attention(self
, query
, key
, value
):
score
= tf
.matmul
(query
, key
, transpose_b
=True)
dim_key
= tf
.cast
(tf
.shape
(key
)[-1], tf
.float32
)
scaled_score
= score
/ tf
.math
.sqrt
(dim_key
)
weights
= tf
.nn
.softmax
(scaled_score
, axis
=-1)
output
= tf
.matmul
(weights
, value
)
return output
, weights
def separate_heads(self
, x
, batch_size
):
x
= tf
.reshape
(
x
, (batch_size
, -1, self
.num_heads
, self
.projection_dim
))
return tf
.transpose
(x
, perm
=[0, 2, 1, 3])
def call(self
, inputs
):
batch_size
= tf
.shape
(inputs
)[0]
query
= self
.query_dense
(inputs
)
key
= self
.key_dense
(inputs
)
value
= self
.value_dense
(inputs
)
query
= self
.separate_heads
(
query
, batch_size
)
key
= self
.separate_heads
(
key
, batch_size
)
value
= self
.separate_heads
(
value
, batch_size
)
attention
, weights
= self
.attention
(query
, key
, value
)
attention
= tf
.transpose
(
attention
, perm
=[0, 2, 1, 3]
)
concat_attention
= tf
.reshape
(
attention
, (batch_size
, -1, self
.embed_dim
)
)
output
= self
.combine_heads
(
concat_attention
)
return output