Attention is all you need注意的点

it2026-04-23  4

这里是对self-Attention具体的矩阵操作,包括怎么separate head,如何进行的QK乘法等细节做了记录,以便自己以后查看。

Dot-Product Self-Attention

Note:

其中的 X n , d m o d e l X^{n,d_{model}} Xn,dmodel一般是seq序列,n为序列的长度, d m o d e l d_{model} dmodel为序列的emedding维度。在self-attention中 d k d_k dk d v d_v dv是相等的。 Q K T d k \frac{QK^T}{\sqrt[]{d_k}} dk QKT中除以 d k d_k dk是防止 d k d_k dk过大引起点乘之后的值太大,做一个归一化。softmax(A, axis=-1)是对最后一个维度鬼一会呀,也就是对于第0行,对第0行的所有列都进行了归一化,目的是找到Q中每一个词对K的权重。其中对于这个axis=1参数的理解在这里,基本的套路就是,axis=1就是列动行不动,相当于对每一列做了归一化。

Multi-Head Self-Attention

其中的 X b s , l e n g t h , e m b X^{bs,length,emb} Xbs,length,emb一般是输入的序列,维度的意义如名字所示。

首先用三个矩阵 W Q W K W V W_QW_KW_V WQWKWV分别对QKV嵌入一个新的维度,emb2也就是projection_dim,当然也可以保持原有的维度不变。以 Q b s , l e n g t h , e m b 2 Q^{bs,length,emb2} Qbs,length,emb2举例,需要将head分离出来,做法也就是对Q的最后一个维度reshape对QK做矩阵乘法,这里注意,Q和K的维度是四个维度( Q b s , h e a d , l e n g h t , e m b 2 / / h e a d Q^{bs,head,lenght,emb2//head} Qbs,head,lenght,emb2//head),这里的乘法是保持bs和head不变只在最后两个维度做乘法,所以得到的Attention矩阵 A b s , h e a d , l e n g h t , l e n g t h A^{bs,head,lenght,length} Abs,head,lenght,length,这里的意义就是用户序列的每一个词都对其余的词有一个attention值。 A V T AV^T AVT得到 Y b s , h e a d , l e n g t h , e m b 2 / / h e a d Y^{bs,head,length,emb2//head} Ybs,head,length,emb2//head,对Y进行reshape一下将head去掉恢复原来QKV的形状 Y b s , l e n g t h , e m b 2 Y^{bs,length,emb2} Ybs,length,emb2

Encoder Mask & Decoder Mask

参考

问题

为什么attention能学到一个词和另一个与他最相关的词的联系? csdn解答为什么BERT的三个embedding可以相加? 知乎解答

self-Attention 实现

class MultiHeadSelfAttention(layers.Layer): def __init__(self, embed_dim, num_heads=8): super(MultiHeadSelfAttention, self).__init__() self.embed_dim = embed_dim self.num_heads = num_heads if embed_dim % num_heads != 0: raise ValueError( f"embedding dimension = {embed_dim} should be divisible by number of heads = {num_heads}" ) self.projection_dim = embed_dim // num_heads self.query_dense = layers.Dense(embed_dim) self.key_dense = layers.Dense(embed_dim) self.value_dense = layers.Dense(embed_dim) self.combine_heads = layers.Dense(embed_dim) def attention(self, query, key, value): score = tf.matmul(query, key, transpose_b=True) dim_key = tf.cast(tf.shape(key)[-1], tf.float32) scaled_score = score / tf.math.sqrt(dim_key) weights = tf.nn.softmax(scaled_score, axis=-1) output = tf.matmul(weights, value) return output, weights def separate_heads(self, x, batch_size): x = tf.reshape( x, (batch_size, -1, self.num_heads, self.projection_dim)) return tf.transpose(x, perm=[0, 2, 1, 3]) def call(self, inputs): # x.shape = [batch_size, seq_len, embedding_dim] batch_size = tf.shape(inputs)[0] query = self.query_dense(inputs) # (batch_size, seq_len, embed_dim) key = self.key_dense(inputs) # (batch_size, seq_len, embed_dim) value = self.value_dense(inputs) # (batch_size, seq_len, embed_dim) query = self.separate_heads( query, batch_size ) # (batch_size, num_heads, seq_len, projection_dim) key = self.separate_heads( key, batch_size ) # (batch_size, num_heads, seq_len, projection_dim) value = self.separate_heads( value, batch_size ) # (batch_size, num_heads, seq_len, projection_dim) attention, weights = self.attention(query, key, value) attention = tf.transpose( attention, perm=[0, 2, 1, 3] ) # (batch_size, seq_len, num_heads, projection_dim) concat_attention = tf.reshape( attention, (batch_size, -1, self.embed_dim) ) # (batch_size, seq_len, embed_dim) output = self.combine_heads( concat_attention ) # (batch_size, seq_len, embed_dim) return output
最新回复(0)