201020学习笔记（BERT）

it2023-02-17 92

前置：word2vec，RNN网络模型，了解词向量如何建模重点：Transformer网络架构，BERT训练方法，实际应用基本组成依旧是机器翻译模型中常见的Seq2Seq网络

传统RNN的问题：下一层需要上一层的输出，不能并行。 Transformer： self-attention机制来进行并行计算，在输入和输出都相同。输出结果是被同时计算出来的，基本已经取代RNN了。考虑词将上下文语境融入到词向量中。两个词x1和x2：第一步向量初始化，转化为编码（四维向量，四个特征）第二步Q矩阵，K矩阵，V矩阵，借助三个辅助矩阵求出

求得当前词和每一个词的分值。 softmax求得：当前词对待编码位置的影响大小。向量维度越大值越大但影响不一定越重要，所以要去掉向量维度影响。每个词汇跟整个序列中每一个K计算得分，然后基于得分再分配特征，得到注意力值。总过程：多头注意力机制一组qkv可以提取一组当前词的特征表达，多组qkv提取多组特征。经过自注意力后要对层做归一化。

decoder：输入输出都是一个序列，

模型：BERT_BASE_DIR 数据：glue_data 选MRPC——两个字符串描述的是否同一个意思

文件： BERT_BASE_DIR/uncased…/ bert_config.json 一些配置参数 ckpt：谷歌保存的预训练模型 vocab.txt：语料库，所有的词

run_classifier.py 设置run configuration Arguments： –task_name=MRPC \ –do_train=true \ –do_eval=true \ –data_dir=…/GLUE/glue_data/MRPC –vocab_file=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/vocab.txt \ –bert_config_file=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/bert_config.json \ –init_checkpoint=…/GLUE/BERT_BASE_DIR/uncased_L-12_H-768_A-12/bert_model.ckpt \ –max_seq_length=128 \ –train_batch_size=1 \ –learning_rate=2e-5 \ –num_train_epochs=3.0 \ –output_dir=…/GLUE/output

任务名，do_train做不做训练，do_eval做不做验证，windows别用绝对路径，别用中文

run_classifier.py：如177-192：读取数据需要自己完成 842：train_examples = processor.get_train_examples(FLAGS.data_dir) 读取到一个数据（跳转到299）

num_train_steps = int(len(train_examples) / FLAGS.train_batch_size * FLAGS.num_train_epochs)

↑844行train_examples有3668个，batch_size=100，要做3700/100=37次迭代，乘上epoches=3，一共迭代111次

num_warmup_steps = int(num_train_steps * FLAGS.warmup_proportion)

↑845行num_warmup_steps：刚开始训练的时候让学习率偏小，经过warmup阶段后再还原（设置的是0.1，就是经过111*0.1=11次迭代后学习率还原）

869行：数据读取

file_based_convert_examples_to_features(train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)

在file_based_convert_examples_to_features中

writer = tf.python_io.TFRecordWriter(output_file)

↑483行，转化为TFRecordWriter格式

if ex_index % 10000 == 0: tf.logging.info("Writing example %d of %d" % (ex_index, len(examples)))

↑485行：每10000次打印一次结果

feature = convert_single_example(ex_index, example, label_list,max_seq_length, tokenizer)

↑489行：核心函数，跳转377行

label_map = {} for (i, label) in enumerate(label_list): #构建标签 label_map[label] = i

389行：构建标签，0和1两个类别

tokens_a = tokenizer.tokenize(example.text_a) #第一句话分词

393行：分词，跳转tokenizer.py 170行

split_tokens = [] for token in self.basic_tokenizer.tokenize(text): for sub_token in self.wordpiece_tokenizer.tokenize(token): split_tokens.append(sub_token)

↑wordpiece 实例： <class ‘list’>: [‘am’, ‘##ro’, ‘##zi’, ‘accused’, ‘his’, ‘brother’, ‘,’, ‘whom’, ‘he’, ‘called’, ‘"’, ‘the’, ‘witness’, ‘"’, ‘,’, ‘of’, ‘deliberately’, ‘di’, ‘##stor’, ‘##ting’, ‘his’, ‘evidence’, ‘.’] 中文基本都是切分成一个一个字来做，总体思路都是切分成更细的来处理分词完成返回run_classifier，如果存在第二句话，则也进行分词 398行判断

if tokens_b: # Modifies `tokens_a` and `tokens_b` in place so that the total # length is less than the specified length. # Account for [CLS], [SEP], [SEP] with "- 3" #保留3个特殊字符 _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3) #如果这俩太长了就截断操作 else: # Account for [CLS] and [SEP] with "- 2" if len(tokens_a) > max_seq_length - 2: tokens_a = tokens_a[0:(max_seq_length - 2)]

①过长截断 ②有b保留三个特殊字符，没有b保留两个特殊字符 408开始进行编码，代码自带的说明：

# The convention in BERT is: # (a) For sequence pairs: # tokens: [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP] # type_ids: 0 0 0 0 0 0 0 0 1 1 1 1 1 1 #表示来自哪句话 # (b) For single sequences: # tokens: [CLS] the dog is hairy . [SEP] # type_ids: 0 0 0 0 0 0 0 # # Where "type_ids" are used to indicate whether this is the first # sequence or the second sequence. The embedding vectors for `type=0` and # `type=1` were learned during pre-training and are added to the wordpiece # embedding vector (and position vector). This is not *strictly* necessary # since the [SEP] token unambiguously separates the sequences, but it makes # it easier for the model to learn the concept of sequences. # # For classification tasks, the first vector (corresponding to [CLS]) is # used as the "sentence vector". Note that this only makes sense because # the entire model is fine-tuned.

type_id=0表示前一句话，1表示后一句话

tokens = [] segment_ids = [] tokens.append("[CLS]") segment_ids.append(0) for token in tokens_a: tokens.append(token) segment_ids.append(0) tokens.append("[SEP]") segment_ids.append(0)

426行开始编码，第一个词是CLS固定的，编码为0，然后遍历word_piece中每一个词，type_id都是0。a里的词都添加完后，加入连接符sep，再添加一个0

if tokens_b: for token in tokens_b: tokens.append(token) segment_ids.append(1) tokens.append("[SEP]") segment_ids.append(1)

436行开始添加b（如果存在），type_id编码为1，通过vocab.txt的索引来找词 443行转化为ID（vocab.txt） <class ‘list’>: [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102]

max_seq_length=128,不够的补0

while len(input_ids) < max_seq_length: #PAD的长度取决于设置的最大长度 input_ids.append(0) input_mask.append(0) segment_ids.append(0)

↑450行，self-attention中考虑补0时要加入额外mask，做自注意力区分补的0，指定实际的词的input_mask=1，实际参与到自注意力计算中，补0的input_mask=0，不参与。 input_ids：<class ‘list’>: [101, 2572, 3217, 5831, 5496, 2010, 2567, 1010, 3183, 2002, 2170, 1000, 1996, 7409, 1000, 1010, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 7727, 2000, 2032, 2004, 2069, 1000, 1996, 7409, 1000, 1010, 2572, 3217, 5831, 5496, 2010, 2567, 1997, 9969, 4487, 23809, 3436, 2010, 3350, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] input_masks:<class ‘list’>: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 459行输出：

470行：inputfeatures->161行，自我初始化赋值 485行：for循环遍历每一个样本 496开始：处理样本

496-502：数据类型转化成tfRecorder

504，505

tf_example = tf.train.Example(features=tf.train.Features(feature=features)) writer.write(tf_example.SerializeToString())

把一条tf数据序列化的写进writer

embedding层： 574层开始创建bert模型

def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, labels, num_labels, use_one_hot_embeddings): """Creates a classification model.""" model = modeling.BertModel( config=bert_config, is_training=is_training, input_ids=input_ids,#（8,128） input_mask=input_mask,#（8,128） token_type_ids=segment_ids,#（8,128） use_one_hot_embeddings=use_one_hot_embeddings)

config：配置文件 is_training：是否训练 input_ids：batchsize都是8,128是每句话长度 input_mask：0还是1，是补的内容还是本来有内容 segment_ids：第几句话

modeling.py： 165行

if input_mask is None: #如果没设置mask 自然就都是1的 input_mask = tf.ones(shape=[batch_size, seq_length], dtype=tf.int32) if token_type_ids is None: token_type_ids = tf.zeros(shape=[batch_size, seq_length], dtype=tf.int32)

mask：没有mask就自动添加，都是1（对self-attention不好） type_id：没说是几句话就默认是1句话，都设置成0

先构建embedding层，171行开始把1-128都转化成一个向量，三个编码要维度相同

with tf.variable_scope(scope, default_name="bert"): with tf.variable_scope("embeddings"): # Perform embedding lookup on the word ids. (self.embedding_output, self.embedding_table) = embedding_lookup( input_ids=input_ids, vocab_size=config.vocab_size, embedding_size=config.hidden_size, initializer_range=config.initializer_range, word_embedding_name="word_embeddings", use_one_hot_embeddings=use_one_hot_embeddings)

↑171，embedding过程，input_ids格式8x128，vocab_size三万多个（预训练模型），embedding_size：映射的多少维（官方是768），initializer_range：初始化取值范围（0.02），one_hot（默认false），预训练模型别改参数

额外编码特征输入两个维度，(batchsize x max_length)=8x128 输出：batchsize x max_length x 768维的向量 modeling.py 171-180 ：完成word_embedding 409行开始

if input_ids.shape.ndims == 2: input_ids = tf.expand_dims(input_ids, axis=[-1]) embedding_table = tf.get_variable( #词映射矩阵，30522, 768 name=word_embedding_name, shape=[vocab_size, embedding_size], initializer=create_initializer(initializer_range)) flat_input_ids = tf.reshape(input_ids, [-1]) if use_one_hot_embeddings: one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size) output = tf.matmul(one_hot_input_ids, embedding_table) else: output = tf.gather(embedding_table, flat_input_ids) #CPU,GPU运算1024, 768 一个batch里所有的映射结果

先给输入多加一个维度，8 x 128 x 1 拍平：flat_input_ids= input_ids(8x128x1)=1024 output=1024 x 768

input_shape = get_shape_list(input_ids) output = tf.reshape(output,input_shape[0:-1] + [input_shape[-1] * embedding_size]) #(8, 128, 768) return (output, embedding_table)

↑421行，output三个维度：8x128x768 batch_sizex每句话中的词x每个词的向量词变成了向量

184-194：完成位置编码position_embedding 只是把信息融入，不会改变shape 跳转472-

if use_token_type: if token_type_ids is None: raise ValueError("`token_type_ids` must be specified if" "`use_token_type` is True.") token_type_table = tf.get_variable(#(2, 768) name=token_type_embedding_name, shape=[token_type_vocab_size, width], initializer=create_initializer(initializer_range)) # This vocab will be small so we always do one-hot here, since it is always # faster for a small vocabulary. flat_token_type_ids = tf.reshape(token_type_ids, [-1])#(1024) one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size) token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) token_type_embeddings = tf.reshape(token_type_embeddings, [batch_size, seq_length, width]) #8, 128, 768 output += token_type_embeddings if use_position_embeddings: assert_op = tf.assert_less_equal(seq_length, max_position_embeddings) with tf.control_dependencies([assert_op]): full_position_embeddings = tf.get_variable( name=position_embedding_name, shape=[max_position_embeddings, width], initializer=create_initializer(initializer_range))

这里因为预设最多两句，所以embedding是(2,768)，第一维度只有0和1两种对每个词确定其是0还是1 这里的one_hot为了加速进行乘法， token_type_embeddings = tf.matmul(one_hot_ids, token_type_table) (1024x2) x(2 x 768),1024个词每个词有2个可能性，表格有两种可能性，每个可能性是768维的向量，最后也是1024x768 然后reshape成8x128x768 full_position_embeddings:512x768 505-507

position_embeddings = tf.slice(full_position_embeddings, [0, 0],[seq_length, -1]) #位置编码给的挺大，为了加速只需要取出有用部分就可以 128, 768 num_dims = len(output.shape.as_list())

进行一个切片截取，512x768，返回的position_embeddings只对128x768进行处理

512-518行

position_broadcast_shape = [] for _ in range(num_dims - 2): position_broadcast_shape.append(1) position_broadcast_shape.extend([seq_length, width]) # [1, 128, 768] 表示位置编码跟输入啥数据无关，因为原始的embedding是有batchsize当做第一个维度，这里为了计算也得加入 position_embeddings = tf.reshape(position_embeddings, position_broadcast_shape) output += position_embeddings

现在得到的位置是128x768，对于每个batch都加一个相同的。位置编码与位置传进的词无关，加一个维度得到[1,128,768] 1.考虑词（2种可能性），2.位置（128个位置）

output = layer_norm_and_dropout(output, dropout_prob) return output

加入dropout层，output相加输出（三个层的和）

mask机制： modeling.py 200行

# This converts a 2D mask of shape [batch_size, seq_length] to a 3D # mask of shape [batch_size, seq_length, seq_length] which is used # for the attention scores. attention_mask = create_attention_mask_from_input_mask(input_ids, input_mask)

基于每个词，去计算每个词需要跟多少个词做attention（跟1的做，0的忽略）输入8x128，输出8x128x128，最后一个128是每一个单词能看见多少个单词

Transformer modeling.py 205行

self.all_encoder_layers = transformer_model( input_tensor=self.embedding_output, attention_mask=attention_mask, hidden_size=config.hidden_size, num_hidden_layers=config.num_hidden_layers,#Transformer中的隐层神经元个数 num_attention_heads=config.num_attention_heads,#多头注意力有多少个头 intermediate_size=config.intermediate_size,#全连接层神经元个数 intermediate_act_fn=get_activation(config.hidden_act), hidden_dropout_prob=config.hidden_dropout_prob, attention_probs_dropout_prob=config.attention_probs_dropout_prob, initializer_range=config.initializer_range, do_return_all_layers=True)#是否返回每一层的输出

input_tensor：前面embedding的结果 attention_mask：映射到0或者1，表示不要或者要这个词 fine-tuning接着训练，很多参数不能改 802行

if hidden_size % num_attention_heads != 0: raise ValueError( "The hidden size (%d) is not a multiple of the number of attention " "heads (%d)" % (hidden_size, num_attention_heads))

hidden_size=768 num_attention_heads=12 768/12，每个头64个特征，把每个头的向量拼在一起。如果不能整除后续计算会麻烦。 807行↓

attention_head_size = int(hidden_size / num_attention_heads) #一共要输出768个特征，给每个头分一下 input_shape = get_shape_list(input_tensor, expected_rank=3) # [8, 128, 768] batch_size = input_shape[0] seq_length = input_shape[1] input_width = input_shape[2]

815行↓

if input_width != hidden_size: #注意残差连接的方式，需要它俩维度一样才能相加 raise ValueError("The width of the input tensor (%d) != hidden size (%d)" % (input_width, hidden_size))

这里不是拼接，是加法残差连接，输入是768维输出必须也是768维才能进行相加，因此进行判断

reshape：对8x128转化为1024（可能是为了加速？） 819行

# We keep the representation as a 2D tensor to avoid re-shaping it back and # forth from a 3D tensor to a 2D tensor. Re-shapes are normally free on # the GPU/CPU but may not be free on the TPU, so we want to minimize them to # help the optimizer. prev_output = reshape_to_matrix(input_tensor) #reshape的目的可能是为了加速

输入转化为1024x768 825行

all_layer_outputs = [] for layer_idx in range(num_hidden_layers): with tf.variable_scope("layer_%d" % layer_idx): layer_input = prev_output

每一层的输出是下一层的输入，layer_input和prev_output都是128x768 830开始是attention

with tf.variable_scope("attention"): attention_heads = [] with tf.variable_scope("self"): attention_head = attention_layer( from_tensor=layer_input, to_tensor=layer_input, attention_mask=attention_mask, num_attention_heads=num_attention_heads, size_per_head=attention_head_size, attention_probs_dropout_prob=attention_probs_dropout_prob, initializer_range=initializer_range, do_return_2d_tensor=True, batch_size=batch_size, from_seq_length=seq_length, to_seq_length=seq_length) attention_heads.append(attention_head)

from_tensor和to_tensor都是layer_input：对自己做attention attention_mask：加1和0 返回2D的tensor，sql_length都是128

558行：构建attention_layer

def attention_layer(from_tensor, to_tensor, attention_mask=None, num_attention_heads=1, size_per_head=512, query_act=None, key_act=None, value_act=None, attention_probs_dropout_prob=0.0, initializer_range=0.02, do_return_2d_tensor=False, batch_size=None, from_seq_length=None, to_seq_length=None): """Performs multi-headed attention from `

637行：

from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])#[1024, 768] to_shape = get_shape_list(to_tensor, expected_rank=[2, 3])#[1024, 768] # Scalar dimensions referenced here: # B = batch size (number of sequences) 8 # F = `from_tensor` sequence length 128 # T = `to_tensor` sequence length 128 # N = `num_attention_heads` 12 # H = `size_per_head` 64

构建QKV矩阵： Query：666行

# `query_layer` = [B*F, N*H] query_layer = tf.layers.dense( from_tensor_2d, num_attention_heads * size_per_head, activation=query_act, name="query", kernel_initializer=create_initializer(initializer_range))

Query矩阵是由from_tensor构建的，num_attention_heads=12个头,size_per_head=64，query_layer=1024x768（BF,NH） 8个batch每个有128个词，1024个词都要跟其他词计算內积，对每个词都要产生矩阵，12个头，每个头有64特征，12x64=768

# `key_layer` = [B*T, N*H] key_layer = tf.layers.dense( to_tensor_2d, num_attention_heads * size_per_head, activation=key_act, name="key", kernel_initializer=create_initializer(initializer_range))

key矩阵显然要跟Query矩阵维度一样，除了传进去的是to_tensor，其他参数与query矩阵一样

# `value_layer` = [B*T, N*H] value_layer = tf.layers.dense( to_tensor_2d, num_attention_heads * size_per_head, activation=value_act, name="value", kernel_initializer=create_initializer(initializer_range))

实际得到特征（查看前面的QKV图），所以v跟k维度一样

內积计算：

# `query_layer` = [B, N, F, H] #为了加速计算内积 query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, from_seq_length,size_per_head) # `key_layer` = [B, N, T, H] #为了加速计算内积 key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads,to_seq_length, size_per_head) # Take the dot product between "query" and "key" to get the raw # attention scores. # `attention_scores` = [B, N, F, T] attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) #结果为(8, 12, 128, 128) attention_scores = tf.multiply(attention_scores,1.0 / math.sqrt(float(size_per_head))) #消除维度对结果的影响

根号dk：8，消除维度的影响

if attention_mask is not None: # `attention_mask` = [B, 1, F, T] attention_mask = tf.expand_dims(attention_mask, axis=[1]) # Since attention_mask is 1.0 for positions we want to attend and 0.0 for # masked positions, this operation will create a tensor which is 0.0 for # positions we want to attend and -10000.0 for masked positions. adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 #mask为1的时候结果为0 mask为0的时候结果为非常大的负数 # Since we are adding it to the raw scores before the softmax, this is # effectively the same as removing these entirely. attention_scores += adder #把这个加入到原始的得分里相当于mask为1的就不变，mask为0的就会变成非常大的负数

有注意力mask的时候mask为1 mask为1时，adder=（1-1）-10000=0 mask为0时，adder=(1-0)-10000=-10000 softmax处理时， x=-10000时，softmax近乎为0，基本没有分配概率，权重不会被映射到，

# Normalize the attention scores to probabilities. # `attention_probs` = [B, N, F, T] attention_probs = tf.nn.softmax(attention_scores) #再做softmax此时负数做softmax相当于结果为0了就相当于不考虑了 # This is actually dropping out entire tokens to attend to, which might # seem a bit unusual, but is taken from the original Transformer paper. attention_probs = dropout(attention_probs, attention_probs_dropout_prob)

进行矩阵计算

# `value_layer` = [B, T, N, H] value_layer = tf.reshape( value_layer, [batch_size, to_seq_length, num_attention_heads, size_per_head])#(8, 128, 12, 64) # `value_layer` = [B, N, T, H] value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) #(8, 12, 128, 64) # `context_layer` = [B, N, F, H] context_layer = tf.matmul(attention_probs, value_layer)#计算最终结果特征 (8, 12, 128, 64) # `context_layer` = [B, F, N, H] context_layer = tf.transpose(context_layer, [0, 2, 1, 3])#转换回[8, 128, 12, 64]

857残差连接

# Run a linear projection of `hidden_size` then add a residual # with `layer_input`. with tf.variable_scope("output"): #1024, 768 残差连接 attention_output = tf.layers.dense( attention_output, hidden_size, kernel_initializer=create_initializer(initializer_range)) attention_output = dropout(attention_output, hidden_dropout_prob) attention_output = layer_norm(attention_output + layer_input)

全连接后做判断，884行

if do_return_all_layers: final_outputs = [] for layer_output in all_layer_outputs: final_output = reshape_from_matrix(layer_output, input_shape) final_outputs.append(final_output) return final_outputs else: final_output = reshape_from_matrix(prev_output, input_shape) return final_output

返回所有层或最后层。

创建模型 run_classifier577行

"""Creates a classification model.""" model = modeling.BertModel( config=bert_config, is_training=is_training, input_ids=input_ids,#（8,128） input_mask=input_mask,#（8,128） token_type_ids=segment_ids,#（8,128） use_one_hot_embeddings=use_one_hot_embeddings)

590行定义输出

# If you want to use the token-level output, use model.get_sequence_output() # instead. output_layer = model.get_pooled_output() hidden_size = output_layer.shape[-1].value #768 output_weights = tf.get_variable( #再连的全连接层 "output_weights", [num_labels, hidden_size], initializer=tf.truncated_normal_initializer(stddev=0.02)) output_bias = tf.get_variable( #偏置参数，0和1进行微调 "output_bias", [num_labels], initializer=tf.zeros_initializer())

get_pooled_output：第一位是CLS，会覆盖到所有的句子 hidden_size：768 output_weights：（2,768） num_labels=2：二分类

modeling.py 205行最终结果

# Run the stacked transformer. # `sequence_output` shape = [batch_size, seq_length, hidden_size]. self.all_encoder_layers = transformer_model( input_tensor=self.embedding_output, attention_mask=attention_mask, hidden_size=config.hidden_size, num_hidden_layers=config.num_hidden_layers,#Transformer中的隐层神经元个数 num_attention_heads=config.num_attention_heads, intermediate_size=config.intermediate_size,#全连接层神经元个数 intermediate_act_fn=get_activation(config.hidden_act), hidden_dropout_prob=config.hidden_dropout_prob, attention_probs_dropout_prob=config.attention_probs_dropout_prob, initializer_range=config.initializer_range, do_return_all_layers=True)#是否返回每一层的输出 self.sequence_output = self.all_encoder_layers[-1] # The "pooler" converts the encoded sequence tensor of shape # [batch_size, seq_length, hidden_size] to a tensor of shape # [batch_size, hidden_size]. This is necessary for segment-level # (or segment-pair-level) classification tasks where we need a fixed # dimensional representation of the segment. with tf.variable_scope("pooler"): # We "pool" the model by simply taking the hidden state corresponding # to the first token. We assume that this has been pre-trained first_token_tensor = tf.squeeze(self.sequence_output[:, 0:1, :], axis=1) self.pooled_output = tf.layers.dense( first_token_tensor, config.hidden_size, activation=tf.tanh, kernel_initializer=create_initializer(config.initializer_range))

first_token_tensor：第一个tensor就是CLS

经过bert_model得到一个模型，需要什么结果就连接怎样的全连接层 run_classifier 601行

with tf.variable_scope("loss"): if is_training: # I.e., 0.1 dropout output_layer = tf.nn.dropout(output_layer, keep_prob=0.9) logits = tf.matmul(output_layer, output_weights, transpose_b=True) logits = tf.nn.bias_add(logits, output_bias) probabilities = tf.nn.softmax(logits, axis=-1) log_probs = tf.nn.log_softmax(logits, axis=-1) one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32) per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1) loss = tf.reduce_mean(per_example_loss) return (loss, per_example_loss, logits, probabilities)

logits=outputlayer * 权重 + 偏置，再加上softmax层，然后交叉熵计算损失

基本只要改数据读和预处理，在run_classifier177行左右

class DataProcessor(object): """Base class for data converters for sequence classification data sets.""" def get_train_examples(self, data_dir): """Gets a collection of `InputExample`s for the train set.""" raise NotImplementedError() def get_dev_examples(self, data_dir): """Gets a collection of `InputExample`s for the dev set.""" raise NotImplementedError() def get_test_examples(self, data_dir): """Gets a collection of `InputExample`s for prediction.""" raise NotImplementedError() def get_labels(self): """Gets the list of labels for this data set.""" raise NotImplementedError()

读取自己的数据集 run_classifier 199行

为了不改源码，text_b弄成乱码。 inputexample：指定的一个啥都没干的函数，把数据一个个传入examples中

class MyDataProcessor(DataProcessor):#自己写的方法，继承了DataProcessor """Base class for data converters for sequence classification data sets.""" def get_train_examples(self, data_dir): """Gets a collection of `InputExample`s for the train set.""" file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\train_sentiment.txt') f=open(file_path,'r',encoding='utf-8') train_data=[] index=0 for line in f.readlines(): guid = "train-%d" % (index)#指定一个id值 line=line.replace('\n','').split('\t') #换行符替换掉 text_a = tokenization.convert_to_unicode(str(line[1])) label = str(line[2]) train_data.append( InputExample(guid=guid, text_a=text_a, text_b=None,label=label)) index +=1 return train_data def get_dev_examples(self, data_dir): """Gets a collection of `InputExample`s for the dev set.""" file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\test_sentiment.txt') f = open(file_path, 'r', encoding='utf-8') dev_data = [] index = 0 for line in f.readlines(): guid = "dev-%d" % (index) # 指定一个id值 line = line.replace('\n', '').split('\t') # 换行符替换掉 text_a = tokenization.convert_to_unicode(str(line[1])) label = str(line[2]) dev_data.append( InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) index += 1 return dev_data def get_test_examples(self, data_dir): """Gets a collection of `InputExample`s for prediction.""" file_path=os.path.join(data_dir,'GLUE\glue_data\mydata\\test.csv') test_df = open(file_path, 'r', encoding='utf-8') test_data = [] for index ,test in enumerate(test_df.values): guid = "test-%d" % (index) text_a = tokenization.convert_to_unicode(str(test[0])) label = str(test[1]) test_data.append( InputExample(guid=guid, text_a=text_a, text_b=None, label=label)) index += 1 return test_data def get_labels(self): """Gets the list of labels for this data set.""" return ['0','1','2']

将自己写的预处理设置为参数 processors，加一行"mydata":MyDataProcessor 运行参数： –data_dir=data –task_name=mydata –vocab_file=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/vocab.txt \ –bert_config_file=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_config.json \ –output_dir=…/mydata_model –do_train=true –do_eval=true –init_checkpoint=…/GLUE/BERT_BASE_DIR/chinese_L-12_H-768_A-12/bert_model.ckpt \ –max_seq_length=70 \ –train_batch_size=32 \ –learning_rate=5e-5 \ –num_train_epochs=3.0 \

最新回复(0)