我会在代码的基础上添加尽可能多的注释。
首先:
def read_coqa_examples(input_file,is_training=True,use_history=False,n_history=-1): ''' 由于CoQA是对话型阅读理解数据集,所以后面的问题依赖于前面的问题与答案,但是这篇论文重点不在于 专门针对对话型数据集,所以并没有加上之前的问题与答案,要想进一步提升模型在coqa上的效果,是需要加上 之前的问题与答案的 ''' total_cnt=0 with open(input_file) as reader: input_data=json.load(reader)["data"]#input_data是一个list,每一个值是一个dict def is_whitespace(c): if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F: return True return False examples=[] for entry in input_data: #entry是一个dict,keys是['answers', 'filename', 'id', 'questions', 'source', 'story'] paragraph_text=entry["story"] paragraph_id=entry["id"] doc_tokens = [] char_to_word_offset = [] prev_is_whitespace = True for c in paragraph_text: if is_whitespace(c): prev_is_whitespace = True else: if prev_is_whitespace: doc_tokens.append(c) else: doc_tokens[-1] += c prev_is_whitespace = False #char_to_word_offset记录的是整个paragraph_text中每一个字符对应的答案的下标,比如I want to. #那么char_to_word_offset就是[0,0,1,1,1,1,1,2,2,2],注意空格的位置下标算在它前一个单词的下标 char_to_word_offset.append(len(doc_tokens) - 1) #doc_tokens就是整个paragraph_text利用空白字符分割后得到的单词列表,每一个值就是一个单词 #接下来要做的是处理问答对,也就是每一个问题和对应的答案与paragraph_text构成一个example #由于对话型任务的特殊性,因此需要加上之前的问题与答案 question_history_texts = []#用来记录所有的问题,其实严格来讲 #还应该定义answer_history_texts用来记录所有的答案,然后拼接 #entry["questions"]和entry["answers"]是两个长度一样的list for (question, ans) in zip(entry['questions'], entry['answers']): #question形如:{'input_text': 'When was the Vat formally opened?', 'turn_id': 1} #ans形如:{'input_text': 'It was formally established in 1475', 'span_end': 192, 'span_start': 151, 'span_text': 'Formally established in 1475', # 'text': 'Formally established in 1475, although it', 'turn_id': 1, 'yes_no_ans': -1, 'yes_no_flag': 0} #text是答案的真实标签,注意这里所谓的真实标签是指: #1. 若该问题的答案是yes_no类型的,那么text是答案所在的句子 #2. 若该问题的答案就是在原文中的一段跨度,那么text就是这段跨度 #3. 若是该问题的答案不是原文的一段跨度,那么就根据文本片段单词匹配重合度,从原文中找出来一段与答案最相似的文本跨度作为真实标签 #yes_no_flag=0代表这个问题不是yes_no答案类型的,所有yes_no_ans必定为-1 #yes_no_flag=1时才表明这个问题是yes或者no,此时yes_no_ans=1表明答案是yes,yes_no_ans=0表明答案是no total_cnt += 1 cur_question_text = question["input_text"] question_history_texts.append(cur_question_text)#添加每一轮的问题 question_id = question["turn_id"] ans_id = ans["turn_id"] start_position = None end_position =None yes_no_flag = None yes_no_ans = None orig_answer_text = None if (question_id != ans_id): print("question turns are not ordered!") print("mismatched question {}".format(cur_question_text)) if is_training: orig_answer_text = ans["text"] answer_offset = ans["span_start"] answer_length = len(orig_answer_text) start_position = char_to_word_offset[answer_offset] if (answer_offset+answer_length >= len(char_to_word_offset)): end_position = char_to_word_offset[-1] else: end_position = char_to_word_offset[answer_offset + answer_length] #上面几行是用来寻找answer在document中的起始单词的位置和终止单词的位置作为交叉熵的标签 actual_text = " ".join(doc_tokens[start_position:(end_position+1)]) cleaned_answer_text = " ".join(whitespace_tokenize(orig_answer_text)) yes_no_flag = int(ans["yes_no_flag"]) yes_no_ans = int(ans["yes_no_ans"]) if actual_text.find(cleaned_answer_text) == -1: logger.warning("Could not find answer: '%s' vs. '%s'", actual_text, cleaned_answer_text) continue if (use_history): if (n_history == -1 or n_history > len(question_history_texts)): question_texts = question_history_texts[:] else: question_texts = question_history_texts[-1*n_history:] else: question_texts = question_history_texts[-1] #如果需要添加历史的问题,那么就设定n_history,然后添加到当前问题上,作为一个问题 example = CoQAExample( paragraph_id=paragraph_id, turn_id=question_id, question_texts=question_texts, doc_tokens=doc_tokens, orig_answer_text = orig_answer_text, start_position=start_position, end_position=end_position, yes_no_flag=yes_no_flag, yes_no_ans=yes_no_ans) examples.append(example) #这里面每一个字段的含义如下: #paragraph_id指的是当前的document在数据集中的id,这个是数据集自带的,turn_id是指 #当前的问题在当前的document中是第几个问题,question_texts是指当前问题的文本,也有可能是添加了历史问题的文本 #orig_answer_text是指答案的文本,注意这里是分情况的,如果问题是yes_no类型,那么此时的答案是yes_no所在 #的问题依据,如果问题的答案不是document的一段跨度,那么就从文章中选出来与答案单词重合度最高的片段作为答案 logger.info("Total raw examples: {}".format(total_cnt)) return examples
可以看出目前特征feature与样本example没有问题,子词序列和全词序列中的答案位置对应关系也没有问题
上面这张图片展示了我们构造好了各个张量准备送给模型。其中:
batch_query_tokens 是batch_size个问题文本batch_doc_tokens 是batch_size个document他们都是经过了wordpiece分词的其他的不必解释定义一个变量cur_global_pointers用来指示当前的segment相较于上一次的segment的移动步数,比如cur_global_pointers=-16,那么就代表当前的segment应该在上一次的segment的开始位置的基础上在document上向左滑动16的单词.
对于第一次分割,显然cur_global_pointers都是0,因为只能从第一个单词开始.
def gen_model_features(cur_global_pointers, batch_query_tokens, batch_doc_tokens, \ batch_start_positions, batch_end_positions, batch_max_doc_length, \ max_seq_length, tokenizer, is_train): ''' cur_global_pointers用来指示当前的segment应该在上一次的segment的基础上如何移动 函数的目的就是根据cur_global_pointers,在document上重新分割, 重新分割后的start_position和end_position要发生变化 ''' chunk_doc_offsets = [] chunk_doc_tokens = [] chunk_start_positions = [] chunk_end_positions = [] chunk_stop_flags = [] for index in range(len(cur_global_pointers)): # span: [doc_start, doc_span) doc_start = max(0, cur_global_pointers[index])#有可能出现第一次预测后模型希望向左移动分割 doc_end = min(doc_start + batch_max_doc_length[index], len(batch_doc_tokens[index])) if (doc_start >= len(batch_doc_tokens[index])): doc_end = len(batch_doc_tokens[index]) doc_start = max(0, doc_end - batch_max_doc_length[index]) chunk_doc_tokens.append(batch_doc_tokens[index][doc_start:doc_end]) chunk_doc_offsets.append(doc_start) if is_train: one_doc_len = doc_end - doc_start one_start_position = batch_start_positions[index] - doc_start#修改答案的位置 one_end_position = batch_end_positions[index] - doc_start # 上面的几行代码和baseline的做法一致 # 注意下面的代码,在BERT中,如果分割后的segment不包含有答案,那么是不作为样本训练模型的 #但是在这篇论文中整个document是一个整体,要保留所有的segment if (one_start_position < 0 or one_start_position >= one_doc_len or \ one_end_position < 0 or one_end_position >= one_doc_len): chunk_stop_flags.append(0) chunk_start_positions.append(max_seq_length) chunk_end_positions.append(max_seq_length) #对于那些不包含answer的segment,我们标记该segment不包含答案 else: chunk_stop_flags.append(1) chunk_start_positions.append(one_start_position) chunk_end_positions.append(one_end_position) # 经过上面的代码我们已经分割得到了一个segment,下面就是把question和这个segment连接构成[CLS]question[SEP]segment[SEP] chunk_input_ids = [] chunk_segment_ids = [] chunk_input_mask = [] # position in input_ids to position in batch_doc_tokens id_to_tok_maps = [] for index in range(len(cur_global_pointers)): one_id_to_tok_map = {} one_query_tokens = batch_query_tokens[index] one_doc_tokens = chunk_doc_tokens[index] one_doc_offset = chunk_doc_offsets[index] one_tokens = [] one_segment_ids = [] one_tokens.append("[CLS]") one_segment_ids.append(0) # add query tokens for token in one_query_tokens: one_tokens.append(token) one_segment_ids.append(0) one_tokens.append("[SEP]") one_segment_ids.append(0) # add doc tokens for (i, token) in enumerate(one_doc_tokens): one_id_to_tok_map[len(one_tokens)] = one_doc_offset + i one_tokens.append(token) one_segment_ids.append(1) one_tokens.append("[SEP]") one_segment_ids.append(1) id_to_tok_maps.append(one_id_to_tok_map) #这些代码和BERT是一模一样的 # gen features one_input_ids = tokenizer.convert_tokens_to_ids(one_tokens) one_input_mask = [1] * len(one_input_ids) while len(one_input_ids) < max_seq_length: one_input_ids.append(0) one_input_mask.append(0) one_segment_ids.append(0) assert len(one_input_ids) == max_seq_length assert len(one_input_mask) == max_seq_length assert len(one_segment_ids) == max_seq_length chunk_input_ids.append(one_input_ids[:]) chunk_input_mask.append(one_input_mask[:]) chunk_segment_ids.append(one_segment_ids[:]) if is_train: # adjust start_positions and end_positions due to doc offsets caused by query and CLS/SEP tokens in the input feature chunk_start_positions[index] += len(one_query_tokens) + 2 chunk_end_positions[index] += len(one_query_tokens) + 2 return chunk_input_ids, chunk_input_mask, chunk_segment_ids, id_to_tok_maps, \ chunk_start_positions, chunk_end_positions, chunk_stop_flags我们来看看gen_model_features获得了什么
我们重新写下代码
train_indices=torch.arange(len(features),dtype=torch.long)#生成从0到所有features_nums的张量 train_sampler=SequentialSampler(train_indices)#注意这里我改成了顺序取数据 train_dataloader=DataLoader(train_indices,sampler=train_sampler,batch_size=6,drop_last=True) for step, batch_indices in enumerate(tqdm(train_dataloader, desc="Iteration")): batch_features = [features[ind] for ind in batch_indices] batch_query_tokens = [f.query_tokens for f in batch_features] batch_doc_tokens = [f.doc_tokens for f in batch_features] batch_start_positions = [f.start_position for f in batch_features] batch_end_positions = [f.end_position for f in batch_features] batch_yes_no_flags = [f.yes_no_flag for f in batch_features] batch_yes_no_answers = [f.yes_no_ans for f in batch_features] #由于是顺序取数据,所以examples[i]和batch_doc_tokens[i]是对应的 i=5 print(examples[i].orig_answer_text) print(examples[i].start_position,examples[i].end_position) print(examples[i].doc_tokens[examples[i].start_position:examples[i].end_position+1]) print(features[i].start_position,features[i].end_position) print(features[i].tok_to_orig_map[features[i].start_position], features[i].tok_to_orig_map[features[i].end_position]) print(features[i].doc_tokens[features[i].start_position:features[i].end_position+1]) #上面的几行代码用来显示我们的examples和features是否是对应的 max_seq_length=256#最大长度设置为256 cur_global_pointers=[0]*6 batch_max_doc_length=[max_seq_length-3-len(query_tokens) for query_tokens in batch_query_tokens] chunk_input_ids, chunk_input_mask, chunk_segment_ids, id_to_tok_maps, chunk_start_positions, chunk_end_positions, chunk_stop_flags=gen_model_features(cur_global_pointers, batch_query_tokens, batch_doc_tokens, batch_start_positions, batch_end_positions, batch_max_doc_length, max_seq_length, tokenizer, is_train=True) print(chunk_start_positions,chunk_end_positions) i=3 print(chunk_start_positions[i],chunk_end_positions[i]) print(id_to_tok_maps[i][chunk_start_positions[i]], id_to_tok_maps[i][chunk_end_positions[i]]) orig_answer_start_position=id_to_tok_maps[i][chunk_start_positions[i]] orig_answer_end_position=id_to_tok_maps[i][chunk_end_positions[i]] print(batch_doc_tokens[i][orig_answer_start_position:orig_answer_end_position+1]) print(examples[i].orig_answer_text)运行结果 看来切分后的chunk_start_positions和chunk_end_positions是没有错的。 chunk_input_ids, chunk_input_mask, chunk_segment_ids以及chunk_start_positions, chunk_end_positions就是模型需要的输入和标签,额外多了chunk_stop_flags用来标记segment是否包含有answer
device=torch.device("cpu") chunk_input_ids = torch.tensor(chunk_input_ids, dtype=torch.long, device=device) chunk_input_mask = torch.tensor(chunk_input_mask, dtype=torch.long, device=device) chunk_segment_ids = torch.tensor(chunk_segment_ids, dtype=torch.long, device=device) chunk_start_positions = torch.tensor(chunk_start_positions, dtype=torch.long, device=device) chunk_end_positions = torch.tensor(chunk_end_positions, dtype=torch.long, device=device) chunk_yes_no_flags = torch.tensor(batch_yes_no_flags, dtype=torch.long, device=device) chunk_yes_no_answers = torch.tensor(batch_yes_no_answers, dtype=torch.long, device=device) chunk_stop_flags = torch.tensor(chunk_stop_flags, dtype=torch.long, device=device)上面的代码是将list转为了torch tensor,接下来终于可以输入到模型中了
##########################模型部分#########################
其中的 v ~ c \tilde{v}_c v~c就是当前的LSTM的输出
class stopNetwork(nn.Module): """ chunk_states就是当前时间步的LSTM的输出,用来判断当前的segment是否包含answer """ def __init__(self, input_size): super(stopNetwork, self).__init__() self.fc = nn.Linear(input_size, 2) def forward(self, chunk_states): stop_logits = self.fc(chunk_states) return stop_logits对于不包含yes_no问题的任务,answer_loss就是指开始位置和结束位置的交叉熵损失。
在训练阶段,模型返回的张量如下:
stop_logits, sampled_stride_inds, sampled_stride_log_probs, start_logits, end_logits, yes_no_flag_logits, yes_no_ans_logits, cur_hidden_states, stop_loss, answer_loss其中:
stop_logits是指LSTM的输出通过一个输出维度是2的全连接层,注意此时还没有概率归一化sampled_stride_inds是指经过policy network后模型做出的行为,也就是在 p a c t ( a ∣ s ) p^{act}(a|s) pact(a∣s)上进行随机采样后得到的输出,比如2,那么对应在行为空间{-16,16,32,64,128}上的值就是32,意味着下一次的分割模型要在原来的基础上向右移动32个单词(这个值也就是cur_global_pointers的值,在get_model_features的具体代码为doc_start = max(0, cur_global_pointers[index]))sampled_stride_log_probs就是模型做出这个行为的log概率值,也就是 log p a c t ( a ∣ s ) \log p^{act}(a|s) logpact(a∣s)start_logits和end_logits不必细说yes_no_flag_logits是指模型预测该问题是否是yes_no问题的分数yes_no_ans_logits是指模型预测该问题的答案是yes或者no的分数cur_hidden_states就是当前的LSTM的输出,用来作为下一个segment的LSTM的输入,达到循环机制的目的模型的loss由三部分组成,stop_loss是指 L c s L_{cs} Lcs,answer_loss是指 L a n s L_{ans} Lans其中:
sampled_stride_log_probs对应 log p a c t ( a ∣ s ) \log p^{act}(a|s) logpact(a∣s)stop_logits经过softmax后对应的是 q c q_c qcstart_logits经过softmax后对应的是 p c , i s t a r t p_{c,i}^{start} pc,istartend_logits经过softmax后对应的是 p c , j e n d p_{c,j}^{end} pc,jend第一个样本的stop_probs=[0.15 0.2 0.25 0.4 0.15]
第一个样本的stop_rewards=[0.08 0. 0.5 0.15 0. ]
下面我们按照公式 R ( s , a ) = q c r c + ( 1 − q c ) R ( s ′ , a ′ ) R(s,a)=q_cr_c+(1-q_c)R(s',a') R(s,a)=qcrc+(1−qc)R(s′,a′)来手动计算第一个样本所应该获得的奖励
首先我们需要的是从后向前计算,max_read_times=5
第五次由于没有下一次了,所以第五次的奖励设置为第五次的 r c r_c rc,也就是0.第四次的奖励为0.4*0.15+(1-0.4)*0.=0.06第三次的奖励为0.25*0.5+(1-0.25)*0.06=0.17第二次的奖励为0.2*0. +(1-0.2)*0.17=0.136第一次的奖励为0.15*0.08+(1-0.15)*0.136=0.1276所以对于第一个样本从第一次到倒数第二次的奖励应该为[0.1276,0.136,0.17,0.06]
但是按照论文的github代码运行出来的结果: 我觉得我的理解没有问题,感觉好像是论文的代码错了
所以我认为rl_reward.py中的代码应该是下面这样
q_vals = [] # calc from the end to the beginning time next_q_vals = stop_rewards[:,-1] #np.zeros(len(stop_rewards)) for t in reversed(range(0, stop_rewards.shape[1]-1)): t_rewards = stop_rewards[:, t] t_probs = stop_probs[:, t] cur_q_vals = np.multiply(t_rewards, t_probs) + np.multiply(next_q_vals, 1-t_probs) q_vals.append(list(cur_q_vals)[:]) next_q_vals = cur_q_vals # q_vals: (bsz, max_read_times-1) q_vals=list(reversed(q_vals)) q_vals = np.transpose(q_vals)运行结果:
那么现在 log p a c t ( a ∣ s ) \log p^{act}(a|s) logpact(a∣s)和 R ( s , a ) R(s,a) R(s,a)都已经得到了,
接下来自然就是:
reinforce_loss=torch.mean(torch.sum(-q_vals*stride_log_probs,dim=1))最终的loss形式为:
loss = (stop_loss + answer_loss) / args.max_read_times + reinforce_loss我们看run_RCM_coqa.py中的一段代码
然而我们和reward_estimation_for_stop函数对比一下: 你就会发现变量传入的顺序是有问题的。 正确的传参顺序如下: