《Towards Accurate Scene Text Recognition with Semantic Reasoning Networks》 paddlepaddle ocr
特征提取部分:resnet50+fpn 这时候输出的特征是[b,t,c](t = imgH/8 * imgW/8) 这时候,每个特征值乘以 根号下512 ,并加上位置信息的特征。 encoder_word_pos [[[0],[1],[2],…[t-1]]]经过一个embedding层转换为位置信息特征。 输入两层的transformer堆叠,8头注意力,d_key = 512/8, 最后输出仍旧是512。 这时候得到的是word_features为[b,t,c] dim2维度上全连接,c到c,然后expand维度1到max_length,得到[b,max_length,t,c](相当于复制max_length份) gsrm_word_pos[[[0],[1], …, [max_length-1]]]通过embedding层获得gsrm_pos_embedding,此时的gsrm_pos_embedding是[b,max_length,c],expand dim_2到t,得到[b,max_length,t,c]。 gsrm_pos_embedding加上word_features之后经过tanh计划函数得到temp,temp通过全连接将将dim_3从c降到1,之后取出dim_3后做softmax获得attention值。 attention[b,max_lenght,t] dot product wordfeatures[b,t,c]之后得到pvam_features [b,max_length,c],这对这个特征,左慈全连接到字符个数,之后做softmax,argmax就可以进行字符的判断。
word_ids [b,max_length,1] ,在dim1的维度上pad 一个 idx(这个idx相当于是一个起始符),这时候word_ids是[b,max_length+1,1]。 这里假设word_ids 是 [[[s], [1], [2], …,[n-1],[n]]], 那么word_1就是 [[[s], [1], [2], …,[n-1]]],word_2是[[[1], [2], …,[n-1],[n]]]。 之后二者分别过embedding层之后输入四层的transformer中,注意这里分别有两个mask。word 1 的mask 保证字符从先往后计算self attention的时候,attention的权重只有自己和自己之前的。就是说保证字符只能看到自己和自己之前的。word_2的mask就相当于是反向的,第一个字符就能看到所有的字符。 word_1 的 mask为[[0,-10^9, -10^9, -10^9], [0,0,-10^9, -10^9], [0,0,0,-10^9], [0,0,0,0]],word_2的mask就是为[[0,0,0,0],[0,0,0,-10^9], [0,0, -10^9, -10^9], [0,-10^9, -10^9, -10^9]], qk计算权值之后,mask掉的部分就加上-10^9,这样后面计算softmax 的时候,就相当于不计算mask掉的部分。 二者通过四层transformer之后,gsrm_feature_2会在最后的位置补零,选择[1:],获得的特征就是字符2 到最后一个字符加上一个结尾符。
#mermaid-svg-weD6klTMaBHrHKh0 .label{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);fill:#333;color:#333}#mermaid-svg-weD6klTMaBHrHKh0 .label text{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 .node rect,#mermaid-svg-weD6klTMaBHrHKh0 .node circle,#mermaid-svg-weD6klTMaBHrHKh0 .node ellipse,#mermaid-svg-weD6klTMaBHrHKh0 .node polygon,#mermaid-svg-weD6klTMaBHrHKh0 .node path{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-weD6klTMaBHrHKh0 .node .label{text-align:center;fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 .node.clickable{cursor:pointer}#mermaid-svg-weD6klTMaBHrHKh0 .arrowheadPath{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 .edgePath .path{stroke:#333;stroke-width:1.5px}#mermaid-svg-weD6klTMaBHrHKh0 .flowchart-link{stroke:#333;fill:none}#mermaid-svg-weD6klTMaBHrHKh0 .edgeLabel{background-color:#e8e8e8;text-align:center}#mermaid-svg-weD6klTMaBHrHKh0 .edgeLabel rect{opacity:0.9}#mermaid-svg-weD6klTMaBHrHKh0 .edgeLabel span{color:#333}#mermaid-svg-weD6klTMaBHrHKh0 .cluster rect{fill:#ffffde;stroke:#aa3;stroke-width:1px}#mermaid-svg-weD6klTMaBHrHKh0 .cluster text{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:12px;background:#ffffde;border:1px solid #aa3;border-radius:2px;pointer-events:none;z-index:100}#mermaid-svg-weD6klTMaBHrHKh0 .actor{stroke:#ccf;fill:#ECECFF}#mermaid-svg-weD6klTMaBHrHKh0 text.actor>tspan{fill:#000;stroke:none}#mermaid-svg-weD6klTMaBHrHKh0 .actor-line{stroke:grey}#mermaid-svg-weD6klTMaBHrHKh0 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333}#mermaid-svg-weD6klTMaBHrHKh0 .messageLine1{stroke-width:1.5;stroke-dasharray:2, 2;stroke:#333}#mermaid-svg-weD6klTMaBHrHKh0 #arrowhead path{fill:#333;stroke:#333}#mermaid-svg-weD6klTMaBHrHKh0 .sequenceNumber{fill:#fff}#mermaid-svg-weD6klTMaBHrHKh0 #sequencenumber{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 #crosshead path{fill:#333;stroke:#333}#mermaid-svg-weD6klTMaBHrHKh0 .messageText{fill:#333;stroke:#333}#mermaid-svg-weD6klTMaBHrHKh0 .labelBox{stroke:#ccf;fill:#ECECFF}#mermaid-svg-weD6klTMaBHrHKh0 .labelText,#mermaid-svg-weD6klTMaBHrHKh0 .labelText>tspan{fill:#000;stroke:none}#mermaid-svg-weD6klTMaBHrHKh0 .loopText,#mermaid-svg-weD6klTMaBHrHKh0 .loopText>tspan{fill:#000;stroke:none}#mermaid-svg-weD6klTMaBHrHKh0 .loopLine{stroke-width:2px;stroke-dasharray:2, 2;stroke:#ccf;fill:#ccf}#mermaid-svg-weD6klTMaBHrHKh0 .note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-weD6klTMaBHrHKh0 .noteText,#mermaid-svg-weD6klTMaBHrHKh0 .noteText>tspan{fill:#000;stroke:none}#mermaid-svg-weD6klTMaBHrHKh0 .activation0{fill:#f4f4f4;stroke:#666}#mermaid-svg-weD6klTMaBHrHKh0 .activation1{fill:#f4f4f4;stroke:#666}#mermaid-svg-weD6klTMaBHrHKh0 .activation2{fill:#f4f4f4;stroke:#666}#mermaid-svg-weD6klTMaBHrHKh0 .mermaid-main-font{font-family:"trebuchet ms", verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 .section{stroke:none;opacity:0.2}#mermaid-svg-weD6klTMaBHrHKh0 .section0{fill:rgba(102,102,255,0.49)}#mermaid-svg-weD6klTMaBHrHKh0 .section2{fill:#fff400}#mermaid-svg-weD6klTMaBHrHKh0 .section1,#mermaid-svg-weD6klTMaBHrHKh0 .section3{fill:#fff;opacity:0.2}#mermaid-svg-weD6klTMaBHrHKh0 .sectionTitle0{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 .sectionTitle1{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 .sectionTitle2{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 .sectionTitle3{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 .sectionTitle{text-anchor:start;font-size:11px;text-height:14px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 .grid .tick{stroke:#d3d3d3;opacity:0.8;shape-rendering:crispEdges}#mermaid-svg-weD6klTMaBHrHKh0 .grid .tick text{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 .grid path{stroke-width:0}#mermaid-svg-weD6klTMaBHrHKh0 .today{fill:none;stroke:red;stroke-width:2px}#mermaid-svg-weD6klTMaBHrHKh0 .task{stroke-width:2}#mermaid-svg-weD6klTMaBHrHKh0 .taskText{text-anchor:middle;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 .taskText:not([font-size]){font-size:11px}#mermaid-svg-weD6klTMaBHrHKh0 .taskTextOutsideRight{fill:#000;text-anchor:start;font-size:11px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 .taskTextOutsideLeft{fill:#000;text-anchor:end;font-size:11px}#mermaid-svg-weD6klTMaBHrHKh0 .task.clickable{cursor:pointer}#mermaid-svg-weD6klTMaBHrHKh0 .taskText.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-weD6klTMaBHrHKh0 .taskTextOutsideLeft.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-weD6klTMaBHrHKh0 .taskTextOutsideRight.clickable{cursor:pointer;fill:#003163 !important;font-weight:bold}#mermaid-svg-weD6klTMaBHrHKh0 .taskText0,#mermaid-svg-weD6klTMaBHrHKh0 .taskText1,#mermaid-svg-weD6klTMaBHrHKh0 .taskText2,#mermaid-svg-weD6klTMaBHrHKh0 .taskText3{fill:#fff}#mermaid-svg-weD6klTMaBHrHKh0 .task0,#mermaid-svg-weD6klTMaBHrHKh0 .task1,#mermaid-svg-weD6klTMaBHrHKh0 .task2,#mermaid-svg-weD6klTMaBHrHKh0 .task3{fill:#8a90dd;stroke:#534fbc}#mermaid-svg-weD6klTMaBHrHKh0 .taskTextOutside0,#mermaid-svg-weD6klTMaBHrHKh0 .taskTextOutside2{fill:#000}#mermaid-svg-weD6klTMaBHrHKh0 .taskTextOutside1,#mermaid-svg-weD6klTMaBHrHKh0 .taskTextOutside3{fill:#000}#mermaid-svg-weD6klTMaBHrHKh0 .active0,#mermaid-svg-weD6klTMaBHrHKh0 .active1,#mermaid-svg-weD6klTMaBHrHKh0 .active2,#mermaid-svg-weD6klTMaBHrHKh0 .active3{fill:#bfc7ff;stroke:#534fbc}#mermaid-svg-weD6klTMaBHrHKh0 .activeText0,#mermaid-svg-weD6klTMaBHrHKh0 .activeText1,#mermaid-svg-weD6klTMaBHrHKh0 .activeText2,#mermaid-svg-weD6klTMaBHrHKh0 .activeText3{fill:#000 !important}#mermaid-svg-weD6klTMaBHrHKh0 .done0,#mermaid-svg-weD6klTMaBHrHKh0 .done1,#mermaid-svg-weD6klTMaBHrHKh0 .done2,#mermaid-svg-weD6klTMaBHrHKh0 .done3{stroke:grey;fill:#d3d3d3;stroke-width:2}#mermaid-svg-weD6klTMaBHrHKh0 .doneText0,#mermaid-svg-weD6klTMaBHrHKh0 .doneText1,#mermaid-svg-weD6klTMaBHrHKh0 .doneText2,#mermaid-svg-weD6klTMaBHrHKh0 .doneText3{fill:#000 !important}#mermaid-svg-weD6klTMaBHrHKh0 .crit0,#mermaid-svg-weD6klTMaBHrHKh0 .crit1,#mermaid-svg-weD6klTMaBHrHKh0 .crit2,#mermaid-svg-weD6klTMaBHrHKh0 .crit3{stroke:#f88;fill:red;stroke-width:2}#mermaid-svg-weD6klTMaBHrHKh0 .activeCrit0,#mermaid-svg-weD6klTMaBHrHKh0 .activeCrit1,#mermaid-svg-weD6klTMaBHrHKh0 .activeCrit2,#mermaid-svg-weD6klTMaBHrHKh0 .activeCrit3{stroke:#f88;fill:#bfc7ff;stroke-width:2}#mermaid-svg-weD6klTMaBHrHKh0 .doneCrit0,#mermaid-svg-weD6klTMaBHrHKh0 .doneCrit1,#mermaid-svg-weD6klTMaBHrHKh0 .doneCrit2,#mermaid-svg-weD6klTMaBHrHKh0 .doneCrit3{stroke:#f88;fill:#d3d3d3;stroke-width:2;cursor:pointer;shape-rendering:crispEdges}#mermaid-svg-weD6klTMaBHrHKh0 .milestone{transform:rotate(45deg) scale(0.8, 0.8)}#mermaid-svg-weD6klTMaBHrHKh0 .milestoneText{font-style:italic}#mermaid-svg-weD6klTMaBHrHKh0 .doneCritText0,#mermaid-svg-weD6klTMaBHrHKh0 .doneCritText1,#mermaid-svg-weD6klTMaBHrHKh0 .doneCritText2,#mermaid-svg-weD6klTMaBHrHKh0 .doneCritText3{fill:#000 !important}#mermaid-svg-weD6klTMaBHrHKh0 .activeCritText0,#mermaid-svg-weD6klTMaBHrHKh0 .activeCritText1,#mermaid-svg-weD6klTMaBHrHKh0 .activeCritText2,#mermaid-svg-weD6klTMaBHrHKh0 .activeCritText3{fill:#000 !important}#mermaid-svg-weD6klTMaBHrHKh0 .titleText{text-anchor:middle;font-size:18px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 g.classGroup text{fill:#9370db;stroke:none;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family);font-size:10px}#mermaid-svg-weD6klTMaBHrHKh0 g.classGroup text .title{font-weight:bolder}#mermaid-svg-weD6klTMaBHrHKh0 g.clickable{cursor:pointer}#mermaid-svg-weD6klTMaBHrHKh0 g.classGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-weD6klTMaBHrHKh0 g.classGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 .classLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.5}#mermaid-svg-weD6klTMaBHrHKh0 .classLabel .label{fill:#9370db;font-size:10px}#mermaid-svg-weD6klTMaBHrHKh0 .relation{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-weD6klTMaBHrHKh0 .dashed-line{stroke-dasharray:3}#mermaid-svg-weD6klTMaBHrHKh0 #compositionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 #compositionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 #aggregationStart{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 #aggregationEnd{fill:#ECECFF;stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 #dependencyStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 #dependencyEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 #extensionStart{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 #extensionEnd{fill:#9370db;stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 .commit-id,#mermaid-svg-weD6klTMaBHrHKh0 .commit-msg,#mermaid-svg-weD6klTMaBHrHKh0 .branch-label{fill:lightgrey;color:lightgrey;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 .pieTitleText{text-anchor:middle;font-size:25px;fill:#000;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 .slice{font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 g.stateGroup text{fill:#9370db;stroke:none;font-size:10px;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 g.stateGroup text{fill:#9370db;fill:#333;stroke:none;font-size:10px}#mermaid-svg-weD6klTMaBHrHKh0 g.statediagram-cluster .cluster-label text{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 g.stateGroup .state-title{font-weight:bolder;fill:#000}#mermaid-svg-weD6klTMaBHrHKh0 g.stateGroup rect{fill:#ECECFF;stroke:#9370db}#mermaid-svg-weD6klTMaBHrHKh0 g.stateGroup line{stroke:#9370db;stroke-width:1}#mermaid-svg-weD6klTMaBHrHKh0 .transition{stroke:#9370db;stroke-width:1;fill:none}#mermaid-svg-weD6klTMaBHrHKh0 .stateGroup .composit{fill:white;border-bottom:1px}#mermaid-svg-weD6klTMaBHrHKh0 .stateGroup .alt-composit{fill:#e0e0e0;border-bottom:1px}#mermaid-svg-weD6klTMaBHrHKh0 .state-note{stroke:#aa3;fill:#fff5ad}#mermaid-svg-weD6klTMaBHrHKh0 .state-note text{fill:black;stroke:none;font-size:10px}#mermaid-svg-weD6klTMaBHrHKh0 .stateLabel .box{stroke:none;stroke-width:0;fill:#ECECFF;opacity:0.7}#mermaid-svg-weD6klTMaBHrHKh0 .edgeLabel text{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 .stateLabel text{fill:#000;font-size:10px;font-weight:bold;font-family:'trebuchet ms', verdana, arial;font-family:var(--mermaid-font-family)}#mermaid-svg-weD6klTMaBHrHKh0 .node circle.state-start{fill:black;stroke:black}#mermaid-svg-weD6klTMaBHrHKh0 .node circle.state-end{fill:black;stroke:white;stroke-width:1.5}#mermaid-svg-weD6klTMaBHrHKh0 #statediagram-barbEnd{fill:#9370db}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-cluster rect{fill:#ECECFF;stroke:#9370db;stroke-width:1px}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-cluster rect.outer{rx:5px;ry:5px}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-state .divider{stroke:#9370db}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-state .title-state{rx:5px;ry:5px}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-cluster.statediagram-cluster .inner{fill:white}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-cluster.statediagram-cluster-alt .inner{fill:#e0e0e0}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-cluster .inner{rx:0;ry:0}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-state rect.basic{rx:5px;ry:5px}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-state rect.divider{stroke-dasharray:10,10;fill:#efefef}#mermaid-svg-weD6klTMaBHrHKh0 .note-edge{stroke-dasharray:5}#mermaid-svg-weD6klTMaBHrHKh0 .statediagram-note rect{fill:#fff5ad;stroke:#aa3;stroke-width:1px;rx:0;ry:0}:root{--mermaid-font-family: '"trebuchet ms", verdana, arial';--mermaid-font-family: "Comic Sans MS", "Comic Sans", cursive}#mermaid-svg-weD6klTMaBHrHKh0 .error-icon{fill:#522}#mermaid-svg-weD6klTMaBHrHKh0 .error-text{fill:#522;stroke:#522}#mermaid-svg-weD6klTMaBHrHKh0 .edge-thickness-normal{stroke-width:2px}#mermaid-svg-weD6klTMaBHrHKh0 .edge-thickness-thick{stroke-width:3.5px}#mermaid-svg-weD6klTMaBHrHKh0 .edge-pattern-solid{stroke-dasharray:0}#mermaid-svg-weD6klTMaBHrHKh0 .edge-pattern-dashed{stroke-dasharray:3}#mermaid-svg-weD6klTMaBHrHKh0 .edge-pattern-dotted{stroke-dasharray:2}#mermaid-svg-weD6klTMaBHrHKh0 .marker{fill:#333}#mermaid-svg-weD6klTMaBHrHKh0 .marker.cross{stroke:#333} :root { --mermaid-font-family: "trebuchet ms", verdana, arial;} #mermaid-svg-weD6klTMaBHrHKh0 { color: rgba(0, 0, 0, 0.75); font: ; } + predict + predict + predict + predict <SOS>_embedding char_2_embedding char_1 char_1_embedding char_3_embedding char_2 ................. .................. .................... char_n-1_embedding <EOS>_embedding char_ngsrm_feature_1 [b,max_length,c] + gsrm_feature_2 [b,max_length,c] 得到gsrm_feature,根据这个feature来预测字符。相当于起始符和第二个字符的embedding相加来预测第一个字符,结束符和倒数第二个字符的特征来预测最后一个字符。
gsrm_feature 和 pvam_feature在维度2concat起来,dim2全连接恢复到c。使用sigmoid计算出attention_map,之后这个attention_map*pvam_faatures + (1-attention_map)*gsrm_features得到vsfd_out
pvam_loss+vsfd_loss*2+gsrm_loss*0.15
