一 、模型建立 1.1 重要标志参数
aspect_ratios_per_layer=[[1.0, 2.0, 0.5], [1.0, 2.0, 0.5, 3.0, 1.0/3.0], [1.0, 2.0, 0.5, 3.0, 1.0/3.0], [1.0, 2.0, 0.5, 3.0, 1.0/3.0], [1.0, 2.0, 0.5, 3.0, 1.0/3.0], [1.0, 2.0, 0.5], [1.0, 2.0, 0.5]] aspect_ratios = aspect_ratios_per_layer for ar in aspect_ratios_per_layer: if (1 in ar) & two_boxes_for_ar1: # +1 for the second box for aspect ratio 1 #[3+1,5+1,5+1,5+1,5+1,3+1,3+1] n_boxes.append(len(ar) + 1) # The number of predictor conv layers in the network is 7 for the original SSD512 n_predictor_layers = 7 # Account for the background class. n_classes += 11.2 VGG基础网络
conv1_1 = Conv2D(64, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv1_1')(x1) conv1_2 = Conv2D(64, (3, 3), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv1_2')(conv1_1) pool1 = MaxPooling2D(pool_size=(2, 2), strides=(2, 2), padding='same', name='pool1')(conv1_2) ... conv10_1 = Conv2D(128, (1, 1), activation='relu', padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv10_1')(conv9_2) conv10_1 = ZeroPadding2D(padding=((1, 1), (1, 1)), name='conv10_padding')(conv10_1) conv10_2 = Conv2D(256, (4, 4), strides=(1, 1), activation='relu', padding='valid', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv10_2')(conv10_1) ...1.3 目标检测附加层
1.3.1 置信度层 如果检测目标共有 c c c个类别,SSD其实需要预测 c + 1 c+1 c+1个置信度值,其中第一个置信度指的是不含目标或者属于背景的评分。后面当我们说 c c c个类别置信度时,请记住里面包含背景那个特殊的类别,即真实的检测类别只有 c − 1 c-1 c−1个。在预测过程中,置信度最高的那个类别就是边界框所属的类别,特别地,当第一个置信度值最高时,表示边界框中并不包含目标。
conv4_3_norm_mbox_conf = Conv2D(n_boxes[0] * n_classes, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv4_3_norm_mbox_conf')(conv4_3_norm) ... conv10_2_mbox_conf = Conv2D(n_boxes[6] * n_classes, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv10_2_mbox_conf')(conv10_2)1.3.2 位置层
对于一个大小 的特征图,共有 m × n m \times n m×n个单元,每个单元设置的先验框数目记为 k k k,也就是n_boxes[…],那么每个单元共需要 ( c + 4 ) × k (c+4) \times k (c+4)×k个预测值,所有的单元共需要 ( c + 4 ) × k m n (c+4) \times kmn (c+4)×kmn个预测值,由于SSD采用卷积做检测,所以就需要 ( c + 4 ) × k (c+4) \times k (c+4)×k个卷积核完成这个特征图的检测过程。
n_boxes = [4, 6 ,6 ,6 ,6 ,4 ,4]有了特征图之后,需要对特征图进行卷积得到检测结果,下图给出了一个 5 × 5 5 \times 5 5×5大小的特征图的检测过程。其中Priorbox是得到先验框。检测值包含两个部分:类别置信度和边界框位置,各采用一次 3 × 3 3 \times 3 3×3卷积来进行完成。每个先验框都会预测一个边界框,所以SSD512一共可以预测 64 × 64 × 4 + 32 × 32 × 6 + 16 × 16 × 6 + 8 × 8 × 6 + 4 × 4 × 6 + 2 × 2 × 4 + 1 × 1 × 4 = 24564 64 \times 64 \times4+32 \times 32 \times6+16 \times 16 \times6+8 \times 8 \times6+4 \times 4 \times6+2 \times 2 \times4+1 \times 1 \times4=24564 64×64×4+32×32×6+16×16×6+8×8×6+4×4×6+2×2×4+1×1×4=24564 个边界框,所以说SSD本质上是密集采样。
conv4_3_norm_mbox_loc = Conv2D(n_boxes[0] * 4, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv4_3_norm_mbox_loc')(conv4_3_norm) ... conv10_2_mbox_loc = Conv2D(n_boxes[6] * 4, (3, 3), padding='same', kernel_initializer='he_normal', kernel_regularizer=l2(l2_reg), name='conv10_2_mbox_loc')(conv10_2)1.3.3 先验框
# Output shape of anchors: `(batch, height, width, n_boxes, 8) #The last axis contains the four anchor box coordinates #and the four variance values for each box. #先验框函数计算是定值,详细如下(具体在AnchorBoxes类的call()函数里面) ''' Note that this tensor does not participate in any graph computations at runtime. It is being created as a constant once during graph creation and is just being output along with the rest of the model output during runtime. Because of this, all logic is implemented as Numpy array operations and it is sufficient to convert the resulting Numpy array into a Keras tensor at the very end before outputting it. ''' #scales = [0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05] #否则,scale从min_scale到max_scale线性增加 #scales = np.linspace(min_scale, max_scale, n_predictor_layers+1) #two_boxes_for_ar1=True conv4_3_norm_mbox_priorbox = AnchorBoxes(img_height, img_width, this_scale=scales[0], next_scale=scales[1], aspect_ratios=aspect_ratios[0], two_boxes_for_ar1=two_boxes_for_ar1, this_steps=steps[0], this_offsets=offsets[0], clip_boxes=clip_boxes, variances=variances, coords=coords, normalize_coords=normalize_coords, name='conv4_3_norm_mbox_priorbox')(conv4_3_norm_mbox_loc) ...1.3.4 reshape层
# Reshape the class predictions, yielding 3D tensors of shape `(batch, height * width * n_boxes, n_classes)` # We want the classes isolated in the last axis to perform softmax on them conv4_3_norm_mbox_conf_reshape = Reshape((-1, n_classes), name='conv4_3_norm_mbox_conf_reshape')(conv4_3_norm_mbox_conf) # Reshape the box predictions, yielding 3D tensors of shape `(batch, height * width * n_boxes, 4)` # We want the four box coordinates isolated in the last axis to compute the smooth L1 loss conv4_3_norm_mbox_loc_reshape = Reshape((-1, 4), name='conv4_3_norm_mbox_loc_reshape')(conv4_3_norm_mbox_loc) # Reshape the anchor box tensors, yielding 3D tensors of shape `(batch, height * width * n_boxes, 8)` conv4_3_norm_mbox_priorbox_reshape = Reshape((-1, 8), name='conv4_3_norm_mbox_priorbox_reshape')(conv4_3_norm_mbox_priorbox)1.3.5 输出合并层
#将7个conf层进行合并 # Output shape of `mbox_conf`: (batch, n_boxes_total, n_classes) mbox_conf = Concatenate(axis=1, name='mbox_conf')([conv4_3_norm_mbox_conf_reshape, fc7_mbox_conf_reshape, conv6_2_mbox_conf_reshape, conv7_2_mbox_conf_reshape, conv8_2_mbox_conf_reshape, conv9_2_mbox_conf_reshape, conv10_2_mbox_conf_reshape]) #将7个loc层进行合并 # Output shape of `mbox_loc`: (batch, n_boxes_total, 4) mbox_loc = Concatenate(axis=1, name='mbox_loc')([conv4_3_norm_mbox_loc_reshape, fc7_mbox_loc_reshape, conv6_2_mbox_loc_reshape, conv7_2_mbox_loc_reshape, conv8_2_mbox_loc_reshape, conv9_2_mbox_loc_reshape, conv10_2_mbox_loc_reshape]) #将7个priorbox层进行合并 # Output shape of `mbox_priorbox`: (batch, n_boxes_total, 8) mbox_priorbox = Concatenate(axis=1, name='mbox_priorbox')([conv4_3_norm_mbox_priorbox_reshape, fc7_mbox_priorbox_reshape, conv6_2_mbox_priorbox_reshape, conv7_2_mbox_priorbox_reshape, conv8_2_mbox_priorbox_reshape, conv9_2_mbox_priorbox_reshape, conv10_2_mbox_priorbox_reshape]) #添加softmax层 mbox_conf_softmax = Activation('softmax', name='mbox_conf_softmax')(mbox_conf) # Output shape of `predictions`: (batch, n_boxes_total, n_classes + 4 + 8) #合并所以输出层 predictions = Concatenate(axis=2, name='predictions')([mbox_conf_softmax, mbox_loc, mbox_priorbox])1.4 模型建立
#在training模式下,训练数据的label是preditions,坐标loc是经过encode后的 if mode == 'training': model = Model(inputs=x, outputs=predictions) #在training模式下,训练数据的label是preditions,坐标loc是经过encode后,然后进行decode elif mode == 'inference': #3D tensor of shape `(batch_size, top_k, 6) '''The last axis contains the coordinates for each predicted box in the format [class_id, confidence, xmin, ymin, xmax, ymax]''' decoded_predictions = DecodeDetections(confidence_thresh=confidence_thresh, iou_threshold=iou_threshold, top_k=top_k, nms_max_output_size=nms_max_output_size, coords=coords, normalize_coords=normalize_coords, img_height=img_height, img_width=img_width, name='decoded_predictions')(predictions) model = Model(inputs=x, outputs=decoded_predictions)1.5 编解码 对于边界框的location,包含4个值 ( c x , c y , w , h ) (c x, c y, w, h) (cx,cy,w,h),分别表示边界框的中心坐标以及宽高。但是真实预测值其实只是边界框相对于先验框的转换值(paper里面说是offset,R-CNN中是transformation)。先验框位置用 d = ( d c x , d c y , d w , d h ) d=\left(d^{c x}, d^{c y}, d^{w}, d^{h}\right) d=(dcx,dcy,dw,dh)表示,其对应边界框用 b = ( b c x , b c y , b w , b h ) b=\left(b^{c x}, b^{c y}, b^{w}, b^{h}\right) b=(bcx,bcy,bw,bh)表示,那么边界框的预测值 l l l 其实是 b b b 相对于 d d d 的转换值: l c x = ( b c x − d c x ) / d w , l c y = ( b c y − d c y ) / d h l^{c x}=\left(b^{c x}-d^{c x}\right) / d^{w}, l^{c y}=\left(b^{c y}-d^{c y}\right) / d^{h} lcx=(bcx−dcx)/dw,lcy=(bcy−dcy)/dh l w = log ( b w / d w ) , l h = log ( b h / d h ) l^{w}=\log \left(b^{w} / d^{w}\right), l^{h}=\log \left(b^{h} / d^{h}\right) lw=log(bw/dw),lh=log(bh/dh) 习惯上称上面这个过程为边界框的编码(encode),预测时,你需要反向这个过程,即进行解码(decode),从预测值 l l l 中得到边界框的真实位置 b b b:
b c x = d w l c x + d c x , b c y = d y l c y + d c y b^{c x}=d^{w} l^{c x}+d^{c x}, b^{c y}=d^{y} l^{c y}+d^{c y} bcx=dwlcx+dcx,bcy=dylcy+dcy b w = d w exp ( l w ) , b h = d h exp ( l h ) b^{w}=d^{w} \exp \left(l^{w}\right), b^{h}=d^{h} \exp \left(l^{h}\right) bw=dwexp(lw),bh=dhexp(lh)
在SSD的Caffe源码实现中还有trick,那就是设置variance超参数来调整检测值
b c x = d w ( b^{c x}=d^{w}\left(\right. bcx=dw(variance [ 0 ] ∗ l c x ) + d c x , b c y = d y ( \left.[0] * l^{c x}\right)+d^{c x}, b^{c y}=d^{y}\left(\right. [0]∗lcx)+dcx,bcy=dy(variance [ 1 ] ∗ l c y ) + d c y \left.[\mathbf{1}] * l^{c y}\right)+d^{c y} [1]∗lcy)+dcy b w = d w exp ( b^{w}=d^{w} \exp \left(\right. bw=dwexp(variance [ 2 ] ∗ l w ) , b h = d h exp ( \left.[2] * l^{w}\right), b^{h}=d^{h} \exp \left(\right. [2]∗lw),bh=dhexp(variance [ 3 ] ∗ l h ) \left.[3] * l^{h}\right) [3]∗lh)