像素聚合网络PAN原理与代码解析

it2024-06-25 49

Pipeline

Loss

像素聚合算法（Pixel Aggregation）

论文：https://arxiv.org/abs/1908.05900

官方代码：https://github.com/whai362/pan_pp.pytorch

像素聚合网络Pixel Aggregation Network是PSENet的改进版，依旧是segmentation-based文本检测方法，可以检测任意形状的文本。主要改进了PSENet速度慢的缺点，在CTW1500数据集上，PAN-320可以达到84.2FPS，同时还可以保证79.9%的F-measure。而PSENet-1s只有3.9FPS和78.0%的F-measure。

PAN主要做了两点改进来提升模型检测速度

用 ResNet-18 作为backbone，并提出了低计算量的 head 以解决因为使用 ResNet-18 而导致的特征提取能力较弱，进而带来的特征感受野较小且表征能力不足的缺点。提出了一个可学习的后处理方法——像素聚合法，它能够通过预测出的相似向量来引导文字像素去纠正核参数。

Pipeline

1. Backbone 取ResNet18，假设Input的shape为(16,3,736,736)，16为batch_size，backbone从左到右输出的shape依次为(16,64,184,184)、(16,128,92,92)、(16,256,46,46)、(16,512,23,23)

2. Reducing Channel，每个backbone的输出接1*1*128conv、bn、relu得到，从左到右的shape依次为(16,128,184,184)、(16,128,92,92)、(16,128,46,46)、(16,128,23,23)

3. FPEM

如上图，FPEM是一个 U形模组，由两个阶段组成，up-scale 增强、down-scale 增强。up-scale 增强作用于输入的特征金字塔，它以步长 32,16,8,4 像素在特征图上迭代增强。在 down-scale 阶段，输入的是由 up-scale 增强生成的特征金字塔，增强的步长从 4 到 32，同时，down-scale 增强输出的的特征金字塔就是最终 FPEM 的输出。

代码如下，输入f1~f4就是上一步的得到的

class FPEM_v1(nn.Module): def __init__(self, in_channels, out_channels): super(FPEM_v1, self).__init__() planes = out_channels # 128 self.dwconv3_1 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, groups=planes, bias=False) self.smooth_layer3_1 = Conv_BN_ReLU(planes, planes) self.dwconv2_1 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, groups=planes, bias=False) self.smooth_layer2_1 = Conv_BN_ReLU(planes, planes) self.dwconv1_1 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, groups=planes, bias=False) self.smooth_layer1_1 = Conv_BN_ReLU(planes, planes) self.dwconv2_2 = nn.Conv2d(planes, planes, kernel_size=3, stride=2, padding=1, groups=planes, bias=False) self.smooth_layer2_2 = Conv_BN_ReLU(planes, planes) self.dwconv3_2 = nn.Conv2d(planes, planes, kernel_size=3, stride=2, padding=1, groups=planes, bias=False) self.smooth_layer3_2 = Conv_BN_ReLU(planes, planes) self.dwconv4_2 = nn.Conv2d(planes, planes, kernel_size=3, stride=2, padding=1, groups=planes, bias=False) self.smooth_layer4_2 = Conv_BN_ReLU(planes, planes) @staticmethod def _upsample_add(x, y): _, _, H, W = y.size() return F.upsample(x, size=(H, W), mode='bilinear') + y def forward(self, f1, f2, f3, f4): f3 = self.smooth_layer3_1(self.dwconv3_1(self._upsample_add(f4, f3))) f2 = self.smooth_layer2_1(self.dwconv2_1(self._upsample_add(f3, f2))) f1 = self.smooth_layer1_1(self.dwconv1_1(self._upsample_add(f2, f1))) f2 = self.smooth_layer2_2(self.dwconv2_2(self._upsample_add(f2, f1))) f3 = self.smooth_layer3_2(self.dwconv3_2(self._upsample_add(f3, f2))) f4 = self.smooth_layer4_2(self.dwconv4_2(self._upsample_add(f4, f3))) return f1, f2, f3, f4

4. FFM

FPEM是可以级联的模块，官方代码中堆叠了两个FPEM。FFM用于融合不同深度的特征金字塔，首先通过逐元素相加结合了每个FPEM输出中相应 scale 的特征图，然后对特征图进行降采样，连接成最终 4*128 通道的特征图。

# FPEM f1_1, f2_1, f3_1, f4_1 = self.fpem1(f1, f2, f3, f4) f1_2, f2_2, f3_2, f4_2 = self.fpem2(f1_1, f2_1, f3_1, f4_1) # FFM f1 = f1_1 + f1_2 f2 = f2_1 + f2_2 f3 = f3_1 + f3_2 f4 = f4_1 + f4_2 f2 = self._upsample(f2, f1.size()) f3 = self._upsample(f3, f1.size()) f4 = self._upsample(f4, f1.size()) f = torch.cat((f1, f2, f3, f4), 1) # torch.Size([16, 512, 184, 184])

5. 最后经过3*3*128conv、bn、relu、1*1*num_class的conv，最后再stride=4 upsample成原始输入大小即得到最终输出。这里的num_class=6，其中0通道对应完整文本、1对应kernel、2~5对应similar vector

Loss

完整的Loss函数如下

其中和分别是完整文本区域和核的损失，计算方法和PSENet一样。

只有一点不同，计算时的mask不同

# PSENet取网络输出的完整本文预测图中大于0.5的区域，同时排除标注为忽略的部分 mask0 = torch.sigmoid(texts).data.cpu().numpy() mask1 = training_masks.data.cpu().numpy() selected_masks = ((mask0 > 0.5) & (mask1 > 0.5)).astype('float32') # PAN取完整文本标注的区域，同时排除标注为忽略的部分 selected_masks = gt_texts * training_masks

这里对照代码重点讲一下和

其中N是文本实例的个数，是第i个文本实例，是第i个文本实例对应的kernel，定义了文本实例内的像素p和之间的距离，公式如下

官方计算代码解读

def forward_single(self, emb, instance, kernel, training_mask): # emb就是similar vector，shape为(4,w,h)，wh为网络输入的宽和高 # instance是文本实例的ground truth，第i个文本实例区域的值为i，背景的值为0，shape为(w,h) # kernel是instance中每个文本实例缩放后的图，且每个文本实例的kernel区域的都为1，背景的值为0，shape为(w,h) # training_mask是标注为DO NOT CARE的文本实例区域值为0，其余部分值为1的图，shape为(w,h) training_mask = (training_mask > 0.5).long() kernel = (kernel > 0.5).long() instance = instance * training_mask # 去掉标注为忽略的文本实例区域 instance_kernel = (instance * kernel).view(-1) # 第i个kernel的值为i instance = instance.view(-1) emb = emb.view(self.feature_dim, -1) # (4,541696) unique_labels, unique_ids = torch.unique(instance_kernel, sorted=True, return_inverse=True) # 假设图中有5个文本实例，unique_label==tensor([0, 1, 2, 3, 4，5], device='cuda:0')，0是背景 num_instance = unique_labels.size(0) if num_instance <= 1: return 0 emb_mean = emb.new_zeros((self.feature_dim, num_instance), dtype=torch.float32) # shape=(4,6) for i, lb in enumerate(unique_labels): if lb == 0: # 背景 continue ind_k = instance_kernel == lb # 第i个kernel所有像素的索引 emb_mean[:, i] = torch.mean(emb[:, ind_k], dim=1) # 公式中的G(Ki) l_agg = emb.new_zeros(num_instance, dtype=torch.float32) # bug （自带的不是我加的） for i, lb in enumerate(unique_labels): # 遍历每一个文本实例 if lb == 0: # 0是背景 continue ind = instance == lb emb_ = emb[:, ind] # 公式中的F(p) # 单个文本instance的所有像素的similar vector，例如这张图片的第一个文本instance共有1012个像素，每个像素的similar vector (4,) dist = (emb_ - emb_mean[:, i:i + 1]).norm(p=2, dim=0) # (torch.Size([4, 1012]) - torch.Size([4, 1])) -> torch.Size([1012]) 这里是单个文本实例的每个像素的距离 # 注意emb_mean[:, i].shape==torch.Size([4]), emb_mean[:, i:i+1].shape==torch.Size([4, 1]), 这里必须用后者，前者会报错 dist = F.relu(dist - self.delta_v) ** 2 l_agg[i] = torch.mean(torch.log(dist + 1.0)) l_agg = torch.mean(l_agg[1:]) # 对应公式里求N个文本实例的平均值

的公式如下

其中

官方代码的实现如下

if num_instance > 2: emb_interleave = emb_mean.permute(1, 0).repeat(num_instance, 1) emb_band = emb_mean.permute(1, 0).repeat(1, num_instance).view(-1, self.feature_dim) mask = (1 - torch.eye(num_instance, dtype=torch.int8)).view(-1, 1).repeat(1, self.feature_dim) mask = mask.view(num_instance, num_instance, -1) mask[0, :, :] = 0 mask[:, 0, :] = 0 mask = mask.view(num_instance * num_instance, -1) dist = emb_interleave - emb_band dist = dist[mask > 0].view(-1, self.feature_dim).norm(p=2, dim=1) dist = F.relu(2 * self.delta_d - dist) ** 2 l_dis = torch.mean(torch.log(dist + 1.0))

注意代码中任意两个kernel间的距离计算了两遍

官方代码中还多算了一个论文中没有出现的loss

l_reg = torch.mean(torch.log(torch.norm(emb_mean, 2, 0) + 1.0)) * 0.001

作者的回答是 “l_reg是用来限制emb的模长不能太大，有没有这一项估计差别不是很大。”

像素聚合算法（Pixel Aggregation）

PA和PSE算法比较像，都是从最小的kernel往外expand获得完整的文本区域。不同点在于pse输出多个kernel，从小到大依次扩充，每一轮扩充的结束条件是当前kernel的所有像素都已扩充完。而pa只有一个kernel和一个完整text预测结果，扩充的条件是当前扩充像素点既在完整text预测区域内又满足和所属kernel的similar vector的欧式距离小于6（代码中为3）。官方代码pa使用pyx实现的，仿照其改成了python，代码如下并加了相应注释

def _pa(kernel, emb, label, cc, label_num, min_area=0): pred = np.zeros((label.shape[0], label.shape[1]), dtype=np.int32) mean_emb = np.zeros((label_num, 4), dtype=np.float32) area = np.full((label_num,), -1, dtype=np.float32) flag = np.zeros((label_num,), dtype=np.int32) inds = np.zeros((label_num, label.shape[0], label.shape[1]), dtype=np.uint8) p = np.zeros((label_num, 2), dtype=np.int32) max_rate = 1024 for i in range(1, label_num): ind = label == i inds[i] = ind area[i] = np.sum(ind) # 614.0 if area[i] < min_area: # 0 label[ind] = 0 continue px, py = np.where(ind) # (614,),(614,) p[i] = (px[0], py[0]) # px[0]==min(px), py[0]==min(py) for j in range(1, i): if area[j] < min_area: continue if cc[p[i, 0], p[i, 1]] != cc[p[j, 0], p[j, 1]]: # 完整的text预测图中没有把两个kernel合并成一个 continue rate = area[i] / area[j] if rate < 1 / max_rate or rate > max_rate: flag[i] = 1 mean_emb[i] = np.mean(emb[:, ind], axis=1) if flag[j] == 0: flag[j] = 1 mean_emb[j] = np.mean(emb[:, inds[j].astype(np.bool)], axis=1) que = queue.Queue(maxsize=0) dx = [-1, 1, 0, 0] dy = [0, 0, -1, 1] points = np.array(np.where(label > 0)).transpose((1, 0)) for point_idx in range(points.shape[0]): x, y = points[point_idx, 0], points[point_idx, 1] l = label[x, y] que.put((x, y, l)) pred[x, y] = l while not que.empty(): (x, y, l) = que.get() for j in range(4): tmpx = x + dx[j] tmpy = y + dy[j] if tmpx < 0 or tmpx >= label.shape[0] or tmpy < 0 or tmpy >= label.shape[1]: continue if kernel[0, tmpx, tmpy] == 0 or pred[tmpx, tmpy] > 0: # 完整text预测图中这个点值为0或者已经扩充过了 continue if flag[l] == 1 and np.linalg.norm(emb[:, tmpx, tmpy] - mean_emb[l]) > 3: # 论文里是6 continue que.put((tmpx, tmpy)) pred[tmpx, tmpy] = l return pred def pa(kernels, emb, min_area=0): # (2, 184, 328)，(4, 184, 328)，0 # kernels[0]是预测的text完整图，kernels[1]是预测的以0.5比例shrink的kernel图 _, cc = cv2.connectedComponents(kernels[0], connectivity=4) label_num, label = cv2.connectedComponents(kernels[1], connectivity=4) # label_num包含了背景，实际要-1 return _pa(kernels[:-1], emb, label, cc, label_num, min_area) # (1, 184, 328)，(4, 184, 328)，(184, 328)，(184, 328)，2,3,0 # kernels[0].shape=(184, 328), kernels[:-1].shape=(1, 184, 328)

最新回复(0)