论文阅读：《Point-Set Anchors for Object Detection, Instance Segmentation and Pose Estimation》

it2025-07-24 41

论文链接：https://arxiv.org/pdf/2007.02846.pdf. ECCV 2020

文章目录

1 Background and Motivation2 Related work3 Advantages/Contributions4 Methods4.1 Point-Set Anchors4.2 Shape Regression4.3. PointSetNet 5 Experiments5.1. Datasets5.2 Experiments on Instance Segmentation5.3 Experiments on Pose Estimation 6 Conclusions

1 Background and Motivation

目标定位的一个有效且基础的方法是去估计关键点。比如目标检测就通过一些关键点去确定bounding box，这个方法的一个典型的代表CenterNet，它通过提取目标的中心点来回归出来bounding box的大小，这种方法也能很容易的被应用到人体姿态估计中。虽然CenterNet高度实用，具有广泛的应用潜力，但是它从中心点提取的特征中来回归关键点仍是一个很重要的缺点，因为有的关键点不一定在中心点的附近，因此从中心点提取的特征可能只能为预测关键点的位置提供很少的信息。这个问题会因为几何变化变得更有挑战性。本文中，作者通过为关键点回归获取更多的信息特征来解决上边的问题。代替提取中心点，本文的方法是获得离回归目标比较近的一组点的特征。点集是根据任务来确定的。对于实例分割，这些点被放置在隐式边界框的边缘上。对于姿态估计，点的排列遵循训练数据中的姿态分布模式。当点集与目标的尺度和长宽比例对齐的时候能更好的发挥它的作用。为了实现这个目的，作者提出 pointset anchors，类似于目标检测中的anchors box， point-set anchors以多种尺度、纵横比和图像位置进行采样。

Motivation：（1）While this center-point regression is simple and efficient, we argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries, due to object deformation and scale/orientation variation.we propose to address this issue by acquiring more informative features for keypoint regression （2）Anchor从某种程度上来说，表示的只是一种先验信息，anchor可以是中心点，也可以是矩形，同时它还可以提供更多的模型设计思路，如正负样本的分配，分类、回归特征的选择。那么我们的思路是，能不能提出更加general的anchor，泛化的应用于更多的任务中，而不只是Object detection中。

2 Related work

Object representations：目标检测一般使用 anchor box来进行目标定位，但是 anchor box是一个很粗糙的表示不能很好的定位实例分割和姿态估计等任务；另一种是使用特定的点来表示物体。eg： center points ， corner points ， extreme points ，octagon points ，point sets ，and radial points .这些方法都不能同时使用在目标检测、实例分割和姿态估计上；本文的point-set anchors结合了anchor box和point representations的优点，可以同时应用在三个任务上。Instance segmentation：Two-stage methods：Detect and Segment； single-stage instance segmentation：eg：PolarMask、YOLACT、ExtremeNet、 Deep Snake；本文的方法与Deep Snake有三个不同点：（1）我们的实例分割是一阶段方法，不需要检测器产生proposals（2） Point-Set Anchors直接执行mask形状回归（3）本文的方法在MS COCO数据集上进行评估，在目标检测、实例分割和姿态估计等方面与现有方法进行了比较。Pose estimation:大多数之前的工作都来估计每个关节的热图，热图表示图像中各位置存在一个节点的置信度虽然性能很好，但是仍有缺点，比如没有端到端训练，需要高分辨率，以及关节定位和关联的单独步骤。本文遵循基于回归的范式，提出通过从一组位于更有利位置的点进行回归来解决远程位移问题。

3 Advantages/Contributions

提出了一种新的对象表示方法——Point-Set Anchors，它是对传统盒锚的推广和扩展。Point-Set Anchors可以进一步为形状回归提供信息特性和更好的特定于任务的初始化。提出了一个基于 point-set anchors的网络— PointSetNet，它是对 RetinaNet的修改。这个网络被应用于目标检测、人体姿态估计和实例分割，解决了定义特定回归目标的问题。

4 Methods

4.1 Point-Set Anchors

Pose point-set anchor： We initialize the point-set anchors as the most frequent poses in the training set.具体为，用 k-means 在训练集中聚类出最常见的 pose,并利用每个聚类的平均位姿作为 pose point-set anchor（3 cluster，3 aspect ratio，3 scale）

**Instance mask point-set anchor：**Two parts: one center point and n ordered anchor points

中心点和implicit bbox 构成 instance mask point-set anchor，特征图的每个空间位置，有 9 个 instance mask point-set anchor（3 aspect ratio，3 scale）本文n的数量为36

4.2 Shape Regression

In this work, instance segmentation, object detection and pose estimation are treated as a shape regression problem （1）Offsets for pose estimation 即 GT 点减 anchor 点（S和T都是17个关键点）以前常用的方法是通过热力图来预测关键点，作者的方法是在 point-set anchor 的基础上通过网络回归出 offset，来定位关键点（2）Offsets for instance segmentation S： the number of points might be different for different object instances point-set anchor T 是固定的，作者引入 matching point T* 来近似 S，不同的 matching 方式，T* 则不同，作者介绍了如下三种 matching 方式:

• Nearest point：如图（a），黄色是 S（GT），绿色是 point-set anchor，对于每一个绿点，找最近黄色的点（L1 distance）为 T*，这会带来一个问题，许多绿点对应到一个黄点上

• Nearest line：如图（b），每个绿点往黄点构成的线段上投影（做垂线），选投影距离最短的点（垂线与黄色线的交点）为T*，这样也会导致多个绿色点映射到了同一个 T*，但概率会小很多

• Corner point with projection：如图（c），先通过 Nearest point 找到离四个角落绿点最近的黄点，以这四个黄点为基准，划分为 top、bottom、right、left 四个区域(如下图两条红线和两条橙色线组成的区域就是上，其它相似)，绿色点往对应的四个区域做投影，与黄点形成线段的交点即为T*，投影不上的绿点为 invalid anchor point，不参与网络的 training

（3）Offsets for object detection S为top-left 和 bottom-right两个点，ΔT 为S的top-left和bottom-right和point-set anchors的top-left和bottom-right的距离（4）Positive and negative samples • 目标检测和实例分割中，IoU > 0.6 的为 positive， IoU < 0.4 的为 negative • 人关键点检测中，OKS > 0.5 的为 positive， OKS < 0.4 的为 negative

object keypoint similarity (OKS) which plays the same role as the IoU，人体关键点检测性能度量指标，这个指标启发于目标检测中的IoU指标，目的就是为了两个人之间的骨骼点相似度的

4.3. PointSetNet

（1）Architecture 本文的网络PointSetNet是基于RetinaNet的改进，把原先的anchor box换成了本文提出的points-set anchor，并在class和box分支上又新增加了一个平行于他们的Mask/pose regression分支。 RetinaNet 三个子网络的输出通道数：

（2）Point-set anchor density

object detection和instance segmentation： Point-Set Anchors 的implicit bounding box 有3 scale（2 k/3 ( k ≤ 3 ) ）和3 aspect ratio （[ 0.5 , 1 , 2 ]），因此每个位置产生9个anchors（每个anchor36（目标检测2个）个点）pose estimation： Point-Set Anchors 为经过 k-means 聚类产生的 3 个 pose，结合 3 scales 和 3 rotations，在每个位置产生 27 个 anchors（每个anchor17个关键点）（3）Loss function The loss is calculated over all locations and all feature maps （4）Elements specific to pose estimation Deep shape indexed features： how feature aggregation with point-set anchors achieves a certain feature transformation invariance

这个模块的作用在于利用提出的Point-set anchor的先验信息，使用DCN来aggregate特定的的feature，然后提供更好的feature用于分类回归，而不是简单的单点center feature

Multi-stage refinement：how point-set anchors can be extended to multi-stage learning

Holistic shape regression is generally more difficult than part based heat map learning，常见的解决方法就是 a sequence of weak regressors（boosting 的思想），本文做姿态评估的时候也采用了方式。 we use one-step refinement for simplicity and efficiency

5 Experiments

5.1. Datasets

MS COCO

5.2 Experiments on Instance Segmentation

1）Mask matching strategies 2）Effect of point-set anchors

3）Comparison with state-of-the-art methods

效果并不理想，作者自己的博客zhuanlan.zhihu.com/p/158054890.写了原因：

5.3 Experiments on Pose Estimation

1）Effect of point-set anchors 2）Effect of deep shape indexed feature

3）Effect of multi-stage refinement

4）Effect of stronger backbone network and multi-scale testing

5）Comparison with state-of-the-art methods

6 Conclusions

提出了point-set anchor，是anchor的泛化形式，使用regression的思路去统一Object Detection, Instance Segmentation，Pose Estimation三个high-level recognition tasks提出了PointSetNet，它是基于RetinaNet来做的，没有额外的设计，在pose estimation上取得了不错的性能，在object detection和instance segmentation也做的比较work（单纯的基于RetinaNet，并且以regression的思路去做）这也是第一个框架以single stage regression的方式尝试统一Object Detection, Instance Segmentation以及Pose Estimation三大任务

最新回复(0)