Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection (ATSS)

Authors: Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, Stan Z. Li
Affiliations: CBSR NLPR CASIA, SAI UCAS, AIR CAS, BUPT, Westlake University

The authors argue that the essential difference between anchor-based and anchor-free detection is how to define positive and negative training samples, no matter regressing from a box or a point. Then they propose an Adaptive Training Sample Selection (ATSS) to automatically select positive and negative samples according to statistical characteristics of object. Experiments show that ATSS can achieve 50.7% AP on MS COCO.

Difference Analysis of Anchor-Based and Anchor-Free Detection

Without loss of generality, the representative anchor-based RetinaNet and anchor-free FCOS are adopted to dissect their differences:

  • The number of anchors tile per location: RetinaNet tiles several anchor boxes per location, while FCOS tiles one anchor point per location.

  • The definition of positive and negative samples: RetinaNet resorts to IoU for positives and negatives, while FCOS utilizes spatial and scale constraints to select samples.

  • The regression starting status. RetinaNet regresses the object bounding box from preset anchor box, while FCOS locates the object from the anchor point.

By removing inconsistencies between the two methods, the authors compare the performance of RetinaNet (#A = 1) and FCOS on the MS COCO minival set, and the results are reported below.

_images/atss-1.png

The results in each column demonstrate that the definition of positive and negative samples is an essential difference between anchor-based and anchor-free detectors. The results in each row indicate that the regression starting status is an irrelevant difference.

Adaptive Training Sample Selection

The authors propose the ATSS method that automatically divides positive and negative samples according to statistical characteristics of object almost without any hyperparameter.


Algorithm: Adaptive Training Sample Selection (ATSS)
Inputs:
\(\mathcal{G}\) is a set of ground-truth boxes on the image
\(\mathcal{L}\) is the number of feature pyramid levels
\(\mathcal{A}_i\) is a set of anchor boxes from the \(i\) th pyramid levels
\(\mathcal{A}\) is a set of all anchor boxes
\(k\) is a quite robust hyperparameter with a default value of 9
Outputs:
\(\mathcal{P}\) is a set of positive samples
\(\mathcal{N}\) is a set of negative samples
Algorithm:
for \(g \in \mathcal{G}\) do
build an empty set for candidate positive samples of \(g\): \(\mathcal{C}_g \leftarrow \emptyset\);
for \(i \in [1, \mathcal{L}]\) do
\(\mathcal{S}_i\) \(\leftarrow\) select \(k\) anchors from \(\mathcal{A}_i\) whose centers are closest to the center of \(g\) based on L2 distance;
\(\mathcal{C}_g = \mathcal{C}_g \cup \mathcal{S}_i\);
end for
compute IoU between \(\mathcal{C}_g\) and \(g\): \(\mathcal{D}_g = IoU(\mathcal{C}_g, g)\);
compute mean of \(\mathcal{D}_g\): \(m_g = Mean(\mathcal{D}_g)\);
compute standard deviation of \(\mathcal{D}_g\): \(v_g = Std(\mathcal{D}_g)\);
compute IoU threshold for \(g\): \(t_g = m_g + v_g\);
for \(c \in \mathcal{C}_g\) do
if \(IoU(c, g) \geq t_g\) and center of \(c\) in \(g\) then
\(\mathcal{P} = \mathcal{P} \cup c\);
end if
end for
end for
\(\mathcal{N} = \mathcal{A} - \mathcal{P}\);
return \(\mathcal{P}\), \(\mathcal{N}\);

Here are some motivations behind this method:

  • Selecting candidates based on the center distance between anchor box and object.

  • Using the sum of mean and standard deviation as the IoU threshold. This helps to adaptively select enough positives for each object from appropriate pyramid levels in accordance of statistical characteristics of object. The figure below illustrates this method.

    _images/atss-2.png
  • Limiting the positive samples’ center to object. Althought the IoU of candidates is not a standard normal distribution, the statistical results show that each object has about \(0.2 \times k\mathcal{L}\) positive samples, invariant to its scale, aspect ratio, and location.

  • Keeping almost hyperparameter-free. Experiments show that results are quite insensitive to the variations of \(k\).