Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Among the various factors that confront real-world face detection, large pose variations remain to be a big challenge. For example, the seminal Viola-Jones [1] detector works well for near-frontal faces, but become much less effective for faces in poses that are far from frontal views, due to the weakness of the Haar features on non-frontal faces.

There were abundant works attempted to tackle with large pose variations under the regime of the boosting cascade advocated by Viola and Jones [1]. Most of them adopt a divide-and-conquer strategy to build a multi-view face detector. Some works [24] proposed to train a detector cascade for each view and combine their results of all detectors at the test time. Some other works [57] proposed to first estimate the face pose and then run the cascade of the corresponding face pose to verify the detection. The complexity of the former approach increases with the number of pose categories, while the accuracy of the latter is prone to the mistakes of pose estimation.

Part-based model offers an alternative solution [810]. These detectors are flexible and robust to both pose variation and partial occlusion, since they can reliably detect the faces based on some confident part detections. However, these methods always require the target face to be large and clear, which is essential to reliably model the parts.

Other works approach to this issue by using more sophisticated invariant features other than Haar wavelets, e.g., HOG [8], SIFT [9], multiple channel features [11], and high-level CNN features [12]. Besides these model-based methods, Shen et al. [13] proposed to use an exemplar-based method to detect faces by image retrieval, which achieved state-of-the-art detection accuracy.

It has been shown in recent years that a face detector trained end-to-end using DNN can significantly outperforms previous methods [10, 14]. However, to effectively handle the different variations, especially pose variations, it often requires a DNN with lots of parameters, inducing high computational cost. To address the conflicting challenge, Li et al. [15] proposed a cascade DNN architecture at multiple resolutions. It quickly rejects the background regions in the low resolution stages, and carefully evaluates the challenging candidates in the high resolution stage.

However, the set of DNNs in Li et al. [15] are trained sequentially, instead of end-to-end, which may not be desirable. In contrast, we propose a new cascade Convolutional Neural Network that is trained end-to-end. The first stage is a multi-task Region Proposal Network (RPN), which simultaneously proposes candidate face regions along with associated facial landmarks. Inspired by Chen et al. [16], we jointly conduct face detection and face alignment, since face alignment is helpful to distinguish faces/non-faces patterns.

Different from Li et al. [15], this network is calculated on the original resolution to better leverage more discriminative information. The alignment step warps each candidate face region to a canonical pose, which maps the facial landmarks into a set of canonical positions. The aligned candidate face region is then fed into the second-stage network, a RCNN [17], for further verification. Note we only keep the K face candidate regions with top responses in a local neighborhood from the RPN. In other words, those Non-top K regions are suppressed. This helps increase detection recall.

Inspired by previous work [18], which revealed that joint features from different spatial resolutions or scales will improve accuracy. We concatenate the feature maps from the two cascaded networks together to form an architecture that is trained end-to-end, as shown in Fig. 1. Note in the learning process, we treat the set of canonical positions also as parameters, which are learnt in the end-to-end learning process.

Note that the canonical positions of the facial landmarks in the aligned face image and the predicted facial landmarks in the candidate face region jointly defines the transform from the candidate face region. In the end-to-end training, the training of the first-stage RPN to predict facial landmarks is also supervised by annotated facial landmarks in each true face regions. We hence call our network a Supervised Transformer Network. These two characteristics differentiate our model from the Spatial Transformer Network [19] because (a) the Spatial Transformer Network conducts regression on the transformation parameters directly, and (b) it is only supervised by the final recognition objective.

Fig. 1.
figure 1

Illustration of the structure of our Supervised Transformer Network.

The proposed Supervised Transformer Network can efficiently run on the GPU. However, in practice, the CPU is still the only choice in most situations. Therefore, we propose a region-of-interest (ROI) convolution scheme to make the run-time of the Supervised Transformer Network to be more efficient. It first uses a conventional boosting cascade to obtain a set of face candidate areas. Then, we combine these regions into irregular binary ROI mask. All DNN operations (including convolution, ReLU, pooling, and concatenation) are all processed inside the ROI mask, and hence significantly reduce the computation.

Our contributions are: (1) we proposed a new cascaded network named Supervised Transformer Network trained end-to-end for efficient face detection; (2) we introduced the supervised transformer layer, which enables to learn the optimal canonical pose to best differentiate face/non-face patterns; (3) we introduced a Non-top K suppression scheme, which can achieve better recall without sacrificing precision; (4) we introduced a ROI convolution scheme. It speeds up our detector 3x on CPU with little recall drop.

Our face detector outperformed the current best performing algorithms on several public benchmarks we evaluated, with real-time performance at 30 FPS with VGA resolution.

2 Network Architecture

2.1 Overview

In this section, we will introduce the architecture of our proposed cascade network. As illustrated in Fig. 1, the whole architecture consists of two stages. The first stage is a multi-task Region Proposal Network (RPN). It produces a set of candidate face regions along with associated facial landmarks. We conduct Non-top K suppression to only keep the candidate face regions with responses ranked in the top K in a local neighborhood.

The second stage starts with a Supervised Transformer layer, and then a RCNN to further verify if a face region is a true face or not. The transformer layer takes the facial landmarks and the candidate face regions, then warp the face regions into a canonical pose by mapping the detected facial landmarks into a set of canonical positions. This explicitly eliminates the effect of rotation and scale variation according to the facial points.

To make this clear, the geometric transformation are uniquely determined by the facial landmarks and the canonical positions. In our cascade network, both the prediction of the facial landmarks and the canonical positions are learned in the end-to-end training process. We call it a Supervised Transformer layer, as it receives supervision from two aspects. On one hand, the learning of the prediction model of the facial landmarks are supervised by the annotated ground-truth facial landmarks. On the other hand, the learning of both the canonical positions and the prediction model of the facial landmarks both are supervised by the final classification objective.

To make a final decision, we concatenate the fine-grained feature from the second-stage RCNN network and the global feature from the first-stage RPN network. The concatenated features are then put into a fully connected layer to make the final face/non-face arbitration. This concludes the whole architecture of our proposed cascade network.

2.2 Multi-task RPN

The design of the multi-task RPN is inspired by the JDA detector [16], which validated that face alignment is helpful to distinguish faces/non-faces. Our method is very straight forward. We use a RPN to simultaneous detect faces and associated facial landmarks. Our method is very similar to the work [20], except that our regression target is facial landmark locations, instead of bounding box parameters.

2.3 The Supervised Transformer Layer

In this section, we describe the detail of the supervised transformer layer. As we know, similarity transformation was widely used in face detection and face recognition task to eliminate scale and rotation variation. The common practice is to train a prediction model to detect the facial landmarks, and then warp the face image to a canonical pose by mapping the facial landmarks to a set of manually specified canonical locations.

This process at least has two drawbacks: (1) one needs to manually set the canonical locations. Since the canonical locations determines the scale and offset of rectified face images, it often takes many try-and-errors to find a relatively good setting. This is not only time-consuming, but also suboptimal. (2) The learning of the prediction model for the facial landmark is supervised by the ground-truth facial landmark points. However, labeling ground-truth facial landmarks is a highly subjective process and hence prone to introducing noise.

We propose to learn both the canonical positions and the prediction of the facial landmarks end-to-end from the network with additional supervision information from the classification objective of the RCNN using end-to-end back propagation. Specifically, we use the following formula to define a similarity transformation, i.e.,

$$\begin{aligned} \left[ \begin{array}{c} \bar{x}_i - m_{\bar{x}}\\ \bar{y}_i - m_{\bar{y}}\\ \end{array} \right] = \left[ \begin{array}{cc} a &{} b\\ -b &{} a\\ \end{array} \right] \left[ \begin{array}{c} x_i - m_{x}\\ y_i - m_{y}\\ \end{array} \right] , \end{aligned}$$
(1)

where \(x_i, y_i\) are the detected facial landmarks, \(\bar{x}_i, \bar{y}_i\) are the canonical positions, \(m_*\) is the mean value of the corresponding variables, e.g., \(m_x = \frac{1}{N} \sum x_i\), N is the number of facial landmarks, a and b are parameters of similarity transforms.

We found that this two parameters model is equivalent to the traditional four parameters, but much simpler in derivation and avoid problems of numerical calculation. After some straightforward mathematical derivation, we can obtain the least squares solution of the parameters, i.e.,

$$\begin{aligned} \begin{aligned} a&= \frac{c_1}{c_3} \\ b&= \frac{c_2}{c_3}. \end{aligned} \end{aligned}$$
(2)

where

$$\begin{aligned} \begin{aligned} c_1&= \sum {\left( (\bar{x}_i-m_{\bar{x}})(x_i-m_x)+(\bar{y}_i-m_{\bar{y}})(y_i-m_y)\right) } \\ c_2&= \sum {\left( (\bar{x}_i-m_{\bar{x}})(y_i-m_y)-(\bar{y}_i-m_{\bar{y}})(x_i-m_x)\right) } \\ c_3&= \sum {\left( (x_i-m_x)^2+(y_i-m_y)^2\right) }. \end{aligned} \end{aligned}$$
(3)

After obtaining the similarity transformation parameters, we can obtain the rectified image \(\bar{I}\) given the original image I, using \(\bar{I}(\bar{x}, \bar{y}) = I(x, y)\). Each point \((\bar{x}, \bar{y})\) in the rectified image can be mapped back to the original image space (xy) by

$$\begin{aligned} \begin{aligned} x&= \frac{a}{a^2+b^2}(\bar{x}-m_{\bar{x}})-\frac{b}{a^2+b^2}(\bar{y}-m_{\bar{y}}) + m_x \\ y&= \frac{b}{a^2+b^2}(\bar{x}-m_{\bar{x}})+\frac{a}{a^2+b^2}(\bar{y}-m_{\bar{y}}) + m_y. \end{aligned} \end{aligned}$$
(4)

Since x and y may not be integers, bilinear interpolation is always used to obtain the value of I(xy). Therefore, we can calculate the derivative by the chain rule

$$\begin{aligned} \begin{aligned} \frac{\partial L}{\partial a}&= \sum _{\{\bar{x}, \bar{y}\}}{\frac{\partial L}{\partial \bar{I}(\bar{x}, \bar{y})} \frac{\partial \bar{I}(\bar{x}, \bar{y})}{\partial a}} = \sum _{\{\bar{x}, \bar{y}\}}{\frac{\partial L}{\partial \bar{I}(\bar{x}, \bar{y})} \frac{\partial I(x, y)}{\partial a}} \\&=\sum _{\{\bar{x}, \bar{y}\}}{\frac{\partial L}{\partial \bar{I}(\bar{x}, \bar{y})} \left( \frac{\partial I(x, y)}{\partial x}\frac{\partial x}{\partial a}+\frac{\partial I(x, y)}{\partial y}\frac{\partial y}{\partial a}\right) } \\&=\sum _{\{\bar{x}, \bar{y}\}}{\frac{\partial L}{\partial \bar{I}(\bar{x}, \bar{y})} \left( I_x\frac{\partial x}{\partial a}+I_y\frac{\partial y}{\partial a}\right) } \end{aligned} \end{aligned}$$
(5)

where L is the final classification loss and \(\frac{\partial L}{\partial \bar{I}(\bar{x}, \bar{y})}\) is the gradient signals back propagated from the RCNN network. The \(I_x\) and \(I_y\) are horizontal and vertical gradient of the original image

$$\begin{aligned} \begin{aligned} I_x&= \beta _y (I(x_r,y_b) - I(x_l,y_b))+(1-\beta _y)(I(x_r,y_t)-I(x_l,y_t)) \\ I_y&= \beta _x (I(x_r,y_b) - I(x_r,y_t))+(1-\beta _x)(I(x_l,y_b)-I(x_l,y_t)). \end{aligned} \end{aligned}$$
(6)

Here we use a bilinear interpolation, \(\beta _x = x - \lfloor x \rfloor \) and \(\beta _y = y - \lfloor y \rfloor \). \(x_l = \lfloor x \rfloor , x_r = x_l + 1, y_t = \lfloor y \rfloor , y_b = y_t + 1\) are the left, right, top, bottom integer boundary of point (xy). Similarly, we can obtain the derivative of other parameters. Finally, we can obtain the gradient of the canonical positions of the facial landmarks, i.e., \(\frac{\partial L}{\partial \bar{x}_i}\) and \(\frac{\partial L}{\partial \bar{y}_i}\). And the gradient with respect to the detected facial landmarks: \(\frac{\partial L}{\partial x_i}\) and \(\frac{\partial L}{\partial y_i}\). Please refer to the supplementary material for more detail.

The proposed Supervised Transformer layer is put between of the RPN and RCNN networks. In the end-to-end training, it automatically adjusts the canonical positions and guiding the detection of the facial landmarks such that the rectified image is more suitable for face/non-face classification. We will further illustrate this in the experiments.

2.4 Non-top K Suppression

In RCNN [17, 20] based object detection, after the region proposals, non-maximum suppression (NMS) is always adopted to reduce the region candidate number for efficiency. However, the candidate with highest confidence score may be rejected by the later stage RCNN. Decreasing the NMS overlap threshold will bring in lots of useless candidates. This will make subsequent RCNN slow. Our idea is to keep K candidate regions with highest confidence for each potential face, since these samples are more promising for RCNN classifier. In the experiments part we will demonstrate that we can effectively improve the recall with the proposed Non-top K Suppression.

2.5 Multi-granularity Feature Combination

Some works have revealed that joint features from different spatial resolutions or scales will improve accuracy [18]. The most straight-forward way may be combining several RCNN networks with different input scales. However, this approach will obviously increase the computation complexity significantly.

In our end-to-end network, the details of the RPN network structure is shown in Table 1. There are 3 convolution and 2 inception layers in our RPN network. Therefore, we can calculate that its receptive field size is 85. While the target face size is \(36{\sim }72\) pixels. Therefore, our RPN takes advantage of the surrounding contextual information around face regions. On the other hand, the RCNN network focuses more on the rotation and scale variation fine grained detail in the inner face region. So we concatenate these two features in an end-to-end training architecture, which makes the two parts more complementary. Experiments demonstrate that this kind of joint feature can significantly improve the face detection accuracy. Besides, the proposed method is much more efficient.

Table 1. RPN network structure

3 The ROI Convolution

3.1 Motivation

As a practical face detection algorithm, real-time performance is very important. However, the heavy computation incurred at test phase using DNN-based models often make them impractical in real-world systems. That is the reason why current DNN-based models heavily rely on a high-end GPU to increase the run-time performance. However, high-end GPU is not often available in commodity computing system, so most often, we still need to run the DNN model with a CPU. However, even using a high-end CPU with highly optimized code, it is still about 4 times slower than the runtime speed on a GPU [21]. More importantly, for portable devices, such as phones and tablets, mostly have low-end CPUs only, it is necessary to accelerate the test-phase performance of DNNs.

In a typical DNN, the convolutional layers are the most computationally expensive and often take up about more than \(90\,\%\) of the time in runtime. There were some works attempted to reduce the computational complexity of convolution layer. For example, Jaderberg et al. [22] applied a sparse decomposition to reconstruct the convolutional filters. Some other works [23, 24] assume that the convolutional filters are approximately low-rank along certain dimensions, and can be approximately decomposed into a series of smaller filters. Our detector may also benefit from these model compression techniques.

Nevertheless, we propose a more practical approach to accelerate the runtime speed of our proposed Supervised Transformer Network for face detection. Our main idea is to use a conventional cascade based face detector to quickly reject non-face regions and obtain a binary ROI mask. The ROI mask has the same size as the input. The background area is represented by 0 and the face area is represented by 1. The DNN convolution is only computed within the region marked as 1, ignoring all other regions. Because most regions did not participate in the calculation, we can greatly reduce the amount of computation in the convolution layers.

Fig. 2.
figure 2

Illustration of the ROI mask

We want to emphasize that our method is different to those RCNN based algorithm [17, 25] which treated each candidate region independently. In those models, features in the overlap subregions will be calculated repeatedly. Instead, we use the ROI masks, so that different samples can share the feature in the overlapping area. It effectively reduces the computational cost by further avoiding repeated operations. Meanwhile, in the following section, we will introduce the implementation details of our ROI convolution. Similar to Caffe [26], we also take advantage of the matrix multiplication in the BLAS library to obtain almost a linear speedup.

3.2 Implementation Details

Cascade Pre-filter. As shown in Fig. 2, we use a cascade detector as a pre-filter. It is basically a variant of the Volia-Jones’s detector [1], but it has more weak classifiers and is trained with more data. Our boosted classifier is consisted of 1000 weak classifiers. Different form [1], we adopted a boosted fern [27] as the weaker classifier, since a fern is more powerful than using a single Haar feature based decision stump, and more efficient than boosted tree on CPUs. For completeness, we briefly describe our implementation.

Each fern contains 8 binary nodes. The splitting function is to compare the difference of two image pixel values in two different locations with a threshold, i.e.,

$$\begin{aligned} s_i = {\left\{ \begin{array}{ll} 1~~~p(x_{1_i}, y_{1_i}) - p(x_{2_i}, y_{2_i}) < \theta _i\\ 0~~~otherwise \end{array}\right. } \end{aligned}$$
(7)

where p is the image patch. The patch size is fixed to 32 in our experiments. The \((x_{1_i}, y_{1_i}, x_{2_i}, y_{2_i}, \theta _i)\) are fern parameters learned from training data. Each fern splits the data space into \(2^8=256\) partitions. We use a Real-Boost algorithm for the cascade classification learning. In each space partition, the classification score is computed as

$$\begin{aligned} \frac{1}{2} \log \left( \frac{\sum _{\{i\in piece \bigcap y_i = 1\}}w_i}{\sum _{\{i\in piece \bigcap y_i = 0\}}w_i}\right) , \end{aligned}$$
(8)

where the enumerator and denominator are the sum of the weights of positive and negative samples in the space partition, respectively.

The ROI Mask. After we obtain some candidate face regions, we will group them according to their sizes. The maximum size is twice larger than the minimum size in each group. Since the smallest face size can be detected by the proposed DNN based face detector is \(36\times 36\) pixels, the first group contains the face size between 36 to 72 pixels. While the second ground contains the face size between 72 to 144, and so on (as shown in Fig. 2).

It should be noted that, beginning from the second group, we need to down-sample the image, such that the candidate face size in the image is always maintained between 36 to 72 pixels. Besides, in order to retain some of the background information, we will double the side length of each candidate. But the side length will not exceed the receptive field size (85) of the following DNN face detector. Finally, we set the ROI mask according to the sizes and positions of the candidate boxes in each group.

We use this grouping strategy for two reasons. First, when there is a face almost filling the whole image, we do not have to deal with the full original image size. Instead, it will be down-sampled to a quite small resolution, so we can more effectively reduce the computation cost. Secondly, since the following DNN detector only need to handle twice the scale variation, this is induces a great advantage when compared with the RPN in [20], which needs to handle all scale changes. This advantage allows us to use a relatively cheaper network for the DNN-based detection.

Besides, such a sparse pyramid structure will only increase about \(33\,\%\) (\(\frac{1}{2^2}+\frac{1}{4^2}+\frac{1}{8^2}\dots \approx \frac{1}{3}\)) computation cost when compared with the computational cost at the base scale.

Details of the ROI Convolution. There are several ways to implement the convolutions efficiently. Currently, the most popular method is to transform the convolutions into a matrix multiplication. As described in [28] and implemented in Caffe [26], this can be done by firstly reshaping the filter tensor into a matrix F with dimensions \(CK^2 \times N\), where C and N are input and output channel numbers, and K is the filter width/height.

We can subsequently gather a data matrix by duplicating the original input data into a matrix D with dimensions \(WH \times CK^2\), W and H are output width and height. The computation can then be performed with a single matrix multiplication to form an output matrix \(O=DF\) with dimension \(WH \times N\). This matrix multiplication can be efficiently calculated with optimized linear algebra libraries such as BLAS.

Our main idea in ROI convolution is to only calculate the area marked as 1 (a.k.a, the ROI regions), while skipping other regions. According to the ROI mask, we only duplicate the input patches whose centers are marked as 1. So the input data become a matrix \(D'\) with dimensions \(M \times CK^2\), where M is the number of non-zero entries in the ROI mask. Similarly, we can then use matrix multiplication to obtain the output \(O'=D'F\) with dimension \(M \times CK^2\). Finally, we put each row of \(O'\) to the corresponding channel of the output. The computation complexity of ROI convolution is \(MCK^2N\). Therefore, we can linearly decrease the computation cost according to the mask sparsity.

Fig. 3.
figure 3

Illustration of the ROI convolution.

As illustrated in Fig. 3, we only apply the ROI convolution in the test phase. We replace all convolution layers into ROI convolution layers. After a max pooling, the size of the input will be halved. So we also half sample the ROI mask, such that their size can be matched. The original DNN detector can run at 50 FPS on GPU and 10 FPS on CPU for a VGA image. With ROI convolution, it can speed up to 30 FPS on CPU with little accuracy loss.

4 Experiments

In this section, we will experimentally validate the proposed method. We collected about 400 K face images from the web with various variations as positive training samples. These images are exclusive from FDDB [29], AFW [8] and PASCAL [30] datasets. We labeled all faces with 5 facial points (two eyes center, nose tip, and two mouth corners). For the negative training samples, we use the Coco database [31]. This dataset has pixel level annotations of various objects, including people. Therefore, we covered all person areas with random color blocks, and ensure that no samples are drawn from those colored regions in these images. We use more than 120 K images (including 2014 training and validation data) for the training. Some sample images are shown in Fig. 4.

We use GoogleNet in both the RPN and RCNN networks. The network structure is similar to that in FaceNet [32], but we cut all the convolution kernel number in half for efficiency. Moreover, we only include two inception layers in RPN network (as shown in Table 1) and the input size of RCNN network is 64.

Fig. 4.
figure 4

Illustration of our negative training sample. We covered all person area with random color blocks in Coco [31] dataset and ensured that no positive training samples are drawn from these regions in these images. (Color figure online)

In order to avoid the initialization problem and improve the convergence speed, we first train the RPN network from random without the RCNN network. After the predicted facial landmarks are largely correct, we add the RCNN network and perform end-to-end training together. For evaluation, we use three challenging public datasets, i.e., FDDB [29], AFW [8] and PASCAL faces [30]. All these three datasets are widely used as face detection benchmark. We employ the Intersection over Union (IoU) as the evaluation metric and fix the IoU threshold to 0.5.

4.1 Learning Canonical Position

In this part, we verify the effect of the Supervised Transformation in finding the best canonical position. We intentionally initialize the Supervised Transformation with three inappropriate canonical positions according to three settings, respectively, i.e., too large, too small, or with offset. Then we perform the end-to-end training and record the canonical points position after 10 K, 100 K, 500 K iterations.

As shown in Fig. 5, each row shows the canonical positions movement for one kind of initializations. We also place the image warp result besides its corresponding canonical points. We can observe that, for these three different kinds of initializations, they all eventually converge to a very close position setting after 500 K iterations. It demonstrated that the proposed Supervised Transformer module is robust to the initialization. It automatically adjusts the canonical positions such that the rectified image is more suitable for face/non-face classification.

4.2 Ablative Evaluation of Various Network Components

As discussed in Sect. 2, our end-to-end cascade network is consisted of four notable parts, i.e., the multi-task RPN, the Supervised Transformer, the multi-granularity feature combination, and non-top K suppression. The former three will affect the network structure of training, while the last one only appear in the test phase.

Fig. 5.
figure 5

Results of learning canonical positions.

Table 2. Evaluation of the effect of three parts in training architecture.

In order to separately study the effect of each part, we conduct an ablative study by removing one or more parts from our network structure and evaluate the new network with the same training and testing data. When removing the multi-task RPN, it means that we directly regress the face rectangle similar to [20], instead of facial points. Without the Supervised Transformer layer, we simply replace it with a standard similarity transformation without training with back propagation. Without the feature combination component means that we directly use the output of the RCNN features to make the finial decision. In the case that we removed multi-task RPN, there will be no facial points for Supervised Transformation or conventional similarity transformation. In this situation, we directly resize the face patch into \(64 \times 64\) and fed it into a RCNN network.

There are 6 different ablative settings in total. We perform end-to-end training with the same training samples for all settings, and evaluate the recall rate on the FDDB dataset when the false alarm number is 10. We manually review the face detection results and add 67 unlabeled faces in the FDDB dataset to make sure all the false alarms are true. As shown in Table 2, multi-task RPN, Supervised Transformer, and feature combination will bring about \(1\,\%\), \(1\,\%\), and \(2\,\%\) recall improvement respectively. Besides, these three parts are complementary, remove any one part will cause a recall drop.

In the training phase, in order to increase the variation of training samples, we randomly select K positive/negative samples from each image for the RCNN network. However, in the test phase, we need to balance the recall rate with efficiency. Next, we will compare the proposed non-top K suppression with NMS in the testing phase,

Fig. 6.
figure 6

Comparison of NMS and Non-Top K Suppression

Table 3. Various results demonstrating the effects of ROI convolution.

We present a sample visual result of RPN, NMS and non-top K suppression in Fig. 6. We keep the same number of candidates for both NMS and Non-top K suppression (\(K = 3\) in the visual result). We found that NMS tend to include too much noisy low confidence candidates. We also compare the PR curves of using all candidates, NMS, and non-top K suppression. Our non-top K suppression is very close to using all candidates, and achieved consistently better results than NMS under the same number of candidates.

4.3 The Effect of ROI Convolution

In this section, we will validate the acceleration performance of the proposed ROI convolution algorithm. We train the Cascade pre-filter with the same training data. By adjusting the classification threshold of the Cascade re-filter, we can obtain the ROI masks in different areas. Therefore, we can strike for the right balance between speed and accuracy.

We conduct the experiments on the FDDB database. We resized all images to 1.5 times of the original size, the resulting average photos resolution is approximately \(640\times 480\). We evaluate the ROI mask sparsity, run-time speed Footnote 1 of each part, and the recall rate when the false alarm number is 10 under different pre-filter threshold. We also compare with the standard network without ROI convolution. Non-top K (\(K=3\)) suppression is adopted in all settings to make RCNN network more efficiency.

Table 3 shows the average ROI mask sparsity, testing speed of each part, and recall rate of each setting. Comparing the second row with the fourth row, it proves that we can linearly decrease the computation cost according to the mask sparsity. The last two rows show the recall rate and average test time of different settings. The original DNN detector can run at 10 FPS on CPU for a VGA image. With ROI convolution, it can speed up to 30 FPS on CPU. We can achieve about 3 times speed up with only \(0.6\,\%\) recall rate drop.

Fig. 7.
figure 7

Comparison with state-of-the-arts on the FDDB [29], AFW [8] and PASCAL faces [30] datasets.

Fig. 8.
figure 8

Qualitative face detection results on (a) FDDB [29], (b) AFW [8], (c) PASCAL faces [30] datasets.

4.4 Comparing with State-of-the-art

We conduct face detection experiments on three benchmark datasets. On the FDDB dataset, we compare with all public methods [810, 3335, 3542]. We regress the annotation ellipses with 5 facial points and ignore 67 unlabeled faces to make sure all false alarms are true. On the AFW and PASCAL faces datasets, we compare with (1) deformable part based methods, e.g. structure model [30] and Tree Parts Model (TSM) [8]; (2) cascade-based methods, e.g. Headhunter [4]; (3) commercial system, e.g. face.com, Face++ and Picasa. We learn a global regression from 5 facial points to face rectangles to match the annotation for each dataset, and use toolbox from [4] for the evaluation. Figure 8 shows that our method outperforms all previous methods by a considerable margin.

5 Conclusion and Future Work

In this paper, we proposed a new Supervised Transformer Network for face detection. The superior performance on three challenge datasets shows its ability to learn the optimal canonical positions to best distinguish face/non-face patterns. We also introduced a ROI convolution, which speeds up our detector 3x on CPU with little recall drop. Our future work will explore how to enhance the ROI convolution so that it does not incur additional drops in recall.