A multi-stag approach was proposed by Airsick, Ross et al.
(2014) that followed classifications using paradigm of regions. There are three
components in the system, SVM or Support Vector Machine as a classifier, CNN
for extraction of feature vectors, and component of regional proposal. Within
the period of training, the supervision of CNN is carried out on ILSVRC or a
large dataset and honed on PASCAL or a small dataset. That is why when it comes
to extracting feature vectors, CNN is more efficient for both negative areas
from the background and positive regions of the ground truth while saving the
disk for the upcoming stage of training. Moving on, such feature elements or
vectors are utilized for training the SVM classifier. At the time of test, a
proposal component of external region which is [28] Selective Search is used
for generating independent regions with a fixed-size type which is external
that might be containing objects. Then potential regions will be converted by
an efficient CNN extractor into feature vectors on the basis of which, SVM is
utilized for classification that is domain-specific. Lastly for the refinement
of bounding box and appliance of a suppression that is greedy and non-maximum,
a line regression model is used for eliminating duplicate detections along with
the basis of IOU that is Intersection-Over-Union overlapping with a region of
higher scoring.
Compared to other models of the age, efficient accuracy of
detection is achieved by R-CNN. But the complex pipeline that is multi-stage of
RCNN brings a lot of drawbacks as well. The components of CNN serve as
classifiers. Still, the region of prediction depends on the external method of
region proposal which is quite slow and drags the whole training system down. Nevertheless,
each components’ individual training manner resulting in the part of CNN is
tough to improve. When an SVM classifier is being trained, updating of the CNN
cannot be carried out.
1: Spatial Pyramid Pooling Network (SPP-Net) [13]
For increasing the convolution computation along with
removing the fixed-input’s constraints, the proposed method was SPP-Net. CNNs
actually needed input-images which were fixed-size before the SPP-Net. Generally,
there are always 2 parts consisted in a CNN, Convolutional layers for feature
man’s outputting and SVM or FC (Fully-connected) layers for categorization.
Within Convolutional layers, some substitute pooling layers (Pooling of Sliding
Window) and layout of convolutional layers are always present. Specifically,
any size in terms of input pictures can be processed by convolutional layers. Meanwhile,
SVM or FC layers need input whose size is fixed. For fulfilling requirements,
cropping or warping region proposals are the common practices before feeding to
layers which are Convolutional. The performance of CNN is affected both
solutions. It is understandable that object’s cropping part might cause a
failure in the recognition of the object while the outcome of warping concerns
a loss in original ration that is quite important in objects that are ratio
sensitive.
SPP-Net made the first progress for removing fixed-size
input’s constraint. Convolutional part’s pooling layer of the final sliding
window was replaced by them in SPP-Net with a SPP or spatial pyramid layer of
pooling. Actually, BoW or Bag of Words’ spatial version can be perceived as a
pooling of spatial pyramid [18] through which features can be extracted at
various levels or scales. To the SVM or FC layers’ input size, the bins’ number
is fized or SVM rather than recognizing the size of input image. Thus, the
whole network and SPP-Net can accept arbitrary size’s images.
This spatial pooling has another benefit after SPP-Net’s
convolution manner is the exchange of all regional proposals’ computation. As
discussed previously, the regions are warped or cropped first by earlier CNNs
and letting them pass through the layers of convolutional for extracting
elements or features. Overlap regions’ duplicate computation wastes an
excessive amount of time. As far as SPP-Net is concerned, the whole image is
allowed to pass through the convolutional layers once for creating a map of
features, and using project function for projecting almost all regions to
convolutional layer which is last. Therefore, each region’s feature extraction
is performed only the feature map. The SPP-Net’s idea in general could increase
the agility of classification methods which are CNN-image-based at that
specific period.
Improvements such as detection accuracy and speed of CNNs
are made by even SPP-Net. But similar to R-CNN, there are some drawbacks as
well. Still, region proposals depend on methods which are external. A
classifier and convolutional layers are required by the strcture multi-stage
for individual training. As back-propagation’s loss error is not allowed by SPP
layers, the upgrade of convolutional layers it still not enabled.
2: Fast Region-based Convolutional Network (Fast R-CNN)
[11]
For realizing the end-to-end testing and training, the Fast
R-CNN was developed. Namely, it can be presumed as the SPP-Net and R-CNN’s
extension. It is similar as they swap the extracting of region feature’s order
and for exchanging computation, phasing through CNN. Later on, the last pooling
is also revised for processing input images of any size. The difference is the
use of RoI or region of interest pooling layer. For training, such a trick is
significant to upgrade the layers of convolution. Additionally over every
labeled RoI, mutil-task loss is used by R-CNN through the combination of loss
of bounding box and loss of class score.
These results in practice prove that when it comes to
testing and accelerating testing time, such innovations are efficient while
optimizing detection accuracy as well. Meanwhile, slow proposal of region is
exposed by network’s quick running time.
3: Faster-RCNN [21]
The problem of slow regional proposal was solved mainly by a
quicker R-CNN. Rather than using methods of external proposals, an RPN or
Region Proposal Network was actually introduced for performing the task of
region proposal which exchanges convolutional computation of image with the
network of detection. Basically, a class-agnostic quick R-CNN is an RPN. At
each region, the n x n divides a feature map for gibing out nine proposals of
the region of various scales and ratios. In an RPN, the feeding of all region
proposals will be carried out for predicting objects’ existence source along
with their positions. Afterward, RPN’s high-score output regions are made as
the second R-CNN’s input for further refinement of bounding box and
class-specific categorization. Now the classification is carried out in the
network and there is end-to-end training. Due to it, detection accuracy at
various datasets was achieved by R-CNN and it became efficient detection
method’s foundation. 10 fps is the detection speed and compared to others, it
is considered the fastest.
References of Regions-based Convolution Neural Network
(R-CNN)\
UIJLINGS, J. R., VAN
DE SANDE, K. E., GEVERS, T., AND SMEULDERS,
A. W. Selective search for object recognition. International
journal of computer vision 104, 2 (2013), 154–171.
REN, S., HE, K., GIRSHICK, R., AND SUN, J. Faster r-cnn:
Towards real-time object detection with region proposal networks. In Advances
in neural information processing systems (2015), pp. 91–99.
GIRSHICK, R. Fast r-cnn. In Proceedings of the IEEE
International Conference on Computer Vision (2015), pp. 1440–1448.
GIRSHICK, R., DONAHUE, J., DARRELL, T., AND MALIK, J.
Rich feature
hierarchies for accurate object detection and semantic
segmentation. In Proceed-ings of the IEEE conference on computer vision and
pattern recognition (2014), pp. 580–587.
HE, K., ZHANG, X., REN, S., AND SUN, J. Spatial pyramid pooling
in deep convolutional networks for visual recognition. In European Conference
on Computer Vision (2014), Springer, pp. 346–361.
UIJLINGS, J. R., VAN
DE SANDE, K. E., GEVERS, T., AND SMEULDERS,
A. W. Selective search for object recognition. International
journal of computer vision 104, 2 (2013), 154–171.