In the last article series written by my colleague, we learned image classification and its implementation. Now we are going to take a glance at the science of object detection.
Whereas image classification returns a class and probability for the whole image, object detection finds a list of objects on the image and returns it for every object.
The process of object detection can be divided into two groups – the traditional way, which includes using a specific pipeline of methods and the other way, by using a framework for detection purposes.
In this method we use a sliding window (left to right and top to bottom) to bound the object at different locations in the image. Then we use an image pyramid to be able to detect objects of different sizes. Finally, we use a pre-trained Convolutional Neural Network for classification on each step of the sliding window, as well as on every stage of the image pyramid.
After consolidating possible duplicate detections, we have a basic, slow and not completely precise method for object detection. It has its uses, but the next method is used for achieving more concrete results.
There are currently three main frameworks for object detection: Faster R-CNN, YOLO and SSD.
This one is a part of a series of frameworks that use region proposals for object detection. It’s the “faster one” of the series as it improves the speed regarding its predecessors (R-CNN and Fast R-CNN) which used selective search for finding region proposals.
It uses a pre-trained CNN generating a convolutional feature map while using it as part of the Region Proposal Network (RPN) that finds region proposals.
Anchor boxes, a predefined set of bounding boxes and ratios, are used here as a reference when predicting object locations.
It currently stands as the best one for accurate prediction, all while being the slowest.
Differing from R-CNNs, YOLO uses a single convolutional network for prediction of bounding boxes for objects and their classification. It works by splitting the image into a grid and calculating the classification probability.
Its speed is much faster than the previous framework’s but has rather low accuracy and has problems with smaller objects.
Originally developed by Google engineers, this framework seeks a balance between the two previous frameworks. Similar to YOLO, it runs a convolutional network on the original image once, but it also uses anchor boxes as seen in Faster R-CNN.
When using object detectors, we need a way to determine if our predictions are accurate, and this is done by calculating the value of Intersection over Union (IoU).
This allows us to accurately match ground truth annotations with the predictions and calculate the accuracy of the model on the predicted objects.
As we don’t need to match the box coordinates perfectly, the IoU number is a good indicator of the accuracy of object detection (generally values > 0.5 are considered good).
Summary
The goal of this post was to introduce the concept of object detection and the methods of implementation that are currently used, as well as the method to calculate the accuracy of our model. These are the basic things used in the continuation of Deep Learning in computer vision and can be a useful tool when tackling further problems.
It is interesting to see how people can work together to improve on old methods and develop them in different ways to tackle certain requirements. This can give us the motivation to continue and create something of our own that might influence others in the future.