Deep Learning - Object Detection

Vladimir Ivančev, Software Engineer

Tech

03.12.2020.

featured image

In the last article series written by my colleague, we learned image classification and its implementation. Now we are going to take a glance at the science of object detection.

Whereas image classification returns a class and probability for the whole image, object detection finds a list of objects on the image and returns it for every object.

Image classification and object detection
Image classification and object detection

The process of object detection can be divided into two groups – the traditional way, which includes using a specific pipeline of methods and the other way, by using a framework for detection purposes.

Traditional Method

In this method we use a sliding window (left to right and top to bottom) to bound the object at different locations in the image. Then we use an image pyramid to be able to detect objects of different sizes. Finally, we use a pre-trained Convolutional Neural Network for classification on each step of the sliding window, as well as on every stage of the image pyramid.

Sliding window
Sliding window
Image pyramid
Image pyramid

After consolidating possible duplicate detections, we have a basic, slow and not completely precise method for object detection. It has its uses, but the next method is used for achieving more concrete results.

Framework

There are currently three main frameworks for object detection: Faster R-CNN, YOLO and SSD.

Faster R-CNN

This one is a part of a series of frameworks that use region proposals for object detection. It’s the “faster one” of the series as it improves the speed regarding its predecessors (R-CNN and Fast R-CNN) which used selective search for finding region proposals.

It uses a pre-trained CNN generating a convolutional feature map while using it as part of the Region Proposal Network (RPN) that finds region proposals.

Anchor boxes, a predefined set of bounding boxes and ratios, are used here as a reference when predicting object locations.

It currently stands as the best one for accurate prediction, all while being the slowest.

Architecture of Faster R-CNN
Architecture of Faster R-CNN
Visualization of Anchor Boxes
Visualization of Anchor Boxes

YOLO (You Only Live Once)

Differing from R-CNNs, YOLO uses a single convolutional network for prediction of bounding boxes for objects and their classification. It works by splitting the image into a grid and calculating the classification probability. 

Its speed is much faster than the previous framework’s but has rather low accuracy and has problems with smaller objects.

SSDs (Single Shot Detectors)

Originally developed by Google engineers, this framework seeks a balance between the two previous frameworks. Similar to YOLO, it runs a convolutional network on the original image once, but it also uses anchor boxes as seen in Faster R-CNN.

Architectures of SSD (top) and YOLO (bottom)
Architectures of SSD (top) and YOLO (bottom)

Calculating accuracy

When using object detectors, we need a way to determine if our predictions are accurate, and this is done by calculating the value of Intersection over Union (IoU).

This allows us to accurately match ground truth annotations with the predictions and calculate the accuracy of the model on the predicted objects.

As we don’t need to match the box coordinates perfectly, the IoU number is a good indicator of the accuracy of object detection (generally values > 0.5 are considered good).

Bounding boxes
Bounding boxes
Intersect over Union visualization
Intersect over Union visualization

Summary

The goal of this post was to introduce the concept of object detection and the methods of implementation that are currently used, as well as the method to calculate the accuracy of our model. These are the basic things used in the continuation of Deep Learning in computer vision and can be a useful tool when tackling further problems.

It is interesting to see how people can work together to improve on old methods and develop them in different ways to tackle certain requirements. This can give us the motivation to continue and create something of our own that might influence others in the future.

RELATED

12.11.2020.

Deep Learning Neural Networks – PART ONE

The most well-known deep learning architectures are certainly deep neural networks, DBF (deep belief network), and recurrent neural networks. Some of the fields where deep learning can be applied are computer vision, speech recognition, sound recognition, social filtering network, bioinformatics, drug design, advanced image processing, segmentation, whatever data has a time component, etc. In many scenarios, deep learning has shown equal and even superior results in relation to human expertise.

Read more