The problems within computer science can be easily underestimated and overestimated. The things that seem easy can be impossible to execute while things that seem too simple are impossible to produce. This has been the problem facing computer vision since its inception.
There are many tasks that are very simple for humans, but are extremely difficult for computers. Detecting faces in pictures is one of those tasks. One of the main reasons why face recognition is a daunting task for computers is because we cannot clearly define how humans detect faces. People are able to recognize faces more accurately than a computer, but they are not able to decipher why or how it can recognize a particular face.
Face detection is the first key step in many systems for recognizing and analyzing faces. Earlier approaches were based on recognizing particular facial features like wrinkles, ears, noses etc. then it’s task was to determine as many of those characteristics within those specific samples (ex. Two ears are always level). However, these approaches were not strong enough to be able to have a high level of accuracy in detecting faces in pictures.
Now we approach face recognition in pictures with a more modern approach which uses big computer power to resolve the inefficiencies of the classical approach to face recognition. Face recognition with the help of deep learning methods give considerably better precision in detecting faces and it has the ability to adjust parameters of a model which can adjust to other distracting and undesirable variables found in pictures (lower quality pictures, or objects in the picture that cover a part of the person’s face, etc.)
Face detection is a problem which computer vision attempts to solve locating and recognizing one or more faces in a picture. Locating faces in a picture is based on finding the exact coordinates of a face in a picture, while the localization is in relation to delimiting the range of a picture through a bounding box (a bordered frame) around a face.
The technicalities of face detection are determined by the location and size of a human face in a particular image. Therefore, the goal of face detection technology is to affirm whether a face exists and then assign a bordered frame to every detected face, similar to how it would detect an object in a picture. Face detection technology has the added challenge to ignore other objects such as trees and buildings in a digital image. Face detection is a specific object detection software in which the task is to find the location and size of all the objects in the picture which belong to a certain class i.e., a face.
Face detection is the first step for algorithms in facial recognition software which includes aligning the faces in an image, recognizing the faces in an image, checking the faces and recognizing individual facial features of a particular face. It is important to note that facial recognition software is used in many industries as a tool for coding videos, for video conferences, and used by governments for video surveillance of mass gatherings and intelligence gathering.
Detecting the human face is a difficult problem that is faced by experts in the computer vision field. The main reason is because the human face is a dynamic object in which there is a high degree of variability as every face is different.
In the past few years, face recognition has made great strides. However, an effective face recognition software remains a difficult challenge, especially for images containing many smaller faces, images with complex backgrounds, and differing resolutions of images particularly those with unusual contrasts or lighting and low-resolution pictures. Additionally, faces in perfectly normal pictures can be difficult to detect due to different skin tones, the distance the picture was taken or due to an unusual orientation the picture was taken.
The modern approach continues to correct the deficiencies of the classical approach by building on the classical model of face detection. The modern approach uses deep learning methods which has grown in popularity over the past couple of years due to big advancements in convolution neural networks as well as the accessibility to good graphic cards.
MTCNN is a model for detecting objects in an image which can obtain excellent results in real time. A model consists of 3 convolutional neural networks where the results of every network are used and modified for the next entry for the next network. Before sending images through the 3 networks, 3 copies are made of the default images in different sizes. That process is known as picture pyramid creation. This process will help with future frame processing of objects.
The first network in the MTCNN model first detects an object in an image. This process is known as the Proposal Network (P-net) as its function is to decipher where objects are located in a picture. P-net is a completely connected convolutional neural networks.
The second network is known as the Refine Network (R-net). Its task is to further filter out bad objects.
The third network is called the Output Network (O-net). The function of this network is the same as the second network with the added task of detecting the 5 main characteristics of a face (the eyes, the tip of the nose, and the left and right sides of the mouth).
The MTCNN model came out in 2016 and it still produces great results. The main advantage of using this model in comparison to the classical model is its precision and adaptability. MTCNN is still one of the most accurate models for detecting objects in real time. The precision of this model derives from the multiple networks through which the data passes before the model makes a prediction.
Inspired by the vast advancements in the field of detecting objects by using deep learning, amongst these advancements is the use of MTCNN, people started to experiment with the idea of detecting faces with the speed of the classical method with the precision and accuracy of the modern method. From these experiments, RetinaFace was born. RetinaFace was inspired by the MTCNN and the Supervised Transformer Network (STN) models which could detect faces and decipher the 5 key facial points simultaneously.
RetinaFace currently achieves the best results for the hardest WIDER FACE group of data which represents a benchmark in face detection.
The model consists of 3 main components:
In this bog we covered what is face detection as a problem in the field of Computer Vision and the application of deep learning and neuron networks in solving this problem. We also explained how the modern approach gives a solution in face detection through two models of deep learning and how these models can achieve highly accurate results in face detection.