After writing about classification and detection, we will get acquainted with another task of computer vision and that is semantic segmentation.
Semantic segmentation is the process of assigning a semantic label to each part of an image. Labels that are assigned represent the categories or classes to which the marked objects or parts of the image belong. The result of semantic segmentation is usually a matrix in which for each pixel of the image the value of the corresponding element is equal to an integer value denoting the semantic class.
Today, semantic segmentation is one of the key problems in the field of computer vision. Looking at the big picture, semantic segmentation is one of the high-level tasks that paves the way for a holistic understanding of the scene. The importance of understanding the scene as a fundamental problem with computer vision is highlighted by the fact that an increasing number of applications stem from the acquisition of knowledge from images. Some of these applications include autonomous vehicles, human-computer interaction, virtual reality, etc.
With the popularity of deep learning in recent years, many semantic segmentation problems are addressed using deep architectures, most commonly using convolutional neural networks, which far outperform other approaches in accuracy and efficiency. Semantic segmentation is a natural step in the transition from coarse to fine inference: the source can be found in the classification, which consists of predictions for the entire input. The next step is localization/detection, which provides not only classes but also additional information regarding the spatial location of these classes. Finally, semantic segmentation achieves fine-grained inference by creating dense predictions to associate tags for each pixel, so that each pixel is marked with the class of its attached object or region.
One important thing to note is that we're not separating instances of the same class; we only care about the category of each pixel. In other words, if you have two objects of the same category in your input image, the segmentation map does not inherently distinguish these as separate objects. There exists a different class of models, known as instance segmentation models, which do distinguish between separate objects of the same class.
The basic architecture in image segmentation consists of an encoder and a decoder. The layers which downsample the input are the part of the encoder and the layers which upsample are part of the decoder.
The encoder extracts features from the image through filters. The decoder is responsible for generating the final output which is usually a segmentation mask containing the outline of the object. Most of the architectures have this architecture or a variant of it.
For the task of semantic segmentation, we need to retain the spatial information, hence no fully connected layers are used. That’s why they are called fully convolutional networks. The convolutional layers coupled with downsampling layers produce a low-resolution tensor containing the high-level information.
Taking the low-resolution spatial tensor, which contains high-level information, we have to produce high-resolution segmentation outputs. To do that, we add more convolution layers coupled with upsampling layers which increase the size of the spatial tensor. As we increase the resolution, we decrease the number of channels as we are getting back to the low-level information.
Let's discuss the metrics which are generally used to understand and evaluate the results of a model.
Pixel accuracy is the most basic metric which can be used to validate the results. Accuracy is obtained by taking the ratio of correctly classified pixels w.r.t total pixels:
Accuracy = (TP+TN)/(TP+TN+FP+FN)
The main disadvantage of using such a technique is the result might look good if one class overpowers the other. Say for example the background class covers 90% of the input image, we can get an accuracy of 90% by just classifying every pixel as background.
IOU is defined as the ratio of intersection of ground truth and predicted segmentation outputs over their union. If we are calculating for multiple classes, IOU of each class is calculated, and their mean is taken. It is a better metric compared to pixel accuracy as if every pixel is given as background in a 2-class input the IOU value is (90/100+0/100)/2 or 45% IOU which gives a better representation as compared to 90% accuracy.
Segmentation models are useful for a variety of tasks, including:
Autonomous vehicles such as self-driving cars and drones can benefit from automated segmentation. For example, self-driving cars can detect drivable regions.
Automated segmentation of body scans can help doctors to perform diagnostic tests. For example, models can be trained to segment tumors.
Aerial images can be used to segment different types of land. Automated land mapping can also be done.
To get a list of more resources for semantic segmentation, get started with GitHub.
In this post, we discussed the concepts of deep learning based segmentation.
Semantic segmentation is a key topic in image processing and computer vision with applications such as scene understanding, medical image analysis, robotic perception, video surveillance, augmented reality, and image compression, among many others. This should provide a basic understanding of semantic segmentation as a topic in general.
If you need some help with the concepts of deep learning based segmentation - feel free to reach us. We will be glad to help.