Serengeti logo BLACK white bg w slogan

Deep learning – OCR Text Detection and Text Recognition

Jasmin Kurtanović, Machine Learning Engineer

Computer vision is an area of artificial intelligence that deals with extracting and analyzing valuable information from images. It was formed in the sixties of the last century, and in the previous ten years, it has experienced a real boom that is best seen through a variety of applications. For example, we can cite the application of computer vision in medicine in production processes, and self-driving cars are also an increasingly popular topic. Some cell phones and computers can unlock the device by face recognition, but face recognition is also used in far more serious things. Computer vision is also used to read texts and numbers. For example, we can cite the translation of text visible through a camera in real-time or the evaluation of handwritten equations.

The use of deep learning in computer vision has allowed us to leave the entire process of manually finding features to a deep neural network.

Reding text in images has attracted increasing attention in computer vision due to many practical applications in document analysis, scene understanding, robot navigation, and image retrieval. Although there has been significant progress in text detection and text recognition, it remains a challenge due to the large variety of text patterns and the complicated background. The most common way of reading text in images is divided into text detection and recognition, which are processed as two separate tasks. Concepts based on deep learning are becoming dominant in both parts.

Text detection

Text detection is one main problem in the field of computer vision that have a wide variety of applications in autonomous driving, industrial automation, geo-locations, and blind navigation.

CRAFT ( Character Region Awareness For Text detection) is a deep neural network that predicts two scores (Region score and Affinity score) for each text character. Region score is used to localize the individual characters in the image, and Affinity score is used to group each character into a text instance.

The CRAFT model gives us two output maps Region Level Map and Affinity Map. The Region Level Map shows us the region of the characters, while the Affinity map tells us that some characters have a high affinity and must be merged into a word. So combined Region Score and Affinity Score give us a bounding box for every word.


The implementation of the CRAFT method is based on fully convolutional networks where the VGG16 neural network is used as the backbone. The decoding segment is similar to a U-network neural network.

Text Recognition

The next step is text recognition once we get a bounding box for each word via the CRAFT model. There are several text recognition techniques, but we will focus on perhaps one of the most popular methods to solve this deep learning problem.

Clova AI Research has developed a four-stage STR framework that most existing STR models fit into and are based on CRNN neural networks.

A CRNN neural network combines a convolutional neural network and a recurrent neural network to recognize text based on an image.

The four stages derived from existing STR models are as follows: 

Transformation: We mainly deal with curved texts. At this stage, it turns into a rectangular shape.

Feature Extraction: Transformed images are converted to a text recognition feature map.

Sequence modeling: Captures information within a string of characters for the next phase to predict each character.

Prediction: This phase estimates the output string of characters.


The RedAI application is based on image processing. It uses computer vision for:

  • Image detection
  • Image classification
  • Image segmentation

Also, another functionality of the RedAI app is price detection and recognition, which is based on the above concepts.


Price detection and recognition


So, in the blog, we described techniques for text detection and text recognition based on deep learning concepts. CRAFT provides bounding box for each word. The four-stage STR takes one word (as a picture) as input and predicts the letters.

Let's do business

The project was co-financed by the European Union from the European Regional Development Fund. The content of the site is the sole responsibility of Serengeti ltd.