It’s not easy to teach a machine to see the space around us as humans do, but it is still happening. Image processing and voice identification, media recreation, and language processing are becoming an everyday routine, performed in many cases by machines.

Tremendous improvements and new inventions are happening in the AI industry. Within the last decades, the distance between the abilities of smart machines and real humans has narrowed considerably. Among the most impressive achievement, we can name computer vision. Let’s go deeper into the basic features to understand how it works.

Convolutional Neural Network (CNN, or ConvNet) is an algorithm that has enabled advances and refinements in Computer Vision and Deep Learning.

CNN represents a neural network with different layers inside, which takes the digital representation of the images and analyzes the similarities. With the assistance of a scientist, CNN learns to recognize the patterns and can group the random set of pictures into classes based on the initial stage of labeled examples (called the training dataset).

How is CNN learning?

To simplify the example, let’s use a straightforward two-colored image representing pixels set and submit it to a Multi-Level Perceptron for analysis. So, if there is a black part of the image, we’ll set it as 1, and if white – as zero.

This procedure may exhibit a moderate accuracy when predicting classes of elementary binary images. Still, it will have practically zero precision where complicated images with pixel-by-pixel relations are concerned.

Appropriate filtering apparatus allows CNN to operate effectively with some settings, fine-turning the calculation to provide the right results.

We can understand that the results are good if the data were labeled in advance. Finally, a well-trained model can recognize even a complex image seamlessly.

Image input

Today, several color schemes like RGB, CMYK, Grayscale, and HSV are used. This RGB image is divided into three color layers: green, red, and blue.

Calculations are immense when the picture size gets to 7680×4320. ConvNet’s function is to compress an image, so it would be easier to work with, keeping the critical characteristics for obtaining a proper prediction. For developing an architecture that is scalable to big data sets and learns features well also, this criterion is crucial.


The Kernel performs multiplication of the matrix by moving nine times as Stride Length is 1, whenever between K and the part P of the image over which the Kernel is hanging.

A filter shifts towards the right until it has analyzed the entire width. As it moves further, it returns to the start of the picture using the same Stride Value and then follows the procedure again until the whole concept is done.

The Kernel shows the same depth as the input image for multichannel images.

Retrieving top-level properties out of an input image, for example, edges, is a goal of the Convolution operation. ConvNet does not necessarily have to be limited to a single Convolutional layer. Generally, the initial layer is in charge of picking up lower-level features, including color, margins, the direction of the gradient, and so on. As we add more layers, the architecture adjusts to the higher-level attributes as well, thus presenting a network that possesses a complete comprehension of the images within the data set, as a human would have done.

The outcomes of this procedure are twofold:

  1. The dimensionality decreases relative to the input;
  2. The dimensionality can increase or stay with no changes.

Pooling Layer

The spatial size of the Convolved feature is shrinking thanks to the Pooling Layer. The essential part of analyzing the information, computational power, is diminished by reducing the dimensionality. Additionally, it would be a proper functionality to extract dominating attributes, thereby helping to support an efficient learning process of the model.

Every layer of the CNN is formed by the Convolutional layer and the Polling layer. Their quantity may be multiplied to further enable the acquisition of low-level components; however, due to more processing power, dependent on the difficulty of an image.

Soon after the previous procedure, our successful development includes a comprehensive feature model. Next, we are going to smooth the final result out and send it to the Neural Network for classification.

Fully Connected Layer (FC Layer)

A Fully-Connected layer addition is typically a low-cost method of teaching nonlinear variations of the top-level attributes presented at the output of a convolutional layer. The Fully-Connected layer learns a probable nonlinear feature in that area.

It can recognize prevailing and some lower-level attributes in pictures and classify them.