Convolution Neural Network


One of the most important and successful research in Deep Learning . Convolution Neural Network is today is heart of all Major computer vision applications. It has become the first choice of all  computer vision researchers .

The idea of Convolution Neural Networks was first proposed in 1980’s by One of the famous AI Researchers Yaan LeCunn , called LeNet-5 architecture. He used his Architecture which consist of 2 convolution layers ,2 maxpooling layer , one fully connected dense layer and a output layer.The architecture was applied on MNIST dataset .The Goal of this article is to answer three questions,

  • Where the idea of Convolution Neural Network was originated ?
  • what is convolution operation ?
  • What is Maxpooling ?
  • How CNN is structured using AlexNet as example?

Where the idea of Convolution Neural Network was originated ?

the CNN or ConvNet is highly inspire by our biological vision system .There was a study conducted in 1959 by two scientist named Torsen Wiesel and David H Hubel on an anesthetized cat, how ethical is this i leave it on you . After experimenting they found out that the vision system of cat has a hierarchical Neural structure in such a way that the lower levels neurons were responsible  to identify general features in an image like Vertical Edge or Horizontal edge and the higher level neuron for deep complex structures like face or hand . The presence hierarchical structure makes sense as all the objects in an image is a collection of small level features like edges , lines , shapes . The two scientists have also got Nobel prize for their research .This research has been one of the core idea behind hierarchical structure of CNN.This is also one of the main reason why CNN performs better than any of existing algorithms.

What is convolution operation ?

Each layer in CNN is responsible for detecting a certain features , initial layers of CNN are responsible for detecting basic features like edges , lines etc where as higher layers are responsible for detection of complex features like faces or hands etc.A CNN is just series of Convolutional operations mixed with MaxPooling and Fully connected Neurons. Before diving into any further details let’s first see how does a Convolution layer works ,

Consider a 3×3 sobel vertical edge detector applied on  6×6 grayscale image below :

the resultant image is calculated by one to one multiplication of pixel value with corresponding filter followed by adding all the multiplicated values .This operation will result the value of one pixel in the output image.The operation is then continued by sliding the  filter over the image with  some stride(no of pixels to skip when sliding) value. This operation is called convolution operation and heart of any CNN .

The size of the output image is determined by following :

\frac{n-k +2p}{s} \ +1 \ , \ \frac{n-k +2p}{s} \ +1


Where ,

n – Size of input image
k – Size of kernel
p – padding (Appending values at border of image, so that the output image does not shrink in size)
s – stride value  (No of pixels to move while convolution operation)

We have used a special type of filter called Sobel filter in above explanation but that is just for demonstrations you can also use random weights as filter parameters and find the optimum value using any modern optimizer.

This operation can be extended on any rgb image with many filters at same time (For illustration see fig below).

Key points to be noted before applying Convolution on an RGB image ,

  1. All three filters must be of same size.
  2. Depth of the a single filter combined should be equal to that of image.
  3. Stride must be same on all three filters.

The convolution operation is then followed by Relu activation and the final output image is generated.

What is MAX-Pooling  operation ?

Maxpooling is usually applied after Convolution Operation in CNN and all layers don’t need to have maxpooling layer. Maxpooling is done to make the network prediction more generalized , as after convolution operation the network has extracted features but to ensure that those features are not size or location dependent maxpooling is applied.It is fairly simple operation , take a block of size N x N and find the maximum value  and replace the block with that maximum value (for illustration take a look at the image below).

Consider a Max pool operation on 3×3 image with max pool block of 2×2.

Note :- Size of the max pool block should be reasonable in size as higher the max pool block size greater the loss of information will be.

Before moving ahead make sure you understand what is a fully connected layer , if need some refresh click here.

Now Let’s understand a very popular CNN architecture by combining all the above pieces . The Model which i will be using is LeNet-5. The reason for choosing this particular architecture is simple , it is first production ready CNN built by the creator  of CNN Yaan LeCun.

The architecture of LeNet-5 is quite simple to understand (for illustration).



Configuration of Arcitecture,

Input Size : 32×32
Conv 1 :

  • Size :5×5
  • Activation : Sigmoid
  • No. of filters :6

Pool 1:

  • Pool Size : 2×2
  • Activation: unknown
  • No of Pooling Kernels : 1

Conv 2:

  • Size :5×5
  • Activation : Sigmoid
  • No. of filters : 16

Pool 2:

  • Pool Size : 2×2
  • Activation: unknown
  • No of Pooling Kernels : 1

Conv 3:

  • Size : 5×5
  • Activation : Sigmoid
  • No of filters : 120

Fully connected :

  • Activation :sigmoid
  • Size : 120


Note: Due to the fact LeNet-5 is quite old architecture it uses Average pooling rather than Maxpooling. In Average pooling we take average of the values rather than taking maximum value.

If you are feeling complex ,then don’t worry understanding this is quite simple just make sure you understand all dimensions and concepts  correctly which are explained above, then you are good to go.There are some modern CNN architecture which are very large and complex , Some of the popular are mentioned below.

  1. AlexNet
  2. Residual Network by Microsoft
  3. VGG-16
  4. VGG-19
  5. Google LeNet

Understand CNN is crucial in Field of DeepLearning ,as it is used extensively used in almost all computer vision application. The use of CNN is not only limited to Image recognition but recently DeepMind has released a research paper in which they have used a CNN architecture called WaveNet  to generate sound .So next time if you hear google assistant voice , its actually a CNN producing  human like sound. We will be covering many such exciting topics in DeepLearning and Machine Learning so stay connected.





About the author


I write blogs about Machine Learning and data science

By abhinavsinghml

Most common tags

%d bloggers like this: