One of the most  popular idea in whole field of Deep Learning which uses the Chain rule of differentiation . Backpropagation was developed by Geoffery Hinton in 1980’s to train neural networks.

Note : Interesting math’s Ahead ;).

Before understanding BackPropagation ,let’s revise our differentiation basics .

Differentiation is just a technique to find the rate of the function w.r.t a variable by rate of change we simple mean what will be the effect on the result of the function if the value of one of its variable changes i.e slope of the function.

Calculating Slope of a Linear function is straight forward ,

Slope = \frac{Y_{2}-Y_{1}}{X_{2}-X_{1}}  , where Y= F(X)


Consider this image ,

and now if we apply same  technique of finding slope on this function the line passing from two points will look some thing like this and if we apply same technique then there would be a error and that will occur because of the void .Now our task is to reduce the error aka the void ,we can do that by reducing the space between x values  upto that extent till the line which was passing from two points looks like a tangent touching the surface of the curve at exactly one point like in the image given below.

Formalizing our discussion ,

Let Difference between X is denoted as dx

so slope can be written as  slope = \frac{F(X_{1} + dx) -F(X_{1})}{dx},

=>  \frac{(X_{{1}}+dx)^{{2}} - (X_{{1}})^{{2}}}{dx}, after solving we get

=> 2\ast X_{{1}} , which is power law of differentiation (Congratulations you have just derived a differentiation law).

Points to be considered from above derivation:-

  1. differentiation is all about making the difference so negligible that even discarding that difference won’t have much effect on final result.
  2. differentiation can tell us about at which point the minima or the maxima of the function lies , as rate of change at that point will be zero.

We know that the goal of any machine learning algorithm is to find the best plane either decision plane or plane that fits the data best i.e minimizing the loss aka finding the global minima of the function.


back propagation is a kind of extension of the chain rule ,

Now the question is what is chain rule and where it is used ?

chain rule is used when we have a composite function like f(g(x)) ,here if we want to study the behavior of  f() when we changes x then we have to use chain rule as f is function of g which is function of x,so when formalizing our discussion ,

\frac{dF}{dx} = \frac{dF}{dg} \ast\frac{dg}{dx} , which is first we calculate change in g w.r.t to x then we calculate change in F w.r.t to g, by doing this we can study collective effect .

Now Let’s understand one more concept which is Vector Differentiation,

The concept is very simple ,consider a vector in which stores derivatives of a function with respect to every variable present in the function just like in the image given below

This vector or sometimes matrix is called Jacobian Matrix ,but the derivatives in a Jacobian matrix are partial derivatives i.e derivative of a function with respect to only one variable at a time .Vector differentiation is denoted by \bigtriangledown.

From now onwards we will use following architecture,

Some notations,

  •  W_{i,j}^{k} ,where k is no. of next layer , i is current neuron number in the layer ,j is neuron number of next layer.
  • O_{{i,j}} output of jth neuron from ith layer .
  • L() loss function
  • X_{{i's}} ith input

now if we want to study the effect of  W_{1,1}^{} on the loss function we can write it as,

\frac{dL}{{W11}}= \frac{dL}{dF3,1} * \frac{df3,1}{dF2,3} * \frac{dF2,3}{dF1,1} *\frac{dF1,1}{W1,1} +\frac{dL}{dF3,1} * \frac{df3,1}{dF2,2} * \frac{dF2,2}{dF1,1} *\frac{dF1,1}{W1,1}   + \frac{dL}{dF3,1} * \frac{df3,1}{dF2,1} * \frac{dF2,1}{dF1,1} *\frac{dF1,1}{W1,1}

and by doing this for all the w’s present inside the network we will update their values , and this  is backpropagation applied on above neural network.This process can be speed up using memoization which is simply means storing values which can be used later .Backpropagation works well with any optimization algorithm such as Gradient descent or Stochastic gradient descent.


About the author


I write blogs about Machine Learning and data science

By abhinavsinghml

Most common tags

%d bloggers like this: