Hi @shsuratwala,

A simple example here. You know, price discount is a linear transformation, and a 20% off discount for a $100 product is just a transformation to 100 * 0.8 = 80.

Now, if I give three discounts to the product - 20% off for winter sale, extra 25% off for limited time offer, and another extra 10% off for membership offer, then the transformation will be 100 * 0.8 * 0.75 * 0.9 = 54.

In above, there are three linear transformations, but in our hearts, we know we can replace them with just one linear transformation - which is a total of 46% off discount. With that, in next time when another customer buys a $270 product, we can do 270 * 0.54 = 145.8 right away.

So, the idea here is, a linear transformation of another linear transformation of another linear transformation ( * 0.8 * 0.75 * 0.9 ) can be combined to one linear transformation (* 0.54).

The slide below is just showing the same idea - two linear transformations can be combined as one linear transformation. It will be clearer if you just plug in some numbers into the w’s and the b’s, which is easy because they are all just scalars.

Above is just some maths, but the key idea here is, this tells us that, no matter how many layers we build into our neural network, as long as only linear transformation exists there, it is no smarter than a network of just one layer. This is to say, we will be wasting our time training so many layers for the effect of just one layer.

This second slide of the video elaborates the idea of the first slide in simple maths, try to fully understand the second slide first, then go back to the first and see what you can/can’t make sense of. After that, if you still have questions, try to ask a more specific one, such as, which sentence did you not get?

Raymond