Deep learning from a mathematical view

Hi, I’m trying to figure out how the math behind Deep Learning approximately looks like.

Can somebody pls verify if my understanding of DL with a (simple, tiny, fully connected) neural network is correct? (1 epoch)
If it’s correct I think it would also help others to acquire a better understanding of neural networks

  1. Forward pass
  • = dot product, + = addition (can include broadcasting) , f = activation function (may be different ones)

a) example with 1 hidden layer
Y^ = f ( W2 * f ( W1 * X + b1 ) + b2 )

b) example with 2 hidden layer
Y^ = f ( W3 * f ( W2 * f ( W1 * X + b1 ) + b2 ) + b3 )

Y^, the result of the forward pass gets applied on a cost function (i.e. for binary classification)

  1. cost function
    J = -Y * log ( Y^ ) + ( 1 - Y ) * log ( 1 - Y^ )

  2. back propagation
    for back propagation this cost function J will be partially derived for Wi and bi.
    The parameters will be updated then:
    Wi := Wi - lr * dWi
    bi := bi. - lr * dbi

epoch is finished

W: Weight-Matrix
b : bias - Matrix
lr : learning rate
X : input
Y : label
Y^: output of forward pass
f : activation functions (may be different ones)
i: index for W, b

I think what you show above is essentially the same as Prof Ng has shown us in the lectures in Week 3 and Week 4. But your notation is a little different. You write out the composition of the functions explicitly, which may make it more apparent how forward propagation actually works, but is not really generalizable. It will be pretty confusing to read with more than 2 hidden layers.

One other thing to note: the activation functions which you have called f() and Prof Ng calls g() do not all have to be the same. You have a choice at the hidden layers. Even within the hidden layers, they don’t have to be the same, although that is typically the way it’s done. At the output layer, it is determined by what you are doing. For a binary classification network, the output activation is always sigmoid. For a multiclass classification it is always softmax, which is the generalization of sigmoid.

You also haven’t shown any of the formulas for back prop, so it’s a bit unclear how this really adds any value over and above what Prof Ng has already shown us.

Also note that Prof Ng has specifically structured these courses so that they do not require knowledge of calculus (either univariate or vector calculus). There are lots of resources available on the web if you have a math background and actually want to see the derivations. Here’s a thread that has some links to get you start on that quest.

Hi Paul,

Thank you very much for your reply!

I agree that it’s not really generalizable - it wouldn’t work exactly like this for CNN or other types of neural networks, but for a start I just wanted to make the example on a plain, tiny fully connected neural network.

It will be pretty confusing to read with more than 2 hidden layers.

Well, I guess for every additional hidden layer a term on the left and on the right would be added …ie for 3 hidden layers:

Y^ = f ( W4 * f ( W3 * f ( W2 * f ( W1 * X + b1 ) + b2 ) + b3 ) + b4)

One other thing to note: the activation functions which you have called f() and Prof Ng calls g ( ) do not all have to be the same

Yes, I have mentioned that the activation functions can be different but thanks for the information that they also can be different within the hidden layers - I didn’t know this before.

You also haven’t shown any of the formulas for back prop, so it’s a bit unclear how this really adds any value over and above what Prof Ng has already shown us.

Backpropagation nowadays anyway is done by deep learning tools automatically …that’s why I didn’t want to dive too much into it.
As a side note - I think calculus is not that difficult to learn or to understand.
Not sure why it is kept off.

I think it’s useful to imagine that by minimizing the cost function - that is calculating the slope via the 1st derivative of the cost-function and bringing the slope closer to zero, the distance from Y to Y^ will be gradually reduced and that the purpose of forward propagation is to find the formula which brings the best minimum when inserted in the cost function.

So imho in summary DL is all about the cost function and its argument provided by
forward propagation.
(Backpropagation is done automatically by DL Tools)

For experienced deep learners it probably wouldn’t make much difference but I found it a little bit more difficult to get a quick insight of what’s really going on just by the schematic presentation of a neural network.
I am also aware that Prof Ng teaches more than that.
Maybe it’s somehow just too distributed for my needs.

There is no intention to belittle the outstanding work of Prof Ng, of course.

Kind regards