Activation functions vs optimization method

So the downside of the sigmoid (and tanh) activation function is that the derivative gets close to zero out in the tails, and the reason that’s a downside is your step size is proportional to your derivative (BECAUSE you’re using steepest descent optimization) where the proportionality constant is the learning rate, so a small derivative means small step size so it takes a long time to get to the solution.

But if you were to use Gauss Newton iteration instead of steepest descent it chooses the pointwise best stepsize/learning rate for each step. Gauss Newton is a variation on Newton’s method applicable only when your minimizing the sum of squared errors across a vector of right hand sides/equations. The wonderful about GN is it only requires a Jacobian (matrix whose rows are the gradients of the equations, one row/gradient per equation). in the Coursera deep learning course we were already computing the the Jacobian but doing an (1/m)*np.sum(…) across it to flatten it into a single gradient, so computing the Jacobian is easy as pie to do. however you have to compute the PSEUDO inverse of the Jacobian, which if you tried to do as a single step would be huge and just kill your update, but…

I think you could do back propagation with Jacobians one layer at a time (you use the jacobian to compute a matrix of updates that you flatten into a vector update, rather than flattening the jacobian into a gradient that you use to compute a vector update) and because of the linear algebra step , followed by applying the activation function, at each layer you would need to do a single pinv (of the previous layer’s activations augmented with a row of ones, you merge w and b) and that scaled by constant (determined by the current activation function) can be used for the whole layer. moreover layer 1 pinv can be cached across GN iterations because X is constant.

the benefit of GN over steepest descent is it chooses a pointwise best learning rate (in a better than steepest descent direction) so you take a lot fewer steps. So because GN means you’re step size is no longer a fixed scalar multiple (learning rate) times your gradient, wouldn’t the choice of activation function matter a lot less if you were to use Gauss Newton iteration instead of steepest descent?

1 Like

Sounds like an interesting topic! Here are a couple of high level points to consider:

  1. We are not doing a least squares optimization here, right? Our cost function has nothing to do with least squares. Does that have any effect on whether Gauss Newton is applicable in our case?
  2. Prof Ng is only showing us the simplest version of Gradient Descent with a fixed learning rate. In Course 2, we will learn about techniques like Learning Rate Decay and then we’ll switch to using TensorFlow, which implements more sophisticated versions of Conjugate Gradient methods which manage the learning rate dynamically.

It is Prof Ng’s pedagogical approach to show us how to build the basic algorithms ourselves directly in python first, but that’s not how anyone actually applies these techniques in “real life”. Everyone uses a high level ML “platform” like TensorFlow or PyTorch or Kaffe or … to actually build real solutions. But it really helps to have an intuitive feel for what is happening “under the covers”, because things don’t always “just work” and require tuning and tweaking. Having some intuition about what’s really happening gives us better tools for knowing what to do when the first try at a solution doesn’t work.

Mind you, nothing I said there actually addresses any of the mathematics you are suggesting, but it makes sense to get an answer to my first question above before we dig any deeper here.

1 Like

In one of the early exercises (logistic regression, which was a single layer NN) I implemented Gauss Newton and it converged to zero cost in something like 40 ish steps rather than the 10K steps the steepest descent took and didn’t quite get there.

The right hand side “happened to be” the same (at least when using logistic regression, a.k.a. a sigmoid activation function for classification) whether you considered it to be a least squares problem OR minimizing the cost function of logistic regression, so at least for a final layer using a sigmoid activation function the question is moot. I haven’t looked to see whether the question was still moot for final layers with different than sigmoid activation functions.

So a little about me… while I was a doctoral student (over a decade ago now) at the University at Buffalo I was simultaneously faculty (job title was “instructor”) where I taught MAE376 Applied Mathematics for Mechanical and Aerospace Engineers… which I will describe as “numerical methods; matrix math; the combination of numerical methods and matrix math; an introduction to statistics and make sure the students can program all of the above in MATLAB” and I’ve taught myself a lot more on the job at a national lab (where I eat sleep and breathe this stuff day in and day out, I effectively function as an applied mathematician/algorithm/uncertainty quantification subject matter expert) for the last 13 or so years after getting the PhD in mechanical engineering. But in like you either continuously grow or become obsolete, and python and machine learning are my current area’s of growth. Conjugate gradient is one of the things I taught in MAE376. Netwon’s method is preferable if you can compute a Hessian but that’s a lot of extra complexity, Gauss Newton is a magical version of Newton limited to a certain special case where you only need to compute the Jacobian. In the examples in the Deep learning class which I finished in about 3 days (including time spent developing improved versions of the algorithms) we were effectively already computing the Jacobian but flattening/averaging it (across training examples) into a single gradient, from which we were computing a single update. The real difference is that with GN you would be using the Jacobian to compute the update (which includes a pointwise best step size) I’ll do some math “on paper” and post that and then maybe we can discuss that (unfortunately I don’t have access to the example data anymore so it’ll be difficult for me to test that)

1 Like

ok so I’ve got it worked out it’s a lot cleaner than I thought it’d be
so notation is a little different than Andrew NG’s class (everything is transposed for starters) i’m going to use MATLAB ish notation where * means matrix multiplication and .* means elementwise multiplication and ./ means elementwise division
I’m assuming a sigmoid activation for the final layer (actually you can have multiple binary classifications so a final layer with n^{[L]} nodes) there are m training examples and n^{[l]} nodes in layer l

when I’m talking about the b^{[l]} (a vector) and W^{[l]} matrices being joined I’m going to refer to them as bW^{[l]}. bW^{[l]} is a (1+n^{[l-1]}) by n^{[l]} matrix. However I sometimes need W^{[l]} by itself, it is a n^{[l-1]} by n{[l]} matrix. A^{[L]} and Y are m by n^{[L]} matrices (where for a single classification n^{[L]}=1). more generally A^{[l]} is m by n^{[l]} and A^{[0]}=X.

so doing the back propagation

layer L: sigmoid activation function g^{[L]}(Z^{[L]}) with derivative g^{[L]'}(Z^{[L]}) = A^{[L]}.*(1-A^{[L]})
\Delta Z^{[L]} = (A^{[L]}-Y)./ g^{[L]'}(Z^{[L]}) %comment this is m by n^{[L]}
\Delta bW^{[L]} = pinv([ones(m,1) A^{[L-1]}]) * \Delta Z^{[L]}
\Delta A^{[L-1]} = \Delta Z^{[L]} * pinv(W^{[L]}) %comment I’m unsure whether W^{[L]} is before or after update, this is a Jacobi vs Gauss Seidel type question

layer l: tanh activation function g^{[l]}(Z^{[l]}) with derivative g^{[l]'}(Z^{[l]}) = 1 - A^{[l]}.\verb1^12
\Delta Z^{[l]} = (\Delta A^{[l]}) ./ g^{[l]'}(Z^{[l]}) %comment this is m by n^{[l]}
\Delta bW^{[l]} = pinv([ones(m,1) A^{[l-1]}])*\Delta Z^{[l]}
\Delta A^{[l-1]} = \Delta Z^{[l]} * pinv(W^{[l]}) %comment I’m unsure whether W^{[l]} is before or after update, this is a Jacobi vs Gauss Seidel type question

at layer l=1: pinv([ones(m,1) A^{[0]}]) = pinv([ones(m,1) X]) is a constant that can/should be cached

so like I said, not that complicated and the size of the matrices you take the pinv of are never huge

and don’t forget to negate the \Delta bW^{[L]} and \Delta bW^{[l]} for the update, i.e.
bW^{[L]} \verb1-=1 \Delta bW^{[L]}
bW^{[l]} \verb1-=1 \Delta bW^{[l]}

we may want to do an update like
bW^{[L]} \verb1-=1 0.5*\Delta bW^{[L]}
bW^{[l]} \verb1-=1 0.5*\Delta bW^{[l]}

to reduce the chances of overstepping when we’re out in the tails of the sigmoid or tanh

note that the ./ will fail when g^{[l]'}(Z^{[l]}) is zero, but that isn’t/shouldn’t be big a problem for sigmoid or tanh activation functions. Relu wouldn’t work but “leaky relu” should work fine, actually it seems like “leaky relu” would be the best activation function to protect the ./ division and probably wouldn’t need the 0.5*\Delta bW^{[l]} protection.

and the pinv is notional… it may be preferable to use QR factorizaton.

can you try implementing/testing this? I don’t have any data to test it on

1 Like

I’m not sure what you mean “no data to test on”. All the data used in the course exercises is available to you. See the appropriate topic on the FAQ Thread.

Also note that this Discourse Forum supports LaTeX interpolation. That is also covered on the FAQ Thread.

Sorry, I haven’t actually looked at any of the math yet, but wanted to get those statements out.

1 Like

thanks I downloaded the week 4 exercises and and switched the code over to LaTeX, it still uses MATLAB-ish notation.

1 Like

Deep_Neural_Network_MINE.ipynb (103.5 KB)

So this is an implementation of the deeplearning week 4 assigment 2 using the GN approach (I basically had to change everything in it except function names because of the transposed matrices).

for the 2 layer network it’s hitting machine zero cost in under 40 iterations on the training data but only scoring 44% on the test data so it’s definitely over fitting, fixes for this are relatively easy, you either

  1. throw a lot more data at it (so the system is over constrained) and it’ll sort itself out OR
  2. terminate iteration early (e.g when the cost/error drops below some threshold that depends on the size of your data)

in the L-layer network it’s encountering a different problem, AL is doing a binary classification (as in giving a 1 or 0 answer in under 10 iterations) but with the wrong answer in a small number of cases. But because of the sigmoid activation function, whenever AL is binary the derivative we need to divide by is identically zero, i tried a hack to try to get it to pop to the negative of the current Z value but without success, so there is a very minor problem that I don’t know how to fix.

The easiest fix would be to use a different than sigmoid activation function for the final layer but I need a suggestion for what to use (I’m a machine learning newb so I don’t know what other activation functions besides sigmoid are available for binary classification applications)

something else that could be happening is that it is heading to the wrong local extreme (heading towards a zero derivative), which could be prevented by also computing the gradient and making sure that it isn’t going in an opposite direction in any dimension (so element wise flipping of dimension signs, before we get into the binary state)… I haven’t tried it though.

If you are doing binary classification, the combination of sigmoid and cross entropy loss is the only reasonable choice. There are simple ways to deal with the “saturation” issue. In mathematical terms, the output of sigmoid is never exactly 0 or 1, but we are dealing with the pathetic limitations of floating point representations here. Here’s a thread which shows a couple of simple techniques for avoid Inf or NaN values from the loss.

so a google search turned up the “hinge” loss function
https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html#hinge
which says it’s used in classification

do you know what activation function it’s typically used with?

I’m not familiar with the hinge function. So what is the domain of that function? How does it work?

Update: that link is, how shall we say, “not useful”. More googling required …

It turns out TensorFlow provides Hinge and basically every other possible loss function. That link makes clear that the y_pred and y_true values are expected to be in the range [-1, 1]. That sounds more like tanh than sigmoid, but it has the very same problem with “flat tails” FWIW …

So I went back and took a solid look at everything…
I had copied and modded the function in dnn_app_v3_utils.py because I had presumed they were correct, that was a mistake… the indexing on the L* stuff was seriously messed up there’s a //2 in places that was causing only half (a.k.a. the first of the 2 layers) to be used and THAT bug was causing my code to kind of work.

The equations I latex’ed above are wrong

\Delta Z^{[l]} = (\Delta A^{[l]}) ./ {g^{[l]}}'(Z^{[l]})
was correct but the SINGLE correct expression for the other two is
[ones(m,1) A^{[l-1]}]*\Delta bW^{[l]} + \Delta A^{[l-1]}*W^{[l]} = \Delta Z^{[l]}
and for a while I was in a pickle about how to solve that equation, but in the end it’s not to hard
first substitute C for \Delta A^{[l-1]}*W^{[l]} to obtain
[ones(m,1) A^{[l-1]}]*\Delta bW^{[l]} + C = \Delta Z^{[l]} which can be re-written as
[ones(m,1) A^{[l-1]} I_m]*[[\Delta bW^{[l]}];[C]] = \Delta Z^{[l]}
then [[\Delta bW^{[l]}];[C]] = pinv([ones(m,1) A^{[l-1]} I_m])*\Delta Z^{[l]}
you can simply extract \Delta bW^{[l]} as the top part of that matrix you can similarly extract C and then do
C=\Delta A^{[l-1]}*W^{[l]}
which can be solved as C*pinv(W^{[1]})=\Delta A^{[l-1]}
now doing this as is will be slow but there’s a way to speed up pinv([ones(m,1) A^{[l-1]} I_m])
pinv([ones(m,1) A^{[l-1]} I_m]) is m by 1+n^{[l-1]} +m
i.e. it has more rows than columns and we can partition it like this
[ones(m,1) A^{[l-1]}, I_m] because of its shape
pinv([ones(m,1) A^{[l-1]},I_m])=...
...[ones(m,1) A^{[l-1]},I_m]^T*pinv([ones(m,1) A^{[l-1]},I_m]*[ones(m,1) A^{[l-1]},I_m]^T)
and
[ones(m,1) A^{[l-1]},I_m]*[ones(m,1) A^{[l-1]},I_m]^T=...
....[ones(m,1) A^{[l-1]}]*[ones(m,1) A^{[l-1]}]^T + I_m
which is clearly symmetric and very likely positive definite but as and way to accelerate this further (and potentially improve accuracy) you could do a symmetric eigen decomposition of
[ones(m,1) A^{[l-1]}]*[ones(m,1) A^{[l-1]}]^T + I_m
rather than a SVD, in my current draft implementation I’m just doing the
pinv([ones(m,1) A^{[l-1]}]*[ones(m,1) A^{[l-1]}]^T + I_m)
for simplicity and that dramatically sped things up

my fix for the sigmoid function was to compute
A^{[L]}_{lim}=min(max(2^{-20},A^{[L]}),1-2^{-20})
then compute \frac{dA^{[L]}}{dZ^{[L]}}=A^{[L]}_{lim}*(1-A^{[L]}_{lim}) and
\Delta Z^{[L]} = (A^{[L]}-Y)/\frac{dA^{[L]}}{dZ^{[L]}}
With the above fixes the 2 layer NN converged in about 40 iterations had 0.99999999… correct on the training data and about 0.65 correct on the test data which was an improvement over yesterday, it still seems like it might be over fitting. And I don’t think comparison to the 0.72 of the exercise is meaningful because the indexing bug in the week 4 homework assignment disqualifies it as being “right by accident”
still fiddling with the algorithm to get it to work for L layers instead of just 2 layers.

I haven’t looked through your new math yet, but I question your analysis of the utility routines as being incorrect. In the cases where len(parameters)//2 appears, that is because there are two parameters per layer, right? W^{[l]} and b^{[l]}. So if your analysis is based on misunderstanding that, I think you should take another look.

There is no indexing bug in the actual assignment as presented, so it is not “right by accident”. You may have changed the definition of what is in the parameters dictionary, but that is your mistake, not a mistake in the code as given.

print out “L” and run it then tell me there’s no bug

I have done that many times in the past and there is no such mistake. I don’t have time to do it right this minute, but will take another look later today.

Of course it is possible that you have mucked with the code. You might also want to get a clean copy and compare before you go casting aspersions. There is an explanation of how to do that on the FAQ Thread, which I think I’ve given you before.

I think you are correct, but I have bW to facilitate linear algebra rather than “b” and “W” separately, sorry

The other way you could mistakenly think there was a bug is running the test cell for L_layer_model without first running the cell that redefines layers_dims to be the 4 layer network. If you run the L layer code with the 2 layer architecture, note that you get a slightly better answer than in the two layer case because they use a more sophisticated initialization algorithm for the L layer case.

I added some instrumentation to my L_layer_model function and to L_model_forward. Here’s the first iteration from a test case with the 4 layer architecture:

layers_dims = [12288, 20, 7, 5, 1]
layer 1
A_prev.shape (12288, 209)
A.shape (20, 209)
layer 2
A_prev.shape (20, 209)
A.shape (7, 209)
layer 3
A_prev.shape (7, 209)
A.shape (5, 209)
layer 4
AL.shape (1, 209)
Cost after iteration 0: 0.7717493284237686

So there is no problem with the logic iterating over all 4 of the layers. Whatever happened in your case must be the result of bugs you introduced.

Deep_Neural_Network_MINE.ipynb (105.4 KB)

this is where I’m at right now, works fine for 2 layer. L-layer with 2 layers produces idential results to 2 layer, but larger L-layer is diverging after first step for L-layer, I don’t know why yet. I’m thinking that the Initial step size may be too large, might also be because there are 20 rather than 7 nodes in the first layer… will have to experiment