Activation functions vs optimization method

update it works great (extremely fast convergence) for 3 or fewer layers, at 4 layers you have to massively scale down the step size to get it to not diverge. segmenting the pseudo inverse into 4 parts (one for each layer) causes the layers to fight each other as their all trying to make the same corrections.

but AND I thought of this before discovering it worked great for 3 layers, if you were to build it 3 layers at a time then freeze the first layer, and take it’s activations as the input layer, and repeat building up one layer at time.

Interesting. What is magic about 3? Either the layer updates fight each other or they don’t, one would think. But I guess intuition and reality do not always agree. :nerd_face:

my intuition was that 3 would work, 3 points define a parabola, 4 points a cubic with a cubic you can have a max and min, i.e. 1 full oscillation, between 2 end points.

What I think is happening is that oscillatory behavior is happening with 4 layers, for some trial runs cost was flipping back and forth between 2 approximate (almost the same but with noise) values. I think that’s only happening because it’s starting a large distance away from the solution, and that trying to make a large correction initially is the problem, and that you might get good results with a starting point near the solution. How can you get that? build up 1 layer at a time by doing a 3 layer fit and throwing away the top 2 layers and then using the bottom layer’s activations as a new set of inputs. And my intuition is that you need it to work for 3 layers for that to work, that you’d need a layer of indirection between the layer you want to keep and classification layer so that you don’t over optimize/try to answer too much of the problem with the layer that you want to keep.

There’s still the issue with how to prevent over fitting… because you get to the minimum possible cost very fast there’s not a lot a leeway about when you should stop to avoid overfitting… we’d need a good criteria to tell us exactly when to stop so that we don’t over fit.

Lots of interesting questions here. Yes, it is definitely the case that 0 cost is really not the goal here, since what it most likely represents is extreme overfitting on the training data. So you need a way to detect and mitigate that. One way is (as you mentioned) “early stopping”. I don’t really recall hearing Prof Ng go into any more detail on how you figure out a good point for early stopping. One thought would be that rather than just looking at the J value, you could also compute training accuracy and dev set accuracy periodically and that might give you more visibility into when you’re edging into overfitting territory.

The other general approach is to change the cost function, e.g. by adding L2 regularization. How does that affect your Gauss Newton method?

I’m sorry that I haven’t had time to invest more effort in understanding what is really going on here. The top level question is: if Gauss Newton is really better, then why doesn’t Prof Ng talk about it? You have to believe that he knows at least as much math as you do and probably quite a bit more. And he’s got the resources of the Stanford Math Department at his disposal. My guess is that you can see the problem already in your description of fiddling with numbers of layers that you can count on the fingers of one hand. That’s not going to scale, right? The actual types of networks people use for solving “real world” problems typically have literally hundreds of layers and sometimes millions or even billions of parameters (read up on the recent GPT-3 model :scream_cat:). When you’re dealing with north of 10^6 parameters, computational complexity matters. Maybe it’s the case that Gauss Newton is more expensive to compute.

a pinv is normally a O(m^3) operation that isn’t very GPU friendly (there may be a version of BLAS and LAPACK geared toward execution on a GPU, I haven’t looked) however there is a “divide and conquer” algorithm for symmetric eigenvalue problems that MIGHT be GPU friendly (and you could use that to do the pinv). That’s what I think is the most significant obstacle to using GN for real. Because my intuition tells me that you could do 3 layers at time and keep only the left most layer and build it up 1 layer at a time that way, and building up a layer at a time (rather than optimizing the whole NN all at once) MIGHT help prevent over optimization.

EDIT: I just checked and there is MAGMA | NVIDIA Developer so linear algebra on a GPU should be doable

1 Like