Hey @David_Farago,
Welcome to the community. This indeed is an interesting piece of work. I loved reading it, thanks a ton for sharing.
Now coming to the question. First of all, ResNets are only discussed in the 4th course of this specialization, and in this assignment the motive is to learn about initialization in general, and hence, it’s not a highly unlikely assumption that the neural networks referred to in this assignment, are the plain vanilla neural networks, something which we have already discussed in the specialization so far. Additionally, this is my intuition that if some idea benefits a simple vanilla neural network, it should also benefit the more complex neural networks, provided that even with their extra sophistication, they can exploit that particular idea. So, here, we are learning about ideas that could benefit even the simplest neural networks. So, let’s assume that we are talking about simple neural networks, and now if we refer to the comment, 0-initialization will reduce multi-neuron layers into single-neuron layers, since they will fail to break symmetry. And if we refer to the paper included, then we can clearly say that “vanilla neural networks with as many single neuron hidden layers as you want” are not “universal approximators” (something I borrowed from the paper), and hence not a good choice.
Now, coming to the second point, this is just something that I thought too, when I was doing the assignment, but didn’t give in much thought, but let me present it now. A logistic regression model is simply
sigmoid(w^Tx + b),
so, if we have multiple hidden layers with single neurons with non-linear activations for each of the layers, such as ReLU, tanh, etc, I guess, a neural network will be more powerful than a logistic regression model, since, the equation for this neural network is much more complex than this simple equation. Now, I don’t know to what extent this neural network will be more powerful than a logistic regression model, but it should be definitely to some extent. It’s only when you have linear activation or no activation function for each of the hidden layers, that this neural network will reduce to a simple linear model like logistic regression model.
Now, let me present the most important point. If I ask you to tell me the gist of this discussion of 0-initialization as intended in the assignment, then you will most probably tell me, it is highlighting the fact that for layers with multiple neurons, if we use 0-initialization, they are as good as layers with single neurons. Now, whether these layers with single neurons are efficient or not for modelling purposes, that’s another discussion altogether, something which you have ignited with your query. So, if you want to use a neural network with single neuron hidden layers, then you should straightaway set units = 1
. Here, using multiple neurons in hidden layers with 0-initialization, is probably not a good choice.
Now, should this be included here in the assignment? My opinion is that it will create unnecessary confusion among the learners who are still trying to grasp the basics of Deep Learning, but once they complete the Deep Learning Specialization, I am sure they will all be set to explore these concepts for themselves, and probably come up with some new concept that might refute the most basic concepts that we know of today, and ultimately, the DeepLearning.AI team will be compelled to modify the assignments to include them.
I hope this helps.
Regards,
Elemento