Power of a DNN with a single neuron per layer

Week 1’s programming exercise about initialization says:

… In general, initializing all the weights to zero results in the network failing to break symmetry. This means that every neuron in each layer will learn the same thing, so you might as well be training a neural network with 𝑛[𝑙]=1 for every layer. This way, the network is no more powerful than a linear classifier like logistic regression.

I am confused about the last sentence: Isn’t the power of a neural network increasing as you add further single neuron layers? If not, I would like to understand why.

I would have thought that with some smart input and output encoding (some kind of dove tailing), the power of a DNN with a single neuron per layer would be as powerful as DNNs in general (maybe needing exponentially more layers). For ResNet, the paper “ResNet with one-neuron hidden layers is a Universal Approximator” does sound like a DNN with a single neuron per layer can be quite powerful…

Hey @David_Farago,
Welcome to the community. This indeed is an interesting piece of work. I loved reading it, thanks a ton for sharing.

Now coming to the question. First of all, ResNets are only discussed in the 4th course of this specialization, and in this assignment the motive is to learn about initialization in general, and hence, it’s not a highly unlikely assumption that the neural networks referred to in this assignment, are the plain vanilla neural networks, something which we have already discussed in the specialization so far. Additionally, this is my intuition that if some idea benefits a simple vanilla neural network, it should also benefit the more complex neural networks, provided that even with their extra sophistication, they can exploit that particular idea. So, here, we are learning about ideas that could benefit even the simplest neural networks. So, let’s assume that we are talking about simple neural networks, and now if we refer to the comment, 0-initialization will reduce multi-neuron layers into single-neuron layers, since they will fail to break symmetry. And if we refer to the paper included, then we can clearly say that “vanilla neural networks with as many single neuron hidden layers as you want” are not “universal approximators” (something I borrowed from the paper), and hence not a good choice.


Now, coming to the second point, this is just something that I thought too, when I was doing the assignment, but didn’t give in much thought, but let me present it now. A logistic regression model is simply

sigmoid(w^Tx + b),

so, if we have multiple hidden layers with single neurons with non-linear activations for each of the layers, such as ReLU, tanh, etc, I guess, a neural network will be more powerful than a logistic regression model, since, the equation for this neural network is much more complex than this simple equation. Now, I don’t know to what extent this neural network will be more powerful than a logistic regression model, but it should be definitely to some extent. It’s only when you have linear activation or no activation function for each of the hidden layers, that this neural network will reduce to a simple linear model like logistic regression model.


Now, let me present the most important point. If I ask you to tell me the gist of this discussion of 0-initialization as intended in the assignment, then you will most probably tell me, it is highlighting the fact that for layers with multiple neurons, if we use 0-initialization, they are as good as layers with single neurons. Now, whether these layers with single neurons are efficient or not for modelling purposes, that’s another discussion altogether, something which you have ignited with your query. So, if you want to use a neural network with single neuron hidden layers, then you should straightaway set units = 1. Here, using multiple neurons in hidden layers with 0-initialization, is probably not a good choice.


Now, should this be included here in the assignment? My opinion is that it will create unnecessary confusion among the learners who are still trying to grasp the basics of Deep Learning, but once they complete the Deep Learning Specialization, I am sure they will all be set to explore these concepts for themselves, and probably come up with some new concept that might refute the most basic concepts that we know of today, and ultimately, the DeepLearning.AI team will be compelled to modify the assignments to include them.

I hope this helps.

Regards,
Elemento

Thanks for clarifying, so my understanding was right that a single neuron per layer network IS more powerful than a linear classifier like logistic regression.

Though I doubt I will use a single neuron per layer network in practice, it would be interesting to know how powerful it is, to train your intuition on how much layer size L and how much the number of hidden neurons n^{[l]} contribute to a network’s power.

Hey @David_Farago,
That indeed sounds like a fun thing to do. If you ever do such a thing, do share your results with the community.

Regards,
Elemento