About how to determine the number of layers of the neural network and the number of neurons in each layer

Hi @_AisingioroHao
Welcome to the community!

  • Why Relu activation function in the hidden layer is discussed in this thread

  • Why number of nuerons decrease that’s because :

Suppose you have 2,000 inputs and 1 output.

Let’s look at a some potential architectures

The hidden layers are:

  1. 2000, 2000, 2000, 2000

    a. ~16 million parameters

  2. 2000, 1000, 500, 250

    b. ~6.63 million parameters

  3. 1000, 1000, 1000, 1000

    c. ~5 million parameters

Obviously network 1 can do anything networks 2 or 3 can do and more, but at roughly triple the cost per evaluation/backprop and requires more data, so likely more than triple the training time

There are two things to consider here:

  1. Is the performance increase in network 1 worth the increased cost?
  2. Networks 2 and 3 have roughly the same number of parameters, but which is more effective?

The answer to both of these will depend on the particular problem, but most of the time, for a starting architecture, Network 2 will be a pretty safe choice.

Whether the increased performance is worth the cost is something you’ll have to judge for yourself, but if it is, then perhaps you can get similar performance for even less cost by training a narrowing network that simply starts out larger.

Our ultimate goal is to map 2000 dimensions of data into 1. Intuitively, it seems like doing it smoothly might be better than doing it all at once. At each layer, the network learns new features based on the previous layers. We would like to remove the “noise” in the data while keeping the important information. Narrowing the network forces it to give up some information, while desperately trying to keep as much relevant information as possible.

In practice, narrowing networks seem to work better. In theory, we can see a little bit of why they might, but it’s very hard to say anything conclusive.

Our ultimate goal is to map 2000 dimensions of data into 1. Intuitively, it seems like doing it smoothly might be better than doing it all at once. At each layer, the network learns new features based on the previous layers. We would like to remove the “noise” in the data while keeping the important information. Narrowing the network forces it to give up some information, while desperately trying to keep as much relevant information as possible.

In practice, narrowing networks seem to work better. In theory, we can see a little bit of why they might, but it’s very hard to say anything conclusive.

1 Like