I’m having a bit of trouble understanding how, as we’re designing neural networks, how many layers and how many units is sufficient.
For example, I know that in the coffee roasting example we had 3 units in 1 layer because we determined that there were 3 variables that the coffee could not be roasted successfully - the 3 variables were undercooked, overcooked, and too short of a duration.
In our Lab Assignment for Week 1 where we are asked to help write a neural network for classifying hand written digits and see whether they are 0 or 1, how did we know that for the neural network we wanted 25 units in layer 1, 15 units in layer 2, and 1 output unit in layer 3?
I understand the amount of layers as the neural network is looking at a small window in layer 1, then moving onto a larger window in the second layer, and then creating a final output in layer 3.
I have questions about:
How do we know that essentially 2 layers was enough?
How did we know that we needed 25 units in the first layer and 15 units in the second layer?
The size of the input and output layers are set by the number of features, and the number of output labels.
The number of hidden layers, and the number of units in each layer, are found by experimentation.
The goals are that you want enough complexity that you get “good enough” results, but you don’t want so much complexity that training is computationally costly.
There are methods later in the course that will help you decide what is “good enough”.
So just to confirm, using 25 > 15 > 1 here is just a guess for now? And for this example there isn’t some formula which has determined us to use these numbers?
Hey @timtait, welcome to our community! We use “1” in the last layer because it’s a binary classification problem. We need one and only one neuron to talk about the probability of the sample being classified as True.
The “25” and “15” are determined by experiments. And initially if you have no ideas how many layers and how many neurons are needed, you can only start with guessing. There is no maths formula for us to calculate these numbers. The numbers are related to the complexity of your problem. For example, if the digits you are predicting are always centered and upright in the photo, you may need less neurons than the case that the digits can be at anywhere in the photo and in any orientations.
Usually how many layers do you start and end with during an experimentation? and how many units in a layer you start and end with too?
wouldn’t it take too much time for this experimentation? would it be possible to just sample a population to test the best parameters for this and still obtain the same results?
Hello @RafaelDichoso, objectively speaking it will take some time to experiment, but this is the only way to find out an architecture we need for the specific dataset and problem we have this time, and we need another series of experiments for the data and problem next time, because it is likely that this and the next problem need different architectures.
The amount of time depends on the size of dataset, and NN can be large if you have a large dataset , so if you have a large dataset, you can start with a big architecture, and use a reasonable regularization to prevent overfitting. I am not able to tell you how big the architecture should be or how strong the regularization has to be, but you will find out as you experiment more and you will come up with your own set of default which can depend on the amount of resources you have access to, the field of your interest, and the way you solve a problem, and these can all be different from person to person.
As you have your default starting point, you might spend less time to experiment.
Btw, by “test”, we are still experimenting, but the test result might not be very useful because after we get the whole dataset back in, we have the potential to enlarge the NN for better performance but as for how much more to enlarge it, it is again back to experiment.
If you want to start with a reliable architecture, the usual way is to look for existing solutions and you can find many of those for image recognition problem, for example. You may download a EfficientNet architecture and google discussions for the level of regularization needed to train that sort of model.
The numbers of “layers and units” are decided upon our mindset of the function and efficiency of the process of predicting “y-hat”. And to define the process of building the structure of our neural network, we call the “layers and units” as architecture, of neural network. And as what professor has mentioned, designing the architecture by decreasing the number of the units of each layer compared with the previous layer is a great choice.
in my model, i get the best results when I set my NN model architecture to be 12 hidden layers with 120 neurons for each
is it norma to use such a large number?
Hi, so usually it means that the more number of units we have in a layer the more accurate predictions on that layer we will have, or that still depends and it is not like a rule ?
Unfortunately, there is no absolute answer to the question. If it turns out to overfit, the answer is no, otherwise, maybe it is a yes.
Also, we don’t talk about just one layer, but the whole NN when it comes to production accuracy, because we need the whole thing for making and evaluating the predictions.
If I wanted to check if adding units helped, I would add them, train them, and evaluate the model with a separate cv dataset (cv dataset is discussed in course 2 week 3). If it improved my metrics, it helped, otherwise it didn’t help.
If it overfit, we might need to strengthen the regularization, and see if we could turn it
from not-helping to helping. However, there is a warning basically to every beginners: it is an excellent exercise to see how far changing a NN architecture can get us to, however, never spend too much time on just that and become looking for “miracle”. Instead, go to the data, because it also dictates how far we can get to.
Hey, I’m also interested in knowing more about units. Unless I missed something, so far (course 2 week 1) we haven’t been told what’s happening behind the hood in each unit during the “fitting” phase. How do they work so that 2 units are not doing the same task?
So taking a naive assumption that each unit is implementing the same algorithm then given the same input I would expect they produce the exact same output. That would make it pretty useless to have multiple neurons in that situation, so I suppose that 2 units have somehow different parameters.
So my question is: are each unit initialized with different initial values to achieve different results, or something like that? Is that random? Is there an arbitrary value? Something else? Is that topic going to be covered later in the course, or should I do my own researches?
From your questions I think you already have a pretty good understanding !
Different initialized values are the exact reason for neurons to behave differently. I have written a post on how we can make sure neurons not be able to achieve different results by initializing them to the same values.
There are 2 videos (first, second) in a row (from DLS Course 2 Week 1) where the first one describes a well-known deep neural network problem and at the end of it, it will mention that we can address the problem by weight initialization which is then discussed in the second video.
I know those videos are from the DLS, but I trust you can just take away whatever you can take away from them
This page has a list of initializers which are mostly random initializers but in different ways of “randomness”. I am sharing that page to show you some names that you can refer to if you would like to do your own research. Particularly if you just look at the L.H.S. of the page which gives us a glance of those names as below
There are, for example, 4 random initalizers that take the name “Uniform” which means they generate random numbers with a uniform probability distribution over different ranges. Among those, GlorotUniform is the default choice for some tensorflow layers, however, it does not mean that there is one initializer which is always superior to the rest. They have their own origins but for that it is going to take some research to find out the information. However, on the page, if you click into some of the initializers, you might be able to see some more detailed information about the initializers themselves, for example, some formula and links to their reference papers.
I have the same question as @irenetheninja… how did our instructor know to use 25, then 15? I get the third layer… it’s simply returning a sigmoid. There were 20x20 pixels. Are the 25 input values random? Are they the results of a 5x5 matrix mapped on top of the 20x20 matrix with a ‘blockier’ version of the original scanned image?
I played with this lab a little, and I understand how, but not why, this worked. Surely there’s a heuristic (rule of thumb) about how to approach these?