Hyperparameter optimization

i want to ask about hperparameter optimization. in video lectures it is told to perform grid search on a range of hyperparmaeters? but how to decide range of hyperparameters. lease tell me about how to decide range for number of layers and number of neurons per layers

Prof Ng goes into quite a bit of detail in this series of lectures. Note that there is a bit more subtlety to it than just doing a “grid search”. You can organize all your choices as a grid, but he also suggests that it’s not efficient to just exhaustively enumerate all the combinations on the grid, as the combinatorics of that get pretty big pretty fast. He suggests doing some random sampling of the grid, if I’m remembering correctly what he said.

The other subtlety is that for some parameters, you use a linear scale and for some you need a logarithmic scale. He describes various examples on that front.

But specifically for number of layers and number of neurons per layer, there are some general rules: in a feed forward network, the number of neurons decreases as you go through the layers in general. For the first hidden layer, it typically doesn’t make sense to have more outputs than there are inputs. But beyond those general guidelines, his advice is to start by finding another network that solves a problem that is at least somewhat similar to your problem and start from that architecture and then adjust as needed. So having experience helps.

But in all the points above, I’m just trying to express my interpretation of what I believe Prof Ng was saying in the lectures. It’s always worth listening to what he says again: he covers quite a lot and it’s not easy to catch everything the first time through.

Hi sir,
If i am starting hyperparameter optimization and my hyperparameters are number of layers and number of neurons and say i am doing random sampling.but in order to do random sampling i need a range. my question is how to set range for these hyperparameters. what is the rationale behind it?

If you are only working with two hyperparameters that both have linear scales, then it probably is not necessary to do random sampling. You can just do something like a binary search on both scales.

For the ranges, I’m not aware of any hard and fast rules, other than what I mentioned above: in a fully connected feed forward net, the number of neurons decreases with each successive hidden layer. The reason is that you are distilling information at each layer: e.g. you start from pixels and end up with an answer that is a classification of some sort, so typically has many fewer choices.

The high level answer that Prof Ng discusses at several points in C1 and here in C2 is that you start by using your experience or other peoples’ experience to pick a network architecture that was successful in solving a problem that is as similar to your problem as you can find. Start with one of those architectures and then run experiments to see how well it works and then adjust from there, using the principles that Prof Ng has talked about here (how to deal with high bias or high variance).

If you’re starting completely from scratch with no comparable examples, then you start with the inputs and the outputs. How big are your input vectors? Let’s say you are using 64 x 64 RGB images, so each vector will have 64 * 64 * 3 = 12288 pixels. And what is your output? Are you doing a binary classification (cat or not a cat) or a multiclass classification with 4 possibilities or 10 or 100 for the number of output classes? So the number of input is 12288 and the number of outputs is some number between 1 and 100 (in my artificial example).

So that gives you the range for the number of neurons: 12288 to 1. Although in the examples we see here in DLS C1 and C2, the first hidden layer typically has a much smaller number of neurons than 12288, probably something between 200 and 50 as the first hidden layer. Notice that in the DLS C1 W4 A2 assignment, the dimensions were:

layers_dims = [12288, 20, 7, 5, 1] # 4-layer model

Of course the other point that I didn’t mention above is that you also have to consider how you will evaluate the answers. The two “axes” of concern are a) prediction accuracy b) compute cost both of training and of prediction. How much weight you place on each of the above will not always be the same, depending on your application. If the predictions are going to have potentially life changing effects (e.g. medical diagnosis), then you will be willing to put up with much higher compute costs in order to meet your prediction accuracy standards than if you are just predicting what link someone is going to click next on a website.

Sorry if this is not a very satisfying answer. Maybe we get lucky and someone else knows some more concrete guidelines. But my interpretation of what Prof Ng has said on this topic in general is that there are not very many hard and fast solutions and it requires a combination of experience and experimentation to arrive at a good solution.

I agree with Paul. It is like cooking that there is no universal recipe for every dish (machine learning problem). If you have a specific dish, you might ask those who have the experience for a recipe, but if there is no one to ask, you will need to keep tasting it while cooking, meaning that we might not be able to setup a full grid search and let it run for a long time, instead we need to inspect from time to time to reach a reasonable range for each hyperparameter - tasting it while cooking.

In food, we know what is the lethal dose of salt, just like we will not set 1,000 as our learning rate for usually randomized weights and properly normalized data, but how much salt is the best range will depend on a lot of things, all of which will be seen in your inspection of the trial-and-error process.

I think we need to keep in mind the lectures about bias and variance - they are the taste.

Cheers,
Raymond

2 Likes