A deterministic option better than random sampling or grid search

Regarding course 2 week 3 slide 4 “Try random values: Don’t use a grid,” Andrew’s justification for using random rather than grid values seems to make sense for the given example where epsilon barely matters and alpha matters a lot: if we use grid search we only assess 5 unique values of alpha (and epsilon) and thereby “lost an opportunity” to have 25 values of alpha (and epsilon) instead.

However, I don’t see why we would attempt to achieve this goal by random sampling. In fact it is trivial to deterministically specify 25 unique, equally spaced values of alpha and 25 unique equally spaced values of epsilon using only 25 points. (Unless there is some benefit to injecting noise into this process?) Here’s an example of a deterministic choice where for simplicity’s sake I’m assuming range of each is 1 to 6.
(1,1), (1.2,2), (1.4,3), (1.6,4), (1.8,5)
(2,1.2), (2.2,2.2), (2.4,3.2), (2.6,4.2), (2.8,5.2)
(3,1.4), (3.2,2.4), (3.4,3.4), (3.6,4.4), (3.8,5.4)
(4,1.6), (4.2,2.6), (4.4,3.6), (4.6,4.6), (4.8,5.6)
(5,1.8), (5.2,2.8), (5.4,3.8), (5.6,4.8), (5.8,5.8)

In contrast, the grid shown in the diagram would be something like:
(1,1), (1,2), (1,3), (1,4), (1,5)
(2,1), (2,2), (2,3), (2,4), (2,5)
(3,1), (3,2), (3,3), (3,4), (3,5)
(4,1), (4,2), (4,3), (4,4), (4,5)
(5,1), (5,2), (5,3), (5,4), (5,5)

Am I wrong?

Hello @am003e,

It is not about right or wrong. Let’s look at your first (left) and second (right) set of searching hyperparameters this way:

From the left plot, it seems to me you are injecting some prior knowledge about the hyperparameter 1 (HP1) and HP2 that the HP1 and HP2 has some linear relations. This is why it is planned to ignore large HP2 when HP1 is small.

I don’t mean you really actually had thought that HP1 and HP2 have any relations, but this is implied in your first set of searching hyperparameters. OK? It is just implied.

From the right plot, it seems to me that the two HPs are uncorrelated.

Therefore, your first set isn’t completely an example to demo ONLY the idea that is brought up in that slide, but perhaps unintentionally added some extra information about a prior assumption of the relation between HP1 and HP2.

Would you like to think about an alternative for the first set that has no relations between HP1 and HP2? Perhaps you will end up with randomizing them? Give it a try :wink:

Cheers,
Raymond

PS: if you have really thought an alternative, it would be easier to draw it out like mine. Don’t make me draw it :laughing:

I like your second to last sentence about randomizing, because the values I gave were just an example. The point is that it is trivial to deterministically generate 25 values of each hyperparameter that are perfectly equally spaced across the range sampled.

Indeed, you could randomize participation in each pair while maintaining the deterministic grid sampling of each hyperparameter. So to ask my question a different way, is the random selection of values actually necessary or advisable (as the lecture states), or can we deterministically specify 25 values of each variable and randomly pair them?

Hello @am003e,

You can. Again, there is no right or wrong. Even your first set is not wrong, but it implied some relations. Put it this way, if the true best HP1 and HP2 are 3 and 3 respectively, both the first and the second sets are fine. The point of searching is merely searching. You just want to search for where the best points are. If you really know a prior relation between HP1 and HP2, then I would encourage you to inject it into your searching list, instead of randomizing them. For example, in building a decision tree, the “max. depth” HP is related to the “max. number of leaves” HP. Who will search for “max. leaves = 6” when “max. depth = 1”?

If you deterministically specify 25 values of each variable and randomly pair them, and turns out that by chance, they concentrate in a narrow region, then it just means you are searching that narrow region, and is that wrong? Perhaps you won’t like it if your goal is to search over the whole rectanglar region.

We should have a goal first (like searching over the whole rectanglar region), and then we come up with a strategy. Randomizing both of them (like the slide suggests) has a good chance of spreading roughly evenly. If you tell me, it is also your goal to search over the whole rectanglar region and your way also has a good chance, then you can prove it, and then it’s fine. Because your way served your purpose.

Now, what is your purpose? And do you think that “specify 25 values of each variable and randomly pair them?” serves your purpose? You can use numpy to actually do it and draw the list like mine. Draw 30 different plots and see if your strategy is fine.

Raymond

Good questions, thanks for raising them. My purpose here is to address the stated weaknesses of grid search as stated in the lecture for slide 4, which I believe my proposals do better than the slide 4 proposal. However, I could be persuaded that my proposal is in the end similar enough to slide 4 that it doesn’t make much difference… which I think is what you are driving at.