Sequence/process from training to testing with one hyperparameter, and also more than one

Hello -
This is a two-part question. The first is more of a confirmation of what is happening (C2, W3), the second is about how to extend what I think is happening to when there are multiple hyper-parameters to be tuned. Any thoughts would be hugely helpful for me. I’ve put together some pieces not explicitly stated in the class and just want to make sure I’m thinking about things correctly.

Part 1: Confirmation of the process with a single tuning parameter
In class 2, week 3 assignment, in the section “7 – Iterate to find optimal regularization value”, there is the for loop where the models[i] models are created and used in the .fit(…) function, where a different lambda is used for each loop. This gives different learned weights for each loop. Then, those learned weights for each lambda are used to get a prediction (in the plot_iterate(…) function) for both the training data and cross validation data. Finally, the err_train[i] and err_cv[i] variables (from the plot_iterate) function are used to make the learning curve plot.

Questions:
1.) That all makes sense to me, if there is something incorrect with that process/logic could someone point it out please?
2.) The next step after creating a learning curve isn’t explicitly shown, but it seems pretty important. I assume we would use judgement to pick a lambda that is best, such as a lambda of 0.01 in this case (models[2]). Then, we use the learned weights that are in the models[2] and the lambda of 0.01 that is in the models[2] model, and run the .predict(…) method on the test data, via

probs = tf.nn.softmax(models[2].predict(X_test)).numpy()

This could be followed by calculating the error for the test data (using y_test) and that would be considered our final. Is that next step / logic that I just described the correct way to go about it?

Part 2: How to deal with multiple hyper-parameters
The great thing about the C2, W3 assignment is that since there is one parameter, it was easy to know which learned weights to use, the ones that were learned when the lambda was equal to the lambda we chose (0.01).

Question:
1.) But what if you are tuning lambda and alpha (learning rate)? Here is my thought, can someone please let me know what is wrong about this thinking if anything: Make a learning curve plot of the training and cv data for each tuning parameter being varied, one for lambda and one for alpha (2 plots, 2 curves on each plot). From those plots, if we chose a lambda of 0.01 and an alpha of 0.001 (for example), then those learned weights for those two models would be different from each other since they are two different models. I guess, then would it be correct to create one additional model using the chosen lambda and alpha values and using .fit(…) on the training data which would now give you new learned weights. Then the last step would be to do a .predict(…) using the new learned weights / model on the test data (you could do it on cv data too just to see I guess), then calculate an error for the test data. Is that the correct way to go, am I missing anything?
Thanks once again!

Hello Navead,

No, I don’t find any incorrect thing.

The judgement is “lowest cv error”, so I think 0.2 is the best lambda.

Screenshot from 2022-11-13 00-13-23

This method is ONLY good as long as the two hyperparameters in question (alpha and lambda) are independent of each other. However, changing lambda will change the cost space and consequently, the choice of alpha should be affected because alpha determines the step size in that cost space.

Given that they are not independent, in your idea, when you make a plot of various lambda, what alpha value would you fix to? This is a tricky question because different alpha values may result in different lambda plots. If different plots results in different optimal lambda, then we don’t know which plot we should base our decision on.

Therefore, instead of fixing an alpha and search for the best lambda, we search for both alpha and lambda at the same time. A basic way for this search is called the Grid Search. In a grid search, for example, we set a list of alpha-lambda pairs, such as [ (0.01, 1), (0.01, 2), (0.1, 1), (0.1, 2) ]. Then for each of the pairs, we build a model with the training set, and evaluate it with the cv set. We select the best model according to the 4 evaluation results (corresponding to 4 pairs of candidate hyparameters). Lastly, we evaluate our selected model with the test set to speak about the generalizability of it.

Cheers,
Raymond

Adjusting the learning rate doesn’t change the quality of the model you learn, it only determines how many iterations you have to run to get convergence.

It isn’t really a hyper-parameter in the same way that a polynomial degree or lambda value is.

Good to keep in mind thanks.

@rmwkwok -
I am looking into grid search now, I was trying to create this type of functionality with a series of functions to get all of the potential combinations of tuning parameters looked at, so thanks for pointing this out and saving me time.

Thanks for looking at my thought process, it looks like you thought it was okay for both situations (one tuning parameter and more than one tuning parameters), except for the items you pointed out. Is that a correct assumption? Sorry if that was supposed to be obvious from your answer, but I thought I would double check. I’m most particularly interested in knowing if the logic in this part of my question is sound:

“…would it be correct to create one additional model using the chosen lambda and alpha values (and whatever other tuning parameters) from your tuning analysis, and using .fit(…) on the training data which would now give you new learned weights. Then the last step would be to do a .predict(…) using the new learned weights / model on the test data (you could do it on cv data too just to see I guess), then calculate an error for the test data.”

I think it is obvious that my quoted question above is correct (I hope it is anyways) now that I’ve re-read through some of your other answers on the site.

A second question is when does k-fold cross validation come into the picture, is that done before, during, or after you have chosen your tuning parameters? What I don’t get about k-fold cv is that you are varying the training and cv data sets each time, so I would assume you would already want your tuning parameters to be set by then, or do you use the k-fold process to help in the tuning of the parameters. Nested cross validation seems to be the answer that I need to look more inot. Sorry if this opens a different topic, I can repost this portion somewhere else.

Thanks!

Hello Navead,

Yes. Basically, we seldom only tune one hyper-parameter.

Yes, that we need to tune all hyperparameters at the same time, instead of one at a time.

I believe you are asking for my above answers?

Yes, after we pick the best combination of hyper-parameters, we need to finally train a model and treat that one as our final output. We evaluate our final output with the test data. However, we can’t evaluate the final output with the cv data. The purpose of the test data is to provide the chosen model a set of completely unseen data to measure its generalizability. Since our model was chosen with the help of the cv dataset, it is not unseen. Evaluating our chosen model with any part of the cv data will over-estimate the performance of our model.

During hyper-parameters tuning. For each combination of hyper-parameters candidate, instead of splitting our data (with test data already held out) once, we split it N times. For each time, we receive a training and a cv set. We train a model with that training set and evaluate it with that cv set. If N = 5, for example, we will have 5 evaluation results based on 5 different cv sets. Then we can aggregate these results to come to a final performance number for the hyper-parameters candidate to be compared with the other candidates.

C2 W3 is the perfect place. :wink:

I am glad to hear that. After your own implementation, please also try sklearn’s. It’s good to be familiar with existing options.

Cheers,
Raymond

Thanks! All extremely helpful.

You are welcome, Navead!

Hi @rmwkwok -
I’ve been revisiting this discussion in the past couple of days and I thought of couple of things I was hoping to get clarification on.

1.) What is the primary purpose of making learning curves with both the training and CV data error scores? I guess it is to look for bias and variance behavior, and not primarily to get the best CV scores, although that is a benefit.

2.) I assume that when looking at multiple folds we are looking at the error/accuracy scores of the model on each of the CV data (for each tuning parameter combination), and that is all we use to assess the accuracy of the model, is that correct? So, we don’t really need to make any learning curves if we are only interested in the scores for the CV data.

3.) If indeed we are just using the error scores on the CV data for each fold, then, I guess we can just average those scores for each parameter combination, is that the correct assumption? Using just the minimum might be misleading. When discussing “the scores of our model”, should we use the average CV data test score or the score on the final test data (that is not part of the train/cv data used for tuning), or maybe both?

Thanks again!

Another significant reason for the learning curve is to get a hint whether the training set is large enough.

If the training cost never reaches a stable minimum value, the data set is too small.

Hello @naveadjensen,

  1. My immediate answer would be as you said - bias and variance behavior. A learning curve compare a training set curve and a cv set curve, whereas “getting the best CV scores” compares among all cv set scores and has nothing to do with the training sets.
    And then @TMosh’s answer reminded me of a different angle. I started to ask myself what else I would look at but of course I have never tried to recite them, so I image-googled “troubleshooting funny neural network learning curve” and found a few that will definitely make me wonder if something else is wrong:


    and then I also find this nice blog post which explains why the curves are sometimes the leads to other potential problems.

  2. that is all we use to assess the accuracy of the model, is that correct?” yes. for a 5-fold case, we have 5 scores to talk about the goodness of the model.
    we don’t really need to make any learning curves if we are only interested in the scores for the CV data.” Agreed.

  3. is that the correct assumption?” yes!
    Using just the minimum might be misleading” Of course!
    or the score on the final test data…” This is what I would use. Some people don’t make a test set at all, then the cv score will be the choice.

Cheers,
Raymond