Hi,
So I have been able to complete the ‘Structuring Machine Learning Projects’ course, but still have a few remaining questions. Also, if anyone can think of any practical / real-world examples that apply to my questions, that’d be great !
- Selecting attractive parameters for hyper parameter tuning -
While in the previous course the ‘random sampling’ technique came up for selecting/zoning-in on hyperparameter ranges to tune, I wondered if using a much smaller subset of the full data set at the same time might also work ? For example, say we have a data set with 10,000,000 points.
Is it reasonable then to train a model on say, a 100k/10k train/test, iterating through various values of the hyper parameters to see which one’s are ‘most sensitive’ to change ? Obviously which ones would likely be different for different models-- The thinking being then on the much larger model you would at least have an inkling of those you should iterate and tune.
Or, is this logic completely wrong-- That a parameter may be quite sensitive on this small subset, yet become effectively ‘mute’ in response to the full data set, and vis versa, a previously ‘ineffective’ parameter suddenly becomes very efficacious ?
-
I’ve done cross validation in other settings/courses, but am still trying to understand Prof. Ng’s structuring for the Dev set. Is this only used for validation ? (i.e. once validation is performed, for the full train we lump it back into the train set)-- Or is the data in the Dev set ever excluded from training ?
-
Custom cost functions – Though I know at many times Prof. Ng has stressed at least many in the Deep Learning community look down many instances of manual feature generation/discrimination, and rather prefer to just let the network ‘figure it out’-- Honestly in some circumstances this surprised me a bit though. I mean typical loss functions include cross-entropy/log loss, MAE, MSE, Hinge, etc.
But as of yet I have seen little about hand designing a loss function specific to the dataset and the problem being solved. I mean in the end the network is still the thing performing the optimization.
Are custom loss functions actually a thing ? And, otherwise, if it is a ‘bad idea’ – Why ? (granted you’d have to both really know what you are doing as well as have a good grasp on the particular problem at hand, specific to the data set and the question being asked).
- With regards to transfer learning, Prof. Ng suggests one could take an existing model and perhaps only need to remove/retrain the last one or two layers. However, it is really not clear to me how you would do that – i.e. inject/train new data on final layers. Even in that case, if the data enters at that point, are you still doing back prop on the entire set of layers/weights ?
Further, while I haven’t gotten to LLM’s yet, I have heard about people doing something very similar with ‘fine-tuning’. However my understanding is even with the open source offerings out there, all you really have are the weights-- Typically not the entire model and original dataset. So how then are they doing this ?
Any thoughts would be appreciated.