A few theoretical/practical questions related to structuring projects

Hi,

So I have been able to complete the ‘Structuring Machine Learning Projects’ course, but still have a few remaining questions. Also, if anyone can think of any practical / real-world examples that apply to my questions, that’d be great !

  1. Selecting attractive parameters for hyper parameter tuning -

While in the previous course the ‘random sampling’ technique came up for selecting/zoning-in on hyperparameter ranges to tune, I wondered if using a much smaller subset of the full data set at the same time might also work ? For example, say we have a data set with 10,000,000 points.

Is it reasonable then to train a model on say, a 100k/10k train/test, iterating through various values of the hyper parameters to see which one’s are ‘most sensitive’ to change ? Obviously which ones would likely be different for different models-- The thinking being then on the much larger model you would at least have an inkling of those you should iterate and tune.

Or, is this logic completely wrong-- That a parameter may be quite sensitive on this small subset, yet become effectively ‘mute’ in response to the full data set, and vis versa, a previously ‘ineffective’ parameter suddenly becomes very efficacious ?

  1. I’ve done cross validation in other settings/courses, but am still trying to understand Prof. Ng’s structuring for the Dev set. Is this only used for validation ? (i.e. once validation is performed, for the full train we lump it back into the train set)-- Or is the data in the Dev set ever excluded from training ?

  2. Custom cost functions – Though I know at many times Prof. Ng has stressed at least many in the Deep Learning community look down many instances of manual feature generation/discrimination, and rather prefer to just let the network ‘figure it out’-- Honestly in some circumstances this surprised me a bit though. I mean typical loss functions include cross-entropy/log loss, MAE, MSE, Hinge, etc.

But as of yet I have seen little about hand designing a loss function specific to the dataset and the problem being solved. I mean in the end the network is still the thing performing the optimization.

Are custom loss functions actually a thing ? And, otherwise, if it is a ‘bad idea’ – Why ? (granted you’d have to both really know what you are doing as well as have a good grasp on the particular problem at hand, specific to the data set and the question being asked).

  1. With regards to transfer learning, Prof. Ng suggests one could take an existing model and perhaps only need to remove/retrain the last one or two layers. However, it is really not clear to me how you would do that – i.e. inject/train new data on final layers. Even in that case, if the data enters at that point, are you still doing back prop on the entire set of layers/weights ?

Further, while I haven’t gotten to LLM’s yet, I have heard about people doing something very similar with ‘fine-tuning’. However my understanding is even with the open source offerings out there, all you really have are the weights-- Typically not the entire model and original dataset. So how then are they doing this ?

Any thoughts would be appreciated.

1 Like

@Nevermnd

Just a suggestion, if you have multiple doubts or queries and if it could be posted separately too if not too literally related, then post the query separately for better read and response, as you would also get lengthier response for lengthier query :slight_smile:
Just a suggestion

Regards
DP

1 Like

@Deepti_Prasad I understand. This particular course was more ‘conceptual’ than the others so I figured less people would be asking about it.

Further, I certainly did not expect anyone to try and respond to all the questions. More so perhaps ‘Oh, well I have a thought on you Q2’ etc.

2 Likes

Lol :+1: Will respond after going through your post completely :grin:

Splitting a dataset based of selective attractiveness of parameters basically you mean in relation to particular obvious feature is again like creating a best model and not letting a training model to learn anything. Random splitting does have a better shot on this.

But in case you are splitting a data based on identity like cats vs dog, it is done.

It is more preferred randomness for the model to be able fine tune the dataset based either smaller or larger subset or full dataset.

I am bit confused by your question, you are asking the importance of larger dataset over smaller subset i.r.t. to selective parameters for model training?

Just want to confirm if my understanding is same as being framed by you.

Regards
DP

Dev dataset is same as cross-validation set or validation set.

Why did this doubt occurred? can I know?
the reason I am asking the above question is Prof.Ng clearly mentions a dataset is divided into training/dev or cross-validation set/test set based on ratio and usually training dataset is has higher ratio than compare to cross-validation and test set because the other two dataset are being used to check for how well training set does on model training.

Per se training set uses the cross-validation set to compare how well it should do and test set shows how well training set did on the comparative set of cross-validation.

@Deepti_Prasad Yes, I am suggesting if I take a much smaller subset of my complete dataset and try and iterate different values of the parameters to see which ones are most responsive – Is it reasonable to expect that (and again, key here-- I don’t mean the actual value of, say, α that I get on my smaller subset-- Only the fact that, for this model/data, tweaking α seems to produce a strong response, whereas say λ, in this case, turns out not to) there is something about the nature of the data in the subset (of which presumably is of the same dimensions as the full data set), that these out to be the parameters to focus on tuning.

Or-- same data, but different size, so all bets are off ?

1 Like

Custom loss function is a real thing :upside_down_face: and is used when the standard loss don’t do well in model performance.

You will come across this in TensorFlow Advanced Technique specialisation where you create a custom loss.

A custom loss function can be created by defining a function that takes the true values and predicted values as required parameters. The function should return an array of losses. The function can then be passed at the compile stage.

Regards
DP

1 Like

@Nevermnd

This is actually done but only when it is really required like if there is very less data for dev test in comparison training or test set. we can do what you are stating.

But one also needs to understanding the importance of when and why we do this. this could be done when

  1. lesser data to cross-validate
  2. imbalanced dataset (not too literal)
  3. if you are not seeing better model performance.
  4. For unknown reasons otherwise

Regards
DP

I am not sure What Prof.Ng stated as it has been long time I did the course.

But transfer learning is basically using a trained with selecting layers, we do not remove the layers but we freeze the layers we don’t want use, then use our new parameters/input, feed into our new model architecture and train the model, and see how the newer model performance with the combination old+new model architecture and model compile.

@Nevermnd Honestly I would actually advise you to do tensorflow advanced technique and give it time to understand these a bit complex things because as you do it practically you get a better understanding.

Regards
DP

LLMs are not only about fine-tuning but yes what you have heard is also partly true.

Some people like to create LLM models from scratch, some prefer using created LLMs and fine tune their models.

You are basically explaining here some part of transfer learning. LLMs is again versatile subject of interest which could be either created from your original dataset and also could be created from openAI source.

Like LLM chatbots are created using GPT which is also a type of LLM
The famous example would be ChatGPT which is a chatbot service powered by the GPT backend provided by OpenAI.

Regards
DP

Hello @Nevermnd,

I agree with @Deepti_Prasad that these questions were better split up into different threads because the threads are linear and parallelly discussing multiple topics are not quite convenient to follow.

  1. No, I wouldn’t 100% rely on that for hyperparameter-tuning, because I do not know how large should that subset be, and that size should depend on the complexity of the problem.
    If we have a problem of only one dependent feature that has a linear relation with the labels, then obviously, whether we have 1k or 100k training points shouldn’t make much difference. What if it becomes a two-feature problem? Maybe 1k is still fine. Then 10-feature? 100-feature? What if the underlying relationship becomes non-linear? What if the noise in the dataset goes up? I mean, that required subset size should go up with the complexity of the problem, right?
    In other words, we do not know whether the smaller subset can get us a good set of hyperparameters that will still do the best in the case of the full set.

  2. This is another version of your first question - can we (1) have a smaller training set by splitting out a dev set, and (2) fix the hyperparameters using that smaller training set and the dev set, and (3) retrain the model with the training+dev sets combined under that fixed hyperparameters.
    I have the same answer - we don’t know if that “smaller training set” here was large enough to give you hyperparameters that still do the best with the larger training set.
    The only thing we can know is, whether the two models (i.e. before and after train/dev combines) perform similarly well by evaulating them on the test set. If the combined one does better, why not combine?

  3. Yes, it’s a thing. For example, we can add more constraints to our cost function. L2 regularization is actually an example of constraints that forces the weights to zero. If we know the physics between an output and an input, or between two outputs, then we can also convert those physics into constraints and add them into the cost function.

  4. Packages like tensorflow allow you to selectively freeze some of the layers so that the freezed ones are non-trainable and the others are trainable. If you have a 5-layer model, and if only the 1st and the 5th layers are trainable, then the backprop will still go through the whole model but only weights in the 1st and the 5th will be changed. You need to know, for example, tensorflow’s command to freeze and unfreeze a layer - check this out.

Cheers,
Raymond

3 Likes

@rmwkwok :grin: I was also a little shy and didn’t want to make it seem like ‘Why is this guy asking so many questions/making so many posts !?!’

2 Likes

@rmwkwok P.s. Thanks in particular for your response to Q4. Though I know we will be depending more on frameworks from here on out, I still have in my mind ‘Well, how would I code this/program it myself if I had to’ – Say, if I wanted to experiment with an idea or concept that was not in / part of the framework.

It was at first not clear to me thus how transfer learning could be implement in this way, but you response now provides insight.

2 Likes

Hello @Nevermnd,

In C1, from scratch, we developed a trainable multi-layer Neural Network with numpy. We can do it, you can do it, right?

Let’s take that as our common ground to start off.

  1. You trained a 5-layer classification model for cats, dogs, and fishes. (call it model A)
  2. You want to build a new model for tigers, dogs, and fishes.
  3. We apply transfer learning:
    1. Possible aproach number one:
      1. randomly initialize a new model B of the same architecture as model A
      2. replace model B’s weights with model A’s except for the 5th layer, in other words, layer 5 has random weights
      3. modify our training numpy code to functionally remove all the parts that will update layer 1 to layer 4’s weights. So, the part for forward propagation is untouched, and back prop will only work on layer 5.
      4. train model B with the datasets for tigers, dogs, and fishes, using the modified code.

Besides 3.1, we can have approaches 3.2, 3.3, and so on.

In this way, model A layer 1-4’s weights are transferred to model B and they affect how model B’s layer 5 will end up.

Cheers,
Raymond

1 Like