Machine Learning Strategies

Chiang_Yuhan · March 13, 2024, 1:16pm

Can you give me any strategies when my dev set has too much of a fluctuation? I am fitting a smaller dataset of almost 20K Examples for training and 2K Examples for Dev and 2K for test. After some pre-processing I have found out that my dev set fluctuates a lot more from 88% accuracy to 83% Accuracy.

Any help would be thanked,

Best,
Yuhan Chiang

dan_herman · March 13, 2024, 1:30pm

Hi Chiang_Yuhan

I think you are putting too many examples into dev and test. The distribution should put majority in training, so the model sees enough examples to reduce overfitting.

According to Andrew Ng in his book on Machine Learning Strategy (Chapter 7)

"One popular heuristic had been to use 30% of your data for your test set. This works well when you have a modest number of examples—say 100 to 10,000 examples. "

I believe you will see a boost in accuracy with larger training and smaller dev/test.

Cheers

Chiang_Yuhan · March 13, 2024, 2:31pm

Thank you @dan_herman for your swift reply!

I have found that I have mistyped, sorry about that.
I’m using 20K for training and 2K Examples for Dev and 2K for test.

Dev and Test are from the same distribution, I mixed the other 2K examples with another pool of similar data for my train set.

So that makes: 18 K from one pool of data for training, 6K from another pool that is distributed 2K to each train-dev-test split

I am using synthetic methods to lower the data mismatch.

And yes, I’m facing an overfitting problem, my training set is at 95% accuracy.

Could you give me another suggestion about this problem?

Thank you,
Yuhan Chiang

dan_herman · March 13, 2024, 2:48pm

And yes, I’m facing an overfitting problem, my training set is at 95% accuracy

At this point, you’re going to have to either 1) adjust the data model or 2) adjust the hyperparamters of your model.

Without understanding your data or objective function of the model, it’s hard to advise on specific remedy.

In machine learning we are guided by prediction error. When you measure loss in your model accuracy, it’s natural to modify the variables in the data model so the algorithm can see a signal.

It’s an iterative process, going back and forth between data modeling and measuring model performance.

TMosh · March 13, 2024, 7:51pm

To me, the difference between 83% and 88% doesn’t indicate excessive variation.

TMosh · March 13, 2024, 7:51pm

That’s the first issue to address.

Chiang_Yuhan · March 14, 2024, 1:54am

I have found out that with more data, overfitting can be solved very efficiently. But with hyper-parameter tuning, it only decreased the accuracy of both my train and dev set. I am working on decreasing the training dataset size and using a lighter weight model.

Chiang_Yuhan · March 14, 2024, 1:55am

Hello!

I’m not quite sure what you mean by this:

does it mean the size or data distribution of the dataset that I’m training on?

Thanks!

TMosh · March 14, 2024, 2:06am

Using less data is rarely going to give you better practical results.

Resist the temptation to tweak the data until you get a result that looks nice.

dan_herman · March 14, 2024, 2:57pm

Data modeling is adjusting the data to improve the data quality.

Data quality is determined by how accurately the sample data represents the population. There are endless variations on how you can organize your data.

The best approach is to decide on a single evaluation metric that you are looking to improve.

For example, I worked on healthcare data and ran a classification algorithm to predict treatment outcomes. My goal was to improve precision, where I predicted positive treatment outcomes at 90% accuracy. I changed the data model several times over 3 months, including how many weeks of clinical data, different aggregations for medication doses, include/exclude certain test results, include/exclude patients who completed or did not complete treatment.

Chiang_Yuhan · March 15, 2024, 1:37am

Thank you!

I am doing a similar research but just on mechanical components. I am using a 1D frequency domain data for binary classification. I would like to learn about your research more, would you mind linking your research down below so that I can draw more inspiration?

Thanks,
Yuhan Chiang

deepti.garg · March 16, 2024, 7:17pm

I have just started this basic course on AI terminology. The scorecard creation techniques are they classified under Machine learning?

TMosh · March 16, 2024, 8:25pm

@deepti.garg, I do not know, because I am not familiar with “scorecard creation”.

Storyboards · March 19, 2024, 3:02pm

Hi Chiang_Yuhan

One strategy to mitigate fluctuation in your dev set performance is to increase its size. A larger dev set can provide a more stable estimate of model performance. Alternatively, consider cross-validation techniques to get a more robust estimate of your model’s performance. Additionally, ensure your model architecture and hyperparameters are well-tuned and avoid overfitting to the dev set. Regularization techniques like dropout or early stopping might help.

Topic		Replies	Views
Should I sacrifice final accuracy score to avoid overfitting? Structuring Machine Learning Projects coursera-platform	1	553	February 17, 2022
When to say we are overfitting the dev set? Structuring Machine Learning Projects coursera-platform	6	657	October 12, 2022
Overfit/overtune the dev set? Structuring Machine Learning Projects coursera-platform	2	621	May 23, 2022
Course 2 Week 1 Basic Recipe for ML Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	519	February 3, 2022
Quiz-Practical aspects of Deep Learning Improving Deep Neural Networks: Hyperparameter tun coursera-platform	2	595	August 25, 2022

Machine Learning Strategies

Related topics