Machine Learning Strategies

Can you give me any strategies when my dev set has too much of a fluctuation? I am fitting a smaller dataset of almost 20K Examples for training and 2K Examples for Dev and 2K for test. After some pre-processing I have found out that my dev set fluctuates a lot more from 88% accuracy to 83% Accuracy.

Any help would be thanked,

Best,
Yuhan Chiang

Hi Chiang_Yuhan

I think you are putting too many examples into dev and test. The distribution should put majority in training, so the model sees enough examples to reduce overfitting.

According to Andrew Ng in his book on Machine Learning Strategy (Chapter 7)

"One popular heuristic had been to use 30% of your data for your test set. This works well when you have a modest number of examples—say 100 to 10,000 examples. "

I believe you will see a boost in accuracy with larger training and smaller dev/test.

Cheers

Thank you @dan_herman for your swift reply!

I have found that I have mistyped, sorry about that.
I’m using 20K for training and 2K Examples for Dev and 2K for test.

Dev and Test are from the same distribution, I mixed the other 2K examples with another pool of similar data for my train set.

So that makes: 18 K from one pool of data for training, 6K from another pool that is distributed 2K to each train-dev-test split

I am using synthetic methods to lower the data mismatch.

And yes, I’m facing an overfitting problem, my training set is at 95% accuracy.

Could you give me another suggestion about this problem?

Thank you,
Yuhan Chiang

And yes, I’m facing an overfitting problem, my training set is at 95% accuracy

At this point, you’re going to have to either 1) adjust the data model or 2) adjust the hyperparamters of your model.

Without understanding your data or objective function of the model, it’s hard to advise on specific remedy.

In machine learning we are guided by prediction error. When you measure loss in your model accuracy, it’s natural to modify the variables in the data model so the algorithm can see a signal.

It’s an iterative process, going back and forth between data modeling and measuring model performance.

To me, the difference between 83% and 88% doesn’t indicate excessive variation.

1 Like

That’s the first issue to address.

1 Like

I have found out that with more data, overfitting can be solved very efficiently. But with hyper-parameter tuning, it only decreased the accuracy of both my train and dev set. I am working on decreasing the training dataset size and using a lighter weight model.

1 Like

Hello!

I’m not quite sure what you mean by this:

does it mean the size or data distribution of the dataset that I’m training on?

Thanks!

Using less data is rarely going to give you better practical results.

Resist the temptation to tweak the data until you get a result that looks nice.

1 Like

Data modeling is adjusting the data to improve the data quality.

Data quality is determined by how accurately the sample data represents the population. There are endless variations on how you can organize your data.

The best approach is to decide on a single evaluation metric that you are looking to improve.

For example, I worked on healthcare data and ran a classification algorithm to predict treatment outcomes. My goal was to improve precision, where I predicted positive treatment outcomes at 90% accuracy. I changed the data model several times over 3 months, including how many weeks of clinical data, different aggregations for medication doses, include/exclude certain test results, include/exclude patients who completed or did not complete treatment.

1 Like

Thank you!

I am doing a similar research but just on mechanical components. I am using a 1D frequency domain data for binary classification. I would like to learn about your research more, would you mind linking your research down below so that I can draw more inspiration?

Thanks,
Yuhan Chiang

I have just started this basic course on AI terminology. The scorecard creation techniques are they classified under Machine learning?

@deepti.garg, I do not know, because I am not familiar with “scorecard creation”.

Hi Chiang_Yuhan

One strategy to mitigate fluctuation in your dev set performance is to increase its size. A larger dev set can provide a more stable estimate of model performance. Alternatively, consider cross-validation techniques to get a more robust estimate of your model’s performance. Additionally, ensure your model architecture and hyperparameters are well-tuned and avoid overfitting to the dev set. Regularization techniques like dropout or early stopping might help.

1 Like