In a real problem, what are the techniques for finding good values for the similarity threshold?
Obviously in real world it depends at least on two factors:
- Dataset:
- how clean (deduplicated, missing values, typos etc.) - generally the cleaner the dateset the narrower the margin could be;
- how long and wide (how many samples vs how many features) - generally the more samples and less features the narrower the margin could be;
- Time/money (“All models are wrong, but some are useful” - a cliché but…):
- how soon you need to have the model - generally the world won’t wait years for best model (hyper parameters) because things can change fast;
- what benefit / profits does every percent of accuracy bring?
Having said that, some common approaches of hyperparameters optimization that also fit similarity threshold are:
- Grid search
- Random search
- Bayesian optimization
- Gradient-based optimization
- Evolutionary optimization
- Population-based
I mostly use Random Search and Grid Search (a simple tutorial).
But all in all, in my experience, “data-centric approach” > “model centric”. On the other hand, in other ML types like in Reinforcement Learning, hyper parameter search is essential.
Just my thoughts
Cheers
1 Like