How to find a good value for the threshold?

In a real problem, what are the techniques for finding good values for the similarity threshold?

Obviously in real world it depends at least on two factors:

  1. Dataset:
  • how clean (deduplicated, missing values, typos etc.) - generally the cleaner the dateset the narrower the margin could be;
  • how long and wide (how many samples vs how many features) - generally the more samples and less features the narrower the margin could be;
  1. Time/money (“All models are wrong, but some are useful” - a cliché but…):
  • how soon you need to have the model - generally the world won’t wait years for best model (hyper parameters) because things can change fast;
  • what benefit / profits does every percent of accuracy bring?

Having said that, some common approaches of hyperparameters optimization that also fit similarity threshold are:

  • Grid search
  • Random search
  • Bayesian optimization
  • Gradient-based optimization
  • Evolutionary optimization
  • Population-based

I mostly use Random Search and Grid Search (a simple tutorial).

But all in all, in my experience, “data-centric approach” > “model centric”. On the other hand, in other ML types like in Reinforcement Learning, hyper parameter search is essential.

Just my thoughts :slight_smile:

1 Like