What is the best way to find a balance between a data-centric and a model-centric approach?

What is the best way to find a balance between a data-centric and a model-centric approach?

Hi, @srik, and welcome to the community! :wave:

TL;DR: data-centric first, model-centric second. Switch your attention to the model only when you exhausted all your options to improve the data.

This is a discussable topic. In my opinion, you should always start by picking a proven and well-known model suitable for your task, training it on the data you have, and using the results you had as the baseline for all of your future work. If you have to pick between multiple models at that stage, pick the simpler one. Simpler models are easier to train, and this will allow you to iterate faster. Also, we tend to overestimate the complexity of the model we need for the task.

Then, focus on improving your data. Go full-on data-centric approach, and don’t forget to evaluate your model after every improvement and compare the result with the baseline.

At some point, you’ll notice that the model isn’t improving anymore and/or that you’ve exhausted all your options for improving the data. If, at this point, you are not happy with the performance of your model, switch your attention to the model. You may, for example, pick a more complex model and evaluate it or look into fine-tuning the architecture manually. But, don’t spend too much time on it; if you still have options to improve your data, do that with a new model and see if it helps.