How much data is needed?

The questions clients always ask. How much data is needed?
I can’t say ‘as much as possible’, as to the client it meant i’m not setting any realistic goal.
If i say a random number like 100k records, i might end up with poor quality with issues like low variance, missing values and such, i can’t ask for more data cos client already provided 100k records.

So…how should i respond to ‘How much data is needed?’ Is there a systematic way which I can employ and quantify in the contract with my client?

The answer to this is largely based on circumstance. I will give my opinion, which is certainly not the only one, or even the best one. As such, I invite other mentors to weigh in further.

I would frame the question as follows:

  1. What are the customer’s qualitative objectives (ie. what is the nature of the prediction - predicting cats, cats and dogs, etc).
  2. What are quantitative predictions (ie. required percentage accuracy, precision, recall, etc)
  3. What is their resource budget (ie. required computing resources and the associated cost)

On point 1, you can quickly do some high level checks for the data being sufficient. Eg: if there are no dog pictures contained in the dataset, then you would need to manage the client’s expectations, namely that their model won’t be predicting dogs. Also, interrogate* the client on how similar to the existing data will future prediction data be. Think of the difference between pictures of cats and pictures containing cats (where a cat is in the picture, but not necessarily the focus of the picture) - nuances like this can have a big impact.

On point 2, armed with the client’s quantitative requirements, the next step would be to train a model and do a rough test on a subset of the data. This will give you a quick idea of how realistic the quantitative objects are. If you are getting 80% accuracy on 10% of the data, and the client’s objective is 95%, then you have more motivation to invest some more time in extending the data input and/or tweaking the model. If, for example, you ratchet the input data up to 50% and tune the model to a few alternative options, and your best output is still 80%, that’s not a great sign, and another discussion with the client has to take place.
(at this stage, a visual(human) inspection of the failed test-set data may yield further insight into what is causing the low prediction rate (maybe a bunch of fox pictures got caught up in the dataset)).

Point 3 follows from the outcome of point 2, but is good to know from the outset so that you keep input data and model sizes in check from the start.

The key point is that you can’t always have a conclusive discussion with the client until you have explored and even run some models on a dataset. It is like trying to get a mechanic to find an issue with you car without them seeing your car. They would be able to give you some high level diagnoses based on experience, but they will at some point need to look under the hood.

* I use this word intentionally as, in my experience, clients tend to be on the vague side about their data. They are focused on fulfilling a business need, not on the technical aspect of how it drives ML. I also have seen a tendency in clients to think that lots of data equals lots of insight/intelligence. That is like saying that if I spend 10000 hours practicing Beethoven’s 5th, I will be able to play the piano (well, at least not as well as if I had varied my practice).

2 Likes

Hi, thanks for the response. I reckon the best way is to really form a contract that consists of two parts. A ‘fact-finding’ portion that really dive into what the client have, and determine the potential of this project, and then a ‘full-on’ portion that defines the conservative gain achievable by machine learning, or if its even worthwhile to do it at all. So we have the first part where clients don’t have to pay a lot and get to assess their readiness for AI in the first place.

1 Like

I thing that important parameter is the quality of data. So with less but clean dataset you will end up with higher accuracy than with bigger noisy dataset.

Dr. Andrew Ng has great video about data-cetric AI which I recommend.