Any best practices on how much data is sufficient for ML/DL?

At various times Prof. Ng mentions “have sufficient data, large enough data set”, etc.

How much is sufficient or large enough? It would depend on many things, however, where can one find some heuristics/guidelines?

Hi there, I think one way to check if the dataset is sufficiently large is to check the accuracy/test accuracy. There is also a ‘10 times rule’ to estimate the minimum sample size based on the degrees of freedom.

1 Like

Theoretically the feature space grows exponentially as each new feature is added, and so would the samples required to be statistically sound everywhere in the space. However, it is also often unrealistic to match with the speed of growth. I therefore agree with @kchong37 that a good metric result is when you know your sample size is large enough with respect to the model assumption you made. Since this is a very empirical stuff, I spent 3 mins on google search (I am lazy) and found this article which experimented different settings of “number of sample per degree of freedom” on the author’s dataset. However, I have never yet found any robust conclusion on the question you asked.