Halo semuanya, selama ini saya kebingungan untuk mendefinisikan yang dimaksud data “banyak” oleh orang-orang. adakah yang bisa mendefinisikan data dapat dikatakan “banyak” dalam machine learning itu seperti apa - seberapa banyak data yang harus kita miliki sehingga bisa disebut “banyak”?
The amount of data required depends on the problem at hand. For instance, to build a classifier for a simple decision boundary, just 10s of data points might be sufficient. On the other hand, to build an image captioning system or a language translation model from scratch, a lot more data points are required.
What if we want to build a regression model for forecasting or prediction?
Are there general guidelines for determining whether our data is considered “large”/”big data”?
This also affects whether we should use traditional machine learning or deep learning for modeling later on, right?
Even for a regression problem, if the output is a simple affine transformation (i.e. y=mx+b) of the input, odds are good that a simple neural network with 10s of data points might be able to get it right.
There are quite a few guidelines on when / how to consider adding more data in training a model to achieve the desired performance. Please see deep learning specialization for more details. The specialization also talks about what to do when you have limited data and how to leverage existing models for your problem at hand.
Consider starting with traditional ML models for tabular data and neural networks for unstructured data like audio, text and videos.
Just one possible guideline, this is certainly not any sort of rule:
For simple regressions, the number of training examples must be much greater than the number of features.
I start with “much greater” being around 10x. That may (or may) not be sufficient to get good performance, it depends entirely on the complexity of the model.