In Search of Data

Sorry that no-one noticed this thread when you posted it. It may be too late for Roberto, but if anyone else sees this, here are some thoughts:

You can get the data files used in any of the assignments here easily. Here’s a thread about how to get all the files for a given assignment. Then you just have to look at the various utility functions provided in order to see how to access the contents of the files.

There are lots of sources of datasets. ImageNet is a famous one for image data. Kaggle, which you already found, is also a great source of ML datasets. Here’s a thread with more links to datasets.

2GB is not a large dataset by modern ML standards. You should be able to handle that on your local computer in terms of memory and disk space, although training a complex algorithm may require that you get a GPU. I have no experience with running “real” training locally, though. There are many “cloud” based services for running your training. I have tried Google Colab and it’s easy to get started there, since they support Jupyter Notebooks. Here’s a thread about how to get one of our assignment notebooks to run on Colab. One nice thing about Colab is that you can experiment with it for free and get access to GPUs and TPUs. The only thing is that if you are not a paying customer, you may have to wait to run your jobs if the paying users are busy. There are other services, including AWS, that support running your training, but I have no personal experience with them. They will probably not be free, but it’s still way cheaper than building the same amount of compute power yourself. :nerd_face: