I have to train a cnn model for medical imaging classification task. The dataset has around 500 thousand images. Input to the model would be of dimension (256, 256, 3). What are the possible platforms where I can train? Colab is out of question as it takes forever to run first epoch. Google cloud platform is another option but I’m unable to get GPU resources in any region. What are the other alternatives?
Are you sure that with paid service colab or paid google cloud you cannot get the adequate resources. I was under the impression that if you paid for them the resources would be much more ehanced.
Other services such as ibm cloud, azure, aws offers computing capabilities online.
Hi @gent.spah . Yes I’ve tried paid colab pro and it takes too much time to train, particularly because the dataset is stored in google drive and the I/O time is too much. Google cloud and other options are yet to be explored by me. Thanks for the suggestions.
You may want to see Amazon Web Services as well.
I bet processing 500K images will take quite some time in any machine. I was doing a 90K images just last week using Google Colab GPU and it would take a good amount of hours.
How long is it taking 1 epoch in Google Colab GPU? or any other GPU you’ve tried so far?
Hi Juan. I tried google colab GPU with a dataset of over 250 thousand images. It took around 10 hours for the 1st epoch. Thereafter every consequent epoch would run in approx 3 hours. But the runtime would disconnect because of the 12 hours limit on colab.
Yes I’m currently looking in detail about the AWS sagemaker and other services. One point that I wish to ask is regarding the time limit in AWS. What is the time limit of running one instance before it disconnects automatically? Like it is 12 hours for google colab.
Regarding your question on AWS Sagemaker runtime, you can refer to THIS LINK where you’ll find the details.
From this link I extracted this:
The default value is 1 day. The maximum value is 28 days.
The maximum time that a TrainingJob
can run in total, including any time spent publishing metrics or archiving and uploading models after it has been stopped, is 30 days.
Hope this helps!