So i built a ResNet 50 model. For each epoch it needed 2 hours. For 100 epochs, it is 200 hours which will take around 8 days to complete 1 training. I thought I would create a smaller model. From 15 million parameters, i dropped to 1.5 million. Training time is now at 1 hour and 45 minutes. The drop in my parameters is significant, but training time droped only 15 minutes.
Any input about it?
Hi, @Marios_Constantinou!
When it comes to training deep learning models, there are a couple of thing that to be taken into account to speed up the process. First, I am going to discuss the most common bottlenecks in the overall pipeline:
- Loading data from disk for each batch: if all the data is pre-loaded in RAM, it is much faster, although this is not always possible due to memory restrictions. If this is the case, make sure the loading process is optimized.
- Evaluation process may take some time. It might be a good option to just evaluate after several epochs, not in every single one.
- Save model on each epoch. Similar to the previous one.
Assuming everything else is optimized, the number of parameters is not the only thing that matters for model performance. You have to consider how many FLOPs (Floating Point Operations) it needs for each single forward pass and how paralellized this ops are (throughput). Check Table 1 of Gao et al. for reference.
2 Likes
Gotcha, I will look into it!