How to create training set from custom paths

Using the Dataset construct that Balaji points out will simplify the code a bit, but my guess is the fundamental problem is that you are sequentially opening, reading and closing 18,000 disk files every time you run this. That’s a lot of file operations and it’s just going to take a long time. Another approach would be to repurpose that code as a “preprocessing” step that you only need to run once: do the open, read operations on each file and then create a single output database file in a format like h5 that contains all the images. The h5 file will be pretty huge of course (the sum of the sizes of all the individual files plus a bit of overhead), but I would bet that then your actual model code that opens and loads that file will be a lot quicker.

Here’s a thread from a fellow student about how to create an h5 file containing multiple images from separate files. You’ll have to adjust the code for your purposes, but it gives you all the logic you need to build your “preprocessor”. Then your “load” logic that you run every time can be simplified to the way the load_dataset functions work in the C1 W2 Logistic Regression assignment e.g.