How to create training set from custom paths

I have a dataframe which contatins about 18000 rows, one column contains the training images paths.

I wrote the following code to calculate the X vector which should be (18000,160,120,3):

def process_path(path):
img = cv2.imread(path)
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
img = img.astype(“float”)
img /= 255.0
return img

X = np.zeros((df.shape[0],160,120,3))

path_array = df[‘path’].to_numpy()

for i in range(len(path_array)):
X[ i, : , : , : ]= process_path(path_array[i])


This code takes a very long time, and my question is, is it a faster way to calculate the training vector X???

Have you seen this? tf.data.Dataset  |  TensorFlow Core v2.7.0

Using the Dataset construct that Balaji points out will simplify the code a bit, but my guess is the fundamental problem is that you are sequentially opening, reading and closing 18,000 disk files every time you run this. That’s a lot of file operations and it’s just going to take a long time. Another approach would be to repurpose that code as a “preprocessing” step that you only need to run once: do the open, read operations on each file and then create a single output database file in a format like h5 that contains all the images. The h5 file will be pretty huge of course (the sum of the sizes of all the individual files plus a bit of overhead), but I would bet that then your actual model code that opens and loads that file will be a lot quicker.

Here’s a thread from a fellow student about how to create an h5 file containing multiple images from separate files. You’ll have to adjust the code for your purposes, but it gives you all the logic you need to build your “preprocessor”. Then your “load” logic that you run every time can be simplified to the way the load_dataset functions work in the C1 W2 Logistic Regression assignment e.g.

@paulinpaloalto num_parallel_calls in Dataset#map and filename in Dataset#cache should help you with that.

Hi, Balaji. That’s a good point. Thanks for illuminating those features of the TF Dataset class. But even if you introduce 16 way parallelism, you’re still opening 18,000 files. That’s a lot of file opens and opening a file is an intrinsically expensive operation. I’ll bet you all the beer you can drink in one sitting that having all the images in a single database file will be significantly faster. :beer: :nerd_face:

really, I appreciate your replies.I realized that I need a data generator that generates data from a dataframe. I found a library called keras_preprocessing on GitHub - keras-team/keras-preprocessing: Utilities for working with image data, text data, and sequence data..
There is a function called : flow_from_dataframe that may help doing my project.
As i understand, it takes the x_col which is the image path, and the y_col which is the target column.

My project deals with a custom object detection dataset.
So, i need also the column bounding box from my dataframe to be generated also.
I don’t know how to use this function to do that, or should I implement a custom data generator for this purpose.

Have you seen Dataset#from_tensor_slices ?