How to create training set from custom paths

ayman3000 · January 30, 2022, 12:36pm

I have a dataframe which contatins about 18000 rows, one column contains the training images paths.

I wrote the following code to calculate the X vector which should be (18000,160,120,3):

def process_path(path):
img = cv2.imread(path)
img = cv2.cvtColor(img,cv2.COLOR_BGR2RGB)
img = img.astype(“float”)
img /= 255.0
return img

X = np.zeros((df.shape[0],160,120,3))

path_array = df[‘path’].to_numpy()

for i in range(len(path_array)):
X[ i, : , : , : ]= process_path(path_array[i])

This code takes a very long time, and my question is, is it a faster way to calculate the training vector X???

balaji.ambresh · January 30, 2022, 12:54pm

Have you seen this? tf.data.Dataset | TensorFlow Core v2.7.0

paulinpaloalto · January 30, 2022, 5:23pm

Using the Dataset construct that Balaji points out will simplify the code a bit, but my guess is the fundamental problem is that you are sequentially opening, reading and closing 18,000 disk files every time you run this. That’s a lot of file operations and it’s just going to take a long time. Another approach would be to repurpose that code as a “preprocessing” step that you only need to run once: do the open, read operations on each file and then create a single output database file in a format like h5 that contains all the images. The h5 file will be pretty huge of course (the sum of the sizes of all the individual files plus a bit of overhead), but I would bet that then your actual model code that opens and loads that file will be a lot quicker.

Here’s a thread from a fellow student about how to create an h5 file containing multiple images from separate files. You’ll have to adjust the code for your purposes, but it gives you all the logic you need to build your “preprocessor”. Then your “load” logic that you run every time can be simplified to the way the load_dataset functions work in the C1 W2 Logistic Regression assignment e.g.

balaji.ambresh · January 30, 2022, 7:48pm

@paulinpaloalto num_parallel_calls in Dataset#map and filename in Dataset#cache should help you with that.

paulinpaloalto · January 30, 2022, 8:59pm

Hi, Balaji. That’s a good point. Thanks for illuminating those features of the TF Dataset class. But even if you introduce 16 way parallelism, you’re still opening 18,000 files. That’s a lot of file opens and opening a file is an intrinsically expensive operation. I’ll bet you all the beer you can drink in one sitting that having all the images in a single database file will be significantly faster.

ayman3000 · February 1, 2022, 11:07am

really, I appreciate your replies.I realized that I need a data generator that generates data from a dataframe. I found a library called keras_preprocessing on GitHub - keras-team/keras-preprocessing: Utilities for working with image data, text data, and sequence data..
There is a function called : flow_from_dataframe that may help doing my project.
As i understand, it takes the x_col which is the image path, and the y_col which is the target column.

My project deals with a custom object detection dataset.
So, i need also the column bounding box from my dataframe to be generated also.
I don’t know how to use this function to do that, or should I implement a custom data generator for this purpose.

balaji.ambresh · February 1, 2022, 5:17pm

Have you seen Dataset#from_tensor_slices ?

Topic		Replies	Views
Week 3 Assignment - Image Segmentation with U-Net Convolutional Neural Networks	29	705	October 17, 2023
Creating tf.Data.dataset from URL images AI Discussions	7	43	March 15, 2023
Convolutional Neural Networks in TensorFlow - week 1 Convolutional Neural Networks in TensorFlow week-4	6	695	November 29, 2023
My split_data is taking more time to execute Convolutional Neural Networks in TensorFlow week-1	2	533	December 19, 2022
Straight Up Stuck Convolutional Neural Networks in TensorFlow week-1	5	428	June 26, 2023

How to create training set from custom paths

Related topics