Creating tf.Data.dataset from URL images

Harshit1097 · March 14, 2023, 2:17pm

I’m training a CNN model on some images. Images are saved in a google cloud storage bucket. There is a csv file that contains image paths (http URLs) in a column and the corresponding labels in subsequent columns.
I wish to create a tensorflow dataset using tf.data but found that tensorflow doesn’t understand http urls, so throws up error when I run model.fit(). How can I proceed now?

Juan_Olano · March 14, 2023, 2:45pm

You can create a function to download the images and return an array. Something like this:

def load_image(url):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
return np.array(img)

Then you can call this function on a for-loop to update your dataset.

Have you tried something like this?

Harshit1097 · March 15, 2023, 7:28am

Thanks @Juan_Olano !
I am not aware of how I can update my dataset object for each image array. I’m using tf.data.Dataset.from_tensor_slices to create the Dataset object. So is there a way where I can update this Dataset everytime after I load an image

ai_curious · March 15, 2023, 12:36pm

maybe take advantage of @Juan_Olano 's idea, but move the loop forward ie download all of the images to a directory, then load them all into your DS at one go?

Juan_Olano · March 15, 2023, 1:11pm

A possible solution could be something like this Using the function I proposed above to load image into array:

Load your csv of URLs into a list using pandas:

df = pd.read_csv(‘your_file.csv’)
urls_list = df[‘url’].tolist() # This is assuming the column name is ‘url’

Load a dataset with this list

ds = tf.data.Dataset.from_tensor_slices(urls_list)

Use the map function to load the arrays
Check HERE for an explanation of ‘map’

ds2 = ds.map(lambda url: tf.py_function(func=load_image, inp=[url], Tout=tf.float32))

I have not tried this code so there may be glitches to fix, but that gives you a general idea.

Harshit1097 · March 15, 2023, 2:50pm

Thanks @Juan_Olano !
I will try this approach.

ai_curious · March 15, 2023, 3:05pm

Weren’t you also asking about reducing training time? One of the strategies for improving system performance, not just in ML, is to move data and computation closer together. Sometimes that means putting computation at the ‘edge’ so data doesn’t move (much). Sometimes it means moving the data once, even though that step is expensive, but then using it multiple times from that new location. So you probably want to avoid loading big data across the web on the fly every time you run a training cycle. Consider once you have paid that price, store and reuse locally.

Harshit1097 · March 15, 2023, 3:09pm

For time being I’m using google drive for data storage as I’m training my model on google colaboratory. I tried using google cloud storage but the result is almost similar. Till the time I get my own workstation, I’ll be training on colab.

Topic		Replies	Views
Understanding Dataset processing Sequences, Time Series and Prediction week-4	2	616	April 6, 2023
CNN data tips AI Discussions	7	71	August 29, 2023
Transfer_learning_with_MobileNet_v1 Assignment Convolutional Neural Networks	2	513	March 10, 2022
How can I turn my raw images and true labels into an (X, y) array for deep learning? Neural Networks and Deep Learning	4	678	November 16, 2021
Week 3 Assignment - Image Segmentation with U-Net Convolutional Neural Networks	29	719	October 17, 2023

Creating tf.Data.dataset from URL images

Related topics