Creating tf.Data.dataset from URL images

I’m training a CNN model on some images. Images are saved in a google cloud storage bucket. There is a csv file that contains image paths (http URLs) in a column and the corresponding labels in subsequent columns.
I wish to create a tensorflow dataset using tf.data but found that tensorflow doesn’t understand http urls, so throws up error when I run model.fit(). How can I proceed now?

You can create a function to download the images and return an array. Something like this:

def load_image(url):
response = requests.get(url)
img = Image.open(BytesIO(response.content))
return np.array(img)

Then you can call this function on a for-loop to update your dataset.

Have you tried something like this?

Thanks @Juan_Olano !
I am not aware of how I can update my dataset object for each image array. I’m using tf.data.Dataset.from_tensor_slices to create the Dataset object. So is there a way where I can update this Dataset everytime after I load an image

maybe take advantage of @Juan_Olano 's idea, but move the loop forward ie download all of the images to a directory, then load them all into your DS at one go?

1 Like

A possible solution could be something like this Using the function I proposed above to load image into array:

  1. Load your csv of URLs into a list using pandas:

df = pd.read_csv(‘your_file.csv’)
urls_list = df[‘url’].tolist() # This is assuming the column name is ‘url’

  1. Load a dataset with this list

ds = tf.data.Dataset.from_tensor_slices(urls_list)

  1. Use the map function to load the arrays
    Check HERE for an explanation of ‘map’

ds2 = ds.map(lambda url: tf.py_function(func=load_image, inp=[url], Tout=tf.float32))

I have not tried this code so there may be glitches to fix, but that gives you a general idea.

1 Like

Thanks @Juan_Olano !
I will try this approach.

1 Like

Weren’t you also asking about reducing training time? One of the strategies for improving system performance, not just in ML, is to move data and computation closer together. Sometimes that means putting computation at the ‘edge’ so data doesn’t move (much). Sometimes it means moving the data once, even though that step is expensive, but then using it multiple times from that new location. So you probably want to avoid loading big data across the web on the fly every time you run a training cycle. Consider once you have paid that price, store and reuse locally.

For time being I’m using google drive for data storage as I’m training my model on google colaboratory. I tried using google cloud storage but the result is almost similar. Till the time I get my own workstation, I’ll be training on colab.