Index issue with code copied on colab

Hey team/community
I copied the beginning of the W3A2 assignement on colab with the code before
from google.colab import drive

drive.mount(‘/content/drive’, force_remount=True)
import os

os.chdir(‘/content/drive/Drive/W3A2C4’)

os.path.exists(‘/content/drive/Drive/W3A2C4’)

N = 90

img = imageio.v2.imread(image_list[N])

mask = imageio.v2.imread(mask_list[N])

#mask = np.array([max(mask[i, j]) for i in range(mask.shape[0]) for j in range(mask.shape[1])]).reshape(img.shape[0], img.shape[1])

fig, arr = plt.subplots(1, 2, figsize=(14, 10))

arr[0].imshow(img)

arr[0].set_title(‘Image’)

arr[1].imshow(mask[:, :, 0])

arr[1].set_title(‘Segmentation’)
since there, i noticed an index issue (mask does not correspond to the image)

then the code 2 cells below shows the same:

anything I am doing wrong?

Are you sure that image_list and mask_list end up the same in both environments? Where are those variables set? That seems to be what drives it. Did you modify any of that code? If it’s just enumerating files in a directory, maybe a different version of linux works differently? It’s also possible that you end up using a different version of TF when you run on Colab. The TF APIs evolve pretty quickly and not always in backwards compatible ways …

Is it just me, or is the fact that the numbers in the file names are different a bit concerning? If the numbers of the image file and the mask file don’t align, how could one ever keep them synched?

1 Like

Hey Paul, files have been copied to the google drive as is. Code has not been modified. Even if tf is newer (it is), i wonder why the index of the files (img and mask) would not be robust against a version of tf? On Coursera, it is of course working. On colab, not only the img and the mask are not matching, but the img is not the same as the one on Coursera given the same N. That should not happen, should it?

TensorFlow routinely ignores backward compatibility, so do not expect newer TF versions to work the same way on a dataset that was prepared for an earlier version.

But as @ai_curious has pointed out, it did happen: the file names are different between the image and the mask. Now your job is to debug why that happened.

I just checked in my version of the notebook on the course website and here’s what I see for the section you showed:

image_filenames = tf.constant(image_list)
masks_filenames = tf.constant(mask_list)

dataset = tf.data.Dataset.from_tensor_slices((image_filenames, masks_filenames))

for image, mask in dataset.take(1):
    print(image)
    print(mask)

And the output is:

tf.Tensor(b'./data/CameraRGB/002128.png', shape=(), dtype=string)
tf.Tensor(b'./data/CameraMask/002128.png', shape=(), dtype=string)

So the file names match, but are from different directories. Why do they not match when you run it on Colab? You got 018084.png and 004094.png. How did that happen?

Hey Paul again. have the exact same outputs when on coursera (see below)

Now let’s look at the notebook on colab:
here are the folders (and structure)
image
image
image

here is the code to set the dir
image
(is W3A2 hard coded somewhere else in the other files?) I have to rename because I had already one) EDIT: changed to W3A2 to make sure, same behavior
imports 1
image

imports 2 & loads

first check

dataset split

second check

I stopped here as it does not make sense

What should I do to have the same behavior as coursera?

Hi @pat,

As Paul mentioned, we encourage that you debug why this has happened in the colab.

From my experience, once I downloaded all of the files from Coursera, which it downloaded in a zip format. I use a Mac, and Mac usually creates hidden files within folders (so that it is easier for “Spotlight search”, maybe ? I’m not sure why it does that), so I unzipped the folder, it created hidden files inside.

Now, the code works using index, right ? So when I tried the code, it gave me an error. I searched online and turned out that the index (value of N) I was using was hitting hidden files which was then throwing error for displaying.

So either, your files have changed in the folders, or as others suspect, it could be a TF issue.

Good luck,
Mubsi

Note from what you most recently showed, we can’t really see what is happening in the “dataset split” portion because we can only see the image pathnames and can’t see whether the mask filenames match.

But it seems like you’ve narrowed it down to the TF datasetfrom_tensor_slices” method. Have you read the docpage on that? What does it do? You can also print the TF versions both in the course and on Colab. You can then find version specific TF documentation and see if there is some change in the default behavior of that API.

I added this to my notebook:

print(f"tf version {tf.__version__}")
tf version 2.9.1

What version do you see on Colab? It turns out this notebook was one of the ones that was recently revised to support using the GPUs on AWS for running the training and they upgraded to a more recent TF version. The other notebooks that were last updated in April 2021 are mostly run 2.3.x versions of TF.

I don’t have access to this code, but my intuition is that the fancy list comprehension expression is hiding what is going on in the creation of image_list and mask_list. My initial reply above was intended to be a clue about how to explore @paulinpaloalto ’s original observation about those variables as being key. If it was me, I write those in an explicit loop and output each file name as it is encountered.

That is definitely one of the things to check, but it seems less likely that the semantics of that kind of basic python construct would get changed. I looked up os.listdir() and I can’t find any statement about the order in which the names are returned. It could be in “directory order” over which you basically have no control or it could be lexicographic order. The “official” python document specifically says that it won’t return “.” or “..”, but says nothing about other invisible names like “.__pycache__” or the like. So another possibility is that your Google Drive folder has different invisible files in it for some googly reason.

Or the other place to look would be the semantics of tf.data.Dataset.from_tensor_slices.

If I were the one whose ox was being gored here, I’d put some print statements at every step in the process and run them both places.

I added some print logic to the first block that creates the lists:

path = ''
image_path = os.path.join(path, './data/CameraRGB/')
mask_path = os.path.join(path, './data/CameraMask/')
image_list = os.listdir(image_path)
mask_list = os.listdir(mask_path)
print("Before list comprehension")
print(f"type(image_list) = {type(image_list)}, len {len(image_list)}, type(image_list[0]) {type(image_list[0])}")
print(f"type(mask_list) = {type(mask_list)}, len {len(mask_list)}")
for ii in range(6):
    print(f"{ii} - image {image_list[ii]} mask {mask_list[ii]}")
image_list = [image_path+i for i in image_list]
mask_list = [mask_path+i for i in mask_list]
print("After list comprehension")
print(f"type(image_list) = {type(image_list)}, len {len(image_list)}, type(image_list[0]) {type(image_list[0])}")
print(f"type(mask_list) = {type(mask_list)}, len {len(mask_list)}")
for ii in range(6):
    print(f"{ii} - image {image_list[ii]} mask {mask_list[ii]}")

Here’s what I get when I run that on the course website:

Before list comprehension
type(image_list) = <class 'list'>, len 1060, type(image_list[0]) <class 'str'>
type(mask_list) = <class 'list'>, len 1060
0 - image 002128.png mask 002128.png
1 - image 008579.png mask 008579.png
2 - image 015232.png mask 015232.png
3 - image 006878.png mask 006878.png
4 - image 008104.png mask 008104.png
5 - image 002281.png mask 002281.png
After list comprehension
type(image_list) = <class 'list'>, len 1060, type(image_list[0]) <class 'str'>
type(mask_list) = <class 'list'>, len 1060
0 - image ./data/CameraRGB/002128.png mask ./data/CameraMask/002128.png
1 - image ./data/CameraRGB/008579.png mask ./data/CameraMask/008579.png
2 - image ./data/CameraRGB/015232.png mask ./data/CameraMask/015232.png
3 - image ./data/CameraRGB/006878.png mask ./data/CameraMask/006878.png
4 - image ./data/CameraRGB/008104.png mask ./data/CameraMask/008104.png
5 - image ./data/CameraRGB/002281.png mask ./data/CameraMask/002281.png

So what we see there is that clearly os.listdir() in fact does not return the file names in the directory in lexicographic order. Not sure I completely understand that, but I find it a bit scary. If it’s really “directory order” meaning just doing a raw read of the directory data block and returning them in the order that they are there, that would imply that they created the directories in a very disciplined way: putting the corresponding files into the two parallel directories in exactly the same order. Then when you use “zip” to download the files and then upload them to your Google Drive, can we guarantee that all the orders are preserved? Hmmmmm. I must be missing something basic here.

But clearly it would be worth having Pat try the above experiment in both places and see what happens.

Agree. I meant only as a diagnostic of what is actually coming back from the directory listing, not that it is performing different logic.

I doubt that relying on the order of the items in the list returned by os.listdir() is robust and portable.

Much better if there was some metadata associated with each file - or maybe a method of extracting the info from the file name.

Actually that’s a great idea: we already know the name of the matching mask file, it’s just in the other directory. So we build either list first and then construct the other one from it. Actually we’ve already got the mechanism set up in the list comprehension section. Just rewrite it like this:

path = ''
image_path = os.path.join(path, './data/CameraRGB/')
mask_path = os.path.join(path, './data/CameraMask/')
image_list_orig = os.listdir(image_path)
image_list = [image_path+i for i in image_list_orig]
mask_list = [mask_path+i for i in image_list_orig]
print("After list comprehension")
print(f"type(image_list) = {type(image_list)}, len {len(image_list)}, type(image_list[0]) {type(image_list[0])}")
print(f"type(mask_list) = {type(mask_list)}, len {len(mask_list)}")
for ii in range(6):
    print(f"{ii} - image {image_list[ii]} mask {mask_list[ii]}")

See what I did there? No need to do a listdir on the mask directory. That should work even if the two directories are somehow not created in the same order. It gives the same final output as above:

After list comprehension
type(image_list) = <class 'list'>, len 1060, type(image_list[0]) <class 'str'>
type(mask_list) = <class 'list'>, len 1060
0 - image ./data/CameraRGB/002128.png mask ./data/CameraMask/002128.png
1 - image ./data/CameraRGB/008579.png mask ./data/CameraMask/008579.png
2 - image ./data/CameraRGB/015232.png mask ./data/CameraMask/015232.png
3 - image ./data/CameraRGB/006878.png mask ./data/CameraMask/006878.png
4 - image ./data/CameraRGB/008104.png mask ./data/CameraMask/008104.png
5 - image ./data/CameraRGB/002281.png mask ./data/CameraMask/002281.png

Pat, you should give that a try and let us know!

working as expected! thanks Paul!! I guess the notebook in coursera should (if not done already) be updated :slight_smile:

to be complete:
here it is:

still not the same order, but now matching with masks

short comparison:
Coursera/Jupyter

Colab

It’s good to hear that it works for you with that change. I can report this as a suggested enhancement to the course notebook, but they may well just shine it on since their code works as advertised in the context for which they built it. But what I believe we’ve shown is that their methodology is a bit sketchy …