Extra data sets for week 2

Hi all, I made some data sets that contain 500x500 pixel images. It has way fewer images so I could only get it to be ~62% accurate when playing around with learning rate/iterations but it was a bit of fun. It also changes the size to 500x500 by squishing rather than cropping which I imagine isn’t ideal as well.

As they are .h5 files I can’t upload them here but if anyone is interested I can try post a link to them or something easier.

Interesting! It’s always a good learning experience when you try experiments like this. The normal method of resizing images is to rescale them, rather than cropping, so that shouldn’t be a problem. Any decent image library supports downsampling or upsampling images.

Discourse supports attachments to posts or replies, so you should be able to use the “up arrow” tool if you want to upload your h5 files containing the images. Maybe they have restrictions on the file types that can be attached. If h5 doesn’t work, you could also try packaging it as a zip file. Well, I guess the other potential issue is copyright violation: I assume these are images that are in the public domain or that you have the rights to publish.

I assume these are images that are in the public domain or that you have the rights to publish

Possibly not depending on your country etc. Would it be allowed to post my code that generated the .h5 files instead?

As long as the code you are describing was written by you and not based on the solution code for any of the exercises or someone else’s copyrighted code, it should be fine to share it here.

import sys
import glob
import h5py
import numpy as np
from PIL import Image

IMG_WIDTH = 500
IMG_HEIGHT = 500

h5file = 'test1_catvnoncat.h5'

nfiles = len(glob.glob('images/TestData/*.jpg'))
print(f'count of image files nfiles={nfiles}')


with h5py.File(h5file,'w') as  h5f:

    dt = h5py.string_dtype(encoding='ascii',length=7)
    classes = h5f.create_dataset('list_classes',shape=(2), dtype=dt)
    classes[0] = "non-cat"
    classes[1] = "cat"

    binary_output = np.zeros(nfiles).reshape(nfiles,1)

    img_ds = h5f.create_dataset('data_set_x',shape=(nfiles, IMG_WIDTH, IMG_HEIGHT,3), dtype=int)
    for cnt, ifile in enumerate(glob.iglob('images/TestData/*.jpg')) :

        image_resized = np.array(Image.open(ifile).resize((IMG_WIDTH, IMG_HEIGHT)))

        print(ifile)
        img_ds[cnt:cnt+1:,:,:] = image_resized
        print(cnt)

        if ifile[16] == 'c':
            binary_output[cnt] = 1


    bin_ds = h5f.create_dataset('data_set_y',shape=(nfiles, 1), dtype=int)
    bin_ds[:] = binary_output

Hi, Tom.

Thanks! Very nice! I had not actually looked into how to use the h5 python API, so this is really helpful.

Just a couple of suggestions/questions:

It would be nice to add a comment explaining your “labeling” scheme: the first character of the (leaf) file name of a cat image must start with ‘c’, as one can deduce by reading the code. But you have to manually compute the length of the string ‘images/TestData/’ in order to figure out how it works. Why make the reader work so hard? :grin:

You could also simplify the initialization of the label variable like this:

binary_output = np.zeros((nfiles,1))

That seems simpler than creating it as a 1D array and then doing a reshape.

Also this line is making my head hurt a bit trying to understand it:

img_ds[cnt:cnt+1:,:,:] = image_resized

Wouldn’t it be simpler and clearer to say:

img_ds[cnt] = image_resized

Since img_ds is a 4D array and image_resized is a 3D array and the “samples” dimension of img_ds is dimension 0. Or does that actually not do what I think it does? Or maybe this would be the explicit and more clearly correct version of the above:

img_ds[cnt,:,:,:] = image_resized

Thanks again!
Paul

Note that if you change the name of the directory, then you have to do the length computation all over again. How about converting the above code into a function that takes the directory name and the output h5 file name as arguments? Then you could just “cd” to the directory as the first step in the function and then blast away without having to compute the length of any strings. You’d just be getting the leaf file names, so it would be filename[0] that is the label.

Thanks Paul, helpful feedback! Yes the labelling of the files was a very rush job as I wanted to get into it but it definitely isn’t the cleanest method.

Hi @paulinpaloalto

I came across this post and worked my way through the code for the last 3 days. I more or less understand all of it by now except for one thing.

  1. The line

classes = h5f.create_dataset(‘list_classes’,shape=(2), dtype=dt)

gives me the error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-a3a34845c1a2> in <module>
     19 with h5py.File(peter, 'w') as h5f:
     20     dt = h5py.string_dtype(encoding='ascii', length=7)
---> 21     classes = h5f.create_dataset('list_classes', shape=(2), dtype=dt)
     22     print("Classes shape = " + str(classes.shape))
     23     print("Classes plain = " + str(classes))

/opt/conda/lib/python3.7/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    134 
    135         with phil:
--> 136             dsid = dataset.make_new_dset(self, shape, dtype, data, **kwds)
    137             dset = dataset.Dataset(dsid)
    138             if name is not None:

/opt/conda/lib/python3.7/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl)
     91         shape = data.shape
     92     else:
---> 93         shape = tuple(shape)
     94         if data is not None and (numpy.product(shape, dtype=numpy.ulonglong) != numpy.product(data.shape, dtype=numpy.ulonglong)):
     95             raise ValueError("Shape tuple is incompatible with data")

TypeError: 'int' object is not iterable

So I tried alternatives like shape=(2,1)

classes = h5f.create_dataset(‘list_classes’, shape=(2,1), dtype=dt)

But that gives me a different error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-82dfe3bcaf1d> in <module>
     22     print("Classes shape = " + str(classes.shape))
     23     print("Classes plain = " + str(classes))
---> 24     classes[0] = "non-cat"
     25     classes[1] = "cat"
     26 

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/opt/conda/lib/python3.7/site-packages/h5py/_hl/dataset.py in __setitem__(self, args, val)
    706         mspace = h5s.create_simple(mshape_pad, (h5s.UNLIMITED,)*len(mshape_pad))
    707         for fspace in selection.broadcast(mshape):
--> 708             self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
    709 
    710     def read_direct(self, dest, source_sel=None, dest_sel=None):

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/h5d.pyx in h5py.h5d.DatasetID.write()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: No conversion path for dtype: dtype('<U7')

The only “solution” (if it is one???) I was able to come up with was to use shape=(2,0)

classes = h5f.create_dataset(‘list_classes’, shape=(2,0), dtype=dt)

Why is this the only solution to get it run?
Is this solution really working?
How can I check if it is working?

I can’t find the values “non-cat” and “cat” anywhere. Tried a lot and used debugging mode, but nothing would show me these values after they have been assigned.

Any valuable ideas, thoughts, or hints are very welcome. I ran out of ideas about where to search next.

Many thanks!

Hi, Sven,

I think it’s just a matter of thinking a little bit harder about the meaning of that shape. The original error is griping that it’s only an integer and not a tuple. At least that’s my reading of the error message. But then you use (2,1) to make it a 2D array. But when you write to one of the elements, you only use 1 index, which is what I think causes it to “throw” on the

classes[0] = ....

line. If it’s a 2D array, then it would be classes[0,0], right? My suggestion would be to try (2,), which would make it a 1D array and then leave the assignment statements alone. Let us know if that helps or not.

Oh, sorry, I didn’t read all the way to the end of your post the first time. I think (2,0) is just a more confusing way to write (2,). So I think I sort of explained above why that works.

As to why the original code with (2) worked for Tom, but not for you, I’m not sure. One theory would be that it’s just “versionitis”. All these python packages mutate really quickly, so maybe something changed between the versions Tom is using and what you are using.

Hi, Sven.

The point of this code is that is creating an h5 file. The way to see if it is working is to save the file and then try opening it with the equivalent of the load_dataset routine. The code for how to extract the classes was shown in the Week 2 Logistic Regression notebook. The load_dataset routine is in the utility file for that assignment.

I should give the disclaimer that I have not tried any of this code and don’t really have any prior knowledge of how the python h5 package works. I did go so far as to google it a long time ago and they do have a webpage. Might be worth a look if you want to understand more about this. One question is how the actual file gets closed and saved. I don’t see anything that looks like it does that in the logic that Tom shows.

Hi Paul,

thanks for your answer. Regarding the closing of the file, I was able to read in the documentation that if you are using a with … as … statement, then the file will be automatically closed when that loop is finished. In the docs it says:

Closing files
If you call File.close(), **or leave a with h5py.File(...) block**, the file will be closed and any objects (such as groups or datasets) you have from that file will become unusable. This is equivalent to what HDF5 calls ‘strong’ closing.

h5py File Objects

Hi, Sven.

Cool! Thanks for this additional info about how the APIs work. So it should be fine to then call your equivalent of the load_dataset routine after that logic and see if the file is as you expect it to be.

Hello Sir, thanks for the thread as to create h5 dataset. But the dataset created is for binary dataset, I need dataset for multiclass. I tried to extend the same for multiclass…but it not working. Kindly help me out in this regard.

A couple of points to make here:

  1. This is beyond the scope of the course, so whether we help you is optional. That said, we can probably offer some suggestions but may not want to go as far as to actually write the code for you.
  2. If you want help, you have to give us a little more to go on than “I need help”. What have you tried? What is the symptom that it doesn’t work?

Just to talk in general terms, you need to extend the code that Tom shows above in a couple of ways:

The classes list will have more possible values and you need to specify the string names for those and assign them numeric values which should correspond to the position of the name in the classes list.

You need to figure out how you are going to determine the label for a given one of the image files. Notice that Tom did that by making the first letter of the “leaf” name of the image file be “c” for cat and then looking at the file name as they are processed. There are lots of potential ways to do that. And perhaps if you are using a dataset that you got from someplace else, it may already have a scheme for the labels.