Get_mean_std_per_batch function

I know this may be an unasked for code review, but I was looking through the util.py file and noticed this function is not well written.

def get_mean_std_per_batch(image_path, df, H=320, W=320):
    sample_data = []
    for idx, img in enumerate(df.sample(100)["Image"].values):
        # path = image_dir + img
        sample_data.append(
            np.array(image.load_img(image_path, target_size=(H, W))))

    mean = np.mean(sample_data[0])
    std = np.std(sample_data[0])
    return mean, std

FIrst, in the for loop, the idx or img variables are not referenced in the body of the loop. They are superfluous.

Then, once the array of random sampled images has been built, only the first image is used to obtain the mean and standard deviation. So why then even have the loop? Why not just select a single random image (which the code does effectively)? Probably because we want a larger sample!

I suggest instead:

def get_mean_std_per_batch(image_path, df, H=320, W=320):
    sample_data = []
    for _ in df.sample(100)["Image"].values:
        sample_data.append(
            np.array(image.load_img(image_path, target_size=(H, W))))

    mean = np.mean(sample_data, axis=(0, 1, 2, 3))
    std = np.std(sample_data, axis=(0, 1, 2, 3), ddof=1)
    return mean, std

In the np.std function, notice also the ddof=1 parameter, which will instruct it to properly calculate the sample (as opposed to population) standard deviation.

2 Likes

Hello,
Thanks for your feedback about the code. We will look into it and try to make the changes. We appreciate your enthusiasm to look into the utils and trying to correct them.
Thank You!

Though I did make a mistake (or two) too. Should be more like:

def get_mean_std_per_batch(image_dir, df, H=320, W=320):
    sample_data = []
    for img in df.sample(100)["Image"].values:
        image_path = os.path.join(image_dir, img)
        sample_data.append(
            np.array(image.load_img(image_path, target_size=(H, W))))

    mean = np.mean(sample_data, axis=(0, 1, 2, 3))
    std = np.std(sample_data, axis=(0, 1, 2, 3), ddof=1)
    return mean, std

It is also suggested to use the Python os library function os.path.join() to build a file path instead of concatenating strings, so as to be more OS agnostic. (Remember of course to import os.)

And for this to work, there also needs be a change in load_image.

def load_image(img, image_dir, df, preprocess=True, H=320, W=320):
    """Load and preprocess image."""
    mean, std = get_mean_std_per_batch(image_dir, df, H=H, W=W)
    img_path = os.path.join(image_dir, img)
    x = image.load_img(img_path, target_size=(H, W))
    if preprocess:
        x -= mean
        x /= std
        x = np.expand_dims(x, axis=0)
    return x

So as to pass the image directory, rather than the full path to one file, to the get_mean_std_per_batch function.

As the code was, it was basically repeating one image 100 times, and then only getting statistics on one copy of it. That is not what was intended I am sure.

Thank you @karencfisher, for pointing it out and the suggestion.