Some fun graphs derived from the Week 2 Programming Assignment

So after basically rewriting the Week 2 Programming Assignment with type annotations and objects (Python actually becomes readable at that point, moving beyond MVP, although even then the IDE’s static checker is not sure about the code semantics :face_with_monocle:), leaving barely anything of the original, I have added a bit of code to generate plots of the cost generated by “perturbed parameters”.

Suppose you take the neural network at training step t, and you perturb one of its weights a bit (adding a small value Δw or Δb). You can then generate the new cost for this “perturbed network” and compute the Δcost relative to the “unperturbed network” cost. This allows you to draw a curve Δcost(Δw) or Δcost(Δb).

This can be repeated for a number of weights (drawing curves for all 64’000 parameters is a bit much, I just sampled a few at random, and also considered the weight).

One can then see get an impression the “shape” of the cost function along a small sample of the dimensions defined by the 64’000+ parameters and how gradient descent moves the “current point” in that space (always at (0,0) in the graphs as we plot the (Δw,Δcost) relative to that point) to the minimum (which happens to be global in this case).

The orange curve is the Δcost curve defined by the “perturbed bias”, the blue ones a sample of the Δcost curves defined by a smattering of “perturbed weights”, not always the same ones.

At first all parameters are bit too large, adding negative ε’s will reduce cost:

All parameters are bit too small, adding positive ε’s will reduce cost:

All parameters are about right:

The gradient here is certainly zero:

Nothing changes any more:

And if one plots the cost:

Finally, I have transformed the images into histograms of a 20x20x20 color cube and trained the logistic regression network on that. So we are just looking at the distribution of color values and want to predict whether there may be cat behind it. This gives us good results relative to the (ordered) set of R,G,B values as specified in the assignment for much less computation. We find:

Cost after iteration 1900: 0.0000
train accuracy: 100.00 %
test accuracy: 74.00 %
train false positives: 0.00 %
train false negatives: 0.00 %
test false positives: 12.00 %
test false negatives: 14.00 %
Found 13 failed tests

But one must not normalize the histogram to sum to 1.0 otherwise the results are really bad :thinking: but I feel that normalization of some kind is required. But the NN model is too simple anyway.

The perturbation curves are more exciting than previously, although a lot of them, including the bias, are just flat. Not sure whether buggy.

I can’t say much more about this as this is a graded assignment. Except that you really want to use Python’s type annotation and object system as soon as possible to spare yourself unnecessary pain (and throw asserts at the code, too)

1 Like

Your perturbation curves can give us some feeling about the cost surface.

Interesting… Wouldn’t normalization mean to divide all samples by the same value of 4096, because there are always 4096pixels? Consequently, the final weights should be like scaled up by 4096? In other words, would a larger learning rate get us good results because the weights need to walk on a larger scale? OR, how did the cost curve look like for normalized? I mean, was it still dropping as the training terminated?

Above metrics are for the 20x20x20 cube. Right? What about the ordered one? I mean, in case you have them in hand.

Interesting…

I just printed the first 50 images and found that cat images there don’t have red things. I think this is a good beginning.

Actually I meant normalization of the histogram to the number of “bins” in the histogram, in this case 20 x 20 x 20.

I checked it, it’s just that the values in the feature vector become too small for something to happen. If the normalized features are multiplied by 1000, training kicks off again.

Code of histogram construction:

Passing an original (64,64,3) ndarray of a single image:

normalize: Final[bool] = True

def build_color_histogram(nonflat_features: np.ndarray) -> np.ndarray:
    bins_by_side : Final[int] = 20
    histogram = np.zeros((bins_by_side,bins_by_side,bins_by_side))
    rows = nonflat_features.shape[0]
    cols = nonflat_features.shape[1]
    assert nonflat_features.shape[2] == 3, "Three color channels"
    assert nonflat_features.shape == (rows,cols,3)
    assert nonflat_features.dtype == np.uint8
    histogram_coords: List[int] = [0, 0, 0]  # mutable storage, why no local variables, Python, eh?!
    for row in range(rows):
        for col in range(cols):
            for channel in range(3):
                color_value = nonflat_features[row, col, channel]
                assert 0  <= color_value <= 255 # chain comparison - nice!
                # compute a coordinate along one of the axes of the R,G,B cube
                histogram_coords[channel] = floor(color_value/255.0 * bins_by_side)
                # handle the extremal case where the original color was 1.0 and is now out of range
                if histogram_coords[channel] == bins_by_side:
                    histogram_coords[channel] = bins_by_side - 1
            # END of loop, we now have the histogram coordinates
            histogram[tuple(histogram_coords)]+=1
            # print(f"histogram{histogram_coords} = {histogram[tuple(histogram_coords)]}", file=sys.stderr)
    # Normalize but also stretch otherwise the values are too small and nothing happens!
    # Normalization gives slightly different results than lack of normalization (slightly changed
    # set of misclassified pictures), this also happens if you change the scaling factor 1000 to 100
    if normalize:
        histogram /= (rows * cols)
        # assert np.isclose(np.sum(histogram),1.0), f"Sum of histogram is {np.sum(histogram)}"
        histogram *= 1000
    return histogram
1 Like

Yes, that’s right.

By the “ordered RGB values” I just mean the fully flattened image ndarray (all RGB values from all pixels aligned in a 64x64x3 = 12288 features column vector) from the assigment.

This is the exact data from the assignment (can I post that)?

train accuracy: 99.52 %
test accuracy: 70.00 %
train false positives: 0.48 %
train false negatives: 0.00 %
test false positives: 12.00 %
test false negatives: 18.00 %
Found 15 failed tests
1 Like

:smirk:

histogram /= (rows * cols)

You normalized it by the size of the image, instead of the number of bins. However, I think the size is a very reasonable choice.

1 Like

Arrr… you are right :sob:

1 Like

Thanks for the data. So, modeling the histograms instead of images improved the test accuracy by 4%. :raised_hands:

This seems to be the case, but it can obviously only be a stroke of luck.

If more cats in the test were orange (say), anything could happen.

1 Like

Indeed!

Whenever I see loops, I will think about how to change it into a vectorized version. :rofl:

Or implement them in fewer lines with numpy functions.

I should distract myself now…

I didn’t find a good way to pack the loop into something else, so there it is.

On second thoughts, this is correct :sweat: . As there are “size of the image” (rows*cols) calls to +1s, so after division by rows*cols, the condition

assert np.isclose(np.sum(histogram),1.0), \
   f"Sum of histogram is {np.sum(histogram)}"

holds.

I need to take a break!

Finally, here is why the work on typing and OO since ALGOL60 hasn’t been in vain and apparently there is a secular cycle where some programming language throws “typing” overboard because the designer is impatient but then the language reacquires it slowly, but never quite as it should be (also a kind of gradient descent but getting stuck in a local minimum). Maybe a better solution for what Guido von Rossum wanted to do would have been to just created a command-line interpreter for Oberon-2 and add features as needed (Philosophy: “minimalistic and strongly structured, avoiding complex features, although lambdas, closures, and functional programming constructs are not part of Prof. Wirth’s philosophy”. Sensei Wirth is wrong about that though.) But I digress. :joy:

The object-oriented and type-annotated neural network model skeleton for Week 1. Isn’t it beautiful. Student fills in the methods marked TBD!

class NNModel:
    def __init__(self, w: np.ndarray, b: np.floating):
        self.width: int = w.shape[0] # aka nx_[0]
        assert w.shape == (self.width, 1)
        self.w: Final[np.ndarray] = w
        self.b: Final[np.floating] = b

    # Create the NNModel with only zeros
    @staticmethod
    def only_zeroes(width: int) -> "NNModel":
        zeroes = np.zeros((width, 1), dtype=np.float64)
        return NNModel(zeroes, 0.0)
   
    # This takes anything that np.exp(x) can take
    # https://numpy.org/doc/stable/reference/generated/numpy.exp.html#numpy.exp
    @staticmethod
    def sigmoid(z) -> Union[np.ndarray, np.floating]:
        # TBD
    
    # Check whether X can be processed by this NN
    def is_compatible(self, X: np.ndarray) -> bool:
        # TBD

    # Create a deep-copy of this NN
    def deep_copy(self) -> "NNModel":
        # TBD

    # Perform a gradient descent step, mutating this instance
    # The gradient is *itself* an NNModel, being structured the same as (w,b)
    def gradient_descent_mutating(self, gradient: "NNModel", learning_rate: float) -> None:
        assert learning_rate > 0.0
        assert self.width == gradient.width
        # TBD

    # Extra, for drawing nice pictures of the perturbed cost.
    # Build a new NNModel with a perturbed bias, "w "not deep copied
    def perturb_b(self, delta_b: np.floating) -> "NNModel":
        # TBD

    # Extra, for drawing nice pictures of the perturbed cost.
    # Build a new NNModel, with a single perturbed weight, "w" deep copied
    def perturb_w(self, weight_index: int, delta_w: np.floating) -> "NNModel":
        # TBD

    # Predict whether the label is 0 or 1 based on current parameters. "features" is the "X".
    def predict(self, features: np.ndarray) -> np.ndarray[int]:
        # TBD

    # Compute the feed-forward output for "features". "features" is the "X".
    def feed_forward(self, features: np.ndarray) -> np.ndarray:
        assert self.is_compatible(features)
        # TBD

    # Compute the feed-forward cost output for the "features" & "labels" passed.
    # A has already been computed, no need to do it again
    @staticmethod
    def compute_cost(A: np.ndarray, data: FeaturesAndLabels) -> float:
        # TBD

    # Returns the current gradient as an NNModel, and the current cost as a "float"
    # The gradient is *itself* an NNModel, being structured the same as (w,b)
    def propagate(self, data: FeaturesAndLabels) -> Tuple["NNModel", float]:
        # TBD
1 Like
# x: (m, rows, cols, channels)

nbin = 20
m, rows, cols, _ = x.shape

# Convert a RGB values into a binned RGB values
x_binned = x//(256/nbin) # (m, rows, cols, channels)

# Convert a binned RGB values into a bin number value
x_bin_number = (x_binned * (nbin ** np.arange(3).reshape(1, 1, 1, 3))).sum(axis=3) # (m, rows, cols)

x_bin_number_flattened = x_bin_number.reshape(m, -1) # (m, rows * cols)

x_histograms = np.apply_along_axis(
    # bincount builds histogram, but it works on 1D array only, so we need apply_along_axis
    np.bincount,

    # apply bincount along axis=1
    axis=1,

    arr=x_bin_number_flattened,

    # bincount arguments
    minlength=nbin**3,
) # (m, nbin**3)

I was just writing as I thought, and didn’t actually run it. I have a habit of putting down the expected array shapes for myself to check later.

1 Like

Agreed! :smile:

Of course it is!!!

Cheers!
Raymond

Yes, this looks more “functional”. I will try it … but I have to get away from the machine for some time.

Yes… Take a break. I need one, too!

Sure, it’s fine to post the data or any analysis of it. The only thing we aren’t supposed to show in public is the solution code.

Yes, this dataset is unrealistically small for a task this difficult, so it’s hard to draw any generalizable conclusions. In Week 4 of Course 1, you’ll see real Neural Networks applied to the same problem with better results than we see for LR in Week 2. Here’s a thread in which I show some experiments with perturbing the dataset to see what happens. My conclusion was that the dataset is carefully curated to get reasonable results.

2 Likes

I can confirm it works :sunglasses:. I adapted it to the rest of the code. I have to confess it took me some time to get what’s going on. A very interesting approach. :ok_hand:

I just had to map the “float ndarray” to an “int ndarray” before applying “bincount”, and then map it “int ndarry” to a “float ndarray” before normalizing.

# Very compact code by Raymond Kwok
def build_color_histograms(x: np.ndarray) -> np.ndarray:
    # x: (m, rows, cols, channels)
    m, rows, cols, colchannels = x.shape
    assert colchannels == 3
    # Convert a RGB values into a binned RGB values
    # 256 / nbin = width of bin, but slightly too large
    # x // (256 / nbin) => all the x values are mapped to their respective integer bin number
    # and the largest bin value is 19 for x = 255.
    # Note the cast to integer. Needed to not get an error in bincount() in apply_along_axis()
    x_binned: np.ndarray = (x // (256 / nbins_per_side)).astype(int)
    assert x_binned.shape == (m, rows, cols, colchannels)
    # Convert a binned RGB values into a bin number value
    # The below creates "[[[[0 1 2]]]]"
    t_arr: np.ndarray = np.arange(3).reshape(1, 1, 1, 3)
    # The below creates "[[[[1 20 400]]]]" if nbin = 20
    # which are the scaling factors for each color channel to hit the right bin
    g_arr: np.ndarray = (nbins_per_side ** t_arr)
    # The individual bin number is just the sum along axis 3
    x_bin_number: np.ndarray = (x_binned * g_arr).sum(axis=3)
    assert x_bin_number.shape == (m, rows, cols)
    x_bin_number_flattened: np.ndarray = x_bin_number.reshape(m, -1)
    assert x_bin_number_flattened.shape == (m, rows * cols)
    # https://numpy.org/doc/stable/reference/generated/numpy.apply_along_axis.html
    # https://numpy.org/doc/stable/reference/generated/numpy.bincount.html#numpy-bincount
    # Note the transformation back to float, which is necessary
    # to not get am error in x_histograms /= (rows * cols)
    x_histograms: np.ndarray = np.apply_along_axis(
        # bincount builds histogram, but it works on 1D array only, so we need apply_along_axis
        np.bincount,
        # apply bincount along axis=1
        axis=1,
        arr=x_bin_number_flattened,
        # bincount arguments
        minlength=nbins_per_side ** 3,
    ).astype(np.float64)
    assert x_histograms.shape == (m, nbins_per_side ** 3)
    # Normalize but also scale otherwise the values are too small and nothing happens!
    # Normalization gives slightly different classification than lack of normalization.
    if normalize_color_histogram:
        x_histograms /= (rows * cols)
        assert np.all(np.isclose(np.apply_along_axis(np.sum, axis=1, arr=x_histograms), 1.0)), f"All histograms should sum to 1.0"
        x_histograms *= scale_normalized_color_histogram
    return x_histograms

The classification results between the two approaches are somewhat different - by a random selection of images - of course.

And btw, if anyone of the mentors is interested in running the code, just ask.

It’s amazingly grown to ~800 lines.

Additionally, I noticed that sometimes the loss computation just generates infinities if an a[i] has been pushed to 1.0 or 0.0, as then either log(A) or log(A-1) yields infinity. Multiplication by 0.0 does not help, as one gets NaN. This does not happen when training on the image data, but does happen when training on the histogram.

Luckily one does not need to do anything about that … because the optimization dZ = A - Y squelches such problematic values. They simply are not used. Only the cost is momentarily messy, gradient descent continues unperturbed.

Also, a smoother plot of gradient descent for full pixel data at learning_rate=0.01

and, a smoother plot of gradient descent for histogram data at learning_rate=0.01