Some notes on week 4, programming assignment 2

Good day!

So I’m finally getting around to typing out my notes on programming assignment 2, of week 4.

Note 1

In “4 - Two-layer Neural Network” “Exercise 1 - two_layer_model”

Tell the student to

DO USE

    (n_x, n_h, n_y) = layers_dims
    parameters = initialize_parameters(n_x, n_h, n_y)

AND DO NOT USE

   parameters = initialize_parameters_deep(layers_dims)

The text says to use the first, but it doesn’t say why (so, yes, I used the other one).

Both of these functions should be computationally equivalent, right? NO! Because they depend on hidden state, namely the state of the random number generator:

def initialize_parameters(n_x: int, n_h: int, n_y: int) -> Dict[str, np.ndarray]:
    init_scale: float = 0.01
    np.random.seed(1)
    # random numbers are generated based on the global RNG
def initialize_parameters_deep(layer_dims: List[int]) -> Dict[str, np.ndarray]:
    init_scale: float = 0.01
    np.random.seed(3)  
    # random numbers are generated based on the global RNG

If one uses the “wrong” function, the random number stream will be not as expected by the tests. The tests will fail (quite mysteriously, too).

There should be a note in the exercise text regarding the fact that using the alternative function, nominally “computationally equivalent” will fail the test

Taking a step back, and reflecting on how things should be done: if the (global) random number generator state is somehow important, one would like to be explicit. Make it local and pass it as parameter:

def initialize_parameters(n_x: int, n_h: int, n_y: int,
                          rand: numpy_random.Generator)
                          -> Dict[str, np.ndarray]:
    init_scale: float = 0.01
    # use the local "rand" RNG to generate random numbers as needed
def initialize_parameters_deep(layer_dims: List[int], 
                               rand: numpy_random.Generator) 
                               -> Dict[str, np.ndarray]:
    init_scale: float = 0.01
    # use the local "rand" RNG to generate random numbers as needed

This would also make the test structure viable. There are initializations of the RNG in the testing code which step on each other’s feet, it’s really weird.

Addendum: here is another one…

def L_layer_model(X, Y, layers_dims, learning_rate=0.0075, num_iterations=3000, print_cost=False):
    np.random.seed(1)

Note 2

In “4.1 - Train the model”

there is a call to

plot_costs(costs, learning_rate)

Maybe I’m confused somehow but that function wasn’t found. I had to add it myself from week 2 I think.

def plot_costs(costs, learning_rate):
    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations (per hundreds)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

Note 3

In the two_layer_model(), the “cost” may have to be squeezed before printing.

print(“Cost after iteration {}: {}”.format(i, np.squeeze(cost)))

No, it’s not because of the random seeds. They initialize those before each call. They actually had to give you a different init algorithm for the “deep” case, because the simple one we built in the Step by Step exercise gives genuinely terrible convergence with the particular dataset and 4 layer parameters we use here. They used a more sophisticated algorithm that we will learn about in DLS C2. They did not call this out for at least two reasons I’m guessing:

  1. They didn’t want to call attention to the fact that they’ve given you worked solutions to the Step by Step functions. Although the “deep” init one is different.
  2. There is just too much to discuss in course one, so they didn’t think it was worth adding the C2 course material here.
3 Likes

That function was given to you in the template code. It is at the end of the two_layer_model cell:

        # Print the cost every 100 iterations and for the last iteration
        if print_cost and (i % 100 == 0 or i == num_iterations - 1):
            print("Cost after iteration {}: {}".format(i, np.squeeze(cost)))
        if i % 100 == 0:
            costs.append(cost)
            
    return parameters, costs

def plot_costs(costs, learning_rate=0.0075):
    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations (per hundreds)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

You must have accidentally deleted it. If you don’t believe me, get a clean copy and compare for yourself.

1 Like

They gave you explicit and clear instructions. It’s possible that the reason they didn’t explain why is what I said in my previous post: they didn’t want to call your attention to the utility functions in the imported file.

2 Likes

There is an interesting followup you could do here:

Notice that when you use the “deep” routine for the 2 layer case, you actually get slightly better results than with the simpler init function. 74% test accuracy instead of 72% with the simpler algorithm. Of course you fail the test case and the grader because they are looking for a match with their expected (worse) results, which they can do because of the setting of the random seed.

It turns out that initialization matters.

Then you can see another more compelling case for why initialization matters by “hand importing” the simple version of the “deep” init from the previous Step by Step exercise, giving it a different name of course. Then try comparing the results you get with the imported function they gave you versus the version that is equivalent to the 2 layer init function. Big difference in this case. Try it and see!

Prof Ng will have a lot more to say about initialization algorithms in DLS C2.

2 Likes

Hello David, I would like to continue on Paul’s excellent suggestions but look at it in a slightly different angle, since I had read your previous code.

I checked and confirmed that the random seeds are the same for both functions, only the generated weights were “scaled” differently. I emphasize on “scale” because you did the same thing to save the training, though they scaled the weights and you the data. This shouldn’t be overlooked. When you get to Course 2 for parameter initialization, we can, if you would like to, post-motem the whole thing together (but would be great if you take the first step :wink: ), analyzing the similarity and difference between what you did and what the lecture suggests.

Cheers,
Raymond

1 Like

By the “whole thing”, I meant both normalization and weight initialization. As you should have felt it, they are considered together.

1 Like

I do believe you. I really wasn’t sure about that myself.

Okay, got it, it’s the division by sqrt(n_{l-1}) in

initialize_parameters_deep(layer_dims: List[int]) -> Dict[str, np.ndarray]

called from

L_layer_model()

in “5 - L-layer Neural Network” “Exercise 2 - L_layer_model”

that provides better convergence. “Xavier initialization”.

Now I’m torn between doing more exploratory programming or proceeding to Part 2 of the course :joy:

As usual, diagram:

2 Likes

I would proceed.

Exploration is a lifelong business, and courses can fuel that.

1 Like

Indeed. Also, no reason to burn out on Part 1 already.

P.S.

Thinking about feature extraction (get better features than pixels or color histogram), I have found that one can do cat detectors quite differently, with “Haar Cascades”, which indeed use perceptrons but apparently nothing deeper.

https://pyimagesearch.com/2016/06/20/detecting-cats-in-images-with-opencv/

Nice!

I always thought the diverse automata were using deep NNs but maybe not.