Yup! Notes on the “Regularization” exercise!
Loading the dataset
The function load_2D_dataset()
plots the dataset, but why is the aspect ratio not equal to 1?
Suggesting adding a plt.axis('equal')
to the function load_2D_dataset()
.
Non-Regularized Model
The naming is off. We are presented with two modes:
- regularization mode
- dropout mode
But both modes are properly about regularization. The correct naming should be:
- L2 mode (or maybe “weight decay” mode, or “frobenius mode” or “2-norm mode”, I understand “L2” refers to the two-norm of vectors rather than matrices?)
- dropout mode
This applies to the function naming as well:
compute_cost_with_regularization()
→compute_cost_with_frobenius_norm()
backward_propagation_with_regularization()
→
backward_propagation_with_frobenius_norm()
The model()
code can be improved.
- Move the plotting out of the
model()
function - Make mode testing more explicit
- Extend data collection: Also collect the Forbenius norms of the weights and gradient matrices and plot those. This gives interesting results, as we will see.
Here we go:
def plot_grad_norms(grad_norms):
dW1_norm, dW2_norm, dW3_norm = zip(*grad_norms)
indices = range(len(grad_norms))
plt.plot(indices, dW1_norm, label='||dW1||')
plt.plot(indices, dW2_norm, label='||dW2||')
plt.plot(indices, dW3_norm, label='||dW3||')
plt.xlabel("kilo iterations")
plt.title(f"Frobenius norms of gradient matrices")
plt.legend()
plt.grid()
plt.show()
def plot_weight_norms(weight_norms):
w1_norm, w2_norm, w3_norm = zip(*weight_norms)
indices = range(len(weight_norms))
plt.plot(indices, w1_norm, label='||W1||')
plt.plot(indices, w2_norm, label='||W2||')
plt.plot(indices, w3_norm, label='||W3||')
plt.xlabel("kilo iterations")
plt.title(f"Frobenius norms of weight matrices")
plt.legend()
plt.grid()
plt.show()
def plot_costs(costs, learning_rate, lambd, keep_prob):
indices = range(len(costs))
plt.plot(indices, costs, label='cost')
plt.ylim(min(costs) * 0.9, max(costs) * 1.1)
plt.ylabel("cost")
plt.xlabel("kilo iterations")
if use_dropout(keep_prob) and use_l2(lambd):
text = f", (keep_prob = {keep_prob}, lambda = {lambd})"
elif use_dropout(keep_prob):
text = f", (keep_prob = {keep_prob})"
elif use_l2(lambd):
text = f", (lambda = {lambd})"
else:
text = ""
plt.title(f"Cost. Learning rate = {learning_rate}{text}")
plt.legend()
plt.grid()
plt.show()
def collect_grad_norms(print_cost, i, grads, grad_norms):
if print_cost and i % 1000 == 0:
dw1_norm = np.linalg.norm(grads["dW1"], ord='fro')
dw2_norm = np.linalg.norm(grads["dW2"], ord='fro')
dw3_norm = np.linalg.norm(grads["dW3"], ord='fro')
grad_norms.append([dw1_norm, dw2_norm, dw3_norm])
def collect_weight_norms(print_cost, i, parameters, weight_norms):
if print_cost and i % 1000 == 0:
w1_norm = np.linalg.norm(parameters["W1"], ord='fro')
w2_norm = np.linalg.norm(parameters["W2"], ord='fro')
w3_norm = np.linalg.norm(parameters["W3"], ord='fro')
weight_norms.append([w1_norm, w2_norm, w3_norm])
def collect_cost(print_cost, i, cost, costs):
# Print the loss every 10000 iterations
if print_cost and i % 10000 == 0:
print(f"Cost after iteration {i}: {cost}")
if print_cost and i % 1000 == 0:
costs.append(cost)
def use_dropout(keep_prob):
return not (np.isclose(keep_prob, 1.0))
def use_l2(lambd):
return not (np.isclose(lambd, 0.0))
def model(X, Y, learning_rate=0.3, num_iterations=30000, print_cost=True, lambd=0, keep_prob=1):
"""
Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
Arguments:
X -- input data, of shape (input size, number of examples)
Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
learning_rate -- learning rate of the optimization
num_iterations -- number of iterations of the optimization loop
print_cost -- If True, print the cost every 10000 iterations
lambd -- regularization hyperparameter, scalar
keep_prob - probability of keeping a neuron active during drop-out, scalar.
Returns:
parameters -- parameters learned by the model. They can then be used to predict.
"""
assert 0.0 <= keep_prob <= 1.0
assert 0.0 <= lambd
grads = {}
costs = [] # to keep track of the cost
weight_norms = [] # to keep track of the weight norms
grad_norms = [] # to keep track of the grad norms
m = X.shape[1] # number of examples
layers_dims = [X.shape[0], 20, 3, 1]
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
# Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
# Depends on whether we use dropout regularization
if use_dropout(keep_prob):
a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)
else:
a3, cache = forward_propagation(X, parameters)
# Cost, depends on whether we use L2 regularization
if use_l2(lambd):
cost = compute_cost_with_regularization(a3, Y, parameters, lambd)
else:
cost = compute_cost(a3, Y)
# Back-propagation, depends on both L2 and dropout regularization
if use_dropout(keep_prob) and use_l2(lambd):
# it is possible to use both L2 regularization and dropout,
# but this assignment will only explore one at a time
raise RuntimeError("Both 'dropout' and 'L2 regularization' are being used")
elif use_dropout(keep_prob):
grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
elif use_l2(lambd):
grads = backward_propagation_with_regularization(X, Y, cache, lambd) # badly named!
else:
grads = backward_propagation(X, Y, cache)
parameters = update_parameters(parameters, grads, learning_rate)
collect_cost(print_cost, i, cost, costs)
collect_weight_norms(print_cost, i, parameters, weight_norms)
collect_grad_norms(print_cost, i, grads, grad_norms)
if print_cost:
plot_weight_norms(weight_norms)
plot_grad_norms(grad_norms)
plot_costs(costs, learning_rate, lambd, keep_prob)
return parameters
Running the model with all regularization methods “off” (as done in the notebook, this being the “baseline model”) now generates the following plots two additional plots of weight and gradient norms and the somewhat nicer plot of the cost:
Exercise 1 - compute_cost_with_regularization
As noted above, this should really be compute_cost_with_frobenius_norm
.
The text says to use
np.sum(np.square(Wl))
To compute the square of the Frobenius norm.
But what about this one, which is “higher level” as it uses norm
and may be more efficient:
np.linalg.norm(Wl,ord='fro')**2
Also remind the reader that
X = X * foo
Creates a modified X
and assigns it but
X *= foo
Modifies the X
in place and is more efficient.
Btw, there are lot divisions by m
, why not define a single
inv_m = np.float64(1.0 / m)
and multiply with inv_m
wherever needed?
Adding the Frobenius norms to cost actually gives interesting weight/gradient plots, open to discussion:
After that we read:
Thus, by penalizing the square values of the weights in the cost function you drive all the weights to smaller values. It becomes too costly for the cost to have large weights! This leads to a smoother model in which the output changes more slowly as the input changes.
This can maybe be changed to
Thus, by penalizing high values of the square norm of the weight matrices through the cost function you drive all the weights to smaller values. It becomes too costly to have large weights! This leads to a model in which the boundary between classes is smoother than can be done with large weights.
Dropout
We read:
Also note that without using
.astype(int)
, the result is an array of booleansTrue
andFalse
, which Python automatically converts to 1 and 0 if we multiply it with numbers.
I salute this attention to proper type conversion, and one may note that the autoconversion from boolean to numeric may not yield a data type of the expected accuracy (i.e. a 16-bit float…)
But why not stay in the “numpy 64 bit float” domain:
X = (X < keep_prob).astype(np.float64)
OTOH, to “shut down” neurons, it is proposed to use elementwise multiplication. However this works too:
D = (D > keep_prob) # a matrix which as "True" on the place to zap
A1[D] = 0.0 # zap the places selected
D = np.logical_not(D).astype(np.float64) # this is what the tester expects
I any case, after programming everything out, we get the following plots for the weight and gradient matrices in case “dropout” is applied:
And also
But is there a bug in the dropout processing?
Going back to the previously programmed-out forward_propagation_with_dropout()
we will notice that the random seed is set in that function:
np.random.seed(1)
But this actually means that rather than doing proper “dropout”, the neurons that will be dropped are always the same: we are using a “crippled network” rathern than doing “dropout regularization”. And apparently with good effect.
If we comment out this RNG initialization and re-run learning with dropout, the situation changes:
And also:
This is not as beautiful as earlier and the accuracy has now decreased to 94% from previously 95%.
And also
In predict(X, y, parameters)
in file reg_utils.py
, the transformation of A3 to {0,1}^n is performed as follows:
for i in range(0, a3.shape[1]):
if a3[0,i] > 0.5:
p[0,i] = 1
else:
p[0,i] = 0
rather than like this using rint:
a3 = np.rint(a3)