Derivation of DL/dz

I leave some Python cents here. It is just one hidden layer back-pass.

    # First, compute the new predictions `y_pred`
    z_2 =, W_1)
    a_2 = sigmoid(z_2)
    z_3 =, W_2)
    y_pred = sigmoid(z_3).reshape((len(x_mat),))
    # Now compute the gradient
    J_z_3_grad = y_pred-y # Trick: This is the gradient of the log loss function:
    J_W_2_grad =, a_2)
    a_2_z_2_grad = sigmoid(z_2)*(1-sigmoid(z_2))
    J_W_1_grad = (,1), W_2.reshape(-1,1).T)*a_2_z_2_grad)
    gradient = (J_W_1_grad, J_W_2_grad)