Hey @saifkhanengr,
It took me some time to test out something before I can suggest you what to do next in the last part of this reply.
Change to your code
I removed 1/m from dA[2].
Verifying your work
I am lazy, so the way I verify your work is by comparing it with Tensorflow:
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(7, activation='relu', input_shape=(1,)),
tf.keras.layers.Dense(1, activation='relu'),
])
# initialize weights the way you do
np.random.seed(1)
parameters = initialize_parameters_deep(layers_dims)
w = [parameters['W1'].T, parameters['b1'].T[0], parameters['W2'].T, parameters['b2'].T[0], ]
model.set_weights(w)
# SGD could be same as our vanilla gradient descent
model.compile(optimizer=tf.keras.optimizers.experimental.SGD(learning_rate=0.0075), loss='mse')
# fit
h = model.fit(X.T, Y[0, :], epochs=5000, batch_size=X.shape[1], verbose=0, shuffle=False)
print(h.history['loss'][-1])
plt.scatter(X[0, :], Y[0, :])
plt.scatter(X[0, :], model(X.T))
Results:
So, your work is pretty much like TF result! So, your code looks OK! I won’t expect them to be exactly the same.
Conclusion
learning_rate = 0.0075 is too large
Some further study
- set learning_rate = 0.0075
- Modify your L_layer_model to print some grads:
def L_layer_model(...):
...
for i in range(0, num_iterations):
...
if print_cost and i % 1000 == 0 or i == num_iterations - 1:
# print("Cost after iteration {}: {}".format(i, np.squeeze(cost)))
print(grads['dW1'].T)
- Results:
[[ 0. 3.61 -2.82 16.52 0. -3.64 0. ]]
[[ 0. -131.81 0. 0. 0. 0. 0. ]]
[[0. 0.07 0. 0. 0. 0. 0. ]]
[[0. 0. 0. 0. 0. 0. 0.]]
[[0. 0. 0. 0. 0. 0. 0.]]
[[0.00e+00 5.23e-06 0.00e+00 0.00e+00 0.00e+00 0.00e+00 0.00e+00]]
- Observation: dW1 becomes almost always zero starting after 2000 iterations, so it’s not learning anymore.
Why it stops learning?
- ReLU.
- The gradient descent algorithm and our learning rate (0.0075) leads us there.
Since I don’t have code bugs to list out, please try to brainstom some explaination around the above 2 points, based on your observations, experiments, and knowledge.
However, for (1), please try to replace relu with leaky_relu but keep using learning_rate = 0.0075. For (2), first I want to say @saifkhanengr, welcome to the world of deep neural network where the cost surface is no long simple and convex, second, try to modify my tensorflow code and use Adam as the optimizer instead, and keep the learning_rate = 0.0075. You would have come across about Adam in DLS Course 2 Week 2.
Lastly, try experimenting more, and don’t limit to my suggestions. Also I only printed grads[‘dW1’], but you should try the others to make sure I am not fooling you by picking the only problematic one. Trust your observation but not mine
.
Cheers,
Raymond
PS: personally, I think if coding a L-Network worths 100 experience points, I think answering yourself why learning_rate = 0.0075 is problematic worths 5000 experience points. 
PS2: It’s sleeping hours in my timezone, so I will come back tomorrow.