# NN unable to fit Y = sign-of-tan(X)

• Week #3
• Classroom item: Programming Assignment: Planar Data Classification with One Hidden Layer
• Description: When I modified the data (X, Y) to the following, the network is unable to give a good fit and settles to 50%. This happens even when the number of hidden units are increased. Any intuition on why this is the case?
• `xsin = np.sin(2*np.pi*np.arange(num_samples)/num_samples)`
• `xcos = np.cos(2*np.pi*np.arange(num_samples)/num_samples)`
• `X = np.stack((xsin,xcos))`
• `Y = ((xsin/xcos)>0.0).astype(np.int16).reshape(1,num_samples)`

Output

``````Accuracy for 1 hidden units: 50.0 %
Accuracy for 2 hidden units: 50.0 %
Accuracy for 3 hidden units: 50.0 %
Accuracy for 4 hidden units: 50.0 %
Accuracy for 5 hidden units: 50.0 %
``````

This issue could be due to the inability of the network to break the symmetry during training. Learned this concept in course 2 (Improving Deep Neural Networks: Hyperparameter Tuning, Regularization andâ€¦) week 1.

Increased the weights initialization from np.random.randn() * 0.01 to np.random.randn() * 0.1 to address this.

Yes, we learn in Course 2 that initialization can really matter. What results do you get with the larger initial values?

Also just as a terminology question, youâ€™re already breaking symmetry with the 0.01 times normal distribution. So the question isnâ€™t breaking symmetry vs not breaking symmetry. Itâ€™s just that the type of initialization you use matters and there is no one universal correct answer for the choice there. Itâ€™s a classic â€śhyperparameterâ€ť.

I havenâ€™t looked at he values in the dataset, but normalization is always something to consider.

It settles to the following accuracy once it is scaled to sqrt(2/n[l-1])

``````W1 = np.random.randn(n_h, n_x) * np.sqrt(2/n_x)
W2 = np.random.randn(n_y, n_h)  * np.sqrt(2/n_h)

Accuracy for 1 hidden units: 74.33333333333333 %
Accuracy for 2 hidden units: 100.0 %
Accuracy for 3 hidden units: 100.0 %
Accuracy for 4 hidden units: 100.0 %
Accuracy for 5 hidden units: 100.0 %
``````

Iâ€™ve been editing your posts to use the â€ś{}â€ť formatting tool to avoid having your outputs interpreted as â€śmarkdownâ€ť. That is recommended for clarity.

Interesting. Did you try graphing the results in the same way that they showed for the standard data?

Hereâ€™s another thread about experimenting with this exercise (using the ReLU activation in that case), which shows the kind of graphing I mean. A picture is (sometimes) worth the proverbial thousand words in terms of seeing what is happening.

With weights initialized as random * 0.01

With weights initialized as random * sqrt (2/n[l-1])

1 Like

Interesting. But I confess that Iâ€™m a bit surprised that it made that much difference. Itâ€™s not a limitation of the network itself, since that is the same in both cases (right?). Did you try running the first initialization for a lot more iterations? Of course if the He Initialization works better, then clearly thatâ€™s the way to go.

Network itself is not the limitation since it is the same in both cases.
More iterations do not help if it is initialized to randn() * 0.01 as you can see here. The cost converges to 0.693147, same as Ln(0.5), ie 50% probability or converged to a local minima.
Other option is to increase the learning rate factor to make it skip this local minima, though not a clean way since it may oscillate and may not converge in some cases. Best way is to use initialize with He method, that worked fine and stable.

With weights initialized to randn() * 0.01
Cost after iteration 0: 0.693147
Cost after iteration 1000: 0.693147
Cost after iteration 2000: 0.693147
Cost after iteration 3000: 0.693147
Cost after iteration 4000: 0.693147
Cost after iteration 5000: 0.693147
Cost after iteration 6000: 0.693147
Cost after iteration 7000: 0.693147
Cost after iteration 8000: 0.693147
Cost after iteration 9000: 0.693147

The data set is normalized since X is generated from Sin and Cos. The issue was that the weights converged to a small value with cost at Ln(0.5) when the weights were initialized to randn() * 0.01. This was resolved by initializing the weights to randn() * sqrt(2/n[l-1]) or He initialization from course 2 - week1. With this, the same network is able to converge.