NN unable to fit Y = sign-of-tan(X)

Gireesh.Rajendran · August 9, 2024, 8:20am

Week #3
Classroom item: Programming Assignment: Planar Data Classification with One Hidden Layer
Description: When I modified the data (X, Y) to the following, the network is unable to give a good fit and settles to 50%. This happens even when the number of hidden units are increased. Any intuition on why this is the case?
xsin = np.sin(2*np.pi*np.arange(num_samples)/num_samples)
xcos = np.cos(2*np.pi*np.arange(num_samples)/num_samples)
X = np.stack((xsin,xcos))
Y = ((xsin/xcos)>0.0).astype(np.int16).reshape(1,num_samples)

Output

Accuracy for 1 hidden units: 50.0 %
Accuracy for 2 hidden units: 50.0 %
Accuracy for 3 hidden units: 50.0 %
Accuracy for 4 hidden units: 50.0 %
Accuracy for 5 hidden units: 50.0 %

Gireesh.Rajendran · August 9, 2024, 8:50am

This issue could be due to the inability of the network to break the symmetry during training. Learned this concept in course 2 (Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and…) week 1.

Increased the weights initialization from np.random.randn() * 0.01 to np.random.randn() * 0.1 to address this.

paulinpaloalto · August 9, 2024, 3:00pm

Yes, we learn in Course 2 that initialization can really matter. What results do you get with the larger initial values?

Also just as a terminology question, you’re already breaking symmetry with the 0.01 times normal distribution. So the question isn’t breaking symmetry vs not breaking symmetry. It’s just that the type of initialization you use matters and there is no one universal correct answer for the choice there. It’s a classic “hyperparameter”.

TMosh · August 9, 2024, 3:02pm

I haven’t looked at he values in the dataset, but normalization is always something to consider.

Gireesh.Rajendran · August 9, 2024, 3:37pm

It settles to the following accuracy once it is scaled to sqrt(2/n[l-1])

W1 = np.random.randn(n_h, n_x) * np.sqrt(2/n_x)
W2 = np.random.randn(n_y, n_h)  * np.sqrt(2/n_h)

Accuracy for 1 hidden units: 74.33333333333333 %
Accuracy for 2 hidden units: 100.0 %
Accuracy for 3 hidden units: 100.0 %
Accuracy for 4 hidden units: 100.0 %
Accuracy for 5 hidden units: 100.0 %

paulinpaloalto · August 9, 2024, 3:38pm

I’ve been editing your posts to use the “{}” formatting tool to avoid having your outputs interpreted as “markdown”. That is recommended for clarity.

paulinpaloalto · August 9, 2024, 3:39pm

Interesting. Did you try graphing the results in the same way that they showed for the standard data?

Here’s another thread about experimenting with this exercise (using the ReLU activation in that case), which shows the kind of graphing I mean. A picture is (sometimes) worth the proverbial thousand words in terms of seeing what is happening.

Gireesh.Rajendran · August 9, 2024, 4:31pm

With weights initialized as random * 0.01

With weights initialized as random * sqrt (2/n[l-1])

paulinpaloalto · August 9, 2024, 5:55pm

Interesting. But I confess that I’m a bit surprised that it made that much difference. It’s not a limitation of the network itself, since that is the same in both cases (right?). Did you try running the first initialization for a lot more iterations? Of course if the He Initialization works better, then clearly that’s the way to go.

Gireesh.Rajendran · August 10, 2024, 4:32am

Network itself is not the limitation since it is the same in both cases.
More iterations do not help if it is initialized to randn() * 0.01 as you can see here. The cost converges to 0.693147, same as Ln(0.5), ie 50% probability or converged to a local minima.
Other option is to increase the learning rate factor to make it skip this local minima, though not a clean way since it may oscillate and may not converge in some cases. Best way is to use initialize with He method, that worked fine and stable.

With weights initialized to randn() * 0.01
Cost after iteration 0: 0.693147
Cost after iteration 1000: 0.693147
Cost after iteration 2000: 0.693147
Cost after iteration 3000: 0.693147
Cost after iteration 4000: 0.693147
Cost after iteration 5000: 0.693147
Cost after iteration 6000: 0.693147
Cost after iteration 7000: 0.693147
Cost after iteration 8000: 0.693147
Cost after iteration 9000: 0.693147

Gireesh.Rajendran · August 10, 2024, 5:02am

The data set is normalized since X is generated from Sin and Cos. The issue was that the weights converged to a small value with cost at Ln(0.5) when the weights were initialized to randn() * 0.01. This was resolved by initializing the weights to randn() * sqrt(2/n[l-1]) or He initialization from course 2 - week1. With this, the same network is able to converge.

Topic		Replies	Views
How tanh(0) is zero Improving Deep Neural Networks: Hyperparameter tun week-1	6	317	February 1, 2024
Week 3 Random Initialization Neural Networks and Deep Learning	6	674	May 6, 2022
Intuition on weight initialization Improving Deep Neural Networks: Hyperparameter tun	1	526	November 1, 2022
Week 3 - 2,2,1 Neural Network Questions Calculus for Machine Learning and Data Science week-3	4	25	November 21, 2024
Symmetry Breaking versus Zero Initialization Neural Networks and Deep Learning week-3	7	8710	January 5, 2022

NN unable to fit Y = sign-of-tan(X)

Related topics