I have created a Neural Network from skretch for multiclass classification of mnist dataset of handwritten digits recognition, during training i end up with an issue that the cost doesn’t get start to get reduce over 3000 iterations and then starts to decrease, any idea where i can improve ?
It is worth tracking the accuracy as well as the cost. What is the training and test accuracy after 500 iterations, 1000 iterations and so forth …
Are you doing this in TF or writing all the code directly in numpy?
Check whether you are using an appropriate method of weight initialization.
That’s a good point. There are lots of “hyperparameters” to consider here. Another is the learning rate. Of course the higher level questions are your network architecture: how many layers, how many neurons, which activation functions …
I tried this a few years ago using basically the code that we wrote in DLS C1 W4 and added the softmax and softmax derivative functions myself. It worked pretty well using network architectures very similar to the “two layer” and “L layer” choices we saw in DLS C1 W4.
At least you can start there and then see if you can tune things to get even better results.
The cost function stay constant till 3000 iterations, around 2.30 and then starts to decay exponentially, and I have written the whole code using just numpy and imported data using scikit learn library
I have randomly initialized the weights.
I have just audited this course, so no access to programming assignments and quizes
Also i have normalized both the training and test data set, and implemented a 3 layer neural network excluding the input layer, where 1st hidden layer contain 64 neurons, 2nd hidden layer contain 32 neurons and output layer contain 10 neurons
Using what method?
Ihave create a class Neural_Net and define there,
The input image of 8x8 pixels,
self.W1=np.random.randn(64, 64)*0.01
self.B1= np.zeros((64, 1))
self.W2=np.random.randn(32, 64)*0.01
self.B2=np.zeros((32, 1))
self.W3=np.random.randn(10, 32)*0.01
self.B3=np.zeros((10, 1))
You might try different multipliers besides 0.01.
What optimization method are you using?
Batch Gradient descent
It seems like the optimization algorithm got stuck in a plateau region, should i try Adam Algorithm, or RMS Prop…?
Have you tried different weight initialization values besides multiplying by 0.01?
i tried by multiplying by 0.001 it got even worse, but when multiplied by 0.1, cost function did not stuck as above but end up with poor generalization so i reduced the iteration to around 4000 now it works fine but with a little bit smaller accuracy.
What are the training and test accuracy values that you are getting? Have you also tried any experiments with different learning rates? I found 0.1 to be a good value for \alpha in my experiments.
It’s also worth trying Xavier Initialization, which Prof Ng does not introduce officially until DLS Course 2, but they gave us that in the initialize_parameters_deep implementation in DLS C1 W4 A2.