I’m trying to implement a regularization technique for my neural network because I suspect it is overfitting my dataset. I am using a simple one hidden layer DNN with 600 input units and 6 output units. By adding L2 regularization in the hidden layer with setting lambda to 0.001, I have some findings that are a bit confusing.
My accuracy in my testing set has dropped from 97% to 90% (which indeed decreased overfitting issue that I suspected). It decreased the the bias though.
The accuracy in my testing set has not increased. (still at around 65%)
I applied L2 regularization on the hidden layer and the softmax layer, it didn’t result in much difference.
Is the whole point of regularization about decreasing the bias so that I can train my network for longer without overfitting, and have a better result on my validation set?
Thank you for reading my message. If I have any conceptual misunderstandings about regularization, please let me know. It will be much appreciated.
Two curves: one for training set accuracy and one for testing set;
For simplicity, fix to regularizing each and every layer (should be easy with Tensorflow)
Whatever lambda value r you are using, centering around it, try a few more steps in a sequence like this: \frac{r}{10}, \frac{r}{3}, r, 3r, 10r - feel free to add a few more if training one network doesn’t cost much time.
Make sure to include \lambda=0.
I would like to see the \lambda values in the x-axis.
If you still have time, can you make one more plot like the above, but with an additional hidden layer of 60 units in between the current two Dense layers? Both plots use the same set of \lambda's please.
Certainly the goal of regularization would be for your model to perform better with the testing set, but it is not a guarantee For example, I can set \lambda to a unreasonably large number to destroy everything.
You were right! I would be editing the post shortly!
And you’re also right, what I meant is it decreased the variance … I was a bit confused yesterday haha. I will be making the plot you mentioned shortly.
This one’s behavior is very easily understandable. However, before I can really comment on it, I notice in your code that you didn’t use EarlyStopping, so I suppose this graph recorded the training and validation accuracies at the end of the 200th epoch.
This is not good, because the it didn’t reflect the best validation accuracies. Please allow them to early stop and remake the graph with the stopped training and validation accuracies. With earlystop, your first model training run should finish within 40 epochs (saving 80% of your time).
This is a very interesting graph, especially how you picked the experimented units, only it would be more usable if EarlyStopping had been applied to them.
In the 3rd code cell, change your DNN1_model(input_shape, r) to DNN1_model(hidden_units, r), since input_shape is actually fixed and you are experimenting the number of hidden units.
In the 4th code cell, use nested loops. The outer loop goes over different values of hidden_units, and the inner goes over the lambda. Then change a bit the way of how you store the final accuracies such that in your next graph, there will be as many training (and validation) curves as the number of hidden_units.
In the 5th code cell, in those plt.plot lines, add the label argument to give each curve a name. Then before plt.show, do plt.legend to show the legend that contains the names of the curves.
I have finished recoding as you have taught me. Here are the plots with accuracy with respect to different lambda values. There are 5 plots which means different input value. I used early stopping at 40 epochs for all cases.
To my surprise, the 30 hidden units performed better, it might stem from simpler training for the simpler network. (Thus my implementation of early stopping is wrong). I’ve also observed that regularization has more impact when there are more are more hidden units.
Would you mind to also update your repo with the code that generated these 5 plots? I want to make sure we are on the same page, because something doesn’t seem to make sense to me.
Also, you said 30 units performed better, but based on validation accuracy, I would say 6 units is the one I would name.
Some trends are understandable:
as lambda increased, increasing validation accuracy followed by decreasing
with more hidden units, validation accuracy dropped
But some are hard to understand:
increasing training accuracies with higher lambda (plot 1, 2)
larger accuracy gaps with higher lambda (plot 1, 2)
with more hidden units, training accuracy dropped (across all plots)
I need to inspect your code before I can make any further comment. In case you also want to do some inspection, before that, would you first update your repo and notify me? We might exchange our findings later.
I have updated the code uploaded into the repo. There were some problems like
is resolved. (I don’t know how I resolved this issue though, I only recoded because it didn’t save my file) Please refer to the file ‘1 hidden layer V3’.
Actually, all of the three problems are gone! Now their opposites are true, see if the opposites make sense to you.
You are still not using EarlyStopping. It does not mean fixing the number of epochs to a smaller value. You need to google for how to use this class to add EarlyStopping to your training process.
I hope that, besides what I have suggested, in your next post, I will see your analysis of the situation, your furher work, and the latest results.
For example, validation accuracy does increase in some cases when Layer 1’s units increases. However, the effect is not clear because you have to go up and down to compare some plots. Having said that, if you did not realize any improvement, you might need to think about what you can do to assist yourself better. For example, if you don’t scroll up and down to compare two plots, why not make them in one single plot?
Another example, accuracies gaps close as lambda increases which is reasonable. However, dropping validation accuracy isn’t something we want.
Therefore, there were both improvement and unfavourable ends, and they should inform you your next actions.
@Chiang_Yuhan, I believe your goal with this work is to train yourself to be a good model trainer. So, the key is your work, not mine. Below is an experiment loop you can jump into:
Experiment → observation → make sense of the observation → hypothesis → Experiment your hypothesis → observation-> …
If you still have no idea what to do, here are my suggestions:
Scan through your code character-by-character and make a list of all tunable parameters:
learning rate,
L1 hidden units,
number of hidden layers (can be 0, why not?)
lambda,
number of training samples (by adjusting train/valid ratio),
many more… I am just lazy, but you can do it for yourself.
For each, ask yourself, would increasing it help overfitting or underfitting.
Observing overfitting/underfitting is easy because you can compare train/valid losses and accuracies
Based on your observation, and your list of parameters, what you are going to do next?
You may add a step in the loop to log down the latest best validation accuracy you have got and how you have achieved it.
I won’t be suprised if you go through that experimentation loop 100 times yourself within a day, and if you do so, there are two consequences:
You are accumulating experience through ups and downs, which would be very resuable in your future. We can’t only know the ups, but if I keep pushing you to the downs, I can’t imagaine how you would feel. It’s best for you to follow your own instinct, and train it as you go through the loop.
You can have done way more in one day than just we exchanging messages here for two weeks.
Cheers,
Raymond
PS1: I would love to read and discuss your list of tunable parameters with how they deal with over-/under-fitting. However, it’s up to you for whether you want to share it
We can put this behind. It’s good if you had known how you fixed it because it was totally only for your own gain, but if you hadn’t, it can’t be a smaller issue at all, because sooner or later, it will come back and you will find out.