I wanted to experiment how more data improves accuracy. I noticed that if I add more then 19 values the network can not find a solution at all. Loss function instead of a decreasing value writes Loss:inf (infinity?) or Loss:nan (not a number?). I have tried other loss functions and it’s even worse. Can anyone explain this a little? why is that happening?
When you have mentioned y = 2x-1(Can I know why and how this relation was created)
Your model statment should be tf.keras.models.Sequential([tf.keras.layers.Dense
For the housing price only single dense layer was used with 1 unit. Until we don’t know your dataset, we cannot state or concur if single layer dense is enough for your model algorithm, same goes for model.fit (for epoch), same goes for model compile where loss and optimizer are decided based on what kind of your input and output are related. Probably that is the reason your value cannot take 19 values. To understand and respond more properly one needs to know your dataset, what kind of analysis or output you are trying to look, like in C1W1 housing price, it was a predictive analysis of housing prices based on number of bedrooms and model was created using a single dense layer with one input(i.e. number of bedrooms). This was a lot simple model but your equation y=2x-1 doesn’t seem to totally relative to this housing price analysis.
Hello Deepti_Prasad! and thank you for the anwser!
This is a very basic exercise from week 1. All the data and model definition is in my post, it’s not the house pricing exercise, but previous one. The goal is to determine the relationship between numbers x and y. And the data is provided for us as follows
so for:
x0 = -1, y0 = -3
x1 = 0, y1 = -1
and so on.
And we know from the exercise that the actual relationship between x and y is
y = 2x-1.
We try to “discover” it using neural network. And the model can learn that using just 6 provided values, and predict accurately.
I wanted to experiment with more data, so I added more x/y pairs, and I noticed that when there are more than 19 pairs the model is unable to find a solution, and I was wondering why?
Here is my notebook if you would like to take a look
and also I have found that Adam optimazer works well with more data, but bu question is in regards to the optimizer we have used in the execise which is sgd
I hope I made it clear, if not please ask I will be glad to explain my problem in more detail.
Yes, just to show the basic concept I suppose. I have also found that Adam optimizer works well with more data, but sgd which was used in the exercise does not. And this is what I’m wandering about
That clears lots of doubt, your are applying a inhomogenous or nonhomegenous equation to a homogeneous equation relation, hence your model is failing. Your equation requires more neural network complexity (or units) where as housing price has only one unit. Even if you change your optimizer and loss, the equation you are trying to implement to this model is not the right choice.
Probably you have to apply some mathematical algorithm implication to your model to work. Would need sometime, to give you any idea about how to go. Or you can ask supermentors especially from mathematical background for your model.
I agree with DP, and I will also take a look at your data set and report back if I find anything interesting.
Note that this assignment isn’t really a “neural network” at all, as it has only one Dense layer and a single output. There is no hidden layer, which is the hallmark of a neural network.
Note that you have to be very careful when you do experiments using a notebook. After you run it once, if you change any of the settings (like the optimizer or the data set), you have to restart the notebook before you will get valid results.
Did you use optimizer=‘sgd’ or Adam?
In the notebook at the beginning, under “#Compile the model” section the active code is
model.compile(optimizer=‘Adam’, loss=‘mean_squared_error’)
and Adam works great.
The problem was with sgd as optimizer, it is now commented out. If You would like to see/reproduce the situation uncomment the first line
model.compile(optimizer=‘sgd’, loss=‘mean_squared_error’)
When you increased the range of the input data (integers from -1 to 2000), you need to normalize the data set if you’re going to use a plain-vanilla optimizer like SGD.
Ideally the range of input feature values will be between around -5 to +5, with a mean of zero. Those are general guidelines.
SGD is also extremely sensitive to the learning rate. Too high and you get +Inf for the cost, too low and it takes an infinite number of iterations to get a converged solution.
I’ve been experimenting with how to use the Keras Normalizer layer, I’ll post more about that later.
Yes, the data needs to be normalized, then it works!
Thank you. Here is what I found after more testing
I tested it with just 6 points of data, which in the lab of week one were working fine. They were:
xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)
Without increasing the number of values, I just increased the values, and sgd could find a solution up to this:
xs = [ 7. 8. 9. 10. 11. 12.]
ys = [13. 15. 17. 19. 21. 23.]
Add one more (+1) to xs:
xs = [ 8. 9. 10. 11. 12. 13.]
ys = [15. 17. 19. 21. 23. 25.]
and it can not find a solution, but writes increasing loss numbers and then loss:“nan”, loss:“inf”
BTW what does loss:nan and loss:inf mean?
Then I fed the model with 200 normalized values and it found a solution, no problem. So definitely the lack of normalization was the issue, not the amount of data. Still I wonder what is the math behind it?
Nan means “not a number”.
Inf means “positive infinity”
These are both signs that the cost calculation has gone haywire.
Gradient descent needs to have the magnitudes of the gradients all be in the same limited range. This helps the optimizer find a solution with one fixed learning rate applied to all of the gradients.
If the gradients are too large for that learning rate, then the cost could diverge to infinity.
If you try to fix this by using a smaller learning rate, then you have to drastically increase the number of iterations.
If you look at the equation for the gradients (it’s covered in other courses, not sure about this one), the equation includes multiplying an error value by the feature value.
The SGD optimizer uses a fixed learning rate.
The Adam optimizer has a variable learning rate, so it copes with non-normalized dataset a little better.