When I add more training valus neural network can not find a solution. Why?

In first Lab of “TensorFlow for ML, AI and DL” we train a basic neural network using just 6 data points.

model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer=‘sgd’, loss=‘mean_squared_error’)
xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)
model.fit(xs, ys, epochs=500)

The goal is to find y, (and y = 2x-1)

I wanted to experiment how more data improves accuracy. I noticed that if I add more then 19 values the network can not find a solution at all. Loss function instead of a decreasing value writes Loss:inf (infinity?) or Loss:nan (not a number?). I have tried other loss functions and it’s even worse. Can anyone explain this a little? why is that happening?

1 Like

Hello Tomas,

When you have mentioned y = 2x-1(Can I know why and how this relation was created)

Your model statment should be tf.keras.models.Sequential([tf.keras.layers.Dense

For the housing price only single dense layer was used with 1 unit. Until we don’t know your dataset, we cannot state or concur if single layer dense is enough for your model algorithm, same goes for model.fit (for epoch), same goes for model compile where loss and optimizer are decided based on what kind of your input and output are related. Probably that is the reason your value cannot take 19 values. To understand and respond more properly one needs to know your dataset, what kind of analysis or output you are trying to look, like in C1W1 housing price, it was a predictive analysis of housing prices based on number of bedrooms and model was created using a single dense layer with one input(i.e. number of bedrooms). This was a lot simple model but your equation y=2x-1 doesn’t seem to totally relative to this housing price analysis.

Regards
DP

Hello Deepti_Prasad! and thank you for the anwser!

This is a very basic exercise from week 1. All the data and model definition is in my post, it’s not the house pricing exercise, but previous one. The goal is to determine the relationship between numbers x and y. And the data is provided for us as follows

xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)

This is all the data, there is nothing more.

so for:
x0 = -1, y0 = -3
x1 = 0, y1 = -1
and so on.

And we know from the exercise that the actual relationship between x and y is
y = 2x-1.
We try to “discover” it using neural network. And the model can learn that using just 6 provided values, and predict accurately.

I wanted to experiment with more data, so I added more x/y pairs, and I noticed that when there are more than 19 pairs the model is unable to find a solution, and I was wondering why?

Here is my notebook if you would like to take a look

and also I have found that Adam optimazer works well with more data, but bu question is in regards to the optimizer we have used in the execise which is sgd

1 Like

So basically your creating a model for a mathematical equation here? Am I getting it correct?

Yes, just to show the basic concept I suppose. I have also found that Adam optimizer works well with more data, but sgd which was used in the exercise does not. And this is what I’m wandering about

1 Like

Hello TomasF,

That clears lots of doubt, your are applying a inhomogenous or nonhomegenous equation to a homogeneous equation relation, hence your model is failing. Your equation requires more neural network complexity (or units) where as housing price has only one unit. Even if you change your optimizer and loss, the equation you are trying to implement to this model is not the right choice.

Probably you have to apply some mathematical algorithm implication to your model to work. Would need sometime, to give you any idea about how to go. Or you can ask supermentors especially from mathematical background for your model.

Regards
DP

1 Like

I agree with DP, and I will also take a look at your data set and report back if I find anything interesting.

Note that this assignment isn’t really a “neural network” at all, as it has only one Dense layer and a single output. There is no hidden layer, which is the hallmark of a neural network.

It’s just doing linear regression.

3 Likes

When I ran the code in your repo (on Colab), I got a cost of essentially zero after about 350 iterations, and the predictions are all perfect.

Here is the learned weight and bias.

This is the expected result.

That’s the version where you have 2000 examples that form a perfectly straight line, and using the Adam optimizer.

So I am unable to duplicate the problem you are reporting.

Typically this error means the learning rate is too high.

Or you need to normalize the data set, because the range of the input feature values is too large.

2 Likes

Note that you have to be very careful when you do experiments using a notebook. After you run it once, if you change any of the settings (like the optimizer or the data set), you have to restart the notebook before you will get valid results.

1 Like

Did you use optimizer=‘sgd’ or Adam?
In the notebook at the beginning, under “#Compile the model” section the active code is
The problem was with sgd as optimizer, it is now commented out. If You would like to see/reproduce the situation uncomment the first line
model.compile(optimizer=‘sgd’, loss=‘mean_squared_error’)

This was the default code for this exercise.

I’m continuing to look into this. I think there may be some issue with your notebook, as the results don’t make a lot of sense.

When you increased the range of the input data (integers from -1 to 2000), you need to normalize the data set if you’re going to use a plain-vanilla optimizer like SGD.

Ideally the range of input feature values will be between around -5 to +5, with a mean of zero. Those are general guidelines.

SGD is also extremely sensitive to the learning rate. Too high and you get +Inf for the cost, too low and it takes an infinite number of iterations to get a converged solution.

I’ve been experimenting with how to use the Keras Normalizer layer, I’ll post more about that later.

Thomas as per the equation being inhomogeneous linear equation, you could try this out

what you could do is a add an epsilon

Episolon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly.

So basically you will derive the probably at mean and max, and
put p = random()
if p = e(epsilon)
pull random action
else:
pull best random action

then first derive the numpy mean of your data.

pass your x value by with the probability (mentioned) if and else random action.

whatever you result is from this, pass to the model you have created.

then plot graph to see how your analysis is doing.

Regards
DP

2 Likes

Yes, the data needs to be normalized, then it works!
Thank you. Here is what I found after more testing

I tested it with just 6 points of data, which in the lab of week one were working fine. They were:
xs = np.array([-1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)

Without increasing the number of values, I just increased the values, and sgd could find a solution up to this:
xs = [ 7. 8. 9. 10. 11. 12.]
ys = [13. 15. 17. 19. 21. 23.]

Add one more (+1) to xs:
xs = [ 8. 9. 10. 11. 12. 13.]
ys = [15. 17. 19. 21. 23. 25.]
and it can not find a solution, but writes increasing loss numbers and then loss:“nan”, loss:“inf”

BTW what does loss:nan and loss:inf mean?

Then I fed the model with 200 normalized values and it found a solution, no problem. So definitely the lack of normalization was the issue, not the amount of data. Still I wonder what is the math behind it?

I’m interested in what you have written, but I do not fully understand.

Could you show me how to actually implement this in the notebook?
Here is the link to an editable notebook in my colab

Nan means “not a number”.
Inf means “positive infinity”

These are both signs that the cost calculation has gone haywire.

Gradient descent needs to have the magnitudes of the gradients all be in the same limited range. This helps the optimizer find a solution with one fixed learning rate applied to all of the gradients.

If the gradients are too large for that learning rate, then the cost could diverge to infinity.

If you try to fix this by using a smaller learning rate, then you have to drastically increase the number of iterations.

If you look at the equation for the gradients (it’s covered in other courses, not sure about this one), the equation includes multiplying an error value by the feature value.

The SGD optimizer uses a fixed learning rate.

The Adam optimizer has a variable learning rate, so it copes with non-normalized dataset a little better.

2 Likes