C2_W1_Lab02_CoffeeRoasting_TF - Why does adding additional neurons result in weird plots?

Hi,

I wanted to experiment a little with the lab, and i believe i added 2 additional neurons to the first layer, but the plots at the end of the lab arent what i expected. Im assuming its not overfitting, as the decision boundary for the sigmoid function is right through the middle of the good roast data points.

Increased number of neurons from 3 to 5.

Size of W1, W2, B1 and B2 appear to be ok.
image

Changed the epochs from 10 to 30.
image

Updated weights. I also commented out the following section of code.

The plots. Specifically 3 and 4 dont make sense to me and i was hoping someone could elaborate why it thought that was a good fit.

Thanks very much.

week-1 Advanced Learning Algorithms

Edit: I’m assuming this is in the wrong area, apologies for that, but i tried posting it in the following page “Advanced Learning Algorithms - DeepLearning.AI” but i kept getting " You need to assign at least one tag" error message.

1 Like

I’ll take a look at this and see if I can understand it.

1 Like

Hi @Frazer_Mann

What is the course are you attending?

1 Like

Hi,
The “Machine Learning Specialisation”, currently on the first week of the “Advanced Learning Algorithms”.

1 Like

Thanks! I’ll move your query to the right category in order to you to get a proper support.

Best regards
elirod

2 Likes

Hello @Frazer_Mann,

It is very easy to understand, and all you need are the following two screenshots. I can tell you how I would kick start thinking it, and let you finish the rest.

Take unit 2 as an example, (1) by examining its weights, (2) we can conclude that the area of high temperature and long duration results as 1 (agree?), and the other side as 0. (3) Focusing on the 1 side, the output layer assigns -32.79 as its weight and, after adding the bias 15.54, becomes a pretty negative value, and what does it mean to the unit 2’s contribution to the final prediction for the 1 side? Then, redo the thinking process in step (3) for the 0 side?

Then, try to think about this for all units. Knowing that these units’ contributions are addictive in the output layer, see if you can come up with a theory to your question? I am pretty confident that you can. If not, let me know, and I can give you one more hint.

We can discuss your theory.

Cheers,
Raymond

1 Like

Before I start, I would just like to say thank you to @TMosh & @rmwkwok for your help thus far, I really appreciate it.

I took the W1 and W2 for W1_3 and converted them back from normalised values with the following:

image

where the StDev and Mean were from the Temp values for W1 and Duration values for W2

Question 1: I’m not sure what i was suppose to do with “b” when i de-normalised this?

I then made an estimate of the line through a couple of the data points and yes its got the same gradient as seems to have the same values as Layer 1.

I think this is where I’m missing something. I thought W2 and B2 were the weight and y intercept for the 2nd neurons linear decision boundary line? Since the data has been normalised I find it hard to visualise it though.

From what you wrote I get the impression that because its a large negative number it doesn’t contribute a lot, but I’d be lying if I said I understood why.

Question 2: Could you please clarify my above misunderstanding?

Question 3: On layer 1 unit 4, why did it think that was a good fit for the data, as most of the x’s would be in the 100% probability region of the sigmoid surface plot.

Thanks again for your help.

1 Like

That just means you need to pick a week number from the drop-down list.

1 Like

Note that your plots of the two features on horizontal and vertical axes, and the ‘y’ values as ‘o’ or ‘x’, do not show the “gradients”.

Gradients have to do with the cost function, not a plot of the classifications.

Two thoughts about this from your last post:

  1. You don’t get “Y” from this, you get the raw X value.
  2. That should be Xn, not X.

So more correctly your de=normalized equation would be:
X = (Xn * std) + mean

Regarding what to do with the bias:
The weights and biases only apply to the normalized x values.

1 Like

Large weights do contribute significantly, in a positive or negative manner.
A weight near zero would not contribute much.

Hi @TMosh,

I had found the following video which i found useful for understanding how the linear decision boundary and the sigmoid equation are connected.

15mins in - Decision Boundary and Sigmoid Function.

So i had assumed temp would be my X axis and Duration would be my Y axis, hence the labelling above. Sorry for not making that clear. How would you normally refer to these axis?

Yeah I just spent most of this morning looking into it and was hoping I could reply before either you or @rmwkwok did. Thanks for clarify that. For some reason i also forgot each neuron in the first layer only outputs a single value #D’oh.

So I see the weight is likely completely offset by the bias term for layer 1 unit 4 suggesting that it will completely ignore this but I’m still confused on 2 things:

  1. Why did it position the decision boundary there in the first place?
  2. The final output graph is as follows which suggests its not ignoring unit 4

1 Like

I go with “horizontal axis” and “vertical axis”.

Or you could use “east/west” and “north/south”, like used on a map. That’s less common.

1 Like

Hello @Frazer_Mann,

De-normalizing is not wrong, but there is no need to do it.

Take unit 2, again, as an example,

image

I deliberately omitted the axis labels because they are unimportant. We know that the shaded region includes the extreme of (+ \infty, + \infty) , and with W1 = [12.89, 10.79], we can argue that at that extreme, the evaluated value is + \infty and so after sigmoid, it becomes 1. Therefore, the shaded region represents 1.

I never need denormalizing the weights to come to this conclusion. So, I will not go into how to denormalize as it is unnecessary and can distract the focus.

The above is also how I would do for the following steps:

For the below one:

Let’s refrain from drawing conclusion or understanding so quickly, instead, I recommend you to finish the following table without questioning the purpose:

layer 1, unit 0 layer 1, unit 1 layer 1, unit 2 layer 1, unit 3 layer 1, unit 4
Is the 1-side of layer 1, unit x’s contribution to the final prediction plus b2 < 0? ? ? Yes ? ?
Is the 0-side of layer 1, unit x’s contribution to the final prediction plus b2 < 0? ? ? ? ? ?

The explanation for the “Yes” that I filled in for you had been gone through in the quote. Note that the outputs of any unit in layer 1 must be between 0 and 1 because of the sigmoids. For the purpose of our discussion, we only consider the extreme cases - meaning that the shaded region outputs 1, and the non-shaded region outputs 0, and we don’t need to think about the marginal cases where the outputs are in between.

After you fill in the table, you might share your table, your observation and any theory for answering your question. Then we discuss your table, observation, and your theory.

There is no need to denormalize, there is no need to visualize more than you have already shown. Please try what I have suggested.

I won’t tell you why unit 4 is like that because I expect you to come up with some theory from the table.

Cheers,
Raymond

1 Like

@Frazer_Mann, please also be noted that your concern on the plot of unit 4 are comparing the boundary of layer 1 unit 4 and the predictions (labels) by layer 2.

We need my table to connect layer 1 and layer 2 up.

Fill in the whole table, and you should see something interesting. Fill in the whole table, and you will have gone through some necessary steps to consider your question.

1 Like

Hi,

Thanks for taking the time to reply and I hope you had a great wknd. I’m still struggling with this / feeling rather stupid/confused now so i’m sorry in advance.

Just to clarify so I’m not misunderstanding something, when you mention “(+∞,+∞)” you mean on the figure we have +∞ on what would typically be called the y-axis and +∞ on what would typically be called the x-axis? If so, then yes I agree.

I dont follow. I understand that because of the sigmoid we jump from 0 to 1, and on the decision boundary line we will be at 50% or 0.5 if you prefer.

:frowning_face: … looks like i’m not that smart :pleading_face:

i took a guess (see note below as to why im not sure if i got the correct answers) at answering the table, see below but I dont know how to answer the 2nd row other than it seems common sense that it would be the opposite of the first row, so: No, No, No, Yes, Yes.

Note: You stated “is the 1-side of layer 1 unit x’s” but why are we adding this to b2? In the example you quoted you added the -32.79 which is a weight from layer 2 so im confused as to how to correctly populate this table. To answer the first row ive just visually assessed if it looked like the removal of the blue zone played a part in the final graphic.

i thought taking a week away from this would give me time to come back to it with a fresh set of eyes but it doesnt appear to have helped.

1 Like

Hello @Frazer_Mann,

Please don’t feel discouraged :wink: It is very common to take trials, and I believe that, as we learn, the process is as important as the outcome (the answer), and probably matters more in our future learning for a new question, because probably only the process can be replicated. While I can do my part, you are actually the >99% shareholder of the process ;), so enjoy and make the most of it!

Yes. That’s the idea. This is a way of thinking, and in this way, we consider the extreme case. If you don’t like “(+∞,+∞)”, we can use “(+100,000, + 100,000)”, just any large pair of numbers. Do you have any real-life experience that extreme case helps you illustrate your ideas to the others?

If the inputs are “(+100,000, + 100,000)”, after the sigmoid, it is between 0 and 1, what about before the sigmoid??

In the most extreme case, if the inputs are “(+∞,+∞)”, what would it be before and after the sigmoid respectively?

It is too soon to say anything, it is just our first exchange.

image

Can you explain your logics for getting the above circled No for unit 3?

Here is my explaination for my “Yes” in the unit 2 for your reference:

image

  1. The equation of layer 1 unit 3 is Sigmoid(12.89 x1 + 10.79 x2 + 1.01)

image
2. The 1-side spans over the upper right corner of the graph


3. Based on 2, to consider for an extreme case of the 1-side, we pick (+ 100,000, +100, 000)

  1. Plug the extreme case into the equation in 1

  2. 12.89 x1 + 10.79 x2 + 1.01 = 2,368,001.01

  3. Sigmoid(2,368,001.01) = 1

  4. To answer “Is the 1-side of layer 1, unit x’s contribution to the final prediction plus b2 < 0?”,

image
8. Look at the weight in layer 2 that multiplies to the 3rd unit

  1. the weight is -32.79

  2. remember from step 6, the 3rd unit outputs “1”

  3. -32.79 * 1 + b2 = -32.79 * 1 + 15.74 = -17.05

  4. Go back to the question: “Is the 1-side of layer 1, unit x’s contribution to the final prediction plus b2 < 0?”

  5. -17.05 is less than zero, so the answer is Yes!

This is, again, too soon to tell.

Follow my above logic and you can fill in the table. Perhaps try to refill the following table again? And then explain the logic to the answers for unit 3?

No worries :wink: It is your learning, and is your decision to make :wink: If I were you, to make the most of that time, I would also consider to explore other possible ways to find yourself an answer to your question. Feel free to do so and we can come back later.

Cheers,
Raymond

@Frazer_Mann, after you re-fill the table with the steps I provided, please share your table, and similar steps that show how you fill the two values for layer 1 unit 3 (you may simplify them), so that I can follow your steps and see if everything’s alright.

Please do take your time :wink: I won’t be here all day anyway.

Thanks,
Raymond

Hi Raymond,

Thanks for the reply and the positive encouragement. Ive tried to answer your questions below. Let me know if i have understood things correctly and then ill try and discuss why certain things are happening.

Before the sigmoid, we have to normalise the values, so they then will be clustered around the origin. Im not sure what more can be deduced from them other than the weightings would be very large if we didnt normalise them which if i remember correctly is not ideal for convergence etc.

I made a plot of the decision boundary and experimented with the various weights.

If w1 and w2 are positive, the slope of the decision boundary is negative and the top right region is the “1” region of the sigmoid.

If w1 becomes negative then this changes the slope, and the “1” region of the sigmoid is still on the same side of the decision line (ie top of the page).

If w2 become negative it flips the decision boundary line around the x-axis. This therefore results in the “1” region of the sigmoid being below the line, if it was originally above the line before we changed the sign of w2.

Changes to b shift offset the decision boundary along the y-axis which i believe makes sense as it represents where the plane intercepts the z-axis (at least i believe it does).

i dont know if this is correct but my working is below.

And the updated table.

image

Can you let me know if this is correct please? I changed the (100000 , 100000) for each unit to ensure it was either in the 1 region or the 0 region.

Rgds

Fraz

Hey Fraz @Frazer_Mann!

Great work! Your approach is beyond my initial imagination. Thank you for your work! As I read your work, I can see why you made the first 3 boundary plots - because you needed them to do the following:

This is the very right thing to do!!

The only very minor thing I can find is that “exp(z)” should be “exp(-z)”:

image

However, your math is absolutely correct!

I also read every number in your last table, and they all look correct to me!

Now, for the text of your reply:

Let me tell you my initial thought. In the following example: image, before the sigmoid, it is z = -4, and after the sigmoid it is 0.0179.

So, if the inputs are “(+100,000, + 100,000)”, using w_1=w_2=2 and b_1=0, before the sigmoid, it is z = 2 \times 100000 + 2 \times 100000 + 0 = 400000, after the sigmoid, it is sigmoid(400000) = 1.

There is no normalization in between ;), just as there was no normalization in your Excel when computing image, right?

Very correct!

Yes!

Feel free to say anything you like to :wink: After clearing up this part, the next thing we are going to focus on will be:

In your first post in this thread, you questioned about the additional neurons (unit 3 & 4), and in the table, we can see that they show different behavior. Before I go on, I would really, really, really, like to know if you want to say anything about such different behavior. Anything at all. Anything you may think of that explain their importance in the final outcome of layer 2. How much do they (as individual and as a member of the inputs to layer 2) matter in the final outcome? And perhaps, if you would like to try, why would I ask to add b2? Don’t be limited by how I pose the questions :wink:

I look forward to your next post, but still, again, please take your time :wink:

Cheers,
Raymond

1 Like