C2_W1_Lab02_CoffeeRoasting_TF layer functions

Hello!,

I have doubts with these 4 frames, when they mention that in layer 1 each neuron is giving a higher probability only if it has characteristics opposite to those of a “good roast”, it causes me confusion because supposedly in the end we will choose the highest probability. So shouldn’t it be that the higher the values (probability) the closer a good roast could be? I ask this because >=0.5 was defined above as a good example (good roast).

yhat = np.zeros_like(predictions)
for i in range(len(predictions)):
if predictions[i] >= 0.5:
yque[i] = 1
the rest:
yque[i] = 0
print(f"decisions = \n{yhat}").

Hello @Nicolas_Gamarra,

Thank you for the question!

I think you were confused by the graphs and the text below the graphs.

When reading the text, there are a few facts we need to keep in mind:

  1. Only the output from layer 2 is probability. The outputs from layer 1, however, are NOT probabilities even though they are also ranged from 0 to 1.

  2. The outputs from layer 1 are in the range of 0 and 1 because layer 1 uses sigmoid as its activation function. However, using a sigmoid doesn’t imply probability. Sigmoid output in hidden layer (layer 1) is not probability. Sigmoid output in the output layer (layer 2) is probability.

  3. Scroll up a little to see the weight values for W2:
    image

  4. W2 are all negative. What does being negative mean? It means “reverse”. If layer 1 unit 0’s output is high, because layer 2’s 0th weight (-45.71) is negative, it becomes contributing a low value in the final probability output. For example, if layer 1 unit 0 outputs a high value of 0.95, layer 1 unit 1 outputs a high value of 0.89, and layer 1 unit2 outputs high value of 0.93, then the final output probability becomes:
    sigmoid(-45.71 \times 0.95 - 42.95 \times 0.89 - 50.19 \times 0.93 + 26.14)
    = sigmoid(-101.6737)
    \approx 0
    See how high values in layer 1 produces low value in the output layer?

  5. Keep in mind that the above conclusion of “high values in layer 1 produces low value in layer 2” is true only because W2 are all negatives. If any of them was positive, the conclusion would have to change. Do you know how the conclusion should change if W2 became the following? (Note the positive signs. And note that you don’t have to answer me, but if you try, I will respond)

W2:
 [[-45.71]
 [+42.95] 
 [+50.19]] 
b2: [26.14]

Cheers,
Raymond

Hi @rmwkwok ,

Thank you for your answers here! Could you still clarify some things?

  1. Why output of L1 is not probability, and output of L2 is? Both of them use the same sigmoid function, which (as a mathematical function) has the same meaning.

  2. What does L2 use as Y during gradient descent? It’s clear what L1 takes as X and Y. And it’s clear that L2 takes A1 as values for ‘X’ parameter in sigmoid. But what does L2 take as Y values for gradient descent?

  3. The case with switching ‘1’ from good to bad and then to good - is it just what happened in this particular case with the particular set of data? Like accidentally. Or is this a part of the system, and this switch is supposed to happen because of NN structure or something?

Regards,
Alex

Hi Alex @alex_fkh,

  1. Look at which one’s output you put to the logistic loss function? See if you can reason anything?

  2. You know the gradient descent formula, right? We need both the label and the prediction for computing the gradient.
    I believe Andrew has walked us through the uses of them in his video “Computation graph (Optional)” in Course 1 Week 2. Since you were replying to a Week 1 lab, you might want to wait until week 2 for a more detailed answer?

  3. Your point number 3 does not have the context about what “the case” is. Can you clarify where I can read the case, because I have no idea what you are talking about?

Raymond

Hi @rmwkwok .

I also don’t understand why output from L1 is not probability. We compute gradient descent for each unit in L1, so answering your question: we put output (sigmoid) of current unit. Could you please clarify this moment?

Hello @DagerD,

image

Looking at this logistic loss function, that it has only two places - y and p for labels and L2’s outputs respectively, do you still think there is room for L1’s output in it?

Just because we used L1’s sigmoid output in gradient descent does not make it a probability.

Think about the following:

Our input features are temperature and duration. If we had put a sigmoid to the input layer, given that the features are used in gradient descent as much as L1’s output did, does it make Sigmoid(temperature) a probability?

I hope your answer is no.

Sigmoid + involvement in gradient descent is not a probability-maker.

Being from 0 and 1 is necessary to be a probability, but not sufficient. Involvement in gradient descent has nothing to do with the qualification of being a probability at all.

The reason I asked this is that it is a good starting point to think. Below I will demonstrate how it can connect to the answer:

Step 1 is just Math.

Step 2, 3, 4, 5 and 6 are by definitions. Read WIki or google for more yourself on Binomial distribution if you are not familiar with it.

Step 7 and 8 are what we give.

So, through the use of the logistic loss function and the provision of the label as y and the L2’s output as p, we are actually producing a neural network that models p.

p is probability, and so L2’s output is modelled to be probability.

It is L2’s output we give as p, not L1’s output, and not Sigmoid(temperature). :wink:

Cheers,
Raymond

PS: tagging @alex_fkh too.

Sigmoid + involvement in gradient descent is not a probability-maker.

So what is probability marker then? Because there is a slide in course one:

I found this topic with similar question What does each neuron really do?

And seems like for now I don’t possess all information needed to cover this question. If yes, feel free to leave me go through necessary courses.

Also found this topic How do Neurons and activations actually work? - #5 by rmwkwok

And these words made everything clear!

However, we never constrain those intermediate (hidden) layers. Meaning that, we don’t have label data for affordability, and we never constrain any neuron’s output to be consistent with the label data for affordability. Without such constraint, we can’t interpret those hidden layers’ neurons as “affordability” or anything else.

Cool! Happy to know that everything is clear! You know what can convince you better than anyone :slight_smile:

Just a very very little follow up:

The word “constrain”.

The sentence “constrain any neuron’s output to be consistent with the label data for affordability”.

Now, when we train a model, what is the only thing that you constrain? L2’s output. To what? The label. By how? Minimizing the cost function.

Cheers,
Raymond

As a follow-up to this query can you or someone clarify the shaded regions in each unit?

If the shaded region is indicative of bad roasts per unit then isn’t unit 0 focused on low duration regardless of the temperature, unit 1 - high temperatures (excess of 260) and unit 2 - low temperatures?

Look forward to the response.

No, for example Unit 0 identifies temperature < about 175 degrees, at any temperature.

I think I see the issue I was commenting on the images as pictured on my lab (attached)


but appears they are skewed. These versions above make sense.

Note that when the system is trained, we don’t control which neuron learns which characteristics. It’s totally random what each hidden layer unit learns. In the image you recently posted, Unit 2 corresponds to what I described in my earlier reply.