I cannot understand the term “probability of being person XYZ.”
If I give the pixel intensity values of Prof. Andrew Ng’s picture as an x vector (input), then the probability of being the person ‘Prof. Andrew Ng’ is 100%.
So, how does it make the output other than 100%?

Or are there some other scenes behind it? For example, the pixel intensity values of multiple people’s pictures in the input layer will be compared with a reference picture of Prof. Andrew Ng to detect the similarities with Prof. Andrew, so it can have the output= probability of being a person ‘Prof. Andrew Ng’.

No, because, from the architecture, you can clearly see that there is no place for the reference picture, right? Instead, there is just one place for the input picture.

What is behind the scenes is that, the model is trained to ONLY recognize photos of Andrew as True and anyone else as False. In other words, the whole model itself is about the one single person - Andrew.

So, the input picture can be anyone, but the model, which is trained to recognize Andrew, outputs a value that is large if it is Andrew, and that value is converted by Sigmoid to fall into the range between 0 and 1 which is part of the reasons for it be called a probability.

Because at this stage, ‘the probability of being the person XYZ’ is determined. So, the model trained to recognize photos of XYZ as true or not true will have an impact on this determination.

Suppose I took a picture of ABC as input, and now, at the final stage, the probability of being the person XYZ =0 since ABC and XYZ have no similarities.

If there are some similarities, the probability will be greater than 0.
If the actual person is in input, the probability is 100%.

Thanks for the explanation which is necessary for me to see how you understood it.

The remaining thing I am not certain is your definition for “this stage” and how you contrast “this stage” with “model”.

My definition for your “this stage”, based on what I have read, is the stage of producing a^{[2]}, as shown in the image you shared.

Under my definition, “this stage” is just part of the model. In particular, “this stage” is about the “layer 2” of the model.

Now, coming back to your last question:

My “behind the scenes” focus on the whole model, not just the stage. It does happen at “this stage”, but it also happens at other stages of the model.

Yes, I think so!

I suppose “actual person” means “ABC”.

Even you give your trained model a photo of “ABC”, it does not guarantee 100%. Practically, the best we can hope is that, it is much larger than 50%, or it is closer to 100% than it is to 0%.

I think my understanding is more clear now.
The pixels are labelled with numbers ( 0 to 1) in that way so they can undergo the operation of the logistic function and predict the probability of the person being ‘XYZ’

No matter what the scale of the pixel values is, either be it from 0 to 1 or from 0 to 255, our network will convert them to something else before the converted values are fed into the logistic function for a probability output.

It is more usual for us to use 0 to 1 to avoid large numbers in the training process which may cause other issue.

Therefore, the purpose for 0 to 1 is not about making logistic function work.

Hi Farhana - I think it simpler: if you check the Paper at the source, the way they do it is the output layer to be a softmax = the function which enforces the most probable signals but aligns them into probabilities.
So, it’s like having a vocabulary of faces that collects the sum of signals, and activation at output layer - maps it into probabilities.

But there is also the way to make it a binary problem - then, you’ll train it having output layer as sigmoid function with 1 meaning the person is True, the rest - as False.
However, given the 3rd picture at the bottom (as well as Paper) - it’s likely the first option.

And the structure of the output looks like this (usually way below 100% but with clear leader - example of ImageRec not FaceRec): https://miro.medium.com/v2/resize:fit:1000/1*QloCzL8cAjWKXGb1ijwyGw.png
My view - it’s good of this NOT to 100% as model puts nuances of object to similar with other things, so kind of challenges itself with broader analysis, right?

Yes, so the model is more focused on the features of [the only] person it’s checking.

The closer example of Binary classification (in contrast with Multi-class, on your slide) inside the course - Written Digits recognition for 0 vs 1 digit problem [C2_W1_Practice Lab].

I’ve actually tried to replicate that representation picture for faces (showing the model creates features to activate neurons)
The full Lab with my small experiment is here - Google Colab

You can see - it’s kind of draws contours, and 1-like or 0-like combinations which help the model to guess digits in different possible positions.

The challenge is it works when mapping Layer-1 on Input-Layer (mid picture with )… But moving into L2–>L1 mapping: there is a “black-box” heatmap…

But that’s an example on a Computer Vision side (and by the way - for 0-9 digits with softmax layer - L1 is unreadable). The best example on this Representation problem (aka how to translate NN-created weights into categories and sub-categories corresponding with its decision-making/classification) - in the course, I found at C2_W1 - which is as simple as this:

Thank you for answer Farhan’s last question and for your wonderful analysis! Your heatmaps drew my attention. While L2 looks blackbox, which I very much agree, I was wondering (not proved myself) whether it would be informative to show instead the gradient of L2’s output activations with respect to the pixel values of any one selected input images.

The gradient value is kind of like answering “How much L2’s output changes if this pixel of that photo changes?” Sounds like a good approach to assess L2 (and it previous layers) in a less black-box wise?

Hi Raymond - thank you for an idea!
You mean, kind of chain-rule/backpropagation of L2 neurons on the Input layer? Expecting - the heatmap of changes (instead of static weights, as above), as a result?
Few chellenges I could see:

Technical. This ‘back-reshape-weights’ worked well - as you just need to extract weights from TensorFlow. To backpropagate effectively -need to check if there are some specialized libraries (calculating delta’s-from-delta doesn’t seem as quick).

Impact of L1 & Combinations. If we trying to do it NOT on the previous layer - any pixel will be connected with L2 neuron via 25 L1-mediators.
So, it’s like pixel-1 will have projection of neuron L2_1: via L1_1, then via L1_2.
==> The Gradient you suggesting - should deliver some composition.
But when training: a) Aren’t we expecting them to be smaller (learning curve approaching minimum), b) Should it be some sequience of gradients to see the heatmap of changes (not just, let’s say, once delta=0.001, and backpropogate); c) Are there the impact of L1 - “filters”, when we “projecting” L2 on 20x20-Input via L1 ?

I think Andrej Karpathy has a good analogy of what was done on the paper with faces (he’s done it for cars). Which is: those Inputs-L1 weights are kind of filters/ templates (set along the training).
And when you re-shape = project L2(n) weights back on the input picture - it basically shows some composition of what this filter is “looking at”.
When doing Gradient - it’s honestly not as easy to imagine this filter, landing on pixels through the impact of L1.

You can use TF’s auto differentiation! The following (from doc) shows dy/dx. You need \frac{{\partial\text{ L2 activation}}}{{\partial\text{ input pixel}}}

Not sure if I follow you. \frac{{\partial\text{ L2 activation}}}{{\partial\text{ input pixel}}} cannot be computed without previous layer.

The loss is minimum. We don’t know what \frac{{\partial\text{ L2 activation}}}{{\partial\text{ input pixel}}} should be.

Definitely yes. We can’t compute \frac{{\partial\text{ L2 activation}}}{{\partial\text{ input pixel}}} without L1 - this is how the chain rule works.

Mine is just a different thing for a different purpose. Consider it not for replacing any other things.

The result needs to be interpreted carefully too – it is conditioned on one particular sample, so any attempt to generalize should be very carefully thought through.

Try it. It takes less than 30 minutes with TF’s auto-differentiation.

I had done \frac{{\partial\text{ Output activation}}}{{\partial\text{ input pixel}}} previously and it was quite interesting. I have not tried \frac{{\partial\text{ hidden layer activation}}}{{\partial\text{ input pixel}}} though.

Not in class yet, but, looking through thread, has anyone checked to see what happens if at layer 2 you use is x^2 > .5 instead of >= .5? Just wondering how that impacts downstream because of the 0 or 1 result. >.5 says yes, there’s a chance. >=.5 says yes there’s a chance except in the case where it = .5 and then its ambiguous. Probably a silly question.