Difference between y and y hat

Hello everyone,
I’d like to know each of the terms and the difference between them
w, b, y and y hat=a
Thank you

It sounds like you are asking about Logistic Regression in Week 2.

The goal is to make predictions based on input data. In order to train the “model”, we are given a collection (dataset) full of pairs of values: x and y.

x is the input data sample and is formatted as a vector with dimensions n_x x 1, where n_x is the number of “features” or elements in each input vector. For example the image data that we use in the Logistic Regression exercise has 12288 elements because it is “unrolled” from an RGB image that is 64 x 64 x 3 (64 x 64 pixels, each of which has 3 color values).

y is the “label” that corresponds to x. So it is the “correct answer”, which is either 0 (no) or 1 (yes). In our case here, 1 means the image x is a picture of a cat and 0 means x is not a picture of a cat.

Now that we understand the input data, here is how we make a prediction:

We have another vector w of “weights” that is the same size as x. We first perform the “linear combination” of w and x and then we add the bias value b, which is just a scalar:

z = \displaystyle \sum_{i = 1}^{n_x} w_i * x_i + b

That will give us a scalar real number as the output. Then we want to convert that into a number between 0 and 1 that looks like a probability. To do that, we feed the output to the sigmoid function:

\sigma(z) = \displaystyle \frac {1}{1 + e^{-z}}

The final output or prediction of the model is called either a or \hat{y}:

\hat{y} = a = \sigma(z)

The way we interpret \hat{y} is that if it is >= 0.5, then the model is predicting that the input sample is a “yes” (a picture of a cat in our particular case). If \hat{y} < 0.5, then the model is predicting that the input sample is classified as a “no” (not a cat in our example).

The goal is that for a given input x, the \hat{y} is as close as possible to the corresponding correct answer given by the label y for that x. The really big question now that we have defined all that is how do we find the w and b values such that the computations described above give accurate predictions. That is what Training using Back Propagation and Gradient Descent is all about.

The one additional point to make here is that if we express the linear combination formula above as a vector operation, it is the following:

z = w^T \cdot x + b

The reason we need the transpose there is that both w and x are formatted as n_x x 1 column vectors. It is thus necessary to transpose w so that it becomes 1 x n_x in order for the dot product to work. Note that Prof Ng could have chosen to define w as a row vector, but he uses the convention that all standalone vectors are column vectors.

Also please note as a general matter that everything I said above was covered in the lectures, although maybe it was spread out over several lectures. If what I said above does not make sense, you might want to watch the Week 2 lectures again.

4 Likes