Why does it need to sum over the 128 dimensions?

In the lecture of Face Verification and Binary Classification, the loss function is presented below:

Since f(x) is just the 120 dimension vector of an image’s encoding, in computing the difference of two vectors, why does it need to sum over K? Can it just be f(x_i) - f(x_j), i.e. one vector minus another vector? Or cosine is used here?

If you take the difference of two vectors with 128 elements, then the result is also a vector with 128 elements, right? But note that what they are showing above is not a loss function: it is generating a binary prediction value \hat{y}. The loss will come from comparing \hat{y} to the labels.

The point is that they are taking the “embedding” vectors produced from the input face images and then learning a function that can do a good job of mapping the differences in the various elements of the embeddings into a “yes/no” answer to the question “are these input images of the same person’s face”. That answer is a scalar, so you need to do an operation that converts the 128 entry vectors into a scalar. The operation they have chosen for that in this particular case is shown above.

There are other alternatives of course: e.g. they could have just taken the 2-norm of the difference vector. Please listen again to what Prof Ng says in the lectures to understand more about the tradeoffs here.

Also note that there’s a reading item before that lecture that gives a correction: the coefficients on that linear combination should be w_k, not w_i.

Yes, simply doing a vector subtraction is going to be another vector!