Face recognition - Understanding img_to_encoding code

In section 5.1. - Face Verification, the following function is defined:

#tf.keras.backend.set_image_data_format('channels_last')
def img_to_encoding(image_path, model):
    img = tf.keras.preprocessing.image.load_img(image_path, target_size=(160, 160))
    img = np.around(np.array(img) / 255.0, decimals=12)
    x_train = np.expand_dims(img, axis=0)
    embedding = model.predict_on_batch(x_train)
    return embedding / np.linalg.norm(embedding, ord=2)

I would appreciate if someone could explain me the code above, i.e.:

  1. What is the commented code doing in the 1st line?
  2. Any special reason for the around method to be used with 12decimals?
  3. Although I know what expand_dims does, why is this used to “re-calculate” img?
  4. Why is the predicted image (embedding) divided by its L2-norm to produce the result of the function?

Many thanks for the help.

The point is that there are two standard orientations for image tensors:

“channels last”, which means m x h x w x c

“channels first”, which means c x h x w x m

In these courses we always use “channels last” mode, which I think is the default for most of the TF functions, which is probably why they comment that line out.

I don’t know the real answer there. It’s certainly plausible that you’re wasting your time with more than 12 bits of resolution in the mantissa. Color values only have 255 choices per pixel, right? But I would have thought the point is that you want to be doing 32 bit FP arithmetic instead of 64 bit to save memory and compute cost. The clearer way to do that would be to use tf.cast to float32.

The point is that we are feeding a single image to the model, but the model is trained on 4D input tensors, right? It is defined to take batches of images as input. So we have to convert our single image (a 3D tensor) to look like a “batch of one”, which is the purpose of that expand_dims call.

This one is interesting. I would have thought that the definition of an embedding is that it’s a unit vector. So I’d expect the model to emit embeddings with length one. But if you’re not sure, then it does no harm to divide by the norm, right? Worst case you just wasted a bit of compute to create the same answer. :nerd_face: You can try adding an extra line before the return to print the L2 norm of the resultant embedding and maybe you’ll find that we’re wrong and the model does not normalize them for us.

I added a print statement in the img_to_encoding function to print the norm of each embedding value returned by the model and it turns out they are not unit length:

norm = 9.145319938659668
norm = 9.275681495666504
norm = 6.319966793060303
norm = 7.333863735198975
norm = 6.1986918449401855
norm = 3.8730342388153076
norm = 3.8907790184020996
norm = 4.685695648193359
norm = 11.39218807220459
norm = 7.841488838195801
norm = 3.648789167404175
norm = 5.053682327270508

So that normalization step of dividing by the norm is required to produce unit length embeddings.

1 Like

Thanks Paul for your further input! Much appreciated.