Problem with Tensor Shape when implementing a custom loss function for my model in Tensorflow

I picked up the idea of triplet loss and global orthogonal regularization from this paper http://cs230.stanford.edu/projects_fall_2019/reports/26251543.pdf. However, I keep getting caught up in an tensor shape error.
After I define modelv1 as the base model (modelv1 take input of shape (None,224,224,3) and return tensor of shape (None,64)), the complete model will be defined as follow:

input_shape=(3,224,224,3)
input_all=Input(shape=input_shape)
input_anchor=input_all[:,0,:]
input_pos=input_all[:,1,:]
input_neg=input_all[:,2,:]
output_anchor=modelv1(input_anchor)
output_pos=modelv1(input_pos)
output_neg=modelv1(input_neg)
model=Model(inputs=input_all,outputs=[output_anchor,output_pos,output_neg])

The formula for triplet loss with global orthogonal regularization, as provided in the paper I mentioned above is:
Formular for the loss function

I implemented this formular as follow:

def triplet_loss_with_margin(margin=0.4,d=64,alpha=1.1):
    def triplet_loss(y_true,y_pred):

        """
         Implementation of the triplet loss as defined by formula (3)

            Arguments:
            y_true -- true labels, required when you define a loss in Keras, you don't need it in this function.
            y_pred -- python list containing three objects:
                    anchor -- the encodings for the anchor images, of shape (None, 64)
                    positive -- the encodings for the positive images, of shape (None, 64)
                    negative -- the encodings for the negative images, of shape (None, 64)

        Returns:
        loss -- real number, value of the loss
        """
        anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]

        # Step 1: Compute the (encoding) distance between the anchor and the positive
        pos_dist = tf.math.reduce_sum(tf.math.square(tf.math.subtract(anchor,positive)),axis=-1)
        # Step 2: Compute the (encoding) distance between the anchor and the negative
        neg_dist = tf.math.reduce_sum(tf.math.square(tf.math.subtract(anchor,negative)),axis=-1)
        # Step 3: subtract the two previous distances and add alpha.
        basic_loss = tf.math.add(tf.math.subtract(pos_dist,neg_dist),margin)
        # Step 4: Take the maximum of basic_loss and 0.0. Sum over the training examples.
        loss = tf.math.reduce_sum(tf.math.maximum(basic_loss,0.0))

        # add regularization term
        dot_product=tf.matmul(anchor,tf.transpose(negative))
        multiply_2_vectors_value=tf.linalg.diag_part(dot_product)

        
        M1=tf.math.reduce_sum(multiply_2_vectors_value,axis=-1)
      
        M2=tf.math.square(multiply_2_vectors_value)
        M2=tf.math.maximum(tf.math.subtract(M2,1/d),0.0)      
        M2=tf.math.reduce_sum(M2,axis=-1)
        
        loss+=alpha*(tf.math.square(M1)+M2)
    
        return loss
  
    
    return triplet_loss

I assumed that since anchor and negative all have shape (None,64), this approach should work. However, when I trained the model, I encoutered the error bellow

ValueError: in user code:

    /opt/conda/lib/python3.7/site-packages/keras/engine/training.py:853 train_function  *
        return step_function(self, iterator)
    /tmp/ipykernel_24/1319124991.py:34 triplet_loss  *
        dot_product=tf.matmul(anchor,tf.transpose(negative))
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py:206 wrapper  **
        return target(*args, **kwargs)
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3655 matmul
        a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py:5714 mat_mul
        name=name)
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:750 _apply_op_helper
        attrs=attr_protos, op_def=op_def)
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py:601 _create_op_internal
        compute_device)
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:3569 _create_op_internal
        op_def=op_def)
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:2042 __init__
        control_input_ops, op_def)
    /opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:1883 _create_c_op
        raise ValueError(str(e))

    ValueError: Shape must be rank 2 but is rank 1 for '{{node triplet_loss/MatMul}} = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false](triplet_loss/strided_slice, triplet_loss/transpose)' with input shapes: [64], [64].

From what I understand, the error is caused because when implementing dot_product=tf.matmul(anchor,tf.transpose(negative)), anchor and negative only has shape (64) so it caused the error. But should anchor and negative is of shape (batch_size,64)? I really could not understand what I did wrong. Could you please enlighten me about this? Thank you

I tried to debug by implementing an independent funct to test

def triplet_loss(y_pred):

        """
         Implementation of the triplet loss as defined by formula (3)

            Arguments:
            y_true -- true labels, required when you define a loss in Keras, you don't need it in this function.
            y_pred -- python list containing three objects:
                    anchor -- the encodings for the anchor images, of shape (None, 64)
                    positive -- the encodings for the positive images, of shape (None, 64)
                    negative -- the encodings for the negative images, of shape (None, 64)

        Returns:
        loss -- real number, value of the loss
        """
        anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]

        # Step 1: Compute the (encoding) distance between the anchor and the positive
        pos_dist = tf.math.reduce_sum(tf.math.square(tf.math.subtract(anchor,positive)),axis=-1)
        # Step 2: Compute the (encoding) distance between the anchor and the negative
        neg_dist = tf.math.reduce_sum(tf.math.square(tf.math.subtract(anchor,negative)),axis=-1)
        # Step 3: subtract the two previous distances and add alpha.
        basic_loss = tf.math.add(tf.math.subtract(pos_dist,neg_dist),0.4)
        # Step 4: Take the maximum of basic_loss and 0.0. Sum over the training examples.
        loss = tf.math.reduce_sum(tf.math.maximum(basic_loss,0.0))

        # add regularization term
        
        print("anchor shape: ",anchor.shape)
        print("neg shape: ",negative.shape)
        
        dot_product=tf.matmul(anchor,tf.transpose(negative))
        multiply_2_vectors_value=tf.linalg.diag_part(dot_product)

        
        M1=tf.math.reduce_sum(multiply_2_vectors_value,axis=-1)
        
        M2=tf.math.square(multiply_2_vectors_value)
        mask=tf.math.maximum(tf.math.subtract(M2,1/64),0.0)
        
        M2=tf.math.reduce_sum(M2,axis=-1)
        
        loss+=1.1*(tf.math.square(M1)+M2)
    
        return loss
  

And it works fine with dummy tensor I passed to it

dummy=tf.random.uniform((1,3,224,224,3))
re_dum=model.predict(dummy)
test=triplet_loss(re_dum)

re_dum is a list of 3 elements, each is a tensor of shape (1,64),test is a number. So this little test shows that there is no problem with my implementing. But why the error keeps showing up?

Besides, when I replace

 dot_product=tf.matmul(anchor,tf.transpose(negative))

with

dot_product=tf.matmul(tf.expand_dims(anchor,axis=0),tf.transpose(tf.expand_dims(negative,axis=0)))

The error disappeared, but it seems very perplexing for me why it works.

Did you add tf.print to print the shapes of those three unpacked variables and make sure their shapes are up to your expectation?

anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
tf.print(tf.shape(anchor), ....)

btw, if the shapes are no problem and if you can quickly produce a minimal reproducible example so that I can run it on my machine and see the error, I can help debugging it. I will have time for it after 3 hours from now.

Cheers,
Raymond

I did run a dummy matrix of shape (1,3,224,224,3) to it and and every things worked as I expected. I could send you the notebook and dataset for the sake of convenience.

Dataset
notebook in Kaggle

What about tf.print(tf.shape(anchor), ....)? You think they should be rank 2 (e.g. (batch_size,64)), are they really so? I always find printing the shapes to be very helpful. You can keep those tf.print in the function while you are using it to train a model, but you may want to run only 1 epoch with a very small amount of data to avoid too much printouts.

Hey @FreyMiggen,

The shapes ain’t right. It seems that tensorflow didn’t pass all three outputs into the loss function. I suggest you to tf.keras.layers.Concatenate the three outputs to form one single output like (None, 3, 32) and take them apart inside the loss function.

Cheers,
Raymond

1 Like

Thank you very much for your suggestion. I will try modifying my code in that approach.

Sure. Or concatenate them to (3, None, 32) or (None, 32 ,3) whichever way better looks.

1 Like

I modified it to shape of (3,None,32) and until now, it works although I need more time to be able to assure that the model behaves as I expect.
This bug has haunted me my whole weekend! Fully understand the way TF works is still a challenge for me!
Thank you very much!
Have a good day~

You are welcome @FreyMiggen!