How do we compute loss between the outputs of two layers?

x1 = Conv2D(32, (3, 3), activation='relu', padding='same')(inputs)
x1= MaxPooling2D((2, 2), padding='same')(x1)                                                         
x2 = Conv2D(64, (3, 3), activation='relu', padding='same')(x1)
x2= MaxPooling2D((2, 2), padding='same')(x2)
x3= Conv2D(128, (3, 3), activation='relu', padding='same')(x2)
x3= MaxPooling2D((2, 2), padding='same')(x3)

Say I wish to compute l2 loss between x1 and x3, how would I do that? I want the deeper layers to guide the feature maps of the intermediate layers. Since the shape of x1 and x2 are different this makes it hard to directly compute loss between them.Simple reshaping does not help Please help


“loss” involves computing the difference between some expected and predicted values.

Do you have any expected values for your intermediate layers?

I don’t think a comparison between x1 and x3 would be considered a “loss”. And if the number of units is different, such a comparison is not possible.

1 Like

I’m trying to make the feature maps of the shallow layers learn from the feature maps of the deeper layers ( self distillation ). This method computes such a loss

Sorry, I have never heard of “self-distillation” and will google it in a second, so maybe I’m not the right person to answer this question. But “back propagation” is the normal way that the earlier layers learn from what is happening in the later layers, right? That’s exactly what back propagation does: the loss is computed at the output layer and then the gradients are calculated at every layer and that enables us to adjust the parameters at all layers using the Chain Rule. The gradients at each layer show how the current layer’s parameters need to change in order to get better results at the next layer.

I’ll attach a link, I’m having trouble figuring it out myself. If anyone has any inputs please help!
[1905.08094] Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation (

Hi, Sara.

Thanks for the link! I took a quick look at the paper, but do not claim to understand enough to implement anything yet. The paper was published in 2019 and they claim in the text that they’ll “soon” release their code on github. Did you try searching on github to see if they actually did publish any implementation code?

Hello @learner1tk,

I can give you some idea but not the details. However, you can google with keywords like “tensorflow” “custom loss” “multiple outputs” for more discussions.

You need to define a custom loss function/class. A loss function accepts y_true and y_pred as input arguments.

y_true is provided by you in the training data, while y_pred is the output of the network.

You need to think about how to construct the network’s output that will send everything you need into the custom loss function, including, e.g., x1 - x3.

Then you implement the loss function that can takes all outputs into account.