Multi Output Regression: Multi Parameters?

Hi there! So in Course 1 Week 1 of TF-AT multi-output regression is reviewed. I can see there’s a functionality to specify different (surely also custom) loss functions for each output. That surely means there’s a set of parameters for each output (I’m thinking that parameters that result from minimizing different loss functions ca’t be the same), am I right to assume that? If so, then effectively multi output amounts to two separate regressions under the hood, correct? How is this computationally more effective than simply having two separate regressions? (It may well be more effective, I just don’t understand how or why this would be the case).

I understand that further down the line branching models become a tool in their own right, but at this point, would it be effectively the same to code two separate regressions except for the extra lines of code?

Hi @jondoff

This is a very small model, but in larger models, we find advantages in training times and inference.

The common layers must be trained only once, with the same data, in the cases where training can lasts hours or even days, the savings in both computation and cost can be very large.

In inference time it is more or less the same, you get two predictions by running a large part of the layers just once, which saves time.

They are not widely used models, but it is very good to know the technique, since it is also used in other parts as in the case of Siamese networks, which share a part of the model.

Possibly something similar could be done using transfer learning, training, replacing the last layers of a trained model and training them with new data, but in this case we would not take advantage in inference time.

I leave you an example that uses this technique using a Kaggle dataset, in which a classification variable and a regression variable have to be predicted.

I think you are right with you assumption in the parameters, that they should adjust better with only a variable to predict, but, as always t is a matter of weighing profits and losses and deciding the best solution.

Best Regards!


Thank you for the thoughtful reply.

Do you know if there is a set of coefficients, latent states, biases per loss function per layer or one for all? That is I suppose my main question.

When the model branches then there’s no question about each branch having its own parameters. But at the times where different loss functions share the same layers, I’m not so sure. If you are to compute loss at each step of SGD, then surely you must compute two losses, which one does the model minimise, right? It’s surely both, so then there’s two resulting sets of parameters per shared layer?

If so, then effectively they’re two separate models sharing “the same” architecture but not one model that predicts two outcomes. Correct?


you can assign a weight to every loss function, and when you call to the compile function pass the weights in the loss_weights parameter.

Not sure if this what are you looking for :slight_smile:

losses = {
‘output_1’: ‘mean_squared_error’,
‘output_2’: ‘mean_squared_error’

loss_weights = [2.0, 1.0]

model.compile(optimizer=‘adam’, loss=losses, loss_weights=loss_weights)

There is only a set of parameters minimizing both functions considering the weight you has indicated.

I think thats it’s just a model predicting two variables. Sometimes it can be accurate enough to compare with two models, and sometimes not.

1 Like

If you have 2 objective functions that you want to minimise (say one for classification and one for regression) at each step of stochastic gradient descent (or whatever optimiser, adam or rmsprop…) how to update coefficients? Should you use the gradient * learning rate of one or the other loss to update coefficients?

The result of both loss functions are used, but multiplied by the loss weight assigned, then you add both, and with this you get the result of the loss function.

To modify the weight you use the learning_rate.

I think that in this article es well explained:

You can see that the model have 3 different loss, one for each output, and the result stored in a general loss, that is the one used to decide how update the weights.

When the weights are updated the learning_rate is used, but trying to minimise the result of the loss_function in the next iteration.

p.d. Not sure If i’m explaining well :slight_smile: But, in brief, the model gets just one loss result, just any other model, but it is the combination of the two loss functions indicated.

Thank you for taking the time to search for detailed info man, I appreciate it! It’s a really good and detailed post.

However, I found that:

initialize our FashionNet multi-output network

model =, 96,




define two dictionaries: one that specifies the loss method for

each output of the network along with a second dictionary that

specifies the weight per loss

losses = {

“category_output”: “categorical_crossentropy”,

“color_output”: “categorical_crossentropy”,


lossWeights = {“category_output”: 1.0, “color_output”: 1.0}

initialize the optimizer and compile the model

print(“[INFO] compiling model…”)

opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)

model.compile(optimizer=opt, loss=losses, loss_weights=lossWeights,


I’m on my mobile, sorry if it looks a bit off!

In the definition of loss and subsequent training there’s no third loss definition that encapsulates both losses. I think these are effectively two separate models which output two disjointed metrics for colour and garment.

Don’t you think?

Hi @jondof,

I’m using the notebook in the sample yo can see the losses with this line:

Specify the optimizer, and compile the model with loss functions for both outputs
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
loss={‘y1_output’: ‘mse’, ‘y2_output’: ‘mse’},
loss_weights = [2, 1],
metrics={‘y1_output’: tf.keras.metrics.RootMeanSquaredError(),
‘y2_output’: tf.keras.metrics.RootMeanSquaredError()})

With this metrics, en each epoch you get the general loss and the loss for every branch:
Epoch 1/500

  • 5/62 [=>…] - ETA: 0s - loss: 0.1110 - y1_output_loss: 0.0391 - y2_output_loss: 0.0327 - y1_output_root_mean_squared_error: 0.1978 - y2_output_root_mean_squared_error: 0.1809*

If you modify the loss weights is really easy to see how the values of losses are different.
I tried with the weights [1, 2] and [4, 1], obtaining this results:

Loss Weight values: [2, 1]
Epoch 1/500
5/62 [=>…] - ETA: 0s - loss: 0.1110 - y1_output_loss: 0.0391 - y2_output_loss: 0.0327 - y1_output_root_mean_squared_error: 0.1978 - y2_output_root_mean_squared_error: 0.1809

Epoch 500/500
62/62 [==============================] - 1s 8ms/step - loss: 0.0604 - y1_output_loss: 0.0202 - y2_output_loss: 0.0200 - y1_output_root_mean_squared_error: 0.1420 - y2_output_root_mean_squared_error: 0.1416 - val_loss: 0.7472 - val_y1_output_loss: 0.1892 - val_y2_output_loss: 0.3689 - val_y1_output_root_mean_squared_error: 0.4349 - val_y2_output_root_mean_squared_error: 0.6074

Loss Weight values: [1, 4]
Epoch 1/500
5/62 [=>…] - ETA: 0s - loss: 0.2546 - y1_output_loss: 0.0127 - y2_output_loss: 0.0605 - y1_output_root_mean_squared_error: 0.1129 - y2_output_root_mean_squared_error: 0.2459

Epoch 500/500
62/62 [==============================] - 1s 8ms/step - loss: 0.3129 - y1_output_loss: 0.0733 - y2_output_loss: 0.0599 - y1_output_root_mean_squared_error: 0.2707 - y2_output_root_mean_squared_error: 0.2447 - val_loss: 1.6289 - val_y1_output_loss: 0.2147 - val_y2_output_loss: 0.3535 - val_y1_output_root_mean_squared_error: 0.4633 - val_y2_output_root_mean_squared_error: 0.5946

I’m still almost sure that is only one model :slight_smile:


@gent.spah @Wendy me and @jondof are having a good time discussing how the Multiple Output Models work, and how the loss is calculate and the weights updated.

Please, feel free to share your knowledge, because I’m having some doubts.

I’m sure that it is only a model that calculate both variables just with a training and that the weights are updated using the loss calculated as a a result of apply the weights indicated for each specific loss. But, if i don’t misunderstand, @jondoff believes that there are two models that share “the same” architecture but not one model that predicts two outcomes.

Hope you can bring us some light in the matter!

1 Like

Hi @Pere_Martra I saw this post and this is what I think about it in a previous post.

Basically you have one set of weights for the main branch and 2 othes sets for each branch when it splits.

How this is done, I always take a water flow in pipes analogy, one main branch splits in 2 other branches, the pressure is the same in the main branch but is different in splits (here the pressure can be anlogue to gradients/learned weights).

The overall presure in each part of the systems is decided by both taps at the end of the branches (suppose we have taps/valves in their ends).

Now the maths behind it is bound to be a bit complex because you have a few “valves” and also it is high dimmensional space we are dealing with. Even if was water pressure would be complex maths :grin:

But I think these are the main principles.

1 Like

Thank you both! This has been really interesting. Yes, I also suspect it’s two sets of gradients/coefficients (two in this case, but however many according to the number of branches one might have in the pipeline). The pipe is a great way to see it, I’ll use that mental image more.

I’ve been thinking about how different loss functions impact the “shape and structure” of the resulting abstract representations learned from the data. Particularly in a transformer or whatever encoder, surely different loss functions - even ever so slightly different loss functions - will yield different, maybe even quite different encoded representations… Would be really interesting to read about it if you guys have any references about research which contrasts the use of say some loss function and it’s smoothened counterpart or something like that, I’d be very interested and grateful.

Also how might one contrast those sets of learned abstractions? I’ve seen heatmaps of embeddings for example, but do you know any meaningful way of contrasting and making sense of different types of abstract representations like that?


Having said that, if there are two sets of coefficients, why weight losses, right? hmmm…

Which two sets…? For each branch there is only one set of weights, not 2.

Hi guys,

Sorry for the slow reply. SO - each branch has one set of parameters. I suppose that is the case also for the ‘main’ branch (i.e. prior of it splitting into branches) there’s one set of parameters. Correct? Then it splits and each branch has its own.

If so, which loss function is being minimised in that main branch? Which of the two? We have two loss functions defined, one for each branch, but there are parts of the network with ONE set of parameters which need to be minimised for TWO loss functions. Each step of SGD involves calculating the gradient of a loss function. Which of the two loss functions defined is being minimised in the main branch which has one set of parameters?

Do you see? @Pere suggests there’s a third loss function that accounts for the two at the stages where there’s only one branch, if I understood correctly. But I’ve never come across that before and not sure that would work? Coefficients would be really bad (i.e. model performance would be really bad) if that was the case, methinks… Dunno…

Any thoughts?


Do you see? @Pere suggests there’s a third loss function that accounts for the two at the stages where there’s only one branch, if I understood correctly. But I’ve never come across that before and not sure that would work? Coefficients would be really bad (i.e. model performance would be really bad) if that was the case, methinks… Dunno…

Hi jondoff, my opinion is that the common branch of the model uses a combination of the other loss functions. As we are indicating the weigths of the losses in loss_weigths it must be a ponderation.

Yes, that does make sense.

@Pere_Martra, @jondoff,
Yes, the loss for the common branch is the sum of the losses from each branch that feeds up to it from backprop.
I tried to find a good article that explains it. I didn’t find exactly what I was looking for, but here are a couple of posts that touch on it: keras - How is a multiple-outputs deep learning model trained? - Stack Overflow
neural networks - Single input - multiple outputs with different loss functions in Keras: how is the gradient computed? - Cross Validated

Spot on Wendy, thank you so much for posting these. And thank you all for pulling me out if the shadows on this one. I’ll read and continue to pick your brains.

Happy Friday for now