Hi there! So in Course 1 Week 1 of TF-AT multi-output regression is reviewed. I can see there’s a functionality to specify different (surely also custom) loss functions for each output. That surely means there’s a set of parameters for each output (I’m thinking that parameters that result from minimizing different loss functions ca’t be the same), am I right to assume that? If so, then effectively multi output amounts to two separate regressions under the hood, correct? How is this computationally more effective than simply having two separate regressions? (It may well be more effective, I just don’t understand how or why this would be the case).
I understand that further down the line branching models become a tool in their own right, but at this point, would it be effectively the same to code two separate regressions except for the extra lines of code?
This is a very small model, but in larger models, we find advantages in training times and inference.
The common layers must be trained only once, with the same data, in the cases where training can lasts hours or even days, the savings in both computation and cost can be very large.
In inference time it is more or less the same, you get two predictions by running a large part of the layers just once, which saves time.
They are not widely used models, but it is very good to know the technique, since it is also used in other parts as in the case of Siamese networks, which share a part of the model.
Possibly something similar could be done using transfer learning, training, replacing the last layers of a trained model and training them with new data, but in this case we would not take advantage in inference time.
I think you are right with you assumption in the parameters, that they should adjust better with only a variable to predict, but, as always t is a matter of weighing profits and losses and deciding the best solution.
Do you know if there is a set of coefficients, latent states, biases per loss function per layer or one for all? That is I suppose my main question.
When the model branches then there’s no question about each branch having its own parameters. But at the times where different loss functions share the same layers, I’m not so sure. If you are to compute loss at each step of SGD, then surely you must compute two losses, which one does the model minimise, right? It’s surely both, so then there’s two resulting sets of parameters per shared layer?
If so, then effectively they’re two separate models sharing “the same” architecture but not one model that predicts two outcomes. Correct?
If you have 2 objective functions that you want to minimise (say one for classification and one for regression) at each step of stochastic gradient descent (or whatever optimiser, adam or rmsprop…) how to update coefficients? Should you use the gradient * learning rate of one or the other loss to update coefficients?
The result of both loss functions are used, but multiplied by the loss weight assigned, then you add both, and with this you get the result of the loss function.
You can see that the model have 3 different loss, one for each output, and the result stored in a general loss, that is the one used to decide how update the weights.
When the weights are updated the learning_rate is used, but trying to minimise the result of the loss_function in the next iteration.
p.d. Not sure If i’m explaining well But, in brief, the model gets just one loss result, just any other model, but it is the combination of the two loss functions indicated.
In the definition of loss and subsequent training there’s no third loss definition that encapsulates both losses. I think these are effectively two separate models which output two disjointed metrics for colour and garment.
I’m using the notebook in the sample yo can see the losses with this line:
Specify the optimizer, and compile the model with loss functions for both outputs
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
model.compile(optimizer=optimizer,
loss={‘y1_output’: ‘mse’, ‘y2_output’: ‘mse’}, loss_weights = [2, 1],
metrics={‘y1_output’: tf.keras.metrics.RootMeanSquaredError(),
‘y2_output’: tf.keras.metrics.RootMeanSquaredError()})
With this metrics, en each epoch you get the general loss and the loss for every branch: Epoch 1/500
If you modify the loss weights is really easy to see how the values of losses are different.
I tried with the weights [1, 2] and [4, 1], obtaining this results:
@gent.spah@Wendy me and @jondof are having a good time discussing how the Multiple Output Models work, and how the loss is calculate and the weights updated.
Please, feel free to share your knowledge, because I’m having some doubts.
I’m sure that it is only a model that calculate both variables just with a training and that the weights are updated using the loss calculated as a a result of apply the weights indicated for each specific loss. But, if i don’t misunderstand, @jondoff believes that there are two models that share “the same” architecture but not one model that predicts two outcomes.
Hi @Pere_Martra I saw this post and this is what I think about it in a previous post.
Basically you have one set of weights for the main branch and 2 othes sets for each branch when it splits.
How this is done, I always take a water flow in pipes analogy, one main branch splits in 2 other branches, the pressure is the same in the main branch but is different in splits (here the pressure can be anlogue to gradients/learned weights).
The overall presure in each part of the systems is decided by both taps at the end of the branches (suppose we have taps/valves in their ends).
Now the maths behind it is bound to be a bit complex because you have a few “valves” and also it is high dimmensional space we are dealing with. Even if was water pressure would be complex maths
Thank you both! This has been really interesting. Yes, I also suspect it’s two sets of gradients/coefficients (two in this case, but however many according to the number of branches one might have in the pipeline). The pipe is a great way to see it, I’ll use that mental image more.
I’ve been thinking about how different loss functions impact the “shape and structure” of the resulting abstract representations learned from the data. Particularly in a transformer or whatever encoder, surely different loss functions - even ever so slightly different loss functions - will yield different, maybe even quite different encoded representations… Would be really interesting to read about it if you guys have any references about research which contrasts the use of say some loss function and it’s smoothened counterpart or something like that, I’d be very interested and grateful.
Also how might one contrast those sets of learned abstractions? I’ve seen heatmaps of embeddings for example, but do you know any meaningful way of contrasting and making sense of different types of abstract representations like that?
Sorry for the slow reply. SO - each branch has one set of parameters. I suppose that is the case also for the ‘main’ branch (i.e. prior of it splitting into branches) there’s one set of parameters. Correct? Then it splits and each branch has its own.
If so, which loss function is being minimised in that main branch? Which of the two? We have two loss functions defined, one for each branch, but there are parts of the network with ONE set of parameters which need to be minimised for TWO loss functions. Each step of SGD involves calculating the gradient of a loss function. Which of the two loss functions defined is being minimised in the main branch which has one set of parameters?
Do you see? @Pere suggests there’s a third loss function that accounts for the two at the stages where there’s only one branch, if I understood correctly. But I’ve never come across that before and not sure that would work? Coefficients would be really bad (i.e. model performance would be really bad) if that was the case, methinks… Dunno…
Do you see? @Pere suggests there’s a third loss function that accounts for the two at the stages where there’s only one branch, if I understood correctly. But I’ve never come across that before and not sure that would work? Coefficients would be really bad (i.e. model performance would be really bad) if that was the case, methinks… Dunno…
Hi jondoff, my opinion is that the common branch of the model uses a combination of the other loss functions. As we are indicating the weigths of the losses in loss_weigths it must be a ponderation.
Spot on Wendy, thank you so much for posting these. And thank you all for pulling me out if the shadows on this one. I’ll read and continue to pick your brains.