How are 2 objective functions are implemented in BERT like transformer models?

I have read about BERT model and came to know that it has 2 training objectives 1.) Masked Language modeling 2.) Next sentence prediction. So, I wonder how is this implemented practically for our custom transformer, if i want to have some 4-5 training objectives for my custom transformer. Is implementation of 2 training objective function same as like this? or do we have an other approach for implementation:

import tensorflow as tf
from tensorflow.keras import layers

# Define your model architecture
inputs = layers.Input(shape=(10,))
hidden = layers.Dense(16, activation='relu')(inputs)
output1 = layers.Dense(1, activation='sigmoid')(hidden)  # Binary classification output
output2 = layers.Dense(1)(hidden)  # Regression output

model = tf.keras.Model(inputs=inputs, outputs=[output1, output2])

# Define the loss functions for each objective
loss_fn1 = tf.keras.losses.BinaryCrossentropy()
loss_fn2 = tf.keras.losses.MeanSquaredError()

# Define the optimizer
optimizer = tf.keras.optimizers.Adam()

@tf.function
def train_step(inputs, labels1, labels2):
    with tf.GradientTape() as tape:
        # Forward pass
        predictions1, predictions2 = model(inputs, training=True)
        
        # Compute the losses for each objective
        loss1 = loss_fn1(labels1, predictions1)
        loss2 = loss_fn2(labels2, predictions2)
        
        # Combine the losses
        total_loss = loss1 + loss2
    
    # Compute gradients and update weights
    gradients = tape.gradient(total_loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# Generate some dummy data for training
train_inputs = tf.random.normal((100, 10))
train_labels1 = tf.random.uniform((100, 1), minval=0, maxval=2, dtype=tf.int32)
train_labels2 = tf.random.normal((100, 1))

# Training loop
for epoch in range(10):
    for inputs, labels1, labels2 in zip(train_inputs, train_labels1, train_labels2):
        train_step(tf.expand_dims(inputs, 0), tf.expand_dims(labels1, 0), tf.expand_dims(labels2, 0))

# Perform predictions
test_inputs = tf.random.normal((10, 10))
predictions1, predictions2 = model(test_inputs, training=False)

The transformer architecture is a bit different than this and I think is introduced in course 5 at DLS or you can find it at the NLP Specialization.

Basically a transformer could have an encoder and a decoder part and also perform high dimendion matrix multiplications of QKV matrices.

You can find info about it also just searching in google for eg.

I get the point that transformer uses self attention mechanism where KQV are building blocks and transformer with encoder and decoder architechture, But my point is only with respective training objectives, Like is the above tensorflow code I mentioned, is that the same process for BERT like transformers models in completing there training objectives (Masked Language modelling and Next sentence prediction), is the training objectives implementation process is same as above tensorflow code or is it different?

Your approach towards training for multiple objectives looks right to me.

That said, there need not be a custom training loop for this since tensorflow sums losses by default. Here’s an example:

2 Likes