How are 2 objective functions are implemented in BERT like transformer models?

Arjun_Reddy · May 29, 2023, 12:14pm

I have read about BERT model and came to know that it has 2 training objectives 1.) Masked Language modeling 2.) Next sentence prediction. So, I wonder how is this implemented practically for our custom transformer, if i want to have some 4-5 training objectives for my custom transformer. Is implementation of 2 training objective function same as like this? or do we have an other approach for implementation:

import tensorflow as tf
from tensorflow.keras import layers

# Define your model architecture
inputs = layers.Input(shape=(10,))
hidden = layers.Dense(16, activation='relu')(inputs)
output1 = layers.Dense(1, activation='sigmoid')(hidden)  # Binary classification output
output2 = layers.Dense(1)(hidden)  # Regression output

model = tf.keras.Model(inputs=inputs, outputs=[output1, output2])

# Define the loss functions for each objective
loss_fn1 = tf.keras.losses.BinaryCrossentropy()
loss_fn2 = tf.keras.losses.MeanSquaredError()

# Define the optimizer
optimizer = tf.keras.optimizers.Adam()

@tf.function
def train_step(inputs, labels1, labels2):
    with tf.GradientTape() as tape:
        # Forward pass
        predictions1, predictions2 = model(inputs, training=True)
        
        # Compute the losses for each objective
        loss1 = loss_fn1(labels1, predictions1)
        loss2 = loss_fn2(labels2, predictions2)
        
        # Combine the losses
        total_loss = loss1 + loss2
    
    # Compute gradients and update weights
    gradients = tape.gradient(total_loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# Generate some dummy data for training
train_inputs = tf.random.normal((100, 10))
train_labels1 = tf.random.uniform((100, 1), minval=0, maxval=2, dtype=tf.int32)
train_labels2 = tf.random.normal((100, 1))

# Training loop
for epoch in range(10):
    for inputs, labels1, labels2 in zip(train_inputs, train_labels1, train_labels2):
        train_step(tf.expand_dims(inputs, 0), tf.expand_dims(labels1, 0), tf.expand_dims(labels2, 0))

# Perform predictions
test_inputs = tf.random.normal((10, 10))
predictions1, predictions2 = model(test_inputs, training=False)

gent.spah · May 29, 2023, 2:18pm

The transformer architecture is a bit different than this and I think is introduced in course 5 at DLS or you can find it at the NLP Specialization.

Basically a transformer could have an encoder and a decoder part and also perform high dimendion matrix multiplications of QKV matrices.

You can find info about it also just searching in google for eg.

Arjun_Reddy · May 29, 2023, 2:24pm

I get the point that transformer uses self attention mechanism where KQV are building blocks and transformer with encoder and decoder architechture, But my point is only with respective training objectives, Like is the above tensorflow code I mentioned, is that the same process for BERT like transformers models in completing there training objectives (Masked Language modelling and Next sentence prediction), is the training objectives implementation process is same as above tensorflow code or is it different?

balaji.ambresh · May 29, 2023, 2:37pm

Your approach towards training for multiple objectives looks right to me.

That said, there need not be a custom training loop for this since tensorflow sums losses by default. Here’s an example:

Topic		Replies	Views
How to extract body of a transformer like models and fine tune with that body on different shape dataset Convolutional Neural Networks coursera-platform	2	465	May 31, 2023
Confusion regarding the video on BERT Objective NLP with Attention Models week-module-3	2	376	September 4, 2023
Output layer of BERT NLP with Attention Models week-module-3	10	795	September 29, 2023
W4 - Help with training the Transformer model built in the assignment Sequence Models coursera-platform	2	743	May 17, 2021
How to extract body of a transformer like models and fine tune with that body on different data Sequence Models coursera-platform	1	473	September 15, 2023

How are 2 objective functions are implemented in BERT like transformer models?

Related topics