Training Siamese Network for more than 2 epochs in C3W3

Maik_Pflugradt · December 11, 2024, 9:52am

Hi,
in exercise 3 of week’s 3 assignment one of the given comments says:

“The model was only trained for 2 steps because training the whole Siamese network takes too long, and produces slightly different results for each run.”

I increased the number of train_steps to 100, but I dont see any improvement on the loss. It remains the same high loss value that occurred after epoch 2 (around 126) What could be the issue? Is there a problem with the trainset or are other learning parameters not optimal?

Epoch 2/100
349/349 ━━━━━━━━━━━━━━━━━━━━ 63s 181ms/step - loss: 126.8882 - val_loss: 126.7723
.
.
.
Epoch 18/100
349/349 ━━━━━━━━━━━━━━━━━━━━ 62s 177ms/step - loss: 126.4929 - val_loss: 126.3806

→ the loss does not improve even after hundreds of epoch iterations

The following comment from the assignment also sounds a bit suspicious:

“For the rest of the assignment you will be using a pretrained model, but this small example should help you understand how the training can be done.”

Shall the training actually just show the steps involved and it was never meant to work good enough for the precise task of question duplicate detection?

gent.spah · December 11, 2024, 10:25am

To fully train an NLP model you need a lot of data and many epochs to really fit the dataset nicely!

They are giving you a pretrained model because it will take a lot of time and recourses to do that in the Lab, actually in the coursera enbvironment it will break out!

Yes!

It does not make much difference. The number of train steps is the division of the dataset per pass! You need many epochs and a large dataset to fit well, as well as of course a generally speaking a good model architecture!

Maik_Pflugradt · December 11, 2024, 10:51am

Hi thank you very much for your reply,

What exactly do you mean with resources?
I can train the model for 2 epochs which takes less than a minute, so to my understanding this proves that I have at least the computing resources needed right?

Additionally, I run the coursera lab offline in my own environment after downloading the jupyter notebook and all related source files.

The only thing that might become a problem is time but suppose I had time (even weeks), should it work under these conditions?

The trainset they used in the lab included about 400.000 Question pairs. Can that be considered a large enough dataset already?

It does not make much difference. The number of train steps is the division of the dataset per pass! You need many epochs and a large dataset to fit well,

When I said train_steps I was referring to the number of epochs (they called the number of epochs train_steps in the lab code…)

When you explicitly mention “you need a large dataset” it sounds like the 400.000 sample train dataset is still too small?

as well as of course a generally speaking a good model architecture!

I suppose the presented model of the lab should be a good model architecture right?

So I still wonder whether I should be able to create a model that was trained from scratch using a) the given lab code and b) the given lab training data

Thanks for any hints!

gent.spah · December 12, 2024, 7:34am

It might work if you can wait long enough!

Do they use the entire dataset for the 2 epochs there, I think maybe not! 400000 is considerable but speaking of LLM’s today they have huge datasets much much bigger!

Its ok for learning purposes I guess!

Deepti_Prasad · December 12, 2024, 4:23pm

as far as I remember in this Assignment you didn’t require to change train_steps

are you working on course assignment or experimenting NLP model yourself?

Maik_Pflugradt · December 12, 2024, 5:07pm

Yeah in the course assignment one should only run the train process for two epochs just to demonstrate the general procedure. Then a pretrained model was provided to work through the evaluation steps.

Now I was just curios whether I would be able to get a working model myself by training it from scratch. Gent.spah indicated that it should be possible with some patience. I try to run the whole train process on a GPU now and will tell you what came out.

Greetings,
Maik

conscell · December 13, 2024, 12:47am

Hi @Maik_Pflugradt,

I believe there might be an issue with your implementation. Using the vanilla model with default settings, I was able to achieve results comparable to the pretrained model after training for just 9 epochs on GPU. Please feel free to DM me your code if you need help with debugging.

Epoch 1/9
  3/349 [..............................] - ETA: 21s - loss: 103.7652
349/349 [==============================] - 14s 30ms/step - loss: 63.2801 - val_loss: 25.4438
Epoch 2/9
349/349 [==============================] - 4s 10ms/step - loss: 16.1986 - val_loss: 13.1102
Epoch 3/9
349/349 [==============================] - 3s 8ms/step - loss: 8.7310 - val_loss: 10.9608
Epoch 4/9
349/349 [==============================] - 3s 8ms/step - loss: 6.6691 - val_loss: 9.8793
Epoch 5/9
349/349 [==============================] - 2s 6ms/step - loss: 5.7970 - val_loss: 9.3495
Epoch 6/9
349/349 [==============================] - 2s 6ms/step - loss: 5.2989 - val_loss: 8.9974
Epoch 7/9
349/349 [==============================] - 2s 6ms/step - loss: 4.9343 - val_loss: 9.0217
Epoch 8/9
349/349 [==============================] - 2s 6ms/step - loss: 4.8120 - val_loss: 8.6252
Epoch 9/9
349/349 [==============================] - 2s 6ms/step - loss: 4.6276 - val_loss: 8.5840

accuracy, cm = classify(Q1_test,Q2_test, y_test, 0.7, model,  batch_size = 512) 
print("Accuracy", accuracy.numpy())
print(f"Confusion matrix:\n{cm.numpy()}")
20/20 [==============================] - 1s 3ms/step
Accuracy 0.73466796875
Confusion matrix:
[[4789 1593]
 [1124 2734]]

1/1 [==============================] - 0s 9ms/step
Q1  =  When will I see you? 
Q2  =  When can I see you again?
d   =  0.8820064
res =  True

question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, verbose=True)

question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, verbose=True)

1/1 [==============================] - 0s 6ms/step
Q1  =  Do they enjoy eating the dessert? 
Q2  =  Do they like hiking in the desert?
d   =  0.1801807
res =  False

conscell · December 13, 2024, 1:10pm

@Maik_Pflugradt,

Thank you for getting in touch!
Please make sure that the tf.math.l2_normalize function is applied along the correct axis.

Maik_Pflugradt · December 13, 2024, 10:33pm

Thank you Pavel for spotting the issue!

I was mixing up the l2_normalize function. Thought tf.math.l2_normalize(x) would be the same as tf.math.l2_normalize(x, axis=-1)but that is obviously not the case. Consider the possible following situations again:

tf.math.l2_normalize(x) calculates l2 norm on all samples of the 2D-Tensor x
tf.math.l2_normalize(x, axis=0) calculates l2 norm in direction of rows ( thus column-wise)
tf.math.l2_normalize(x, axis=1) calculates l2 norm in direction of colums (thus row-wise)

Option 3 is what we are looking for in the given case, as the sequences are stored in each row and shall be normalized individually. Note that because x is a 2D Tensor,
tf.math.l2_normalize(x, axis=-1) would also work.

Thank you again for your help, the model now converges to acceptable losses (3.55) after 10-20 epochs which corresponds to the performance of the given pretrained model.

Feeling satisfied now

Topic		Replies	Views
Training loss for IFT goes up Generative AI with Large Language Models week-2	3	444	July 25, 2023
Assignment 3: Question Duplicates - Exercise 4: Classify NLP with Sequence Models week-3	4	494	February 4, 2024
Train steps stop at 1, while it is set to 5 (UNQ_C4) NLP with Sequence Models week-4	5	521	March 7, 2023
Improving Training accuracy of LSTM in C3W4 assignment Natural Language Processing in TensorFlow week-4	6	381	August 3, 2023
C3W3 Assignment NLP specialization exercise 3 NLP with Sequence Models week-3	2	109	July 10, 2024

Training Siamese Network for more than 2 epochs in C3W3

Related topics