Hi,
in exercise 3 of week’s 3 assignment one of the given comments says:
“The model was only trained for 2 steps because training the whole Siamese network takes too long, and produces slightly different results for each run.”
I increased the number of train_steps to 100, but I dont see any improvement on the loss. It remains the same high loss value that occurred after epoch 2 (around 126) What could be the issue? Is there a problem with the trainset or are other learning parameters not optimal?
→ the loss does not improve even after hundreds of epoch iterations
The following comment from the assignment also sounds a bit suspicious:
“For the rest of the assignment you will be using a pretrained model, but this small example should help you understand how the training can be done.”
Shall the training actually just show the steps involved and it was never meant to work good enough for the precise task of question duplicate detection?
To fully train an NLP model you need a lot of data and many epochs to really fit the dataset nicely!
They are giving you a pretrained model because it will take a lot of time and recourses to do that in the Lab, actually in the coursera enbvironment it will break out!
Yes!
It does not make much difference. The number of train steps is the division of the dataset per pass! You need many epochs and a large dataset to fit well, as well as of course a generally speaking a good model architecture!
What exactly do you mean with resources?
I can train the model for 2 epochs which takes less than a minute, so to my understanding this proves that I have at least the computing resources needed right?
Additionally, I run the coursera lab offline in my own environment after downloading the jupyter notebook and all related source files.
The only thing that might become a problem is time but suppose I had time (even weeks), should it work under these conditions?
The trainset they used in the lab included about 400.000 Question pairs. Can that be considered a large enough dataset already?
It does not make much difference. The number of train steps is the division of the dataset per pass! You need many epochs and a large dataset to fit well,
When I said train_steps I was referring to the number of epochs (they called the number of epochs train_steps in the lab code…)
When you explicitly mention “you need a large dataset” it sounds like the 400.000 sample train dataset is still too small?
as well as of course a generally speaking a good model architecture!
I suppose the presented model of the lab should be a good model architecture right?
So I still wonder whether I should be able to create a model that was trained from scratch using a) the given lab code and b) the given lab training data
Do they use the entire dataset for the 2 epochs there, I think maybe not! 400000 is considerable but speaking of LLM’s today they have huge datasets much much bigger!
Yeah in the course assignment one should only run the train process for two epochs just to demonstrate the general procedure. Then a pretrained model was provided to work through the evaluation steps.
Now I was just curios whether I would be able to get a working model myself by training it from scratch. Gent.spah indicated that it should be possible with some patience. I try to run the whole train process on a GPU now and will tell you what came out.
I believe there might be an issue with your implementation. Using the vanilla model with default settings, I was able to achieve results comparable to the pretrained model after training for just 9 epochs on GPU. Please feel free to DM me your code if you need help with debugging.
Epoch 1/9
3/349 [..............................] - ETA: 21s - loss: 103.7652
349/349 [==============================] - 14s 30ms/step - loss: 63.2801 - val_loss: 25.4438
Epoch 2/9
349/349 [==============================] - 4s 10ms/step - loss: 16.1986 - val_loss: 13.1102
Epoch 3/9
349/349 [==============================] - 3s 8ms/step - loss: 8.7310 - val_loss: 10.9608
Epoch 4/9
349/349 [==============================] - 3s 8ms/step - loss: 6.6691 - val_loss: 9.8793
Epoch 5/9
349/349 [==============================] - 2s 6ms/step - loss: 5.7970 - val_loss: 9.3495
Epoch 6/9
349/349 [==============================] - 2s 6ms/step - loss: 5.2989 - val_loss: 8.9974
Epoch 7/9
349/349 [==============================] - 2s 6ms/step - loss: 4.9343 - val_loss: 9.0217
Epoch 8/9
349/349 [==============================] - 2s 6ms/step - loss: 4.8120 - val_loss: 8.6252
Epoch 9/9
349/349 [==============================] - 2s 6ms/step - loss: 4.6276 - val_loss: 8.5840
accuracy, cm = classify(Q1_test,Q2_test, y_test, 0.7, model, batch_size = 512)
print("Accuracy", accuracy.numpy())
print(f"Confusion matrix:\n{cm.numpy()}")
20/20 [==============================] - 1s 3ms/step
Accuracy 0.73466796875
Confusion matrix:
[[4789 1593]
[1124 2734]]
1/1 [==============================] - 0s 9ms/step
Q1 = When will I see you?
Q2 = When can I see you again?
d = 0.8820064
res = True
question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, verbose=True)
question1 = "Do they enjoy eating the dessert?"
question2 = "Do they like hiking in the desert?"
# 1 means it is duplicated, 0 otherwise
predict(question1 , question2, 0.7, model, verbose=True)
1/1 [==============================] - 0s 6ms/step
Q1 = Do they enjoy eating the dessert?
Q2 = Do they like hiking in the desert?
d = 0.1801807
res = False
I was mixing up the l2_normalize function. Thought tf.math.l2_normalize(x) would be the same as tf.math.l2_normalize(x, axis=-1)but that is obviously not the case. Consider the possible following situations again:
tf.math.l2_normalize(x) calculates l2 norm on all samples of the 2D-Tensor x
tf.math.l2_normalize(x, axis=0) calculates l2 norm in direction of rows ( thus column-wise)
tf.math.l2_normalize(x, axis=1) calculates l2 norm in direction of colums (thus row-wise)
Option 3 is what we are looking for in the given case, as the sequences are stored in each row and shall be normalized individually. Note that because x is a 2D Tensor, tf.math.l2_normalize(x, axis=-1) would also work.
Thank you again for your help, the model now converges to acceptable losses (3.55) after 10-20 epochs which corresponds to the performance of the given pretrained model.