Kernel dies when training the model

On my laptop (Macbook Pro M2 with 16 GB RAM), the kernel seems to die when I get to the cell where you do:

training_loop = train_model(Siamese, TripletLoss, train_generator, val_generator)
training_loop.run(train_steps)

I’m having a hard time finding out where the memory overuse might be, or whether I need to configure my jupyter notebook differently somehow to give it more memory. Also, it seems to also fail when I submit the assignment to the grader (idk if that runs in a different environment on the server).

I used fastnp for the operations in TripletLossFn.
I used this syntax for getting the row maximums:
closest_negative = negative_without_positive.max(axis=1), tho I also tried fastnp.max(axis=1).

I also tried fastnp.multiply() instead of the * syntax. I think these are equivalent tho, but somehow this OOM issue seems to recur.

This is how I made the training loop, I think I’m supposed to instantiate the Siamese model here with the () syntax:

training_loop = training.Loop(Siamese(),
                                  train_task,
                                  eval_tasks=[eval_task],
                                  output_dir=output_dir)

Any ideas what I should do to debug / solve this further?

1 Like

Dear @adp214,

Welcome to the Community.

I think there is problem with your code. You have to tweak code to run fast.

Hmm… Any tips on what I could do to improve the speed of my code in this case?

  1. I already use fastnp methods to do the matrix math of TripletLossFn, including fastnp.dot(), fastnp.diag(), fastnp.multiply(), fastnp.sum(), fastnp.eye() and fastnp.maximum().
    For calcuating the maximum in each row, I used this syntax:
    closest_negative = negative_without_positive.max(axis=1)

But it didn’t make any difference compared to:

fastnp.max(negative_without_positive, axis=1)

The data_generator method returns 2 regular np arrays, not using fastnp DeviceArray. And the TripletLossevaluates correctly to close to 0.70 on the test example.

I’m wondering if it should be maybe something different in the generators (are they causing the problem?):

train_generator = data_generator(train_Q1, train_Q2, batch_size, vocab['<PAD>'])
val_generator = data_generator(val_Q1, val_Q2, batch_size, vocab['<PAD>'])

I noticed the text talks about some lambda use here, but I didn’t use any lambda. I’m guessing it might be a problem that I’m invoking the data_generator with some args and producing values that are then stored in train_generator as a list. Rather than for instance, making a function train_generator which evaluates only when it is called with some smaller batch_size or such.
What do you think?

Ok, I figured it out. I had a bug in my generator for determining the max_len of the questions. I guess this was masked and not caught by any of the previous unit tests. eventually it just led to padding my arrays with huge amounts of padding, and thus crashed at model training time.

Here’s my final solution to the max_len determination:

            max_len = max(len(max(input1, key=len)), len(max(input2, key=len)))