Kernel dies . ... w3a2 Image_segmentation_Unet_v2

Followed all the instructions, all my functions passed, no infinite loops or such. Training model fails with “Kernel has died, it will be restarted”. It has gotten as far as Epoch 3, but usually fails in 1 or 2, at different places. I have restarted Kernel and Checkpointed and saved my work and rebooted server many times over. Still fails. Result just kind of stops like this:

(TensorSpec(shape=(96, 128, 3), dtype=tf.float32, name=None), TensorSpec(shape=(96, 128, 1), dtype=tf.uint8, name=None))
Epoch 1/5
5/34 [===>…] - ETA: 8s - loss: 2.9687 - accuracy: 0.1911

Lab Id wcwzghbmwikn

Can you please help? Also seems very slow to respond.

AW

There are two common reasons for the kernel dying.

  1. The Jupyter server is too busy. “Try again later” seems helpful in this case.

  2. If your notebook contains a lot of output data, and the kernel cannot digest it all. For this, try using “Kernel → Reset and Clear Output” followed by “File → Save and Checkpoint” then Submit without running any cells.

If neither of them help, then maybe there is an error in your code.

1 Like

Thank you for the quick response. For the benefit of anyone looking at this with a similar problem, I did NOT get a “Try again later” message or anything implying that the server was too busy, although that may have well been the case

I also did not add any output other than the output from the course testing output that was already there. If I add output for the purpose of debugging, I always comment it out right away after I have figured out the issues.

That said, I ran each cell from the top to make sure there weren’t any errors in my code or I had missed something. I did not change a thing in my code, and when I got to the training portion it trained without issue and I completed the lab. I actually trained it a second time just to make sure it wasn’t a fluke.

It seems that there was something wrong with the system (like low resources) that has been resolved now, it would be nice to get an error message of some type indicating that.

Thank you for your assistance.

1 Like

Thanks for your report.

1 Like

Well, I guess what your experiment has showed is that

Is the error message indicating that. :laughing:

I guess so. I guess I could reword that to say a HELPFUL error message. If I see a stack overflow or attempt to access protected memory or such I would expect that it was an error in my code. But when the kernel just dies it is anyone’s guess.

So, just FYI, since I am not really a Python programmer and this is the first time I have used Jupyter, is there a way to check the status of your environment to see if there are a lack of resources for your program? I spent a lot of time trying to troubleshoot my own code, in fact I had hoped to complete another module last night. I might just take a Python class after this so I could learn in a more orderly fashion rather than on the fly.

1 Like

Maybe, but not that I’m aware of. I’m not much of a Jupyter expert.

Of course you’re right in that and I assumed that was you really meant. I was just making a small joke, but it’s not totally without a point: it is the case in general that error messages of all sorts (from compilers, e.g.) sometimes take some experience to interpret. We should advocate for improvements, but in the meantime we must deal with things as they are, not as we wish them to be. :nerd_face:

Agreed, but if we accept mediocrity (or worse) that is what we get.

I understand you were making a joke, just frustrating as time is the one thing I don’t have enough of, and its frustrating to make time for the class and have it wasted by something like this.