C2_W2_Assignment training job failed after 2hours

%%time
estimator.latest_training_job.wait(logs=False)

E​rror: UnexpectedStatusException: Error for Training job pytorch-training-2021-06-14-19-49-54-773: Failed. Reason: InternalServerError: We encountered an internal error. Please try again.

@humanityrultz,

Please try restarting the notebook kernel and running the job again. Did this solve the problem?

@mkosanovic Thanks for your reply.
I tried couple of times… every time same issue… Starting is quick (1-2 mins) , Downloading(seconds) , Training(40mins), Uploading(approx 60-80mins)… then it failing…

It sounds to me like there is something wrong with the estimator. Upload time should not be longer than a couple of minutes.

Try uploading the notebook (run the last cell in the notebook) and then submitting it to the grader. The report should indicate if you got some of the previous exercises wrong.

I have the same problem. I tried it already 3 times. The model got uploaded, but still the training job goes on and fails after an hour, which should takes some minutes.

Status Start time End time Description
Starting Jun 14, 2021 14:41 UTC Jun 14, 2021 14:44 UTC Preparing the instances for training
Downloading Jun 14, 2021 14:44 UTC Jun 14, 2021 14:44 UTC Downloading input data
Training Jun 14, 2021 14:44 UTC Jun 14, 2021 15:25 UTC Training image download completed. Training in progress.
Uploading Jun 14, 2021 15:25 UTC Jun 14, 2021 16:25 UTC Uploading generated training model
Failed Jun 14, 2021 16:25 UTC Jun 14, 2021 16:25 UTC Training job failed

Hi guys,

Unfortunately, it seems that this particular training job is demanding to the server. It worked out for me: to restart the kernel and run the previous cells again.

DeepLearning.ai have reported the issue to AWS and said that they are investigating. For now the solution is to restart the Kernel and try to run the job again as you did.

Please also refer to the following thread, as it ended up working out.

4 Likes

It’s been two weeks since you said that AWS are investigating, and it’s still happening today =/

I’m doing this for work, and it’s pretty hard motivating the labs taking several hours just to fail like this. Since we only have a four hour window before the lab environment turns useless, it’s not like we can really start doing something else while waiting either, it’s too short time to begin a productive task and too long to spend doing nothing.

1 Like

Got the same exact error – problem while Uploading generated training model, and then it failed.

If the issue is that the training is compute intensive, I’d suggest reducing the # of epochs by default on the lab to reduce the annoyance of sporadic failures after long training times. I tried with 2 and it took ~10 minutes (twice, since the first time failed to upload yet again just like the original 40 minute wait).

Pro tip for anyone facing this:
Skip this exercise, upload (the last 2 cells) your assignment, and get 80/100 while you try to figure this out. That way you’re not completely stuck on this.

2 Likes

Any progress on this issue? I am stuck with it for already a couple of days, wasting time trying to restart again and again, without success so far. Does the problem seem to be in the uploading of the generated training model? The uploading takes longer than the training itself:

Is this actively worked upon? It’s Nov 2021 and I am facing this same issue.

I’m facing the same issue. After about 40 min, the uploading is still ongoing

2021-11-22 15:50:14 Starting - Preparing the instances for training
2021-11-22 15:50:14 Downloading - Downloading input data
2021-11-22 15:50:14 Training - Training image download completed. Training in progress…
2021-11-22 16:26:32 Uploading - Uploading generated training model…