Week 2 - Lab - Section 2.1 - Internal Server Error

memark · June 13, 2021, 7:42pm

Three times today I have tried and failed this lab. This is my latest output:

In CloudWatch Logs everything looks ok:

bj.kim · June 14, 2021, 4:21am

Hello @memark,

I have faced the same issue. I will report it.

Kind regards,

Raul · June 14, 2021, 1:00pm

Hi @memark,

Unfortunately, it seems that this particular training job is demanding to the server. It worked out for me: to restart the kernel and run the previous cells again.

Let us know how it goes.

Cheers!

memark · June 14, 2021, 1:12pm

I have already tried three times (restarting the whole lab, not just the kernel) and failed. Is there no other solution?

memark · June 14, 2021, 4:29pm

Did another try. It finally succeeded!

Raul · June 14, 2021, 8:26pm

Thanks for letting us know. I have reported to the DeepLearning.ai team; and I think @bj.kim also did.

Additional Edit:
DeepLearning.ai team said that they have reported the issue to AWS and that they are investigating. This happens very rarely. For now the solution is to restart the Kernel and try to run the job again as you did.

norcalpedaler · June 27, 2021, 1:59am

I’ve tried this several times today and it is still a problem. I spent a lot of time restarting the kernel, restarting the lab and have gotten no where.

This is not a rare occurrence.

norcalpedaler · June 27, 2021, 8:43pm

Same issue today. CloudWatch suggests the training was successful. Could this be an issue with my how I setup the notebook?

Alkanen · June 30, 2021, 1:51pm

Same happens to me. Training takes a reasonable amount of time (30-40 minutes) but uploading the resulting model takes literally hours, which makes absolutely no sense.

Raul · June 30, 2021, 10:48pm

Hi @Alkanen and @norcalpedaler,

Thanks for letting us know and apologies for facing such issue.

I have followed up with the teaching staff team within Deeplearning.AI and AWS and will let you know as soon as they return a status to share.

Raul · July 1, 2021, 2:15pm

@Alkanen and @norcalpedaler:

Can you please share your vocareum lab link and your AWS account in a private message? I have had feedback from DeepLearning.AI: we’ll forward them to the AWS team.

However, what I was told is that it might be not a fast fix as this is a Sagemaker problem.

Alkanen · July 1, 2021, 5:46pm

Hi @Raul!

Thanks for getting back to us so quickly. I actually got it working shortly after writing my previous comment (my very next retry actually), so I finished the lab and don’t have the link anymore.

Raul · July 1, 2021, 6:10pm

I’m glad you’ve made it. Nonetheless, this issue is currently under assessment…

ValentinDeLaRosa · July 5, 2021, 3:33am

Same happening to me, can´t make it work after several tries…

sachinkl · July 16, 2021, 4:59am

Hello All, Tried twice this evening and failed both times. Will try again tomorrow. Is there no other solution than to just try again? I noticed the generated model file is ~800mb is that why it is taking time?

gary · July 18, 2021, 7:36am

Hi, I also got the same issue. What should I do?

Raul · July 19, 2021, 2:57pm

Hi @gary, @sachinkl and @ValentinDeLaRosa

Can you please share your vocareum lab link and the AWS account so that I can forward it to the DeepLearning.AI and AWS team?

However, this is not a fast fix because is a Sagemaker problem. Apologies for the inconvenience.

sachinkl · July 19, 2021, 11:02pm

I tried it the next day early morning and it worked fine so I am good for now.
Thanks Raul for the followup.

gary · July 24, 2021, 8:32am

It works now. Thanks Raul!

Topic		Replies	Views
C2_W2 lab for "Build, Train, and Deploy ML Pipelines using BERT" AI Discussions	1	49	August 12, 2023
502 Bad Gateway Error when attempting to open second lab of first week Convolutional Neural Networks	13	563	August 14, 2023
C2W3 - Your total lab usage time of 480 minutes has exceeded the total allocated time of 480 minutes Build, Train, and Deploy ML Pipelines using BERT	10	630	February 2, 2022
Course_3_Week_1_Can't verify the dataset creation Machine Learning Modeling Pipelines in Production	7	544	February 18, 2023
Week 2-Training Job Failure, InternalServerError Build, Train, and Deploy ML Pipelines using BERT	2	597	June 9, 2021

Week 2 - Lab - Section 2.1 - Internal Server Error

Related topics