In the labs while trying to train the models, I am getting error “the kernel has died” frequently and not able to proceed due to the same. Is anyone else getting the same error? Any suggestion on what to do?
Yes try filling up this form:
Please check if you selected the right size for your virtual machine. The recommended instance is Ml.m5.2xlarge, and it will create a 8 CPU 32 GB VM.
I had the same problem with Lab 2.
I was unable to select any kernel instance with > 4GB RAM, so the train() functions would fail after exausting the 4GB RAM.
Failed to start kernelFailed to launch app [sagemaker-data-scienc-ml-t3-xlarge-926c947fdea9528b6cf58021c71b]. AccessDeniedException: User: arn:aws:sts::452237133870:assumed-role/sagemaker-studio-vpc-firewall-us-east-1-sagemaker-execution-role/SageMaker is not authorized to perform: sagemaker:CreateApp on resource: arn:aws:sagemaker:us-east-1:452237133870:app/d-8p6dilptvyvi/sagemaker-user-profile-us-east-1/kernelgateway/sagemaker-data-scienc-ml-t3-xlarge-926c947fdea9528b6cf58021c71b with an explicit deny in an identity-based policy (Context: RequestId: cc67018b-a996-462a-a88f-e20588b28da6, TimeStamp: 1697733015.3176513, Date: Thu Oct 19 16:30:15 2023)
You can only select the bigger instance with the exact name shown in the lab instructions. When I did it, the name was: ml.m5.2xlarge.
Thanks, Leonardo. Must have misread that!