This was originally posted to Stanford’s CS 230 EdStem forum and has been modified to be more general and to remove links
I’ve spent the last couple of months working on the CS 230 final project using AWS Sagemaker and I wanted to share what I’ve learned so that other students can take advantage of it to help move their projects along quickly. AWS Sagemaker is AWS’s managed machine learning service which offers a wide range of products to help with every step in the machine learning pipeline.
Here’s some of the ways that AWS Sagemaker can help you with model development:
- Access to JupyterLab on GPU instances without needing any funky CUDA setup
- Managed training, allowing you to use instances with better GPUs to train your model
- Spot instance training, which takes advantage of EC2 spot instances to save costs when training on GPU instances (I’ve seen machines with 8 V100’s available for as low as $5/hr)
- Hyperparameter tuning jobs to help fine-tune your model*
- Real-time collaboration on notebooks*
* = I haven’t actually tried this yet
In this post, I am going to go over getting started in Sagemaker quickly and cost-effectively. There are a lot of different ways you can use Sagemaker, like storing data in Feature Store, but this post will be going over the best workflow to getting a model built and trained as quickly as possible (in my opinion). I’ll also be going over some advanced tips for more specific training scenarios.
Getting Started
First, you’ll need to log into your AWS Console and navigate to Sagemaker. You’ll be prompted to set up a domain for the first time, follow through with the instructions to quick setup a domain. If you don’t see the option, here are the manual instructions from AWS:
- Open the Sagemaker Console.
- Open the left navigation pane.
- Under Admin configurations, choose Domains.
- Choose Create domain.
- Choose Set up for single user (Quick setup). Your domain and user profile are created automatically.
A Note on Collaboration
If you want multiple people to be able to access your Sagemaker instance, you will need to instead go through custom setup of the domain. I haven’t done this, but you need to set up IAM users for each team member. If you can configure this, then you and your team can access the same notebooks and edit them together in real time. However, it may take some time to troubleshoot to get working. If you’re comfortable with AWS or if you have some free time, I would go for it. But if you just want to have a space to experiment in then just do the quick start.
Once you’ve made your domain, you can launch Sagemaker Studio. Under Applications and IDEs, click Studio. Clicking on “Open Studio” should open up Sagemaker Studio
From here you can launch a new JupyterLab instance to get messing around with. Click “JupyterLab,” then click “+ Create JupyterLab Space”. Give it a name and you’ll be taken to a configuration page.
The first option to configure is which underlying EC2 instance to use. You can view the pricing on each of these instances on AWS’s pricing page for Sagemaker. The main thing to know is that the ml.g4dn.xlarge
instance has a T4 GPU on it. For some reason I couldn’t choose it immediately after making my domain, I had to wait a day before the quota limit went up. If you run into this, you can always submit a quota increase request with AWS. From this screen you can also configure how much storage you want to use on your instance. You can change this value later without losing data.
Once all these values are configured, click “Start Space” and wait for the instance to spin up. Please remember that you are billed for all time that your instance spends running! Once the instance has started, click “Open JupyterLab”
Building an Experiment with JupyterLab
The JupyterLab instance works just like any plain old JupyterLab instance. You can upload files, create notebooks, and run commands through the terminal. There are a few good things to know about using JupyterLab on Sagemaker.
Dependency and Environment Management
There is a base conda environment installed which contains common machine learning packages. You can use this base conda environment to install dependencies into, or you can create a new one with conda env create -n my-new-env
. To activate the new environment, run conda activate my-new-env
. You can then install packages with conda install
and pip install
.
If you use the base conda environment, or you install a lot of packages into your own environment, you may notice that conda gets really slow when solving its environment. By replacing the solver with the libmamba
solver you can make conda considerably faster. Run the following commands to replace the solver:
conda update -n base conda
conda install -n base conda-libmamba-solver
conda config --set solver libmamba
If you are using a custom conda environment, you will also need to register a kernel to use with Jupyter:
conda activate my-new-env
conda install ipykernel
ipython kernel install --user --name=my-new-env
Working with GPUs
All the base conda images should have versions of tensorflow and pytorch installed that are compatible with the instance’s GPU. Each GPU has a minimum of CUDA 12.0 installed. GPU accelerated training should work as normal in notebooks.
If you are using a custom conda environment, you might be wondering which version of pytorch to install. The website lists installable versions for CUDA 12.1, but the minimum system version is 12.0. Plus, some instances like the one I’m working on have files for CUDA 12.5 on them, but the base version has the cuda
conda package installed for 12.0. I personally found the exact wheel files for pytorch 2.4.1 and CUDA 12.0 by searching through the conda-forge repository. If you want to install it along with torchaudio and torchvison, the command is conda install pytorch=2.4.1=cuda120_py311he27b719_303 torchvision=0.19.1=cuda120py311h272b9ac_1 torchaudio=2.4.1=cuda_120py311h3050088_1 -c conda-forge
.
Note: Building Libraries with CUDA
All the instances have CUDA installed, but you may run into issues with building libraries from source (not with conda or pip). This especially applies when building custom CUDA kernels for pytorch. I’m going to list a couple of the issues I ran into, along with some fixes:
- Missing
nvcc
:conda install nvidia/label/cuda-12.0.0::cuda
orconda install nvidia/label/cuda-12.0.0::cuda-toolkit
- Missing
cuda_runtime.h
: was able to easily solve withconda install nvidia/label/cuda-12.0.0::cuda
. - Missing libraries when
nvcc
runs: sometimes the libraries within the conda torch installation folder are symlinked to files that don’t exist based on the version of CUDA that torch thinks the system is running. By finding the “missing” libraries within the torch directory, removing the old symlink and symlinking to a version 12.x library in the same directory you should fix the error.
Permissions and External Resources
Say you have your data stored in S3 and you want to use it to train your model. To access it from within SageMaker studio, you need to give S3 access to your Studio instance. From the Sagemaker page on the Console, select Admin Configurations → Domains and navigate to your domain. Scroll down to Authentication and Permissions and copy the part of the string in Space Execution Role starting with “AmazonSageMaker-ExecutionRole-”. Next, search “Roles” in the seach box and select IAM Roles. Paste the role name in the role search box and select it. You can then add permissions by selecting “Add Permissions” and choosing to either attach policies (ex. AmazonS3FullAccess) or create your own inline policy for fine-grained access. Once you do this, you should be able to access your data either by downloading the data from S3 or using something like a custom DataLoader and s3fs to stream the training data from S3 (I’ve implemented this if anyone is interested in seeing it)
Training Models
Say you’ve been working in a notebook and you have a model which you would like to train. To train it on a different instance, you need to create a python training script. The training script should:
- Parse command line arguments (explained more in-depth later)
- Train your model, saving checkpoints to the checkpoint directory
- Save your model to the model directory
Let’s go over these steps:
Parse Arguments
One of the most important steps is parsing the command line arguments. The command line arguments provide many important details to training, including where the data is located, where to save the model, and user-defined hyperparameters. Here’s an example of how to do this in the training script:
import os
import json
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--depth", type=int, default=24)
parser.add_argument("--learning_rate", type=float, default=5e-4)
parser.add_argument("--weight_decay", type=float, default=0.01)
parser.add_argument("--batch_size", type=int, default=16)
parser.add_argument("--num_workers", type=int, default=0)
parser.add_argument("--every_n_epochs", type=int, default=2)
parser.add_argument("--warmup_epochs", type=int, default=20)
parser.add_argument("--warmup_lr", type=float, default=1e-6)
parser.add_argument("--cosine_epochs", type=int, default=260)
parser.add_argument("--cosine_t", type=int, default=20)
parser.add_argument("--cosine_min", type=float, default=1e-5)
parser.add_argument("--cooldown_epochs", type=int, default=20)
parser.add_argument("--cooldown_lr", type=float, default=1e-5)
parser.add_argument("--model_name", type=str, default="model")
parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS']))
parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
parser.add_argument("--checkpoint-path",type=str,default="/opt/ml/checkpoints")
parser.add_argument('--num-gpus', type=int, default=os.environ['SM_NUM_GPUS'])
args = vars(parser.parse_args())
The first set of arguments are hyperparameters and other arguments which you can specify when setting up a training job. The last set of arguments are defined by Sagemaker and the training instance. There are three important arguments:
model_dir
is the directory to output the final model to after training. The model is uploaded to S3 once training is finished.data_dir
is the directory where data is stored. When training, you may pass one or more S3 URIs to the job. The files are downloaded/streamed and cached indata_dir
where they can be used for training/validation (as a sidenote, you should have your data stored in S3 before training).checkpoint_dir
is the directory to save model checkpoints to. This is important when training on spot instances.
Train your model
Next, you train your model as you would normally, getting data from args['data_dir']
and writing checkpoints to args['checkpoint_dir']
. If you want to get really fancy with it, you can configure TensorBoard to write logs out to an S3 bucket where you can view them during training. This makes it really easy to see whether training is going well or not.
Training on Spot Instances
One way to save money is by using spot instances to train. Spot instances are cheaper EC2 instances which are available for inconsistent periods of time. Sagemaker supports training on spot instances, but you need to adapt your training script to be resumable. When your script starts, you should check the checkpoint_dir
to see if you have any checkpoints saved. If so, you should load from that checkpoint and resume training. Otherwise, you should start training as normal. Every few epochs, you must save a checkpoint to checkpoint_dir
. If the spot instance shuts down, checkpoint_dir
will be saved and copied over to the next instance where training will resume. Here’s an example of how to do this using pytorch lightning:
import os
import lightning as L
def locate_checkpoint(checkpoint_dir):
print("Locating checkpoint...")
if not os.path.exists(checkpoint_dir):
return None
print(os.listdir(checkpoint_dir))
return os.listdir(checkpoint_dir)[0]
if __name__ == "__main__":
# ...
checkpoint = locate_checkpoint(args["checkpoint_path"])
if checkpoint:
model = MyLitModel.load_from_checkpoint(checkpoint_path=f"{args['checkpoint_path']}/{checkpoint}")
else:
model = MyLitModel(
# default args
)
# continue training
Save your Model
After training, save your model to model_dir
. The model will be uploaded to a new S3 bucket after training is finished where you can download it and try it yourself.
Running a Training Job
Now it is time to run a training job. In the python Sagemaker SDK, training jobs are ran using Sagemaker Estimator
objects. Estimators take in a configuration, including your training script, dependencies, instance type, and hyperparameters, and run the training job. You don’t have to do this in your nice GPU instance you are paying for. You could create a separate CPU-based instance just for initializing training jobs. In fact, you can even initialize jobs from your own computer if you have the correct AWS credentials set.
Pytorch Estimators
Sagemaker offers many types of estimators. The most commonly used is the Pytorch estimator, although I will also go over using custom containers. Here’s the most basic use of the estimator:
from sagemaker import PyTorch
pytorch_estimator = PyTorch('pytorch-train.py',
source_dir='code/',
instance_type='ml.p3.2xlarge',
instance_count=1,
framework_version='1.8.0',
py_version='py3',
dependencies=['einops'],
hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-rate': 0.1})
pytorch_estimator.fit('s3://my-data-bucket/path/to/my/data')
When this is called:
- A new EC2 is spun up for the training job.
- The contents of
source_dir
are copied to the machine in the same directory where your script will run - Dependencies are installed with pip
- The training data is downloaded to
data_dir
- The training script is ran with the hyperparameters passed as arguments
Spot training can be configured by setting use_spot_instances
, max_run
, and max_wait
. For more information, google “sagemaker training spot instances” and read the docs (I can’t link them)
There’s many different ways to configure the estimator, so I highly recommend looking at the API documentation for the Estimator class to learn more.
Estimators with Custom Containers
There may be cases in which you cannot use the normal PyTorch estimator. For example, you may need to custom install OS packages or build a library from source before using it in your code. To do this, you will need to create a custom container which can run your training code. AWS also has an article on this which you can reference for more information.
First, you need to install Docker on your JupyterLab instance. Download the setup script based on your environment’s version from the sagemaker-training-toolkit repository (direct link to the folder is also in AWS’s article which I can’t link). Run the script, then verify that docker is installed with docker
.
Next, create your container. The container should contain your training code within the /opt/ml/code
directory and the environment variable SAGEMAKER_PROGRAM
set to the filename of the training script. Here’s an example Dockerfile for a training script:
FROM pytorch/pytorch:2.4.1-cuda12.1-cudnn9-devel
# ^ select the best image that works for your needs
RUN pip3 install sagemaker-training
# install packages, do whatever you need to do
COPY src/train.py /opt/ml/code/
# define train.py as script entrypoint
ENV SAGEMAKER_PROGRAM train.py
Build the image with docker build -t my-image . --network sagemaker
.
To use the image for training, AWS suggests uploading it to Elastic Container Registry. You may need to further configure permissions if you are uploading from your JupyterLab instance or your own computer.
Once the image has been uploaded to ECR, you can finally train! Here’s the barebones code for training:
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
inputs = sagemaker.inputs.TrainingInput('s3://my-data-bucket/path/to/my/data', input_mode='FastFile')
estimator = Estimator(image_uri=byoc_image_uri, role=get_execution_role(), base_job_name='tf-custom-container-test-job', instance_count=1, instance_type='ml.g4dn.xlarge')
estimator.fit(inputs)
You can also change the instance_type
to local_gpu
if you want to test the container running on your own device.
That’s It!
I hope that this has helped you see what you can achieve with Sagemaker. Happy to answer any questions you may have!