Course 2, Week 1: Does Sagemaker sklearn processor really support preprocessing at scale?

The fit method of preprocessing classes like StandardScalar from scikit-learn loads the entire input data into memory as per my understanding. In the cases of preprocessing using multiple instances / sklearn containers using sagemaker sklearn processing job, does sagemaker take care of interaction between the containers? Otherwise, how can we call fit method on a data that doesn’t fit into the memory of a single instance?

Hello @Anilsekhar,

As far as I understand, The benefit of using multiple instances based on the SKLearnProcessor is that you can decrease your processing time.

If you want to handle the memory limit, you can use the PySparkProcessor. However, you may need more instances to resolve this problem. Please refer to this example.

And If you want to resolve the memory limit problem with a single instance, you might make your own logic based on your processing scenario.

Best regards,

2 Likes

Hi @bj.kim ,

Thanks for replying. When we use sklearnprocessor, is it like during fit method, only one instance will be used and during transform all the instances will be used in case we specify that the instance count is greater than 1 for the sklearn processing job?

Hi @Anilsekhar,

Thanks for reaching me again.

All the instances will be used for both the training job and processing job if we specified many instances.

In addition, there is a fit() method which is for the training job in the Scikit Learn Estimator class. Please refer to this. And there is a run() method that is for the processing job in the Scikit Learn Processor class. Please refer to this.

Best regards,

Hi @bj.kim ,

  • When we use spark processor, I understand how all the instances will be used during fit method as well as transform method since spark is designed for parallel processing.
  • However, the sklearn library is not designed for running in a cluster of nodes. In this context, I am trying to understand what benefit will increasing the no of instances in a sklearn processor provide.
    • The fit method requires to see the entire data. How does having multiple instances help here?

Hi @Anilsekhar,

As you already know SKLearnProcessor can not communicate information or data among instances of a processing job, unlike the PySparkProcessor. So we can not use the SKLearnProcessor if a file is very big to load in the memory.

Yes, you are right we can not have the benefit of parallel processing sometimes it depends on what kinds of scenarios we have.

However, we can have the benefit of parallel processing if this is not a single file. Please refer to this.

Best regards,

Hi @bj.kim ,

The link you have shared (SageMaker Processing: how to parallelize when instance_count > 1 ? · Issue #1075 · aws/amazon-sagemaker-examples · GitHub) seems to clarify my doubt. What I have understood is this:

  • We cannot do any preprocessing which requires calculating statistics from entire data when using multiple instances.
    • May be it is possible, but it won’t be straight forward. We will need to keep track of which are the hosts and which are all the files allocated to each host and write our own custom logic to calculate statistics that involve passing through the entire data.
  • We can apply any instance level transformation in a straight forward manner

Hi @Anilsekhar,

It’s good to hear that you are clear now with your concern.

Happy learning