Azure ML vs. AWS SageMaker: A Deep Dive into Model Training

Contents

Azure ML & AWS SageMaker Training Jobs Project and Permission Management Azure ML AWS SageMaker What does this mean in practice?Data Storage Azure ML AWS SageMaker What does it mean in practice?Reference Take-Home Message Related Resources

(AWS) are the world’s two largest cloud computing platforms, providing database, network, and compute resources at global scale. Together, they hold about 50% of the global enterprise cloud infrastructure services market—AWS at 30% and Azure at 20%. Azure ML and AWS SageMaker are machine learning services that enable data scientists and ML engineers to develop and manage the entire ML lifecycle, from data preprocessing and feature engineering to model training, deployment, and monitoring. You can create and manage these ML services in AWS and Azure through console interfaces, or cloud CLI, or software development kits (SDK) in your preferred programming language – the approach discussed in this article.

Azure ML & AWS SageMaker Training Jobs

While they offer similar high-level functionalities, Azure ML and AWS SageMaker have fundamental differences that determine which platform best suits you, your team, or your company. Firstly, consider the ecosystem of the existing data storage, compute resources, and monitoring services. For instance, if your company’s data primarily sits in an AWS S3 bucket, then SageMaker may become a more natural choice for developing your ML services, as it reduces the overhead of connecting to and transferring data across different cloud providers. However, this doesn’t mean that other factors are not worth considering, and we will dive into the details of how Azure ML differs from AWS SageMaker in a common ML scenario—training and building models at scale using jobs.

Although Jupyter notebooks are valuable for experimentation and exploration in an interactive development workflow on a single device, they are not designed for productionization or distribution. Training jobs (and other ML jobs) become essential in the ML workflow at this stage by deploying the task to multiple cloud instances in order to run for a longer time, and process more data. This requires setting up the data, code, compute instances and runtime environments to ensure consistent outputs when it is no longer executed on one local machine. Think of it like the difference between developing a dinner recipe (Jupyter notebook) and hiring a catering team to cook it for 500 customers (ML job). It needs everyone in the catering team to access the same ingredients, recipe and tools, following the same cooking procedure.

Now that we understand the importance of training jobs, let’s look at how they’re defined in Azure ML vs. SageMaker in a nutshell.

Define Azure ML training job

from azure.ai.ml import command

job = command(
    code=...
    command=...
    environment=...
    compute=...
)

ml_client.jobs.create_or_update(job)

Create SageMaker training job estimator

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri=...
    role=...
    instance_type=...
)
 
estimator.fit(training_data_s3_location)

We’ll break down the comparison into following dimensions:

Project and Permission Management
Data storage
Compute
Environment

In part 1, we will start with comparing the high-level project setup and permission management, then talk about storing and accessing the data required for model training. Part 2 will discuss various compute options under both cloud platforms, and how to create and manage runtime environments for training jobs.

Project and Permission Management

Let’s start by understanding a typical ML workflow in a medium-to-large team of data scientists, data engineers, and ML engineers. Each member may specialize in a specific role and responsibility, and assigned to one or more projects. For example, a data engineer is tasked with extracting data from the source and storing it in a centralized location for data scientists to process. They don’t need to spin up compute instances for running training jobs. In this case, they may have read and write access to the data storage location but don’t necessarily need access to create GPU instances for heavy workloads. Depending on data sensitivity and their role in an ML project, team members need different levels of access to the data and underlying cloud infrastructure. We are going to explore how two cloud platforms structure their resources and services to balance the requirements of team collaboration and responsibility separation.

Azure ML

Project management in Azure ML is Workspace-centric, starting by creating a Workspace (under your Azure subscription ID and resource group) for storing relevant resource and assets, and shared across the project team for collaboration.

Permissions to access and manage resources are granted at the user-level based on their roles – i.e. role-based access control (RBAC). Generic roles in Azure include owner, contributor and reader. ML specialized roles include AzureML Data Scientist and AzureML Compute Operator, which is responsible for creating and managing compute instances as they are generally the largest cost element in an ML project. The objectives of setting up an Azure ML Workspace is to create a contained environments for storing data, compute, model and other resources, so that only users within the Workspace are given relevant access to read or edit the data assets, use existing or create new compute instances based on their responsibilities.

In the code snippet below, we connect to the Azure ML workspace through MLClient by passing the workspace’s subscription ID, resource group and the default credential – Azure follows the hierarchical structure Subscription > Resource Group > Workspace.

Upon workspace creation, associated services like an Azure Storage Account (stores metadata and artifacts and can store training data) and an Azure Key Vault (stores secrets like usernames, passwords, and credentials) are also instantiated automatically.

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

subscription_id = '<YOUR_SUBSCRIPTION_ID>'
resource_group = '<YOUR_RESOURCE_GROUP>'
workspace = '<YOUR_AZUREML_WORKSPACE>'

# Connect to the workspace
credential = DefaultAzureCredential()
ml_client = MLClient(credential, subscription, resource_group, workspace)

When developers run the code during an interactive development session, the workspace connection is authenticated through the developer’s personal credentials. They would be able to create a training job using the command ml_client.jobs.create_or_update(job) as demonstrated below. To detach personal account credentials in the production environment, it is recommended to use a service principal account to authenticate for automated pipelines or scheduled jobs. More information can be found in this article “Authenticate in your workspace using a service principal”.

# Define Azure ML training job
from azure.ai.ml import command

job = command(
    code=...
    command=...
    environment=...
    compute=...
)

ml_client.jobs.create_or_update(job)

AWS SageMaker

Roles and permissions in SageMaker are designed based on a completely different principle, primarily using “Roles” in AWS Identity Access Management (IAM) service. Although IAM allows creating user-level (or account-level) access similar to Azure, AWS recommends granting permissions at the job-level throughout the ML lifecycle. In this way, your personal AWS permissions are irrelevant at runtime and SageMaker assumes a role (i.e. SageMaker execution role) to access relevant AWS services, such as S3 bucket, SageMaker Training Pipeline, compute instances for executing the job.

For example, here is a quick peek of setting up an Estimator with the SageMaker execution role for running the Training Job.

import sagemaker
from sagemaker.estimator import Estimator

# Get the SageMaker execution role
role = sagemaker.get_execution_role()

# Define the estimator
estimator = Estimator(
    image_uri=image_uri,
    role=role,  # assume the SageMaker execution role during runtime
    instance_type="ml.m5.xlarge",
    instance_count=1,
)

# Start training
estimator.fit("s3://my-training-bucket/train/")

It means that we can set up enough granularity to grant role permissions to run only training jobs in the development environment but not touching the production environment. For example, the role is given access to an S3 bucket that holds test data and is blocked from the one that holds production data, then the training job that assumes this role won’t have the chance to overwrite the production data by accident.

Permission Management in AWS is a sophisticated domain by itself, and I won’t pretend I can fully explain this topic. I recommend reading this article for more best practices from AWS official documentation “Permissions management“.

What does this mean in practice?

Azure ML: Azure’s Role Based Access Control (RBAC) fits companies or teams that manage which user need to access what resources. More intuitive to understand and useful for centralized user access control.

AWS SageMaker AI: AWS fits systems that care about which job need to access what services. Decouple individual user permissions with job execution for better automation and MLOps practices. AWS fits for large data science team with granular job and pipeline definitions and isolated environments.

Reference

Data Storage

You may have the question — can I store the data in the working directory? At least that’s been my question for a long time, and I believe the answer is still yes if you are experimenting or prototyping using a simple script or notebook in an interactive development environment. But data storage location is important to consider in the context of creating ML jobs.

Since code runs in a cloud-managed environment or a docker container separate from your local directory, any locally stored data cannot be accessed when executing pipelines and jobs in SageMaker or Azure ML. This requires centralized, managed data storage services. In Azure, this is handled through a storage account within the Workspace that supports datastores and data assets.

Datastores contain connection information, while data assets are versioned snapshots of data used for training or inference. AWS, on the other hand, relies heavily on S3 buckets as centralized storage locations that enable secure, durable, cross-region access across different accounts, and users can access data through its unique URI path.

Azure ML

Azure ML treats data as attached resources and assets in the Workspaces, with one storage account and four built-in datastores automatically created upon the instantiation of each Workspace in order to store files (in Azure File Share) and datasets (in Azure Blob Storage).

Since datastores securely keep data connection information and automatically handle the credential/identity behind the scene, it decouples data location and access permission from the code, so that the code to remain unchanged even if the underlying data connection changes. Datastores can be accessed through their unique URI. Here’s an example of creating an Input object with the type uri_file by passing the datastore path.

# create training data using Datastore
training_data=Input(
          type="uri_file",
          path="<azureml://datastores/Workspaceblobstore/paths/demo-datasets/train/data.csv>",
)

Then this data can be used as the training data for an AutoML classification job.

classification_job = automl.classification(
    compute='aml-cluster',
    training_data=training_data,
    target_column_name='Survived',
    primary_metric='accuracy',
)

Data Asset is another option to access data in an ML job, especially when it is beneficial to keep track of multiple data versions, so data scientists can identify the correct data snapshots being used for model building or experimentations. Here is an example code for creating an Input object with AssetTypes.URI_FILE type by passing the data asset path “azureml:my_train_data:1” (which includes the data asset name + version number) and using the mode InputOutputModes.RO_MOUNT for read only access. You can find more information in the documentation “Access data in a job”.

# creating training data using Data Asset
training_data = Input(
    type=AssetTypes.URI_FILE,      
    path="azureml:my_train_data:1",  
    mode=InputOutputModes.RO_MOUNT
)

AWS SageMaker

AWS SageMaker is tightly integrated with Amazon S3 (Simple Storage Service) for ML workflows, so that SageMaker training jobs, inference endpoints, and pipelines can process input data from S3 buckets and write output data back to them. You may find that creating a SageMaker managed job environment (which will be discussed in Part 2) requires S3 bucket location as a key parameter, alternatively a default bucket will be created if unspecified.

Unlike Azure ML’s Workspace-centric datastore approach, AWS S3 is a standalone data storage service that provides scalable, durable, and secure cloud storage that can be shared across other AWS services and accounts. This offers more flexibility for permission management at the individual folder level, but at the same time requires explicitly granting the SageMaker execution role access to the S3 bucket.

In this code snippet, we use estimator.fit(train_data_uri)to fit the model on the training data by passing its S3 URI directly, then generates the output model and stores it at the specified S3 bucket location. More scenarios can be found in their documentation: “Amazon S3 examples using SDK for Python (Boto3)”.

import sagemaker
# Define S3 paths
train_data_uri = "<s3://demo-bucket/train/data.csv>"
output_folder_uri = "<s3://demo-bucket/model/>"

# Use in training job
estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_type="ml.m5.xlarge",
    output_path=output_folder_uri
)

estimator.fit(train_data_uri)

What does it mean in practice?

Azure ML: use Datastore to manage data connections, which handles the credential/identity information behind the scene. Therefore, this approach decouples data location and access permission from the code, allowing the code remain unchanged when the underlying connection changes.
AWS SageMaker: use S3 buckets as the primary data storage service for managing input and output data of SageMaker jobs through their URI paths. This approach requires explicit permission management to grant the SageMaker execution role access to the required S3 bucket.

Reference

Take-Home Message

Compare Azure ML and AWS SageMaker for scalable model training, focusing on project setup, permission management, and data storage patterns, so teams can better align platform choices with their existing cloud ecosystem and preferred MLOps workflows.

In part 1, we compare the high-level project setup and permission management, storing and accessing the data required for model training. Part 2 will discuss various compute options under both cloud platforms, and the creation and management of runtime environments for training jobs.