AWS vs. Azure: A Deep Dive into Model Training – Part 2

Editor
17 Min Read


In Part 1 of this series, how Azure and AWS take fundamentally different approaches to machine learning project management and data storage.

Azure ML uses a workspace-centric structure with user-level role-based access control (RBAC), where permissions are granted to individuals based on their responsibilities. In contrast, AWS SageMaker adopts a job-centric architecture that decouples user permissions from job execution, granting access at the job level through IAM roles. For data storage, Azure ML relies on datastores and data assets within workspaces to manage connections and credentials behind the scenes, while AWS SageMaker integrates directly with S3 buckets, requiring explicit permission grants for SageMaker execution roles to access data.

Find out more in this article:

Having established how these platforms handle project setup and data access, in Part 2, we’ll examine the compute resources and runtime environments that power the model training jobs.

Compute

Compute is the virtual machine where your model and code run. Along with network and storage, it is one of the fundamental building blocks of cloud computing. Compute resources typically represent the largest cost component of an ML project, as training models—especially large AI models—requires long training times and often specialized compute instances (e.g., GPU instances) with higher costs. Therefore, Azure ML designs a dedicated AzureML Compute Operator role (see details in Part 1) for managing compute resources.

Azure and AWS offer various instance types that differ in the number of CPUs/GPUs, memory, disk space and type, each designed for specific purposes. Both platforms use a pay-as-you-go pricing model, charging only for active compute time.

Azure virtual machine series are named in alphabetic order; for instance, D family VMs are designed for general-purpose workloads and meet the requirements for most development and production environments. AWS compute instances are also grouped into families based on their purpose; for instance, the m5 family contains general-purpose instances for SageMaker ML development. The table below compares compute instances offered by Azure and AWS based on their purpose, hourly pricing and typical use cases. (Please note that the pricing structure varies by region and plan, so I recommend checking out their official websites.)

Now that we’ve compared compute pricing in AWS and Azure, let’s explore how the two platforms differ in integrating compute resources into ML systems.

Azure ML

Azure Compute for ML

Computes are persistent resources in the Azure ML Workspace, typically created once by the AzureML Compute Operator and reused by the data science team. Since compute resources are cost-intensive, this structure allows them to be centrally managed by a role with cloud infrastructure expertise, while data scientists and engineers can focus on development work.

Azure offers a spectrum of compute target options designated for ML development and deployment, depending on the scale of the workload. A compute instance is a single-node machine suitable for interactive development and testing in the Jupyter notebook environment. A compute cluster is another type of compute target that spins up multi-node cluster machines. It can be scaled for parallel processing based on workload demand and supports auto-scaling by configuring the parameter min_instances and max_instances. Additionally, there are severless compute, Kubernetes clusters, and containers that are fit for different purposes. Here is a useful visual summary that helps you make the decision based on your use case.

image from “[Explore and configure the Azure Machine Learning workspace DP-100](https://www.youtube.com/watch?v=_f5dlIvI5LQ)”
image from “Explore and configure the Azure Machine Learning workspace DP-100

To create an Azure ML managed compute target we create an AmlCompute object using the code below:

  • type: use"amlcompute" for compute cluster. Alternatively, use "computeinstance" for single-node interactive development and “kubernetes" for AKS clusters.
  • name: specify the compute target name.
  • size: specify the instance size.
  • min_instances and max_instances (optional): set the range of instances allowed to run simultaneously.
  • idle_time_before_scale_down (optional): automatically shut down the compute cluster when idle to avoid incurring unnecessary costs.
# Create a compute cluster
cpu_cluster = AmlCompute(
    name="cpu-cluster",
    type="amlcompute",
    size="Standard_DS3_v2",
    min_instances=0,
    max_instances=4,
    idle_time_before_scale_down=120
)

# Create or update the compute
ml_client.compute.begin_create_or_update(cpu_cluster)

Once the compute resource is created, anyone in the shared Workspace can use it by simply referencing its name in an ML job, making it easily accessible for team collaboration.

# Use the persisted compute "cpu-cluster" in the job
job = command(
    code='./src',
    command='python code.py',
    compute='cpu-cluster',
    display_name='train-custom-env',
    experiment_name='training'
)

AWS SageMaker AI

AWS Compute Instance

Compute resources are managed by a standalone AWS service – EC2 (Elastic Compute Cloud). When using these compute resources in SageMaker, it require developers to explicitly configure the instance type for each job, then compute instances are created on-demand and terminated when the job finishes. This approach gives developers more flexibility over compute selection based on task, but requires more infrastructure knowledge to select and manage the appropriate compute resource. For example, available instance types differ by job type. ml.t3.medium and ml.t3.large are commonly used for powering SageMaker notebooks in interactive development environments, but they are not available for training jobs, which require more powerful instance types from the m5, c5, p3, or g4dn families.

As shown in the code snippet below, AWS SageMaker specifies the compute instance and the number of instances running simultaneously as job parameters. A compute instance with the ml.m5.xlarge type is created during job execution and charged based on the job runtime.

estimator = Estimator(
    image_uri=image_uri,
    role=role,  
    instance_type="ml.m5.xlarge", 
    instance_count=1
)

SageMaker jobs spin up on-demand instances by default. They are charged by seconds and provides guaranteed capacity for running time-sensitive jobs. For jobs that can tolerate interruptions and higher latency, spot instance is a more cost-saving option that utilizes unused compute instances. The downside is the additional waiting period when there are no available spot instances. We use the code snippet below to implement a spot instance option for a training job.

  • use_spot_instances: set as True to use spot instances, otherwise default to on-demand
  • max_wait: the maximum amount of time you are willing to wait for available spot instances (waiting time is not charged)
    max_run: the maximum amount of training time allowed for the job
  • checkpoint_s3_uri: the S3 bucket URI path to save model checkpoints, so that training can safely restart after waiting
estimator = Estimator(
    image_uri=image_uri,
    role=role,  
    instance_type="ml.m5.xlarge", 
    instance_count=1,
    use_spot_instances=True, 
    max_run=3600,
    max_wait=7200,  
    checkpoint_s3_uri="<s3://demo-bucket/model/checkpoints>"  
)

What does this mean in practice?

  • Azure ML: Azure’s persistent compute approach allows centralized management and sharing across multiple developers, allowing data scientists to focus on model development rather than infrastructure management.
  • AWS SageMaker AI: SageMaker requires developers to explicitly define compute instance type for each job, providing more flexibility but also demanding deeper infrastructure knowledge of instance types, costs and availability constraints.

Reference

Environment

Environment defines where the code or job is run, including software, operating system, program packages, docker image and environment variables. While compute is responsible for the underlying infrastructure and hardware selections, environment setup is crucial in ensuring consistent and reproducible behaviors across development and production environment, mitigating package conflicts and dependency issues when executing the same code in different runtime setup by different developers. Azure ML and SageMaker both support using their curated environments and setting up custom environments.

Azure ML

Similar to Data and Compute, Environment is considered a type of resource and asset in the Azure ML Workspace. Azure ML offers a comprehensive list of curated environments for popular python frameworks (e.g. PyTorch, Tensorflow, scikit-learn) designed for CPU or GPU/CUDA target.

The code snippet below helps to retrieve the list of all curated environments in Azure ML. They generally follow a naming convention that includes the framework name, version, operating system, Python version, and compute target (CPU/GPU), e.g.AzureML-sklearn-1.0-ubuntu20.04-py38-cpu indicates scikit-learn version 1.0, running on Ubuntu 20.04 with Python 3.8 for CPU compute.

envs = ml_client.environments.list()
for env in envs:
    print(env.name)
    
    
# >>> Auzre ML Curated Environments
"""
AzureML-AI-Studio-Development
AzureML-ACPT-pytorch-1.13-py38-cuda11.7-gpu
AzureML-ACPT-pytorch-1.12-py38-cuda11.6-gpu
AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu
AzureML-ACPT-pytorch-1.11-py38-cuda11.5-gpu
AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu
AzureML-responsibleai-0.21-ubuntu20.04-py38-cpu
AzureML-responsibleai-0.20-ubuntu20.04-py38-cpu
AzureML-tensorflow-2.5-ubuntu20.04-py38-cuda11-gpu
AzureML-tensorflow-2.6-ubuntu20.04-py38-cuda11-gpu
AzureML-tensorflow-2.7-ubuntu20.04-py38-cuda11-gpu
AzureML-sklearn-1.0-ubuntu20.04-py38-cpu
AzureML-pytorch-1.10-ubuntu18.04-py38-cuda11-gpu
AzureML-pytorch-1.9-ubuntu18.04-py37-cuda11-gpu
AzureML-pytorch-1.8-ubuntu18.04-py37-cuda11-gpu
AzureML-sklearn-0.24-ubuntu18.04-py37-cpu
AzureML-lightgbm-3.2-ubuntu18.04-py37-cpu
AzureML-pytorch-1.7-ubuntu18.04-py37-cuda11-gpu
AzureML-tensorflow-2.4-ubuntu18.04-py37-cuda11-gpu
AzureML-Triton
AzureML-Designer-Score
AzureML-VowpalWabbit-8.8.0
AzureML-PyTorch-1.3-CPU
"""

To run the training job in a curated environment, we create an environment object by referencing its name and version, then passing it as a job parameter.

# Get an curated Environment
environment = ml_client.environments.get("AzureML-sklearn-1.0-ubuntu20.04-py38-cpu", version=44)

# Use the curated environment in Job
job = command(
    code=".",
    command="python train.py",
    environment=environment,
    compute="cpu-cluster"
)

ml_client.jobs.create_or_update(job)

Alternatively, create a custom environment from a Docker image registered in Docker Hob using the code snippet below.

# Get an curated Environment
environment = ml_client.environments.get("AzureML-sklearn-1.0-ubuntu20.04-py38-cpu", version=44)

# Use the curated environment in Job
job = command(
    code=".",
    command="python train.py",
    environment=environment,
    compute="cpu-cluster"
)

ml_client.jobs.create_or_update(job)

AWS SageMaker AI

SageMaker’s environment configuration is tightly coupled with job definitions, offering three levels of customization to establish the OS, frameworks and packages required for job execution. These are Built-in Algorithm, Bring Your Own Script (Script mode) and Bring Your Own Container (BYOC), ranging from the most simple yet rigid option to the most complex yet customizable option.

Built-in Algorithms

AWS Sagemaker Built-in Algorithm

This is the option with the least amount of effort for developers to train and deploy machine learning models at scale in AWS SageMaker and Azure currently does not offer an equivalent built-in algorithm approach using Python SDK as of February 2026.

SageMaker encapsulates the machine learning algorithm, as well as its python library and framework dependencies within an estimator object. For example, here we instantiate a KMeans estimator by specifying the algorithm-specific hyperparameter k and passing the training data to fit the model. Then the training job will spin up a ml.m5.large compute instance and the trained model will be saved in the output location.

Bring Your Own Script

The bring your own script approach (also known as script mode or bring your own model) allows developers to leverage SageMaker’s prebuilt containers for popular python frameworks for machine learning like scikit-learn, PyTorch and Tensorflow. It provides the flexibility of customizing the training job through your own script without the need of managing the job execution environment, making it the most popular choice when using specialized algorithms not included in SageMaker’s built-in options.

In the example below, we instantiate an estimator using the scikit-learn framework by providing a custom training script train.py, the model’s hyperparameters, along with the framework version and python version.

from sagemaker.sklearn import SKLearn

sk_estimator = SKLearn(
    entry_point="train.py",
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    py_version="py3",
    framework_version="1.2-1",
    script_mode=True,
    hyperparameters={"estimators": 20},
)

# Train the estimator
sk_estimator.fit({"train": training_data})

Bring Your Own Container

This is the approach with the highest level of customization, which allows developers to bring a custom environment using a Docker image. It suits scenarios that rely on unsupported python frameworks, specialized packages, or other programming languages (e.g. R, Java etc). The workflow involves building a Docker image that contains all required package dependencies and model training scripts, then push it to Elastic Container Registry (ECR), which is AWS’s container registry service equivalent to Docker Hub.

In the code below, we specify the custom docker image URI as a parameter to create the estimator and fit the estimator with training data.

from sagemaker.estimator import Estimator

image_uri = "<your-dkr-ecr-image-uri>:<tag>"

byoc_estimator = Estimator(
    image_uri=image_uri,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path="<s3://demo-bucket/model/>",
    sagemaker_session=sess,
)

byoc_estimator.fit(training_data)

What does it mean in practice?

  • Azure ML: Provides support for running training jobs using its extensive collection of curated environments that cover popular frameworks such as PyTorch, TensorFlow, and scikit-learn, as well as offering the capability to build and configure custom environments from Docker images for more specialized use cases. However, it is important to note that Azure ML does not currently offer the built-in algorithm approach that encapsulates and packages popular machine learning algorithms directly into the environment in the same way that SageMaker does.
  • AWS SageMaker AI: SageMaker is known for its three level of customizations—Built-in Algorithm, Bring Your Own Script, Bring Your Own Container—which cover a spectrum of developers requirements. Built-in Algorithm and Bring Your Own Script use AWS’s managed environments and integrate tightly with ML algorithms or frameworks. They offer simplicity but are less suitable for highly specialized model training processes.

In Summary

Based on the comparisons of Compute and Environment above along with what we discussed in AWS vs. Azure: A Deep Dive into Model Training — Part 1 (Project Setup and Data Storage), we would have realized the two platforms adopt different design principles to structure their machine learning ecosystems.

Azure ML follows a more modular architecture where Data, Compute, and Environment are treated as independent resources and assets within the Azure ML Workspace. Since they can be configured and managed separately, this approach is more beginner-friendly, especially for users without extensive cloud computing or permission management knowledge. For instance, a data scientist can create a training job by attaching an existing compute in the Workspace without needing infrastructural expertise to manage compute instances.

AWS SageMaker has a steeper learning curve, as multiple services are tightly coupled and orchestrated together as a holistic system for ML job execution. However, this job-centric approach offers clear separation between model training and model deployment environments, as well as the ability for distributed training at scale. By giving developers more infrastructure control, SageMaker is well suited for large-scale data science and AI teams with high MLOps maturity and the need of CI/CD pipelines.

Take-Home Message

In this series, we compare the two most popular cloud platforms Azure and AWS for scalable model training, breaking down the comparison into the following dimensions:

  • Project and Permission Management
  • Data storage
  • Compute
  • Environment

In Part 1, we discussed high-level project setup and permission management, then talked about storing and accessing the data required for model training.

In Part 2, we examined how Azure ML’s persistent, workspace-centric compute resources differ from AWS SageMaker’s on-demand, job-specific approach. Additionally, we explored environment customization options, from Azure’s curated environments and custom environments to SageMaker’s three level of customizations—Built-in Algorithm, Bring Your Own Script, Bring Your Own Container. This comparison reveals Azure ML’s modular, beginner-friendly architecture vs. SageMaker’s integrated, job-centric design that offers greater scalability and infrastructure control for teams with MLOps requirements.

Share this Article
Please enter CoinGecko Free Api Key to get this plugin works.