Skip to main content

AWS SageMaker

SageMaker overview, training jobs, endpoints, pipelines, built-in algorithms, JumpStart, and cost optimization

~50 min
Listen to this lesson

AWS SageMaker

Amazon SageMaker is AWS's fully managed machine learning platform that covers the entire ML lifecycle — from data labeling and feature engineering to training, tuning, deployment, and monitoring. It is the most widely adopted cloud ML platform in enterprise settings.

SageMaker's Core Value Proposition

SageMaker removes the undifferentiated heavy lifting of ML infrastructure. Instead of managing GPU clusters, Docker containers, and load balancers yourself, you define WHAT you want to train or deploy and SageMaker handles HOW — provisioning instances, managing storage, auto-scaling endpoints, and cleaning up resources.

SageMaker Architecture Overview

SageMaker is organized around several key components:

SageMaker Studio

The integrated IDE for ML development. Provides Jupyter notebooks, experiment tracking, model registry, and pipeline management in a unified web interface.

Training Jobs

Managed training infrastructure that:
  • Spins up ML instances (CPU or GPU) on demand
  • Pulls training data from S3
  • Runs your training script in a Docker container
  • Saves model artifacts back to S3
  • Automatically shuts down instances when training completes
  • Endpoints

    Managed real-time inference infrastructure:
  • Deploys models behind HTTPS endpoints
  • Supports auto-scaling based on traffic
  • Enables A/B testing with production variants
  • Handles model versioning and rollback
  • Pipelines

    ML workflow orchestration:
  • Define multi-step ML workflows (preprocessing, training, evaluation, deployment)
  • Integrates with SageMaker Experiments for tracking
  • Supports conditional steps and approval gates
  • python
    1# SageMaker Training Job — Complete Example
    2# This shows how to train an XGBoost model on SageMaker
    3
    4import sagemaker
    5from sagemaker import Session
    6from sagemaker.inputs import TrainingInput
    7from sagemaker.xgboost import XGBoost
    8
    9# Initialize SageMaker session
    10session = Session()
    11role = sagemaker.get_execution_role()  # IAM role for SageMaker
    12bucket = session.default_bucket()
    13
    14# --- Step 1: Upload data to S3 ---
    15train_path = session.upload_data(
    16    path="data/train.csv",
    17    bucket=bucket,
    18    key_prefix="demo/train"
    19)
    20val_path = session.upload_data(
    21    path="data/val.csv",
    22    bucket=bucket,
    23    key_prefix="demo/val"
    24)
    25
    26# --- Step 2: Configure the training job ---
    27xgb_estimator = XGBoost(
    28    entry_point="train.py",          # Your training script
    29    role=role,
    30    instance_count=1,
    31    instance_type="ml.m5.xlarge",     # CPU instance
    32    framework_version="1.7-1",
    33    py_version="py3",
    34    hyperparameters={
    35        "max_depth": 5,
    36        "eta": 0.2,
    37        "gamma": 4,
    38        "min_child_weight": 6,
    39        "subsample": 0.8,
    40        "objective": "binary:logistic",
    41        "num_round": 200,
    42    },
    43    output_path=f"s3://{bucket}/demo/output",
    44)
    45
    46# --- Step 3: Launch training ---
    47xgb_estimator.fit({
    48    "train": TrainingInput(train_path, content_type="csv"),
    49    "validation": TrainingInput(val_path, content_type="csv"),
    50})
    51
    52# SageMaker provisions an instance, runs training,
    53# saves model to S3, and terminates the instance.
    54print(f"Model artifact: {xgb_estimator.model_data}")

    Deploying to a Real-Time Endpoint

    Once a model is trained, deploying it to a real-time endpoint takes a single API call:

    python
    1# Deploy model to a real-time endpoint
    2predictor = xgb_estimator.deploy(
    3    initial_instance_count=1,
    4    instance_type="ml.m5.large",
    5    endpoint_name="my-xgb-endpoint",
    6)
    7
    8# Make predictions
    9import numpy as np
    10test_data = np.array([[25, 50000, 3], [45, 120000, 7]])
    11predictions = predictor.predict(test_data)
    12print(f"Predictions: {predictions}")
    13
    14# IMPORTANT: Delete endpoint when done to stop charges!
    15predictor.delete_endpoint()

    Built-In Algorithms

    SageMaker provides optimized implementations of common algorithms that are faster and more cost-effective than custom implementations:

    AlgorithmUse CaseKey Advantage
    XGBoostClassification, regressionDistributed training, GPU support
    Linear LearnerLinear/logistic regressionHighly optimized for large datasets
    K-MeansClusteringDistributed training
    Image ClassificationImage recognitionBuilt on ResNet, transfer learning
    BlazingTextText classification, Word2VecOrders of magnitude faster
    DeepARTime series forecastingHandles multiple related time series
    Object DetectionFinding objects in imagesSingle-shot detection

    SageMaker JumpStart

    JumpStart is SageMaker's model hub — a collection of pre-trained models that you can deploy with one click or fine-tune on your data:

  • Foundation models: LLaMA, Falcon, Mistral, Stable Diffusion
  • Task-specific models: Sentiment analysis, named entity recognition, object detection
  • Solution templates: End-to-end ML solutions for common business problems
  • Cost Optimization Strategies

    StrategySavingsTrade-off
    Spot Instances for trainingUp to 90%Job may be interrupted
    SageMaker Savings PlansUp to 64%1-3 year commitment
    Multi-model EndpointsShare one endpoint across modelsSlightly higher latency
    Serverless InferencePay per requestCold start latency
    Right-sizing instancesVariableRequires benchmarking
    Managed Warm PoolsReduce startup timeOngoing instance cost

    Cost Trap: Idle Endpoints

    The most common SageMaker cost mistake is leaving endpoints running after testing. A single ml.m5.xlarge endpoint costs ~$280/month. Always delete endpoints when done testing, use auto-scaling to scale to zero during off-hours, or consider Serverless Inference for intermittent workloads.
    python
    1# SageMaker Pipeline — Multi-Step ML Workflow
    2from sagemaker.workflow.pipeline import Pipeline
    3from sagemaker.workflow.steps import (
    4    ProcessingStep, TrainingStep, CreateModelStep
    5)
    6from sagemaker.workflow.conditions import ConditionGreaterThan
    7from sagemaker.workflow.condition_step import ConditionStep
    8from sagemaker.workflow.functions import JsonGet
    9from sagemaker.processing import ScriptProcessor
    10
    11# Step 1: Data preprocessing
    12sklearn_processor = ScriptProcessor(
    13    framework_version="1.2-1",
    14    role=role,
    15    instance_type="ml.m5.xlarge",
    16    instance_count=1,
    17    command=["python3"],
    18    image_uri=sagemaker.image_uris.retrieve(
    19        "sklearn", session.boto_region_name, "1.2-1"
    20    ),
    21)
    22
    23preprocess_step = ProcessingStep(
    24    name="PreprocessData",
    25    processor=sklearn_processor,
    26    code="scripts/preprocess.py",
    27)
    28
    29# Step 2: Model training
    30train_step = TrainingStep(
    31    name="TrainModel",
    32    estimator=xgb_estimator,
    33    inputs={
    34        "train": TrainingInput(
    35            s3_data=preprocess_step.properties.ProcessingOutputConfig
    36            .Outputs["train"].S3Output.S3Uri,
    37            content_type="csv"
    38        ),
    39    },
    40)
    41
    42# Step 3: Conditional deployment (only if accuracy > 0.8)
    43condition = ConditionGreaterThan(
    44    left=JsonGet(
    45        step_name=train_step.name,
    46        property_file="evaluation",
    47        json_path="metrics.accuracy"
    48    ),
    49    right=0.8,
    50)
    51
    52cond_step = ConditionStep(
    53    name="CheckAccuracy",
    54    conditions=[condition],
    55    if_steps=[],     # deploy steps would go here
    56    else_steps=[],   # alert/retrain steps
    57)
    58
    59# Create and execute pipeline
    60pipeline = Pipeline(
    61    name="my-ml-pipeline",
    62    steps=[preprocess_step, train_step, cond_step],
    63    sagemaker_session=session,
    64)
    65
    66pipeline.upsert(role_arn=role)
    67execution = pipeline.start()
    68print(f"Pipeline execution: {execution.arn}")

    SageMaker vs DIY Infrastructure

    A common question: why pay SageMaker's premium when you can run training on EC2 yourself? The answer: SageMaker handles instance lifecycle, container management, experiment tracking, model versioning, endpoint management, and auto-scaling. For teams of 3+ ML engineers, the productivity gains typically outweigh the ~15-20% cost premium over raw EC2.