Skip to main content

Neural Architecture Search

NAS concepts (search space, search strategy, performance estimation), search strategies (random, RL-based, evolutionary, differentiable/DARTS), EfficientNet (compound scaling), hardware-aware NAS, once-for-all networks, and practical NAS tools

~45 min
Listen to this lesson

Neural Architecture Search (NAS)

Designing neural network architectures has traditionally been a manual, expertise-intensive process. Neural Architecture Search (NAS) automates this by treating architecture design as a search problem: define a space of possible architectures, search through it efficiently, and evaluate candidates to find the best one.

NAS has produced some of the most successful architectures in deep learning, including NASNet, EfficientNet, and MobileNetV3.

The Three Pillars of NAS

Every NAS method has three components: (1) Search Space - the set of possible architectures (what can be built), (2) Search Strategy - how to explore the space efficiently (random, RL, evolutionary, gradient-based), (3) Performance Estimation - how to evaluate a candidate architecture without fully training it (early stopping, weight sharing, proxy tasks). The interaction between these three determines NAS effectiveness and cost.

Search Spaces

Cell-based Search Space

Instead of searching over entire architectures, search for a cell (a small building block) that is repeated to form the full network. This dramatically reduces the search space.

  • Normal cell: Preserves spatial dimensions
  • Reduction cell: Reduces spatial dimensions (stride 2)
  • The full network is built by stacking normal and reduction cells
  • Operation Space

    Each edge in the cell can be one of several operations:
  • Convolutions: 3x3, 5x5, dilated, depthwise separable
  • Pooling: max, average
  • Identity (skip connection)
  • Zero (no connection)
  • Macro vs Micro Search

    ApproachSearches forSpace sizeCost
    MacroEntire network topologyEnormousVery high
    Micro (cell-based)Cell structure onlySmallManageable
    HierarchicalBoth cell and networkMediumMedium

    Search Strategies

    Random Search

    Surprisingly competitive baseline. Randomly sample architectures and evaluate them. Works well because many architectures in a well-designed search space perform similarly.

    Reinforcement Learning (NASNet, 2017)

    An RNN controller generates architecture descriptions. The controller is trained with REINFORCE, using the validation accuracy of each generated architecture as the reward signal. Very expensive: the original NASNet search used 500 GPUs for 4 days.

    Evolutionary Methods (AmoebaNet, 2018)

    Maintain a population of architectures. In each generation: 1. Select parent architectures (tournament selection) 2. Mutate (add/remove/change operations) 3. Evaluate offspring 4. Replace weakest members of the population

    Evolutionary methods are more sample-efficient than RL and naturally explore diverse architectures.

    Differentiable NAS (DARTS, 2019)

    The breakthrough that made NAS practical. Instead of discrete search, make the search space continuous: 1. Place all possible operations on every edge (mixed operation) 2. Weight each operation with a learnable architecture parameter alpha 3. Optimize architecture parameters and model weights jointly using gradient descent 4. After search, discretize: keep the operation with the highest alpha on each edge

    DARTS reduces search cost from thousands of GPU-hours to a single GPU-day.

    python
    1# === NAS Concepts: Search Space & Evaluation ===
    2import numpy as np
    3import torch
    4import torch.nn as nn
    5import torch.nn.functional as F
    6
    7# --- Define a simple cell-based search space ---
    8OPERATIONS = {
    9    "conv3x3": lambda C: nn.Sequential(
    10        nn.Conv2d(C, C, 3, padding=1, bias=False),
    11        nn.BatchNorm2d(C), nn.ReLU()
    12    ),
    13    "conv5x5": lambda C: nn.Sequential(
    14        nn.Conv2d(C, C, 5, padding=2, bias=False),
    15        nn.BatchNorm2d(C), nn.ReLU()
    16    ),
    17    "sep_conv3x3": lambda C: nn.Sequential(
    18        nn.Conv2d(C, C, 3, padding=1, groups=C, bias=False),
    19        nn.Conv2d(C, C, 1, bias=False),
    20        nn.BatchNorm2d(C), nn.ReLU()
    21    ),
    22    "max_pool3x3": lambda C: nn.MaxPool2d(3, stride=1, padding=1),
    23    "avg_pool3x3": lambda C: nn.AvgPool2d(3, stride=1, padding=1),
    24    "skip": lambda C: nn.Identity(),
    25    "zero": lambda C: Zero(),
    26}
    27
    28class Zero(nn.Module):
    29    """Zero operation (no connection)."""
    30    def forward(self, x):
    31        return torch.zeros_like(x)
    32
    33class NASCell(nn.Module):
    34    """A cell with a specific architecture (list of operations)."""
    35    def __init__(self, channels, ops_config):
    36        super().__init__()
    37        self.ops = nn.ModuleList([
    38            OPERATIONS[op](channels) for op in ops_config
    39        ])
    40
    41    def forward(self, x):
    42        outputs = [op(x) for op in self.ops]
    43        return sum(outputs)
    44
    45class NASNetwork(nn.Module):
    46    """Full network built by stacking cells."""
    47    def __init__(self, n_cells, channels, ops_config, n_classes=10):
    48        super().__init__()
    49        self.stem = nn.Sequential(
    50            nn.Conv2d(3, channels, 3, padding=1, bias=False),
    51            nn.BatchNorm2d(channels), nn.ReLU()
    52        )
    53        self.cells = nn.ModuleList([
    54            NASCell(channels, ops_config) for _ in range(n_cells)
    55        ])
    56        self.pool = nn.AdaptiveAvgPool2d(1)
    57        self.classifier = nn.Linear(channels, n_classes)
    58
    59    def forward(self, x):
    60        x = self.stem(x)
    61        for cell in self.cells:
    62            x = cell(x)
    63        x = self.pool(x).flatten(1)
    64        return self.classifier(x)
    65
    66    def count_params(self):
    67        return sum(p.numel() for p in self.parameters())
    68
    69
    70# --- Random Architecture Search ---
    71def random_architecture(n_ops=4):
    72    """Generate a random cell configuration."""
    73    ops = list(OPERATIONS.keys())
    74    return [np.random.choice(ops) for _ in range(n_ops)]
    75
    76def evaluate_architecture(ops_config, n_cells=3, channels=16):
    77    """Quick evaluation of an architecture (parameter count + dummy forward)."""
    78    model = NASNetwork(n_cells, channels, ops_config)
    79    params = model.count_params()
    80
    81    # Measure forward pass time
    82    x = torch.randn(1, 3, 32, 32)
    83    import time
    84    start = time.time()
    85    with torch.no_grad():
    86        for _ in range(10):
    87            model(x)
    88    latency = (time.time() - start) / 10 * 1000  # ms
    89
    90    return params, latency
    91
    92# --- Search ---
    93np.random.seed(42)
    94n_candidates = 20
    95
    96print("=== Random Architecture Search ===")
    97print(f"Operations: {list(OPERATIONS.keys())}")
    98print(f"Evaluating {n_candidates} random architectures...\n")
    99
    100results = []
    101for i in range(n_candidates):
    102    ops = random_architecture(n_ops=4)
    103    params, latency = evaluate_architecture(ops)
    104    results.append({
    105        "id": i, "ops": ops, "params": params, "latency_ms": latency,
    106    })
    107
    108# Sort by efficiency (params * latency)
    109results.sort(key=lambda x: x["params"] * x["latency_ms"])
    110
    111print(f"{'Rank':<6} {'Params':>10} {'Latency':>10} {'Operations'}")
    112print("-" * 60)
    113for rank, r in enumerate(results[:10], 1):
    114    ops_str = ", ".join(r["ops"])
    115    print(f"{rank:<6} {r['params']:>10,} {r['latency_ms']:>9.2f}ms "
    116          f"{ops_str}")
    117
    118print(f"\nMost efficient: {results[0]['ops']}")
    119print(f"Least efficient: {results[-1]['ops']}")
    120print(f"\nParam range: {results[0]['params']:,} - {results[-1]['params']:,}")

    EfficientNet: Compound Scaling

    EfficientNet (Tan & Le, 2019) addresses a key question: given a fixed compute budget, how should you scale a network? Prior work scaled one dimension at a time (depth, width, or resolution). EfficientNet scales all three simultaneously using a compound coefficient.

    The Compound Scaling Method

    Given a baseline architecture, scale with coefficient phi:

  • Depth: d = alpha^phi
  • Width: w = beta^phi
  • Resolution: r = gamma^phi
  • Where alpha, beta, gamma are constants found via grid search such that: alpha * beta^2 * gamma^2 ~= 2 (doubling compute per step)

    For EfficientNet-B0 (baseline): alpha=1.2, beta=1.1, gamma=1.15

    ModelPhiParamsTop-1 AccFLOPs
    B005.3M77.1%0.39B
    B117.8M79.1%0.70B
    B3312M81.6%1.8B
    B5530M83.6%9.9B
    B7766M84.3%37B

    Hardware-Aware NAS

    Modern NAS incorporates hardware constraints directly:

  • MnasNet (Google): Optimizes for mobile latency, not just accuracy
  • FBNet (Facebook): Uses a differentiable latency predictor
  • Once-for-All (OFA): Train one supernetwork that contains all sub-networks; extract the best one for any hardware target
  • Once-for-All Networks

    Instead of searching separately for each hardware target: 1. Train a single supernet that supports variable depth, width, and resolution 2. Use progressive shrinking: first train the largest network, then gradually allow smaller sub-networks 3. At deployment time, search for the best sub-network that fits the target hardware constraints

    This amortizes the training cost: one training run supports deployment to phones, tablets, servers, and IoT devices.

    python
    1# === Compound Scaling (EfficientNet Style) ===
    2import numpy as np
    3import torch
    4import torch.nn as nn
    5import time
    6
    7def make_network(depth_mult, width_mult, resolution, base_channels=32,
    8                  base_depth=3, n_classes=10):
    9    """Create a simple CNN with configurable scaling."""
    10    channels = int(base_channels * width_mult)
    11    depth = int(base_depth * depth_mult)
    12
    13    layers = [
    14        nn.Conv2d(3, channels, 3, padding=1, bias=False),
    15        nn.BatchNorm2d(channels),
    16        nn.ReLU(),
    17    ]
    18
    19    for _ in range(depth):
    20        layers.extend([
    21            nn.Conv2d(channels, channels, 3, padding=1, bias=False),
    22            nn.BatchNorm2d(channels),
    23            nn.ReLU(),
    24        ])
    25
    26    layers.extend([
    27        nn.AdaptiveAvgPool2d(1),
    28        nn.Flatten(),
    29        nn.Linear(channels, n_classes),
    30    ])
    31
    32    return nn.Sequential(*layers)
    33
    34def measure_model(model, resolution, n_runs=20):
    35    """Measure model parameters and latency."""
    36    params = sum(p.numel() for p in model.parameters())
    37    x = torch.randn(1, 3, resolution, resolution)
    38
    39    model.eval()
    40    # Warmup
    41    with torch.no_grad():
    42        for _ in range(5):
    43            model(x)
    44
    45    start = time.time()
    46    with torch.no_grad():
    47        for _ in range(n_runs):
    48            model(x)
    49    latency = (time.time() - start) / n_runs * 1000
    50
    51    return params, latency
    52
    53# === Compound Scaling Experiments ===
    54# EfficientNet constants (simplified)
    55alpha = 1.2   # depth multiplier base
    56beta = 1.1    # width multiplier base
    57gamma = 1.15  # resolution multiplier base
    58
    59base_resolution = 32
    60
    61print("=== Compound Scaling (EfficientNet-Style) ===\n")
    62print(f"Scaling constants: alpha={alpha}, beta={beta}, gamma={gamma}")
    63print(f"Base: depth=3, width=32, resolution={base_resolution}\n")
    64
    65# Compare scaling strategies
    66strategies = {
    67    "Depth only": [],
    68    "Width only": [],
    69    "Resolution only": [],
    70    "Compound": [],
    71}
    72
    73for phi in range(5):
    74    # Depth only
    75    d, w, r = alpha**phi, 1.0, base_resolution
    76    model = make_network(d, w, r)
    77    params, lat = measure_model(model, r)
    78    strategies["Depth only"].append((phi, params, lat))
    79
    80    # Width only
    81    d, w, r = 1.0, beta**phi, base_resolution
    82    model = make_network(d, w, r)
    83    params, lat = measure_model(model, r)
    84    strategies["Width only"].append((phi, params, lat))
    85
    86    # Resolution only
    87    d, w, r = 1.0, 1.0, int(base_resolution * gamma**phi)
    88    model = make_network(d, w, r)
    89    params, lat = measure_model(model, r)
    90    strategies["Resolution only"].append((phi, params, lat))
    91
    92    # Compound (all three)
    93    d, w, r = alpha**phi, beta**phi, int(base_resolution * gamma**phi)
    94    model = make_network(d, w, r)
    95    params, lat = measure_model(model, r)
    96    strategies["Compound"].append((phi, params, lat))
    97
    98# Print comparison
    99for strategy_name, results in strategies.items():
    100    print(f"--- {strategy_name} ---")
    101    print(f"{'Phi':>4} {'Params':>10} {'Latency':>10} {'Efficiency':>12}")
    102    for phi, params, lat in results:
    103        # Efficiency = params per ms of latency (lower is better)
    104        eff = params / lat if lat > 0 else 0
    105        print(f"{phi:>4} {params:>10,} {lat:>9.2f}ms {eff:>11,.0f} p/ms")
    106    print()
    107
    108# Summary
    109print("=== Scaling Summary (at phi=4) ===")
    110print(f"{'Strategy':<20} {'Params':>10} {'Latency':>10}")
    111print("-" * 42)
    112for name, results in strategies.items():
    113    _, params, lat = results[4]
    114    print(f"{name:<20} {params:>10,} {lat:>9.2f}ms")
    115
    116print("\nCompound scaling achieves a balanced tradeoff between")
    117print("model capacity (params) and computational cost (latency).")

    Practical NAS Today

    For most practitioners, full NAS is unnecessary. Instead, use proven architectures (EfficientNet, ResNet, MobileNet) as starting points and tune depth/width/resolution for your specific task and hardware. Tools like Optuna and Ray Tune can automate this scaling search. Reserve full NAS for when you are building a product where 1-2% accuracy improvement justifies the engineering investment.

    Practical NAS Tools

    Optuna

    General-purpose hyperparameter optimization that works well for architecture search:
    def objective(trial):
        n_layers = trial.suggest_int("n_layers", 2, 8)
        hidden = trial.suggest_int("hidden", 32, 256)
        lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
        # Build and train model, return validation accuracy
    

    NNI (Microsoft)

    Neural Network Intelligence toolkit with built-in NAS support:
  • DARTS, ENAS, ProxylessNAS implementations
  • Search space visualization
  • Multi-trial and one-shot NAS
  • AutoKeras

    NAS for Keras/TensorFlow with a simple API:
    import autokeras as ak
    clf = ak.ImageClassifier(max_trials=10)
    clf.fit(x_train, y_train)
    

    When Full NAS Matters

  • Building a product deployed to millions of devices (every % matters)
  • Targeting new hardware where existing architectures are not optimized
  • Working on novel domains where standard architectures underperform
  • Research on architecture design principles
  • AutoML ToolsYou've reached the end!