Neural Architecture Search (NAS)

Designing neural network architectures has traditionally been a manual, expertise-intensive process. Neural Architecture Search (NAS) automates this by treating architecture design as a search problem: define a space of possible architectures, search through it efficiently, and evaluate candidates to find the best one.

NAS has produced some of the most successful architectures in deep learning, including NASNet, EfficientNet, and MobileNetV3.

The Three Pillars of NAS

Every NAS method has three components: (1) Search Space - the set of possible architectures (what can be built), (2) Search Strategy - how to explore the space efficiently (random, RL, evolutionary, gradient-based), (3) Performance Estimation - how to evaluate a candidate architecture without fully training it (early stopping, weight sharing, proxy tasks). The interaction between these three determines NAS effectiveness and cost.

Search Spaces

Cell-based Search Space

Instead of searching over entire architectures, search for a cell (a small building block) that is repeated to form the full network. This dramatically reduces the search space.

Normal cell: Preserves spatial dimensions

Reduction cell: Reduces spatial dimensions (stride 2)

The full network is built by stacking normal and reduction cells

Operation Space

Each edge in the cell can be one of several operations:

Convolutions: 3x3, 5x5, dilated, depthwise separable

Pooling: max, average

Identity (skip connection)

Zero (no connection)

Macro vs Micro Search

Approach	Searches for	Space size	Cost
Macro	Entire network topology	Enormous	Very high
Micro (cell-based)	Cell structure only	Small	Manageable
Hierarchical	Both cell and network	Medium	Medium

Search Strategies

Random Search

Surprisingly competitive baseline. Randomly sample architectures and evaluate them. Works well because many architectures in a well-designed search space perform similarly.

Reinforcement Learning (NASNet, 2017)

An RNN controller generates architecture descriptions. The controller is trained with REINFORCE, using the validation accuracy of each generated architecture as the reward signal. Very expensive: the original NASNet search used 500 GPUs for 4 days.

Evolutionary Methods (AmoebaNet, 2018)

Maintain a population of architectures. In each generation: 1. Select parent architectures (tournament selection) 2. Mutate (add/remove/change operations) 3. Evaluate offspring 4. Replace weakest members of the population

Evolutionary methods are more sample-efficient than RL and naturally explore diverse architectures.

Differentiable NAS (DARTS, 2019)

The breakthrough that made NAS practical. Instead of discrete search, make the search space continuous: 1. Place all possible operations on every edge (mixed operation) 2. Weight each operation with a learnable architecture parameter alpha 3. Optimize architecture parameters and model weights jointly using gradient descent 4. After search, discretize: keep the operation with the highest alpha on each edge

DARTS reduces search cost from thousands of GPU-hours to a single GPU-day.

python

1# === NAS Concepts: Search Space & Evaluation ===
2import numpy as np
3import torch
4import torch.nn as nn
5import torch.nn.functional as F
6
7# --- Define a simple cell-based search space ---
8OPERATIONS = {
9    "conv3x3": lambda C: nn.Sequential(
10        nn.Conv2d(C, C, 3, padding=1, bias=False),
11        nn.BatchNorm2d(C), nn.ReLU()
12    ),
13    "conv5x5": lambda C: nn.Sequential(
14        nn.Conv2d(C, C, 5, padding=2, bias=False),
15        nn.BatchNorm2d(C), nn.ReLU()
16    ),
17    "sep_conv3x3": lambda C: nn.Sequential(
18        nn.Conv2d(C, C, 3, padding=1, groups=C, bias=False),
19        nn.Conv2d(C, C, 1, bias=False),
20        nn.BatchNorm2d(C), nn.ReLU()
21    ),
22    "max_pool3x3": lambda C: nn.MaxPool2d(3, stride=1, padding=1),
23    "avg_pool3x3": lambda C: nn.AvgPool2d(3, stride=1, padding=1),
24    "skip": lambda C: nn.Identity(),
25    "zero": lambda C: Zero(),
26}
27
28class Zero(nn.Module):
29    """Zero operation (no connection)."""
30    def forward(self, x):
31        return torch.zeros_like(x)
32
33class NASCell(nn.Module):
34    """A cell with a specific architecture (list of operations)."""
35    def __init__(self, channels, ops_config):
36        super().__init__()
37        self.ops = nn.ModuleList([
38            OPERATIONS[op](channels) for op in ops_config
39        ])
40
41    def forward(self, x):
42        outputs = [op(x) for op in self.ops]
43        return sum(outputs)
44
45class NASNetwork(nn.Module):
46    """Full network built by stacking cells."""
47    def __init__(self, n_cells, channels, ops_config, n_classes=10):
48        super().__init__()
49        self.stem = nn.Sequential(
50            nn.Conv2d(3, channels, 3, padding=1, bias=False),
51            nn.BatchNorm2d(channels), nn.ReLU()
52        )
53        self.cells = nn.ModuleList([
54            NASCell(channels, ops_config) for _ in range(n_cells)
55        ])
56        self.pool = nn.AdaptiveAvgPool2d(1)
57        self.classifier = nn.Linear(channels, n_classes)
58
59    def forward(self, x):
60        x = self.stem(x)
61        for cell in self.cells:
62            x = cell(x)
63        x = self.pool(x).flatten(1)
64        return self.classifier(x)
65
66    def count_params(self):
67        return sum(p.numel() for p in self.parameters())
68
69
70# --- Random Architecture Search ---
71def random_architecture(n_ops=4):
72    """Generate a random cell configuration."""
73    ops = list(OPERATIONS.keys())
74    return [np.random.choice(ops) for _ in range(n_ops)]
75
76def evaluate_architecture(ops_config, n_cells=3, channels=16):
77    """Quick evaluation of an architecture (parameter count + dummy forward)."""
78    model = NASNetwork(n_cells, channels, ops_config)
79    params = model.count_params()
80
81    # Measure forward pass time
82    x = torch.randn(1, 3, 32, 32)
83    import time
84    start = time.time()
85    with torch.no_grad():
86        for _ in range(10):
87            model(x)
88    latency = (time.time() - start) / 10 * 1000  # ms
89
90    return params, latency
91
92# --- Search ---
93np.random.seed(42)
94n_candidates = 20
95
96print("=== Random Architecture Search ===")
97print(f"Operations: {list(OPERATIONS.keys())}")
98print(f"Evaluating {n_candidates} random architectures...\n")
99
100results = []
101for i in range(n_candidates):
102    ops = random_architecture(n_ops=4)
103    params, latency = evaluate_architecture(ops)
104    results.append({
105        "id": i, "ops": ops, "params": params, "latency_ms": latency,
106    })
107
108# Sort by efficiency (params * latency)
109results.sort(key=lambda x: x["params"] * x["latency_ms"])
110
111print(f"{'Rank':<6} {'Params':>10} {'Latency':>10} {'Operations'}")
112print("-" * 60)
113for rank, r in enumerate(results[:10], 1):
114    ops_str = ", ".join(r["ops"])
115    print(f"{rank:<6} {r['params']:>10,} {r['latency_ms']:>9.2f}ms "
116          f"{ops_str}")
117
118print(f"\nMost efficient: {results[0]['ops']}")
119print(f"Least efficient: {results[-1]['ops']}")
120print(f"\nParam range: {results[0]['params']:,} - {results[-1]['params']:,}")

EfficientNet: Compound Scaling

EfficientNet (Tan & Le, 2019) addresses a key question: given a fixed compute budget, how should you scale a network? Prior work scaled one dimension at a time (depth, width, or resolution). EfficientNet scales all three simultaneously using a compound coefficient.

The Compound Scaling Method

Given a baseline architecture, scale with coefficient phi:

Depth: d = alpha^phi

Width: w = beta^phi

Resolution: r = gamma^phi

Where alpha, beta, gamma are constants found via grid search such that: alpha * beta^2 * gamma^2 ~= 2 (doubling compute per step)

For EfficientNet-B0 (baseline): alpha=1.2, beta=1.1, gamma=1.15

Model	Phi	Params	Top-1 Acc	FLOPs
B0	0	5.3M	77.1%	0.39B
B1	1	7.8M	79.1%	0.70B
B3	3	12M	81.6%	1.8B
B5	5	30M	83.6%	9.9B
B7	7	66M	84.3%	37B

Hardware-Aware NAS

Modern NAS incorporates hardware constraints directly:

MnasNet (Google): Optimizes for mobile latency, not just accuracy

FBNet (Facebook): Uses a differentiable latency predictor

Once-for-All (OFA): Train one supernetwork that contains all sub-networks; extract the best one for any hardware target

Once-for-All Networks

Instead of searching separately for each hardware target: 1. Train a single supernet that supports variable depth, width, and resolution 2. Use progressive shrinking: first train the largest network, then gradually allow smaller sub-networks 3. At deployment time, search for the best sub-network that fits the target hardware constraints

This amortizes the training cost: one training run supports deployment to phones, tablets, servers, and IoT devices.

python

1# === Compound Scaling (EfficientNet Style) ===
2import numpy as np
3import torch
4import torch.nn as nn
5import time
6
7def make_network(depth_mult, width_mult, resolution, base_channels=32,
8                  base_depth=3, n_classes=10):
9    """Create a simple CNN with configurable scaling."""
10    channels = int(base_channels * width_mult)
11    depth = int(base_depth * depth_mult)
12
13    layers = [
14        nn.Conv2d(3, channels, 3, padding=1, bias=False),
15        nn.BatchNorm2d(channels),
16        nn.ReLU(),
17    ]
18
19    for _ in range(depth):
20        layers.extend([
21            nn.Conv2d(channels, channels, 3, padding=1, bias=False),
22            nn.BatchNorm2d(channels),
23            nn.ReLU(),
24        ])
25
26    layers.extend([
27        nn.AdaptiveAvgPool2d(1),
28        nn.Flatten(),
29        nn.Linear(channels, n_classes),
30    ])
31
32    return nn.Sequential(*layers)
33
34def measure_model(model, resolution, n_runs=20):
35    """Measure model parameters and latency."""
36    params = sum(p.numel() for p in model.parameters())
37    x = torch.randn(1, 3, resolution, resolution)
38
39    model.eval()
40    # Warmup
41    with torch.no_grad():
42        for _ in range(5):
43            model(x)
44
45    start = time.time()
46    with torch.no_grad():
47        for _ in range(n_runs):
48            model(x)
49    latency = (time.time() - start) / n_runs * 1000
50
51    return params, latency
52
53# === Compound Scaling Experiments ===
54# EfficientNet constants (simplified)
55alpha = 1.2   # depth multiplier base
56beta = 1.1    # width multiplier base
57gamma = 1.15  # resolution multiplier base
58
59base_resolution = 32
60
61print("=== Compound Scaling (EfficientNet-Style) ===\n")
62print(f"Scaling constants: alpha={alpha}, beta={beta}, gamma={gamma}")
63print(f"Base: depth=3, width=32, resolution={base_resolution}\n")
64
65# Compare scaling strategies
66strategies = {
67    "Depth only": [],
68    "Width only": [],
69    "Resolution only": [],
70    "Compound": [],
71}
72
73for phi in range(5):
74    # Depth only
75    d, w, r = alpha**phi, 1.0, base_resolution
76    model = make_network(d, w, r)
77    params, lat = measure_model(model, r)
78    strategies["Depth only"].append((phi, params, lat))
79
80    # Width only
81    d, w, r = 1.0, beta**phi, base_resolution
82    model = make_network(d, w, r)
83    params, lat = measure_model(model, r)
84    strategies["Width only"].append((phi, params, lat))
85
86    # Resolution only
87    d, w, r = 1.0, 1.0, int(base_resolution * gamma**phi)
88    model = make_network(d, w, r)
89    params, lat = measure_model(model, r)
90    strategies["Resolution only"].append((phi, params, lat))
91
92    # Compound (all three)
93    d, w, r = alpha**phi, beta**phi, int(base_resolution * gamma**phi)
94    model = make_network(d, w, r)
95    params, lat = measure_model(model, r)
96    strategies["Compound"].append((phi, params, lat))
97
98# Print comparison
99for strategy_name, results in strategies.items():
100    print(f"--- {strategy_name} ---")
101    print(f"{'Phi':>4} {'Params':>10} {'Latency':>10} {'Efficiency':>12}")
102    for phi, params, lat in results:
103        # Efficiency = params per ms of latency (lower is better)
104        eff = params / lat if lat > 0 else 0
105        print(f"{phi:>4} {params:>10,} {lat:>9.2f}ms {eff:>11,.0f} p/ms")
106    print()
107
108# Summary
109print("=== Scaling Summary (at phi=4) ===")
110print(f"{'Strategy':<20} {'Params':>10} {'Latency':>10}")
111print("-" * 42)
112for name, results in strategies.items():
113    _, params, lat = results[4]
114    print(f"{name:<20} {params:>10,} {lat:>9.2f}ms")
115
116print("\nCompound scaling achieves a balanced tradeoff between")
117print("model capacity (params) and computational cost (latency).")

Practical NAS Today

For most practitioners, full NAS is unnecessary. Instead, use proven architectures (EfficientNet, ResNet, MobileNet) as starting points and tune depth/width/resolution for your specific task and hardware. Tools like Optuna and Ray Tune can automate this scaling search. Reserve full NAS for when you are building a product where 1-2% accuracy improvement justifies the engineering investment.

Practical NAS Tools

Optuna

General-purpose hyperparameter optimization that works well for architecture search:

def objective(trial):
    n_layers = trial.suggest_int("n_layers", 2, 8)
    hidden = trial.suggest_int("hidden", 32, 256)
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
    # Build and train model, return validation accuracy

NNI (Microsoft)

Neural Network Intelligence toolkit with built-in NAS support:

DARTS, ENAS, ProxylessNAS implementations

Search space visualization

Multi-trial and one-shot NAS

AutoKeras

NAS for Keras/TensorFlow with a simple API:

import autokeras as ak
clf = ak.ImageClassifier(max_trials=10)
clf.fit(x_train, y_train)

When Full NAS Matters

Building a product deployed to millions of devices (every % matters)

Targeting new hardware where existing architectures are not optimized

Working on novel domains where standard architectures underperform

Research on architecture design principles

Neural Architecture Search (NAS)

NAS has produced some of the most successful architectures in deep learning, including NASNet, EfficientNet, and MobileNetV3.

The Three Pillars of NAS

Search Spaces

Cell-based Search Space

Instead of searching over entire architectures, search for a cell (a small building block) that is repeated to form the full network. This dramatically reduces the search space.

Normal cell: Preserves spatial dimensions

Reduction cell: Reduces spatial dimensions (stride 2)

The full network is built by stacking normal and reduction cells

Operation Space

Each edge in the cell can be one of several operations:

Convolutions: 3x3, 5x5, dilated, depthwise separable

Pooling: max, average

Identity (skip connection)

Zero (no connection)

Macro vs Micro Search

Approach	Searches for	Space size	Cost
Macro	Entire network topology	Enormous	Very high
Micro (cell-based)	Cell structure only	Small	Manageable
Hierarchical	Both cell and network	Medium	Medium

Search Strategies

Random Search

Surprisingly competitive baseline. Randomly sample architectures and evaluate them. Works well because many architectures in a well-designed search space perform similarly.

Reinforcement Learning (NASNet, 2017)

Evolutionary Methods (AmoebaNet, 2018)

Evolutionary methods are more sample-efficient than RL and naturally explore diverse architectures.

Differentiable NAS (DARTS, 2019)

DARTS reduces search cost from thousands of GPU-hours to a single GPU-day.

python

1# === NAS Concepts: Search Space & Evaluation ===
2import numpy as np
3import torch
4import torch.nn as nn
5import torch.nn.functional as F
6
7# --- Define a simple cell-based search space ---
8OPERATIONS = {
9    "conv3x3": lambda C: nn.Sequential(
10        nn.Conv2d(C, C, 3, padding=1, bias=False),
11        nn.BatchNorm2d(C), nn.ReLU()
12    ),
13    "conv5x5": lambda C: nn.Sequential(
14        nn.Conv2d(C, C, 5, padding=2, bias=False),
15        nn.BatchNorm2d(C), nn.ReLU()
16    ),
17    "sep_conv3x3": lambda C: nn.Sequential(
18        nn.Conv2d(C, C, 3, padding=1, groups=C, bias=False),
19        nn.Conv2d(C, C, 1, bias=False),
20        nn.BatchNorm2d(C), nn.ReLU()
21    ),
22    "max_pool3x3": lambda C: nn.MaxPool2d(3, stride=1, padding=1),
23    "avg_pool3x3": lambda C: nn.AvgPool2d(3, stride=1, padding=1),
24    "skip": lambda C: nn.Identity(),
25    "zero": lambda C: Zero(),
26}
27
28class Zero(nn.Module):
29    """Zero operation (no connection)."""
30    def forward(self, x):
31        return torch.zeros_like(x)
32
33class NASCell(nn.Module):
34    """A cell with a specific architecture (list of operations)."""
35    def __init__(self, channels, ops_config):
36        super().__init__()
37        self.ops = nn.ModuleList([
38            OPERATIONS[op](channels) for op in ops_config
39        ])
40
41    def forward(self, x):
42        outputs = [op(x) for op in self.ops]
43        return sum(outputs)
44
45class NASNetwork(nn.Module):
46    """Full network built by stacking cells."""
47    def __init__(self, n_cells, channels, ops_config, n_classes=10):
48        super().__init__()
49        self.stem = nn.Sequential(
50            nn.Conv2d(3, channels, 3, padding=1, bias=False),
51            nn.BatchNorm2d(channels), nn.ReLU()
52        )
53        self.cells = nn.ModuleList([
54            NASCell(channels, ops_config) for _ in range(n_cells)
55        ])
56        self.pool = nn.AdaptiveAvgPool2d(1)
57        self.classifier = nn.Linear(channels, n_classes)
58
59    def forward(self, x):
60        x = self.stem(x)
61        for cell in self.cells:
62            x = cell(x)
63        x = self.pool(x).flatten(1)
64        return self.classifier(x)
65
66    def count_params(self):
67        return sum(p.numel() for p in self.parameters())
68
69
70# --- Random Architecture Search ---
71def random_architecture(n_ops=4):
72    """Generate a random cell configuration."""
73    ops = list(OPERATIONS.keys())
74    return [np.random.choice(ops) for _ in range(n_ops)]
75
76def evaluate_architecture(ops_config, n_cells=3, channels=16):
77    """Quick evaluation of an architecture (parameter count + dummy forward)."""
78    model = NASNetwork(n_cells, channels, ops_config)
79    params = model.count_params()
80
81    # Measure forward pass time
82    x = torch.randn(1, 3, 32, 32)
83    import time
84    start = time.time()
85    with torch.no_grad():
86        for _ in range(10):
87            model(x)
88    latency = (time.time() - start) / 10 * 1000  # ms
89
90    return params, latency
91
92# --- Search ---
93np.random.seed(42)
94n_candidates = 20
95
96print("=== Random Architecture Search ===")
97print(f"Operations: {list(OPERATIONS.keys())}")
98print(f"Evaluating {n_candidates} random architectures...\n")
99
100results = []
101for i in range(n_candidates):
102    ops = random_architecture(n_ops=4)
103    params, latency = evaluate_architecture(ops)
104    results.append({
105        "id": i, "ops": ops, "params": params, "latency_ms": latency,
106    })
107
108# Sort by efficiency (params * latency)
109results.sort(key=lambda x: x["params"] * x["latency_ms"])
110
111print(f"{'Rank':<6} {'Params':>10} {'Latency':>10} {'Operations'}")
112print("-" * 60)
113for rank, r in enumerate(results[:10], 1):
114    ops_str = ", ".join(r["ops"])
115    print(f"{rank:<6} {r['params']:>10,} {r['latency_ms']:>9.2f}ms "
116          f"{ops_str}")
117
118print(f"\nMost efficient: {results[0]['ops']}")
119print(f"Least efficient: {results[-1]['ops']}")
120print(f"\nParam range: {results[0]['params']:,} - {results[-1]['params']:,}")

EfficientNet: Compound Scaling

The Compound Scaling Method

Given a baseline architecture, scale with coefficient phi:

Depth: d = alpha^phi

Width: w = beta^phi

Resolution: r = gamma^phi

Where alpha, beta, gamma are constants found via grid search such that: alpha * beta^2 * gamma^2 ~= 2 (doubling compute per step)

For EfficientNet-B0 (baseline): alpha=1.2, beta=1.1, gamma=1.15

Model	Phi	Params	Top-1 Acc	FLOPs
B0	0	5.3M	77.1%	0.39B
B1	1	7.8M	79.1%	0.70B
B3	3	12M	81.6%	1.8B
B5	5	30M	83.6%	9.9B
B7	7	66M	84.3%	37B

Hardware-Aware NAS

Modern NAS incorporates hardware constraints directly:

MnasNet (Google): Optimizes for mobile latency, not just accuracy

FBNet (Facebook): Uses a differentiable latency predictor

Once-for-All (OFA): Train one supernetwork that contains all sub-networks; extract the best one for any hardware target

Once-for-All Networks

This amortizes the training cost: one training run supports deployment to phones, tablets, servers, and IoT devices.

python

1# === Compound Scaling (EfficientNet Style) ===
2import numpy as np
3import torch
4import torch.nn as nn
5import time
6
7def make_network(depth_mult, width_mult, resolution, base_channels=32,
8                  base_depth=3, n_classes=10):
9    """Create a simple CNN with configurable scaling."""
10    channels = int(base_channels * width_mult)
11    depth = int(base_depth * depth_mult)
12
13    layers = [
14        nn.Conv2d(3, channels, 3, padding=1, bias=False),
15        nn.BatchNorm2d(channels),
16        nn.ReLU(),
17    ]
18
19    for _ in range(depth):
20        layers.extend([
21            nn.Conv2d(channels, channels, 3, padding=1, bias=False),
22            nn.BatchNorm2d(channels),
23            nn.ReLU(),
24        ])
25
26    layers.extend([
27        nn.AdaptiveAvgPool2d(1),
28        nn.Flatten(),
29        nn.Linear(channels, n_classes),
30    ])
31
32    return nn.Sequential(*layers)
33
34def measure_model(model, resolution, n_runs=20):
35    """Measure model parameters and latency."""
36    params = sum(p.numel() for p in model.parameters())
37    x = torch.randn(1, 3, resolution, resolution)
38
39    model.eval()
40    # Warmup
41    with torch.no_grad():
42        for _ in range(5):
43            model(x)
44
45    start = time.time()
46    with torch.no_grad():
47        for _ in range(n_runs):
48            model(x)
49    latency = (time.time() - start) / n_runs * 1000
50
51    return params, latency
52
53# === Compound Scaling Experiments ===
54# EfficientNet constants (simplified)
55alpha = 1.2   # depth multiplier base
56beta = 1.1    # width multiplier base
57gamma = 1.15  # resolution multiplier base
58
59base_resolution = 32
60
61print("=== Compound Scaling (EfficientNet-Style) ===\n")
62print(f"Scaling constants: alpha={alpha}, beta={beta}, gamma={gamma}")
63print(f"Base: depth=3, width=32, resolution={base_resolution}\n")
64
65# Compare scaling strategies
66strategies = {
67    "Depth only": [],
68    "Width only": [],
69    "Resolution only": [],
70    "Compound": [],
71}
72
73for phi in range(5):
74    # Depth only
75    d, w, r = alpha**phi, 1.0, base_resolution
76    model = make_network(d, w, r)
77    params, lat = measure_model(model, r)
78    strategies["Depth only"].append((phi, params, lat))
79
80    # Width only
81    d, w, r = 1.0, beta**phi, base_resolution
82    model = make_network(d, w, r)
83    params, lat = measure_model(model, r)
84    strategies["Width only"].append((phi, params, lat))
85
86    # Resolution only
87    d, w, r = 1.0, 1.0, int(base_resolution * gamma**phi)
88    model = make_network(d, w, r)
89    params, lat = measure_model(model, r)
90    strategies["Resolution only"].append((phi, params, lat))
91
92    # Compound (all three)
93    d, w, r = alpha**phi, beta**phi, int(base_resolution * gamma**phi)
94    model = make_network(d, w, r)
95    params, lat = measure_model(model, r)
96    strategies["Compound"].append((phi, params, lat))
97
98# Print comparison
99for strategy_name, results in strategies.items():
100    print(f"--- {strategy_name} ---")
101    print(f"{'Phi':>4} {'Params':>10} {'Latency':>10} {'Efficiency':>12}")
102    for phi, params, lat in results:
103        # Efficiency = params per ms of latency (lower is better)
104        eff = params / lat if lat > 0 else 0
105        print(f"{phi:>4} {params:>10,} {lat:>9.2f}ms {eff:>11,.0f} p/ms")
106    print()
107
108# Summary
109print("=== Scaling Summary (at phi=4) ===")
110print(f"{'Strategy':<20} {'Params':>10} {'Latency':>10}")
111print("-" * 42)
112for name, results in strategies.items():
113    _, params, lat = results[4]
114    print(f"{name:<20} {params:>10,} {lat:>9.2f}ms")
115
116print("\nCompound scaling achieves a balanced tradeoff between")
117print("model capacity (params) and computational cost (latency).")

Practical NAS Today

Practical NAS Tools

Optuna

General-purpose hyperparameter optimization that works well for architecture search:

def objective(trial):
    n_layers = trial.suggest_int("n_layers", 2, 8)
    hidden = trial.suggest_int("hidden", 32, 256)
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
    # Build and train model, return validation accuracy

NNI (Microsoft)

Neural Network Intelligence toolkit with built-in NAS support:

DARTS, ENAS, ProxylessNAS implementations

Search space visualization

Multi-trial and one-shot NAS

AutoKeras

NAS for Keras/TensorFlow with a simple API:

import autokeras as ak
clf = ak.ImageClassifier(max_trials=10)
clf.fit(x_train, y_train)

When Full NAS Matters

Building a product deployed to millions of devices (every % matters)

Targeting new hardware where existing architectures are not optimized

Working on novel domains where standard architectures underperform

Research on architecture design principles