Skip to main content

Object Detection

Bounding boxes, YOLO, R-CNN family, anchor boxes, and evaluation metrics

~55 min
Listen to this lesson

Object Detection

Object detection goes beyond classification by identifying what objects are in an image and where they are located, using bounding boxes.

Problem Formulation

Each detection consists of:

  • Bounding box: (x, y, width, height) or (x_min, y_min, x_max, y_max)
  • Class label: Which category the object belongs to
  • Confidence score: How certain the model is about the detection
  • Intersection over Union (IoU)

    IoU measures the overlap between a predicted bounding box and a ground truth box:

    $$\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}}$$

  • IoU = 0: No overlap at all
  • IoU = 1: Perfect match
  • Typical threshold for a "correct" detection: IoU >= 0.5
  • Two-Stage Detectors

    Two-stage detectors first propose regions that might contain objects, then classify and refine those regions.

    R-CNN (2014)

    1. Use Selective Search to propose ~2000 region candidates 2. Warp each region to a fixed size 3. Pass each through a CNN to extract features 4. Classify with SVMs + refine boxes with regression
  • Problem: Very slow (~47 seconds per image) — each of 2000 regions processed independently
  • Fast R-CNN (2015)

    1. Pass the entire image through a CNN once to get a feature map 2. Project region proposals onto the feature map 3. Use RoI Pooling to extract fixed-size features for each proposal 4. Classify + regress in a single network
  • Improvement: ~200x faster than R-CNN
  • Faster R-CNN (2016)

    1. Replace Selective Search with a Region Proposal Network (RPN) 2. RPN shares the CNN backbone, producing proposals nearly for free 3. RoI Pooling + classification + regression as in Fast R-CNN
  • Improvement: Near real-time (5-17 FPS), end-to-end trainable
  • Anchor Boxes

    Anchor boxes are predefined bounding boxes of various sizes and aspect ratios placed at each spatial position in the feature map. Instead of predicting boxes from scratch, the model predicts *offsets* from these anchors. This makes the prediction problem easier because: 1. The model only needs to learn small adjustments, not absolute coordinates 2. Different anchors capture different object shapes (tall/wide/square) 3. Multiple anchors per position allow detecting multiple objects at the same location

    One-Stage Detectors

    One-stage detectors skip the region proposal step and predict boxes and classes directly from the feature map in a single pass. They are typically faster but historically less accurate (though this gap has largely closed).

    SSD (Single Shot MultiBox Detector, 2016)

  • Predicts at multiple scales using feature maps from different layers
  • Uses anchor boxes at each scale
  • Faster than Faster R-CNN but lower accuracy on small objects
  • YOLO Family

    You Only Look Once — the most popular object detection framework.

    #### YOLO v1 (2016)

  • Divides image into an S x S grid
  • Each cell predicts B bounding boxes and C class probabilities
  • Single forward pass — extremely fast (45 FPS)
  • Struggles with small objects and groups of small objects
  • #### YOLO v2/v3 (2017-2018)

  • Added anchor boxes (from Faster R-CNN)
  • Multi-scale predictions (like SSD)
  • Darknet-53 backbone
  • Much better accuracy while maintaining speed
  • #### YOLO v5 / v8 (Ultralytics)

  • PyTorch-based (original YOLO was Darknet/C)
  • Excellent engineering: auto-augmentation, auto-anchoring, NMS
  • Easy to use, state-of-the-art speed/accuracy tradeoff
  • YOLOv8 adds instance segmentation and pose estimation
  • #### YOLO v11 / YOLO-World (2024+)

  • Open-vocabulary detection (detect objects from text descriptions)
  • Real-time performance with language grounding
  • Non-Maximum Suppression (NMS)

    Detectors produce many overlapping boxes. NMS filters them: 1. Sort all detections by confidence score 2. Take the highest-scoring box, add it to the final results 3. Remove all remaining boxes with IoU > threshold (e.g., 0.5) with the selected box 4. Repeat until no boxes remain

    Evaluation Metrics

    Precision and Recall (per class)

  • Precision: Of all detections, what fraction are correct?
  • Recall: Of all ground-truth objects, what fraction did we detect?
  • Average Precision (AP)

    The area under the Precision-Recall curve for a single class, at a given IoU threshold.

    Mean Average Precision (mAP)

  • [email protected]: Mean of AP across all classes at IoU=0.5
  • [email protected]:0.95: Mean of AP at IoU thresholds from 0.5 to 0.95 (step 0.05) — the COCO primary metric and much harder to score well on
  • python
    1# ==============================================================
    2# Using Ultralytics YOLOv8 for object detection
    3# pip install ultralytics
    4# ==============================================================
    5from ultralytics import YOLO
    6from PIL import Image
    7import matplotlib.pyplot as plt
    8import matplotlib.patches as patches
    9import numpy as np
    10
    11# Load a pretrained YOLOv8 model
    12model = YOLO("yolov8n.pt")  # nano model — fast and lightweight
    13
    14# Run inference on an image
    15results = model("https://ultralytics.com/images/bus.jpg")
    16
    17# Parse results
    18result = results[0]
    19boxes = result.boxes
    20
    21print(f"Detected {len(boxes)} objects:\n")
    22for box in boxes:
    23    cls_id = int(box.cls[0])
    24    cls_name = result.names[cls_id]
    25    confidence = float(box.conf[0])
    26    x1, y1, x2, y2 = box.xyxy[0].tolist()
    27    print(f"  {cls_name}: {confidence:.2f} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}]")
    28
    29# Visualize results
    30fig, ax = plt.subplots(1, figsize=(12, 8))
    31img = Image.open(result.path)
    32ax.imshow(img)
    33
    34colors = plt.cm.Set3(np.linspace(0, 1, len(result.names)))
    35for box in boxes:
    36    cls_id = int(box.cls[0])
    37    conf = float(box.conf[0])
    38    x1, y1, x2, y2 = box.xyxy[0].tolist()
    39    color = colors[cls_id % len(colors)]
    40    rect = patches.Rectangle(
    41        (x1, y1), x2 - x1, y2 - y1,
    42        linewidth=2, edgecolor=color, facecolor="none"
    43    )
    44    ax.add_patch(rect)
    45    ax.text(x1, y1 - 5, f"{result.names[cls_id]} {conf:.2f}",
    46            color="white", fontsize=10,
    47            bbox=dict(boxstyle="round,pad=0.2", facecolor=color, alpha=0.8))
    48
    49ax.axis("off")
    50plt.tight_layout()
    51plt.show()
    python
    1# ==============================================================
    2# Calculating IoU from scratch
    3# ==============================================================
    4def calculate_iou(box1, box2):
    5    """
    6    Calculate IoU between two boxes in [x1, y1, x2, y2] format.
    7    """
    8    # Intersection coordinates
    9    x1 = max(box1[0], box2[0])
    10    y1 = max(box1[1], box2[1])
    11    x2 = min(box1[2], box2[2])
    12    y2 = min(box1[3], box2[3])
    13
    14    # Intersection area (0 if no overlap)
    15    intersection = max(0, x2 - x1) * max(0, y2 - y1)
    16
    17    # Union area
    18    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
    19    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
    20    union = area1 + area2 - intersection
    21
    22    return intersection / union if union > 0 else 0.0
    23
    24# Example
    25pred_box = [100, 100, 200, 200]
    26gt_box = [120, 110, 210, 210]
    27iou = calculate_iou(pred_box, gt_box)
    28print(f"IoU: {iou:.3f}")  # ~0.53
    29
    30# ==============================================================
    31# Non-Maximum Suppression from scratch
    32# ==============================================================
    33def nms(boxes, scores, iou_threshold=0.5):
    34    """
    35    Apply Non-Maximum Suppression.
    36    boxes: list of [x1, y1, x2, y2]
    37    scores: list of confidence scores
    38    Returns: indices of kept boxes
    39    """
    40    indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
    41    keep = []
    42
    43    while indices:
    44        current = indices.pop(0)
    45        keep.append(current)
    46        indices = [
    47            i for i in indices
    48            if calculate_iou(boxes[current], boxes[i]) < iou_threshold
    49        ]
    50
    51    return keep
    52
    53# Example: Multiple overlapping detections of same object
    54boxes = [
    55    [100, 100, 200, 200],  # High confidence
    56    [105, 95, 205, 195],   # Overlapping, lower confidence
    57    [110, 100, 210, 200],  # Overlapping, even lower
    58    [300, 300, 400, 400],  # Different object
    59]
    60scores = [0.95, 0.88, 0.75, 0.92]
    61
    62kept = nms(boxes, scores, iou_threshold=0.5)
    63print(f"Kept box indices: {kept}")  # [0, 3] — one per cluster

    Choosing a YOLO Model Size

    Ultralytics provides models from Nano to Extra-Large: - **YOLOv8n** (Nano): 3.2M params, fastest, good for edge devices - **YOLOv8s** (Small): 11.2M params, good speed/accuracy balance - **YOLOv8m** (Medium): 25.9M params, general purpose - **YOLOv8l** (Large): 43.7M params, high accuracy - **YOLOv8x** (XL): 68.2M params, maximum accuracy Start with the smallest model that meets your accuracy needs.