Object Detection

Object detection goes beyond classification by identifying what objects are in an image and where they are located, using bounding boxes.

Problem Formulation

Each detection consists of:

Bounding box: (x, y, width, height) or (x_min, y_min, x_max, y_max)

Class label: Which category the object belongs to

Confidence score: How certain the model is about the detection

Intersection over Union (IoU)

IoU measures the overlap between a predicted bounding box and a ground truth box:

$$\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}}$$

IoU = 0: No overlap at all

IoU = 1: Perfect match

Typical threshold for a "correct" detection: IoU >= 0.5

Two-Stage Detectors

Two-stage detectors first propose regions that might contain objects, then classify and refine those regions.

R-CNN (2014)

1. Use Selective Search to propose ~2000 region candidates 2. Warp each region to a fixed size 3. Pass each through a CNN to extract features 4. Classify with SVMs + refine boxes with regression

Problem: Very slow (~47 seconds per image) — each of 2000 regions processed independently

Fast R-CNN (2015)

1. Pass the entire image through a CNN once to get a feature map 2. Project region proposals onto the feature map 3. Use RoI Pooling to extract fixed-size features for each proposal 4. Classify + regress in a single network

Improvement: ~200x faster than R-CNN

Faster R-CNN (2016)

1. Replace Selective Search with a Region Proposal Network (RPN) 2. RPN shares the CNN backbone, producing proposals nearly for free 3. RoI Pooling + classification + regression as in Fast R-CNN

Improvement: Near real-time (5-17 FPS), end-to-end trainable

Anchor Boxes

Anchor boxes are predefined bounding boxes of various sizes and aspect ratios placed at each spatial position in the feature map. Instead of predicting boxes from scratch, the model predicts *offsets* from these anchors. This makes the prediction problem easier because: 1. The model only needs to learn small adjustments, not absolute coordinates 2. Different anchors capture different object shapes (tall/wide/square) 3. Multiple anchors per position allow detecting multiple objects at the same location

One-Stage Detectors

One-stage detectors skip the region proposal step and predict boxes and classes directly from the feature map in a single pass. They are typically faster but historically less accurate (though this gap has largely closed).

SSD (Single Shot MultiBox Detector, 2016)

Predicts at multiple scales using feature maps from different layers

Uses anchor boxes at each scale

Faster than Faster R-CNN but lower accuracy on small objects

YOLO Family

You Only Look Once — the most popular object detection framework.

#### YOLO v1 (2016)

Divides image into an S x S grid

Each cell predicts B bounding boxes and C class probabilities

Single forward pass — extremely fast (45 FPS)

Struggles with small objects and groups of small objects

#### YOLO v2/v3 (2017-2018)

Added anchor boxes (from Faster R-CNN)

Multi-scale predictions (like SSD)

Darknet-53 backbone

Much better accuracy while maintaining speed

#### YOLO v5 / v8 (Ultralytics)

PyTorch-based (original YOLO was Darknet/C)

Excellent engineering: auto-augmentation, auto-anchoring, NMS

Easy to use, state-of-the-art speed/accuracy tradeoff

YOLOv8 adds instance segmentation and pose estimation

#### YOLO v11 / YOLO-World (2024+)

Open-vocabulary detection (detect objects from text descriptions)

Real-time performance with language grounding

Non-Maximum Suppression (NMS)

Detectors produce many overlapping boxes. NMS filters them: 1. Sort all detections by confidence score 2. Take the highest-scoring box, add it to the final results 3. Remove all remaining boxes with IoU > threshold (e.g., 0.5) with the selected box 4. Repeat until no boxes remain

Evaluation Metrics

Precision and Recall (per class)

Precision: Of all detections, what fraction are correct?

Recall: Of all ground-truth objects, what fraction did we detect?

Average Precision (AP)

The area under the Precision-Recall curve for a single class, at a given IoU threshold.

Mean Average Precision (mAP)

[email protected]: Mean of AP across all classes at IoU=0.5

[email protected]:0.95: Mean of AP at IoU thresholds from 0.5 to 0.95 (step 0.05) — the COCO primary metric and much harder to score well on

python

1# ==============================================================
2# Using Ultralytics YOLOv8 for object detection
3# pip install ultralytics
4# ==============================================================
5from ultralytics import YOLO
6from PIL import Image
7import matplotlib.pyplot as plt
8import matplotlib.patches as patches
9import numpy as np
10
11# Load a pretrained YOLOv8 model
12model = YOLO("yolov8n.pt")  # nano model — fast and lightweight
13
14# Run inference on an image
15results = model("https://ultralytics.com/images/bus.jpg")
16
17# Parse results
18result = results[0]
19boxes = result.boxes
20
21print(f"Detected {len(boxes)} objects:\n")
22for box in boxes:
23    cls_id = int(box.cls[0])
24    cls_name = result.names[cls_id]
25    confidence = float(box.conf[0])
26    x1, y1, x2, y2 = box.xyxy[0].tolist()
27    print(f"  {cls_name}: {confidence:.2f} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}]")
28
29# Visualize results
30fig, ax = plt.subplots(1, figsize=(12, 8))
31img = Image.open(result.path)
32ax.imshow(img)
33
34colors = plt.cm.Set3(np.linspace(0, 1, len(result.names)))
35for box in boxes:
36    cls_id = int(box.cls[0])
37    conf = float(box.conf[0])
38    x1, y1, x2, y2 = box.xyxy[0].tolist()
39    color = colors[cls_id % len(colors)]
40    rect = patches.Rectangle(
41        (x1, y1), x2 - x1, y2 - y1,
42        linewidth=2, edgecolor=color, facecolor="none"
43    )
44    ax.add_patch(rect)
45    ax.text(x1, y1 - 5, f"{result.names[cls_id]} {conf:.2f}",
46            color="white", fontsize=10,
47            bbox=dict(boxstyle="round,pad=0.2", facecolor=color, alpha=0.8))
48
49ax.axis("off")
50plt.tight_layout()
51plt.show()

python

1# ==============================================================
2# Calculating IoU from scratch
3# ==============================================================
4def calculate_iou(box1, box2):
5    """
6    Calculate IoU between two boxes in [x1, y1, x2, y2] format.
7    """
8    # Intersection coordinates
9    x1 = max(box1[0], box2[0])
10    y1 = max(box1[1], box2[1])
11    x2 = min(box1[2], box2[2])
12    y2 = min(box1[3], box2[3])
13
14    # Intersection area (0 if no overlap)
15    intersection = max(0, x2 - x1) * max(0, y2 - y1)
16
17    # Union area
18    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
19    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
20    union = area1 + area2 - intersection
21
22    return intersection / union if union > 0 else 0.0
23
24# Example
25pred_box = [100, 100, 200, 200]
26gt_box = [120, 110, 210, 210]
27iou = calculate_iou(pred_box, gt_box)
28print(f"IoU: {iou:.3f}")  # ~0.53
29
30# ==============================================================
31# Non-Maximum Suppression from scratch
32# ==============================================================
33def nms(boxes, scores, iou_threshold=0.5):
34    """
35    Apply Non-Maximum Suppression.
36    boxes: list of [x1, y1, x2, y2]
37    scores: list of confidence scores
38    Returns: indices of kept boxes
39    """
40    indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
41    keep = []
42
43    while indices:
44        current = indices.pop(0)
45        keep.append(current)
46        indices = [
47            i for i in indices
48            if calculate_iou(boxes[current], boxes[i]) < iou_threshold
49        ]
50
51    return keep
52
53# Example: Multiple overlapping detections of same object
54boxes = [
55    [100, 100, 200, 200],  # High confidence
56    [105, 95, 205, 195],   # Overlapping, lower confidence
57    [110, 100, 210, 200],  # Overlapping, even lower
58    [300, 300, 400, 400],  # Different object
59]
60scores = [0.95, 0.88, 0.75, 0.92]
61
62kept = nms(boxes, scores, iou_threshold=0.5)
63print(f"Kept box indices: {kept}")  # [0, 3] — one per cluster

Choosing a YOLO Model Size

Ultralytics provides models from Nano to Extra-Large: - **YOLOv8n** (Nano): 3.2M params, fastest, good for edge devices - **YOLOv8s** (Small): 11.2M params, good speed/accuracy balance - **YOLOv8m** (Medium): 25.9M params, general purpose - **YOLOv8l** (Large): 43.7M params, high accuracy - **YOLOv8x** (XL): 68.2M params, maximum accuracy Start with the smallest model that meets your accuracy needs.

Object Detection

Object detection goes beyond classification by identifying what objects are in an image and where they are located, using bounding boxes.

Problem Formulation

Each detection consists of:

Bounding box: (x, y, width, height) or (x_min, y_min, x_max, y_max)

Class label: Which category the object belongs to

Confidence score: How certain the model is about the detection

Intersection over Union (IoU)

IoU measures the overlap between a predicted bounding box and a ground truth box:

$$\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}}$$

IoU = 0: No overlap at all

IoU = 1: Perfect match

Typical threshold for a "correct" detection: IoU >= 0.5

Two-Stage Detectors

Two-stage detectors first propose regions that might contain objects, then classify and refine those regions.

R-CNN (2014)

1. Use Selective Search to propose ~2000 region candidates 2. Warp each region to a fixed size 3. Pass each through a CNN to extract features 4. Classify with SVMs + refine boxes with regression

Problem: Very slow (~47 seconds per image) — each of 2000 regions processed independently

Fast R-CNN (2015)

Improvement: ~200x faster than R-CNN

Faster R-CNN (2016)

1. Replace Selective Search with a Region Proposal Network (RPN) 2. RPN shares the CNN backbone, producing proposals nearly for free 3. RoI Pooling + classification + regression as in Fast R-CNN

Improvement: Near real-time (5-17 FPS), end-to-end trainable

Anchor Boxes

One-Stage Detectors

SSD (Single Shot MultiBox Detector, 2016)

Predicts at multiple scales using feature maps from different layers

Uses anchor boxes at each scale

Faster than Faster R-CNN but lower accuracy on small objects

YOLO Family

You Only Look Once — the most popular object detection framework.

#### YOLO v1 (2016)

Divides image into an S x S grid

Each cell predicts B bounding boxes and C class probabilities

Single forward pass — extremely fast (45 FPS)

Struggles with small objects and groups of small objects

#### YOLO v2/v3 (2017-2018)

Added anchor boxes (from Faster R-CNN)

Multi-scale predictions (like SSD)

Darknet-53 backbone

Much better accuracy while maintaining speed

#### YOLO v5 / v8 (Ultralytics)

PyTorch-based (original YOLO was Darknet/C)

Excellent engineering: auto-augmentation, auto-anchoring, NMS

Easy to use, state-of-the-art speed/accuracy tradeoff

YOLOv8 adds instance segmentation and pose estimation

#### YOLO v11 / YOLO-World (2024+)

Open-vocabulary detection (detect objects from text descriptions)

Real-time performance with language grounding

Non-Maximum Suppression (NMS)

Evaluation Metrics

Precision and Recall (per class)

Precision: Of all detections, what fraction are correct?

Recall: Of all ground-truth objects, what fraction did we detect?

Average Precision (AP)

The area under the Precision-Recall curve for a single class, at a given IoU threshold.

Mean Average Precision (mAP)

[email protected]: Mean of AP across all classes at IoU=0.5

[email protected]:0.95: Mean of AP at IoU thresholds from 0.5 to 0.95 (step 0.05) — the COCO primary metric and much harder to score well on

python

1# ==============================================================
2# Using Ultralytics YOLOv8 for object detection
3# pip install ultralytics
4# ==============================================================
5from ultralytics import YOLO
6from PIL import Image
7import matplotlib.pyplot as plt
8import matplotlib.patches as patches
9import numpy as np
10
11# Load a pretrained YOLOv8 model
12model = YOLO("yolov8n.pt")  # nano model — fast and lightweight
13
14# Run inference on an image
15results = model("https://ultralytics.com/images/bus.jpg")
16
17# Parse results
18result = results[0]
19boxes = result.boxes
20
21print(f"Detected {len(boxes)} objects:\n")
22for box in boxes:
23    cls_id = int(box.cls[0])
24    cls_name = result.names[cls_id]
25    confidence = float(box.conf[0])
26    x1, y1, x2, y2 = box.xyxy[0].tolist()
27    print(f"  {cls_name}: {confidence:.2f} at [{x1:.0f}, {y1:.0f}, {x2:.0f}, {y2:.0f}]")
28
29# Visualize results
30fig, ax = plt.subplots(1, figsize=(12, 8))
31img = Image.open(result.path)
32ax.imshow(img)
33
34colors = plt.cm.Set3(np.linspace(0, 1, len(result.names)))
35for box in boxes:
36    cls_id = int(box.cls[0])
37    conf = float(box.conf[0])
38    x1, y1, x2, y2 = box.xyxy[0].tolist()
39    color = colors[cls_id % len(colors)]
40    rect = patches.Rectangle(
41        (x1, y1), x2 - x1, y2 - y1,
42        linewidth=2, edgecolor=color, facecolor="none"
43    )
44    ax.add_patch(rect)
45    ax.text(x1, y1 - 5, f"{result.names[cls_id]} {conf:.2f}",
46            color="white", fontsize=10,
47            bbox=dict(boxstyle="round,pad=0.2", facecolor=color, alpha=0.8))
48
49ax.axis("off")
50plt.tight_layout()
51plt.show()

python

1# ==============================================================
2# Calculating IoU from scratch
3# ==============================================================
4def calculate_iou(box1, box2):
5    """
6    Calculate IoU between two boxes in [x1, y1, x2, y2] format.
7    """
8    # Intersection coordinates
9    x1 = max(box1[0], box2[0])
10    y1 = max(box1[1], box2[1])
11    x2 = min(box1[2], box2[2])
12    y2 = min(box1[3], box2[3])
13
14    # Intersection area (0 if no overlap)
15    intersection = max(0, x2 - x1) * max(0, y2 - y1)
16
17    # Union area
18    area1 = (box1[2] - box1[0]) * (box1[3] - box1[1])
19    area2 = (box2[2] - box2[0]) * (box2[3] - box2[1])
20    union = area1 + area2 - intersection
21
22    return intersection / union if union > 0 else 0.0
23
24# Example
25pred_box = [100, 100, 200, 200]
26gt_box = [120, 110, 210, 210]
27iou = calculate_iou(pred_box, gt_box)
28print(f"IoU: {iou:.3f}")  # ~0.53
29
30# ==============================================================
31# Non-Maximum Suppression from scratch
32# ==============================================================
33def nms(boxes, scores, iou_threshold=0.5):
34    """
35    Apply Non-Maximum Suppression.
36    boxes: list of [x1, y1, x2, y2]
37    scores: list of confidence scores
38    Returns: indices of kept boxes
39    """
40    indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
41    keep = []
42
43    while indices:
44        current = indices.pop(0)
45        keep.append(current)
46        indices = [
47            i for i in indices
48            if calculate_iou(boxes[current], boxes[i]) < iou_threshold
49        ]
50
51    return keep
52
53# Example: Multiple overlapping detections of same object
54boxes = [
55    [100, 100, 200, 200],  # High confidence
56    [105, 95, 205, 195],   # Overlapping, lower confidence
57    [110, 100, 210, 200],  # Overlapping, even lower
58    [300, 300, 400, 400],  # Different object
59]
60scores = [0.95, 0.88, 0.75, 0.92]
61
62kept = nms(boxes, scores, iou_threshold=0.5)
63print(f"Kept box indices: {kept}")  # [0, 3] — one per cluster