Object Detection

Object detection goes beyond classification—it finds and localizes multiple objects in an image. Each detection includes a bounding box and class label.

Task: For each object in an image, predict: (1) What it is (class), (2) Where it is (bounding box coordinates).

Key Concepts

Bounding Box

Rectangle around object.

Format: (x, y, width, height)
or (x1, y1, x2, y2)

IoU (Intersection over Union)

Overlap between predicted and ground truth boxes.

IoU = Area of Overlap / Area of Union
Range: 0 to 1

Confidence Score

How confident the model is about detection.

Combines: objectness + classification confidence

NMS (Non-Max Suppression)

Remove duplicate detections.

Keep highest confidence box
Suppress overlapping boxes (IoU > threshold)

Detection Approaches

Two-Stage Detectors (R-CNN Family)

First propose regions, then classify them.

1. Region Proposal Network (RPN) suggests candidate boxes
2. Classify and refine each proposal
Examples: R-CNN, Fast R-CNN, Faster R-CNN, Mask R-CNN

✓ High accuracy | ✗ Slower (two stages)

One-Stage Detectors (YOLO, SSD)

Predict boxes and classes in single pass.

Divide image into grid
Each cell predicts bounding boxes + class probabilities
Examples: YOLO (v1-v8), SSD, RetinaNet

✓ Very fast (real-time) | ✗ Slightly lower accuracy

YOLO (You Only Look Once)

Most popular real-time object detector. Treats detection as regression problem.

Divide Image

Split into S×S grid (e.g., 7×7)

Predict Boxes

Each cell predicts B bounding boxes

Predict Classes

Each cell predicts class probabilities

Apply NMS

Remove duplicate detections

YOLO Versions

YOLOv1-v3: Original versions, progressively faster and more accurate

YOLOv4-v5: Community improvements, optimizations

YOLOv6-v7: Industrial applications, edge devices

YOLOv8: Latest, state-of-the-art performance

Evaluation Metrics

mAP (mean Average Precision)

Primary metric for object detection.

Calculate AP for each class, then average
AP = area under precision-recall curve
Common: mAP@0.5, mAP@0.5:0.95

FPS (Frames Per Second)

Speed metric for real-time applications.

Real-time: 30+ FPS
YOLO: 30-150 FPS depending on version
Faster R-CNN: 5-10 FPS

Applications

🚗

Autonomous Driving

Detect cars, pedestrians, traffic signs

📹

Surveillance

Track people, detect suspicious activity

🏭

Manufacturing

Quality control, defect detection

🏥

Medical Imaging

Detect tumors, lesions in scans

🛒

Retail

Checkout-free stores, inventory management

🎮

AR/VR

Real-time object tracking

Key Takeaway: Object detection localizes and classifies multiple objects. YOLO is the go-to for real-time applications, while Faster R-CNN offers higher accuracy when speed isn't critical.