facebookresearch/detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.

34,331 stars Python 8 components

Trains computer vision models to detect, segment, and analyze objects in images

Images flow through a multi-stage detection pipeline: raw images are loaded and augmented by DatasetMapper, the backbone (ResNet+FPN) extracts multi-scale features, RPN generates region proposals from these features, ROI heads classify proposals and predict masks/keypoints, and finally predictions are post-processed through NMS before evaluation.

Under the hood, the system uses 3 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.

A 8-component ml training. 512 files analyzed. Data flows through 5 distinct pipeline stages.

How Data Flows Through the System

Load and preprocess images — DatasetMapper loads images from COCO dataset, applies augmentations (resize, flip), normalizes pixel values, and packages with ground truth annotations into training dicts [COCO annotations → TrainingBatch] (config: INPUT.MIN_SIZE_TRAIN, INPUT.MAX_SIZE_TRAIN, dataloader.train.mapper.augmentations)
Extract backbone features — ResNet backbone processes input images through conv layers to extract features at multiple scales (res2-res5), then FPN combines these into a feature pyramid with consistent 256-channel dimensions [ImageTensor → FeatureMaps] (config: MODEL.BACKBONE.NAME, MODEL.RESNETS.DEPTH, MODEL.FPN.IN_FEATURES)
Generate region proposals — RPN takes FPN features and generates object proposals by classifying dense anchor grids as foreground/background, then refines anchor coordinates using box regression [FeatureMaps → Boxes] (config: MODEL.RPN.IN_FEATURES, MODEL.ANCHOR_GENERATOR.SIZES, MODEL.RPN.PRE_NMS_TOPK_TRAIN)
ROI feature extraction and classification — StandardROIHeads pools features from FPN levels for each proposal using ROIAlign, then runs box classification, regression, and optionally mask/keypoint prediction through separate heads [FeatureMaps → Instances] (config: MODEL.ROI_HEADS.NAME, MODEL.ROI_HEADS.NUM_CLASSES, MODEL.MASK_ON)
Post-processing and evaluation — Raw predictions are filtered through NMS to remove duplicate detections, then COCOEvaluator computes AP metrics by matching predictions to ground truth using IoU thresholds [Instances → Evaluation metrics] (config: MODEL.ROI_HEADS.SCORE_THRESH_TEST, MODEL.ROI_HEADS.NMS_THRESH_TEST, TEST.DETECTIONS_PER_IMAGE)

Data Models

The data structures that flow between stages — the contracts that hold the system together.

ImageTensor detectron2/structures/image_list.py
Tensor[B, 3, H, W] in BGR format with pixel values normalized using ImageNet statistics
Raw images are loaded, resized, normalized into tensors, fed through the backbone, then discarded after feature extraction

FeatureMaps detectron2/modeling/backbone/fpn.py
Dict[str, Tensor] with keys like 'p2', 'p3', 'p4', 'p5' mapping to feature tensors of shape [B, 256, H/stride, W/stride]
Backbone extracts multi-scale features, FPN combines them into pyramid, RPN and ROI heads consume specific pyramid levels

Instances detectron2/structures/instances.py
Structure containing pred_boxes: Boxes[N, 4], scores: Tensor[N], pred_classes: Tensor[N], pred_masks: Tensor[N, H, W] (optional)
ROI heads predict raw detections, NMS filters overlapping boxes, evaluator computes metrics against ground truth

Boxes detectron2/structures/boxes.py
Tensor[N, 4] in (x1, y1, x2, y2) format representing bounding box coordinates
RPN generates initial proposals, box regression refines coordinates, NMS removes duplicates

TrainingBatch detectron2/data/common.py
List[Dict] where each dict contains 'image': Tensor[3, H, W], 'instances': Instances with gt_boxes, gt_classes, gt_masks
DatasetMapper creates dicts from annotations, DataLoader batches them, model processes batch and computes losses

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Shape unguarded

All input feature maps from backbone have consistent spatial dimensions that align with expected strides (4, 8, 16, 32, 64) but FPN never validates that input['p2'] has height/width that is exactly 4x smaller than input['p1']

If this fails: If backbone produces feature maps with unexpected spatial ratios, FPN's lateral connections and top-down pathway will misalign features, causing detection heads to process spatially inconsistent features and produce wrong bounding box coordinates

detectron2/modeling/backbone/fpn.py:FPN.forward

critical Domain unguarded

Input images are in standard 8-bit RGB/BGR format with pixel values in [0,255] range, but DatasetMapper applies ImageNet normalization (mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0]) without checking pixel value distribution

If this fails: If input contains HDR images, medical images with 16-bit depth, or pre-normalized images in [0,1] range, the normalization will push pixel values far outside expected ranges, causing backbone features to saturate and models to fail silently

detectron2/data/dataset_mapper.py:DatasetMapper.__call__

critical Contract unguarded

Proposal boxes from RPN are in absolute image coordinates (x1,y1,x2,y2) format matching the original image size, but ROI heads never validate that proposal coordinates are within image boundaries

If this fails: If RPN generates out-of-bounds proposals due to anchor misconfiguration or image resizing bugs, ROIAlign will sample features from invalid memory locations, causing crashes or corrupted gradients during training

detectron2/modeling/roi_heads/roi_heads.py:StandardROIHeads.forward

critical Scale weakly guarded

Training assumes balanced positive/negative anchor sampling with hardcoded ratios (positive_fraction=0.5, batch_size_per_image=256) but never checks if the dataset actually contains sufficient positive anchors

If this fails: On datasets with very small objects or sparse annotations, most images may have <128 positive anchors available, causing the sampler to pad with duplicate positives or fall back to fewer samples, leading to unstable gradients and poor convergence

detectron2/modeling/proposal_generator/rpn.py:RPN.forward_training

warning Resource unguarded

DataLoader assumes sufficient GPU memory to hold batch_size * max_image_size * num_workers worth of preprocessed images, but never estimates or validates memory requirements

If this fails: Large images (>2000px) or high batch sizes can cause CUDA out-of-memory errors that manifest as cryptic RuntimeError messages mid-training, losing hours of training progress without clear memory usage guidance

detectron2/engine/defaults.py:DefaultTrainer.build_train_loader

warning Temporal weakly guarded

Checkpoint loading assumes model architecture hasn't changed between save and load - specifically that all parameter names and shapes match exactly

If this fails: If config changes backbone depth (ResNet50->ResNet101) or adds new heads between training runs, checkpoint loading fails with KeyError or shape mismatch, but error messages don't clearly indicate which architectural change caused the incompatibility

detectron2/checkpoint/checkpoint.py:Checkpointer.load

warning Ordering unguarded

DataLoader iteration assumes dataset records can be accessed in any order via __getitem__(index), but some dataset implementations may expect sequential access or have stateful transforms

If this fails: Multi-worker data loading with random sampling can break datasets that maintain internal state or cache, causing inconsistent augmentations or corrupted batches that lead to training instability

detectron2/data/build.py:build_detection_train_loader

info Environment unguarded

All configs hardcode pixel normalization constants (pixel_mean=[103.530, 116.280, 123.675]) assuming BGR channel order and ImageNet statistics, but never validate actual dataset statistics

If this fails: If dataset uses RGB order, different camera sensors, or domain-specific images (medical, satellite), the hardcoded normalization will shift the data distribution, causing pretrained features to activate incorrectly and reducing model accuracy

configs/common/models/mask_rcnn_fpn.py:model.pixel_mean

info Scale unguarded

COCO evaluation assumes detection scores are well-calibrated probabilities in [0,1] range and uses fixed IoU thresholds (0.5:0.95) without checking score distribution

If this fails: Models that output uncalibrated confidence scores or use different output ranges may appear to perform poorly in evaluation even if spatial predictions are accurate, masking model quality issues

detectron2/evaluation/coco_evaluation.py:COCOEvaluator._eval_predictions

info Contract weakly guarded

Image batching assumes all input tensors have the same number of channels (3 for RGB/BGR) and will pad spatial dimensions to match the largest image in the batch

If this fails: If batch contains grayscale (1-channel) or RGBA (4-channel) images mixed with RGB, tensor concatenation will fail with shape mismatch errors that don't clearly indicate the channel dimension issue

detectron2/structures/image_list.py:ImageList.from_tensors

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Registry (registry)
Maps model names to pretrained weight URLs and config files for easy model loading

Training Checkpoints (checkpoint)
Stores model weights, optimizer state, and training iteration for resuming training

Dataset Catalog (registry)
Registers dataset metadata including paths, class names, and annotation formats

Feedback Loops

Training Loop (training-loop, reinforcing) — Trigger: Batch of training data. Action: Forward pass computes losses, backward pass updates model weights through optimizer.step(). Exit: Maximum iterations reached or manual stop.
Learning Rate Scheduling (convergence, balancing) — Trigger: Training iteration milestones. Action: WarmupParamScheduler adjusts learning rate based on MultiStepParamScheduler schedule. Exit: Training completes.
Gradient Accumulation (gradient-accumulation, reinforcing) — Trigger: Small batch size or large model. Action: Accumulates gradients across multiple forward passes before optimizer step. Exit: Configured accumulation steps reached.

Delays

DataLoader Prefetching (async-processing, ~Variable based on augmentation complexity) — Multiple worker processes preload and augment images while GPU processes current batch
Checkpoint Saving (checkpoint-save, ~1-10 seconds depending on model size) — Periodic saves pause training briefly to write model state to disk
Evaluation Phase (batch-window, ~Minutes depending on test set size) — Training pauses to run inference on validation set and compute metrics

Control Points

MODEL.MASK_ON (feature-flag) — Controls: Enables/disables instance segmentation mask prediction in ROI heads. Default: true
SOLVER.BASE_LR (hyperparameter) — Controls: Base learning rate for SGD optimizer — scales with batch size. Default: 0.02
MODEL.RESNETS.DEPTH (architecture-switch) — Controls: ResNet backbone depth (50, 101, 152) affecting model capacity and computation. Default: 50
SOLVER.IMS_PER_BATCH (hyperparameter) — Controls: Total batch size across all GPUs for training. Default: 16
MODEL.DEVICE (device-selection) — Controls: Compute device selection (cuda, cpu) for model execution. Default: cuda

Technology Stack

PyTorch (framework)
Core deep learning framework providing tensors, autograd, and neural network modules

torchvision (library)
Provides pretrained ResNet backbones and image transformations for data augmentation

pycocotools (library)
Handles COCO dataset format parsing and evaluation metric computation

OmegaConf (library)
Configuration management system enabling YAML configs with variable interpolation and composition

fvcore (library)
Utilities for parameter counting, FLOP computation, and training parameter scheduling

cloudpickle (serialization)
Serialization for complex Python objects including lambda functions in configs

Key Components

GeneralizedRCNN (orchestrator) — Top-level model architecture that coordinates backbone feature extraction, region proposal generation, and ROI head processing for two-stage detectors detectron2/modeling/meta_arch/rcnn.py
FPN (transformer) — Feature Pyramid Network that combines multi-scale ResNet features into a pyramid with consistent channel dimensions (256) across all levels detectron2/modeling/backbone/fpn.py
RPN (generator) — Region Proposal Network that generates object proposals by classifying anchor boxes as foreground/background and refining their coordinates detectron2/modeling/proposal_generator/rpn.py
StandardROIHeads (processor) — ROI heads that pool features from proposals, run box classification/regression, and optionally mask/keypoint prediction detectron2/modeling/roi_heads/roi_heads.py
DatasetMapper (transformer) — Converts raw COCO annotations into training format — applies augmentations, normalizes images, creates Instances objects with ground truth detectron2/data/dataset_mapper.py
COCOEvaluator (validator) — Computes COCO detection metrics (AP, AP50, AP75, etc.) by comparing predicted Instances against ground truth annotations detectron2/evaluation/coco_evaluation.py
DefaultTrainer (orchestrator) — Main training loop coordinator that handles optimizer steps, learning rate scheduling, checkpointing, and periodic evaluation detectron2/engine/defaults.py
build_detection_train_loader (factory) — Creates PyTorch DataLoader with dataset sampling, batching, and multi-worker data loading for training detectron2/data/build.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Training Repositories

Frequently Asked Questions

What is detectron2 used for?

Trains computer vision models to detect, segment, and analyze objects in images facebookresearch/detectron2 is a 8-component ml training written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 512 files.

How is detectron2 architected?

detectron2 is organized into 5 architecture layers: Configuration System, Modeling Framework, Data Pipeline, Training Engine, and 1 more. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through detectron2?

Data moves through 5 stages: Load and preprocess images → Extract backbone features → Generate region proposals → ROI feature extraction and classification → Post-processing and evaluation. Images flow through a multi-stage detection pipeline: raw images are loaded and augmented by DatasetMapper, the backbone (ResNet+FPN) extracts multi-scale features, RPN generates region proposals from these features, ROI heads classify proposals and predict masks/keypoints, and finally predictions are post-processed through NMS before evaluation. This pipeline design reflects a complex multi-stage processing system.

What technologies does detectron2 use?

The core stack includes PyTorch (Core deep learning framework providing tensors, autograd, and neural network modules), torchvision (Provides pretrained ResNet backbones and image transformations for data augmentation), pycocotools (Handles COCO dataset format parsing and evaluation metric computation), OmegaConf (Configuration management system enabling YAML configs with variable interpolation and composition), fvcore (Utilities for parameter counting, FLOP computation, and training parameter scheduling), cloudpickle (Serialization for complex Python objects including lambda functions in configs). A focused set of dependencies that keeps the build manageable.

What system dynamics does detectron2 have?

detectron2 exhibits 3 data pools (Model Registry, Training Checkpoints), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does detectron2 use?

4 design patterns detected: Lazy Configuration, Registry Pattern, Hook System, Structure Objects.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.