facebookresearch/detectron2
Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Trains computer vision models to detect, segment, and analyze objects in images
Images flow through a multi-stage detection pipeline: raw images are loaded and augmented by DatasetMapper, the backbone (ResNet+FPN) extracts multi-scale features, RPN generates region proposals from these features, ROI heads classify proposals and predict masks/keypoints, and finally predictions are post-processed through NMS before evaluation.
Under the hood, the system uses 3 feedback loops, 3 data pools, 5 control points to manage its runtime behavior.
A 8-component ml training. 512 files analyzed. Data flows through 5 distinct pipeline stages.
How Data Flows Through the System
Images flow through a multi-stage detection pipeline: raw images are loaded and augmented by DatasetMapper, the backbone (ResNet+FPN) extracts multi-scale features, RPN generates region proposals from these features, ROI heads classify proposals and predict masks/keypoints, and finally predictions are post-processed through NMS before evaluation.
- Load and preprocess images — DatasetMapper loads images from COCO dataset, applies augmentations (resize, flip), normalizes pixel values, and packages with ground truth annotations into training dicts [COCO annotations → TrainingBatch] (config: INPUT.MIN_SIZE_TRAIN, INPUT.MAX_SIZE_TRAIN, dataloader.train.mapper.augmentations)
- Extract backbone features — ResNet backbone processes input images through conv layers to extract features at multiple scales (res2-res5), then FPN combines these into a feature pyramid with consistent 256-channel dimensions [ImageTensor → FeatureMaps] (config: MODEL.BACKBONE.NAME, MODEL.RESNETS.DEPTH, MODEL.FPN.IN_FEATURES)
- Generate region proposals — RPN takes FPN features and generates object proposals by classifying dense anchor grids as foreground/background, then refines anchor coordinates using box regression [FeatureMaps → Boxes] (config: MODEL.RPN.IN_FEATURES, MODEL.ANCHOR_GENERATOR.SIZES, MODEL.RPN.PRE_NMS_TOPK_TRAIN)
- ROI feature extraction and classification — StandardROIHeads pools features from FPN levels for each proposal using ROIAlign, then runs box classification, regression, and optionally mask/keypoint prediction through separate heads [FeatureMaps → Instances] (config: MODEL.ROI_HEADS.NAME, MODEL.ROI_HEADS.NUM_CLASSES, MODEL.MASK_ON)
- Post-processing and evaluation — Raw predictions are filtered through NMS to remove duplicate detections, then COCOEvaluator computes AP metrics by matching predictions to ground truth using IoU thresholds [Instances → Evaluation metrics] (config: MODEL.ROI_HEADS.SCORE_THRESH_TEST, MODEL.ROI_HEADS.NMS_THRESH_TEST, TEST.DETECTIONS_PER_IMAGE)
Data Models
The data structures that flow between stages — the contracts that hold the system together.
detectron2/structures/image_list.pyTensor[B, 3, H, W] in BGR format with pixel values normalized using ImageNet statistics
Raw images are loaded, resized, normalized into tensors, fed through the backbone, then discarded after feature extraction
detectron2/modeling/backbone/fpn.pyDict[str, Tensor] with keys like 'p2', 'p3', 'p4', 'p5' mapping to feature tensors of shape [B, 256, H/stride, W/stride]
Backbone extracts multi-scale features, FPN combines them into pyramid, RPN and ROI heads consume specific pyramid levels
detectron2/structures/instances.pyStructure containing pred_boxes: Boxes[N, 4], scores: Tensor[N], pred_classes: Tensor[N], pred_masks: Tensor[N, H, W] (optional)
ROI heads predict raw detections, NMS filters overlapping boxes, evaluator computes metrics against ground truth
detectron2/structures/boxes.pyTensor[N, 4] in (x1, y1, x2, y2) format representing bounding box coordinates
RPN generates initial proposals, box regression refines coordinates, NMS removes duplicates
detectron2/data/common.pyList[Dict] where each dict contains 'image': Tensor[3, H, W], 'instances': Instances with gt_boxes, gt_classes, gt_masks
DatasetMapper creates dicts from annotations, DataLoader batches them, model processes batch and computes losses
Hidden Assumptions
Things this code relies on but never validates. These are the things that cause silent failures when the system changes.
All input feature maps from backbone have consistent spatial dimensions that align with expected strides (4, 8, 16, 32, 64) but FPN never validates that input['p2'] has height/width that is exactly 4x smaller than input['p1']
If this fails: If backbone produces feature maps with unexpected spatial ratios, FPN's lateral connections and top-down pathway will misalign features, causing detection heads to process spatially inconsistent features and produce wrong bounding box coordinates
detectron2/modeling/backbone/fpn.py:FPN.forward
Input images are in standard 8-bit RGB/BGR format with pixel values in [0,255] range, but DatasetMapper applies ImageNet normalization (mean=[103.53, 116.28, 123.675], std=[1.0, 1.0, 1.0]) without checking pixel value distribution
If this fails: If input contains HDR images, medical images with 16-bit depth, or pre-normalized images in [0,1] range, the normalization will push pixel values far outside expected ranges, causing backbone features to saturate and models to fail silently
detectron2/data/dataset_mapper.py:DatasetMapper.__call__
Proposal boxes from RPN are in absolute image coordinates (x1,y1,x2,y2) format matching the original image size, but ROI heads never validate that proposal coordinates are within image boundaries
If this fails: If RPN generates out-of-bounds proposals due to anchor misconfiguration or image resizing bugs, ROIAlign will sample features from invalid memory locations, causing crashes or corrupted gradients during training
detectron2/modeling/roi_heads/roi_heads.py:StandardROIHeads.forward
Training assumes balanced positive/negative anchor sampling with hardcoded ratios (positive_fraction=0.5, batch_size_per_image=256) but never checks if the dataset actually contains sufficient positive anchors
If this fails: On datasets with very small objects or sparse annotations, most images may have <128 positive anchors available, causing the sampler to pad with duplicate positives or fall back to fewer samples, leading to unstable gradients and poor convergence
detectron2/modeling/proposal_generator/rpn.py:RPN.forward_training
DataLoader assumes sufficient GPU memory to hold batch_size * max_image_size * num_workers worth of preprocessed images, but never estimates or validates memory requirements
If this fails: Large images (>2000px) or high batch sizes can cause CUDA out-of-memory errors that manifest as cryptic RuntimeError messages mid-training, losing hours of training progress without clear memory usage guidance
detectron2/engine/defaults.py:DefaultTrainer.build_train_loader
Checkpoint loading assumes model architecture hasn't changed between save and load - specifically that all parameter names and shapes match exactly
If this fails: If config changes backbone depth (ResNet50->ResNet101) or adds new heads between training runs, checkpoint loading fails with KeyError or shape mismatch, but error messages don't clearly indicate which architectural change caused the incompatibility
detectron2/checkpoint/checkpoint.py:Checkpointer.load
DataLoader iteration assumes dataset records can be accessed in any order via __getitem__(index), but some dataset implementations may expect sequential access or have stateful transforms
If this fails: Multi-worker data loading with random sampling can break datasets that maintain internal state or cache, causing inconsistent augmentations or corrupted batches that lead to training instability
detectron2/data/build.py:build_detection_train_loader
All configs hardcode pixel normalization constants (pixel_mean=[103.530, 116.280, 123.675]) assuming BGR channel order and ImageNet statistics, but never validate actual dataset statistics
If this fails: If dataset uses RGB order, different camera sensors, or domain-specific images (medical, satellite), the hardcoded normalization will shift the data distribution, causing pretrained features to activate incorrectly and reducing model accuracy
configs/common/models/mask_rcnn_fpn.py:model.pixel_mean
COCO evaluation assumes detection scores are well-calibrated probabilities in [0,1] range and uses fixed IoU thresholds (0.5:0.95) without checking score distribution
If this fails: Models that output uncalibrated confidence scores or use different output ranges may appear to perform poorly in evaluation even if spatial predictions are accurate, masking model quality issues
detectron2/evaluation/coco_evaluation.py:COCOEvaluator._eval_predictions
Image batching assumes all input tensors have the same number of channels (3 for RGB/BGR) and will pad spatial dimensions to match the largest image in the batch
If this fails: If batch contains grayscale (1-channel) or RGBA (4-channel) images mixed with RGB, tensor concatenation will fail with shape mismatch errors that don't clearly indicate the channel dimension issue
detectron2/structures/image_list.py:ImageList.from_tensors
System Behavior
How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.
Data Pools
Maps model names to pretrained weight URLs and config files for easy model loading
Stores model weights, optimizer state, and training iteration for resuming training
Registers dataset metadata including paths, class names, and annotation formats
Feedback Loops
- Training Loop (training-loop, reinforcing) — Trigger: Batch of training data. Action: Forward pass computes losses, backward pass updates model weights through optimizer.step(). Exit: Maximum iterations reached or manual stop.
- Learning Rate Scheduling (convergence, balancing) — Trigger: Training iteration milestones. Action: WarmupParamScheduler adjusts learning rate based on MultiStepParamScheduler schedule. Exit: Training completes.
- Gradient Accumulation (gradient-accumulation, reinforcing) — Trigger: Small batch size or large model. Action: Accumulates gradients across multiple forward passes before optimizer step. Exit: Configured accumulation steps reached.
Delays
- DataLoader Prefetching (async-processing, ~Variable based on augmentation complexity) — Multiple worker processes preload and augment images while GPU processes current batch
- Checkpoint Saving (checkpoint-save, ~1-10 seconds depending on model size) — Periodic saves pause training briefly to write model state to disk
- Evaluation Phase (batch-window, ~Minutes depending on test set size) — Training pauses to run inference on validation set and compute metrics
Control Points
- MODEL.MASK_ON (feature-flag) — Controls: Enables/disables instance segmentation mask prediction in ROI heads. Default: true
- SOLVER.BASE_LR (hyperparameter) — Controls: Base learning rate for SGD optimizer — scales with batch size. Default: 0.02
- MODEL.RESNETS.DEPTH (architecture-switch) — Controls: ResNet backbone depth (50, 101, 152) affecting model capacity and computation. Default: 50
- SOLVER.IMS_PER_BATCH (hyperparameter) — Controls: Total batch size across all GPUs for training. Default: 16
- MODEL.DEVICE (device-selection) — Controls: Compute device selection (cuda, cpu) for model execution. Default: cuda
Technology Stack
Core deep learning framework providing tensors, autograd, and neural network modules
Provides pretrained ResNet backbones and image transformations for data augmentation
Handles COCO dataset format parsing and evaluation metric computation
Configuration management system enabling YAML configs with variable interpolation and composition
Utilities for parameter counting, FLOP computation, and training parameter scheduling
Serialization for complex Python objects including lambda functions in configs
Key Components
- GeneralizedRCNN (orchestrator) — Top-level model architecture that coordinates backbone feature extraction, region proposal generation, and ROI head processing for two-stage detectors
detectron2/modeling/meta_arch/rcnn.py - FPN (transformer) — Feature Pyramid Network that combines multi-scale ResNet features into a pyramid with consistent channel dimensions (256) across all levels
detectron2/modeling/backbone/fpn.py - RPN (generator) — Region Proposal Network that generates object proposals by classifying anchor boxes as foreground/background and refining their coordinates
detectron2/modeling/proposal_generator/rpn.py - StandardROIHeads (processor) — ROI heads that pool features from proposals, run box classification/regression, and optionally mask/keypoint prediction
detectron2/modeling/roi_heads/roi_heads.py - DatasetMapper (transformer) — Converts raw COCO annotations into training format — applies augmentations, normalizes images, creates Instances objects with ground truth
detectron2/data/dataset_mapper.py - COCOEvaluator (validator) — Computes COCO detection metrics (AP, AP50, AP75, etc.) by comparing predicted Instances against ground truth annotations
detectron2/evaluation/coco_evaluation.py - DefaultTrainer (orchestrator) — Main training loop coordinator that handles optimizer steps, learning rate scheduling, checkpointing, and periodic evaluation
detectron2/engine/defaults.py - build_detection_train_loader (factory) — Creates PyTorch DataLoader with dataset sampling, batching, and multi-worker data loading for training
detectron2/data/build.py
Explore the interactive analysis
See the full architecture map, data flow, and code patterns visualization.
Analyze on CodeSeaRelated Ml Training Repositories
Frequently Asked Questions
What is detectron2 used for?
Trains computer vision models to detect, segment, and analyze objects in images facebookresearch/detectron2 is a 8-component ml training written in Python. Data flows through 5 distinct pipeline stages. The codebase contains 512 files.
How is detectron2 architected?
detectron2 is organized into 5 architecture layers: Configuration System, Modeling Framework, Data Pipeline, Training Engine, and 1 more. Data flows through 5 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.
How does data flow through detectron2?
Data moves through 5 stages: Load and preprocess images → Extract backbone features → Generate region proposals → ROI feature extraction and classification → Post-processing and evaluation. Images flow through a multi-stage detection pipeline: raw images are loaded and augmented by DatasetMapper, the backbone (ResNet+FPN) extracts multi-scale features, RPN generates region proposals from these features, ROI heads classify proposals and predict masks/keypoints, and finally predictions are post-processed through NMS before evaluation. This pipeline design reflects a complex multi-stage processing system.
What technologies does detectron2 use?
The core stack includes PyTorch (Core deep learning framework providing tensors, autograd, and neural network modules), torchvision (Provides pretrained ResNet backbones and image transformations for data augmentation), pycocotools (Handles COCO dataset format parsing and evaluation metric computation), OmegaConf (Configuration management system enabling YAML configs with variable interpolation and composition), fvcore (Utilities for parameter counting, FLOP computation, and training parameter scheduling), cloudpickle (Serialization for complex Python objects including lambda functions in configs). A focused set of dependencies that keeps the build manageable.
What system dynamics does detectron2 have?
detectron2 exhibits 3 data pools (Model Registry, Training Checkpoints), 3 feedback loops, 5 control points, 3 delays. The feedback loops handle training-loop and convergence. These runtime behaviors shape how the system responds to load, failures, and configuration changes.
What design patterns does detectron2 use?
4 design patterns detected: Lazy Configuration, Registry Pattern, Hook System, Structure Objects.
Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.