openai/whisper

Robust Speech Recognition via Large-Scale Weak Supervision

98,063 stars Python 10 components

Converts audio files to transcribed text using transformer models

Audio files are loaded and resampled to 16kHz mono, then converted to 80-channel log mel spectrograms in 30-second chunks. A transformer encoder processes these features while a decoder generates text tokens autoregressively, guided by special tokens for task specification and language identification. Cross-attention weights enable word-level timestamp alignment through dynamic time warping.

Under the hood, the system uses 2 feedback loops, 2 data pools, 4 control points to manage its runtime behavior.

A 10-component ml inference. 20 files analyzed. Data flows through 7 distinct pipeline stages.

How Data Flows Through the System

Load audio file — load_audio() calls ffmpeg subprocess to convert any audio format to 16kHz mono PCM, returning numpy array normalized to [-1, 1] range
Generate mel spectrogram — log_mel_spectrogram() applies STFT with 400-point FFT and 160-sample hop length, then mel filter bank with 80 channels, producing log-scaled features [AudioTensor → MelSpectrogram]
Encode audio features — AudioEncoder processes mel spectrogram through transformer layers with sinusoidal position embeddings and self-attention, creating contextualized audio representations [MelSpectrogram → encoded audio features]
Configure decoding — DecodingOptions dataclass specifies task (transcribe/translate), target language, search strategy (beam/sampling), temperature, and output preferences (config: task, language, temperature)
Generate text tokens — TextDecoder runs autoregressive generation with cross-attention to audio features, using beam search or sampling guided by special tokens for task control [DecodingOptions → DecodingResult] (config: task, language, temperature)
Align tokens with audio — add_word_timestamps() uses cross-attention weights from specific model heads to compute token-audio alignment via dynamic time warping, creating word boundaries [DecodingResult → WordTiming]
Format final output — transcribe() assembles segments with text, timestamps, token IDs, and confidence scores, applying text normalization and creating structured output dictionary [WordTiming → transcription with segments]

Data Models

The data structures that flow between stages — the contracts that hold the system together.

AudioTensor whisper/audio.py
numpy.ndarray with shape (N_SAMPLES,) where N_SAMPLES = 480000 for 30-second chunks, dtype float32, values normalized to [-1, 1]
Created by ffmpeg subprocess converting audio files to 16kHz mono PCM, consumed by mel spectrogram computation

MelSpectrogram whisper/audio.py
torch.Tensor with shape (n_mels=80, N_FRAMES=3000) representing log mel features over time, dtype float32
Generated from audio waveform using STFT and mel filter bank, fed into transformer encoder

DecodingOptions whisper/decoding.py
dataclass with task: str, language: Optional[str], temperature: float, sample_len: Optional[int], best_of: Optional[int], beam_size: Optional[int], patience: Optional[float]
Created with user preferences for search strategy, language constraints, and output format, used throughout decoding process

DecodingResult whisper/decoding.py
dataclass with audio_features: Tensor, language: str, language_probs: Optional[Dict[str, float]], tokens: List[int], text: str, avg_logprob: float, no_speech_prob: float, temperature: float, compression_ratio: float
Accumulated during token generation with logprobs and confidence metrics, converted to final transcription format

WordTiming whisper/timing.py
dataclass with word: str, tokens: List[int], start: float, end: float, probability: float representing word-level timestamps
Generated by cross-attention alignment between audio and text tokens using dynamic time warping

ModelDimensions whisper/model.py
dataclass with n_mels: int, n_audio_ctx: int, n_audio_state: int, n_audio_head: int, n_audio_layer: int, n_vocab: int, n_text_ctx: int, n_text_state: int, n_text_head: int, n_text_layer: int
Loaded from model checkpoint to configure transformer architecture dimensions

Hidden Assumptions

Things this code relies on but never validates. These are the things that cause silent failures when the system changes.

critical Domain unguarded

FFmpeg subprocess returns 16-bit signed integer PCM data that can be directly cast to float32 and normalized by dividing by 32768.0, regardless of the input audio's original bit depth or encoding format

If this fails: If input audio uses 24-bit, 32-bit float, or other formats where the dynamic range differs from 16-bit signed integers, the normalization produces incorrect amplitude scaling leading to distorted spectrograms and wrong transcription confidence scores

whisper/audio.py:load_audio

critical Resource weakly guarded

Model download URLs remain permanently available at OpenAI's Azure CDN endpoints and SHA256 checksums never change for existing model versions

If this fails: If OpenAI moves, removes, or updates model files at these URLs, the download fails with network errors and users cannot load any models, breaking all inference functionality

whisper/__init__.py:load_model

critical Shape unguarded

Cross-attention weights tensor from model.decoder.blocks[layer].cross_attn has shape (batch_size, n_heads, seq_len, n_audio_ctx) where n_audio_ctx exactly equals the mel spectrogram time dimension

If this fails: If the model architecture changes head dimensions or audio context length, the dynamic time warping alignment indexing goes out of bounds, causing crashes during word timestamp computation

whisper/timing.py:add_word_timestamps

critical Temporal unguarded

30-second audio chunks with 1-second overlap provide sufficient context for seamless transcription, and word boundaries never span across the overlap regions

If this fails: Long words, sentences, or phrases that cross chunk boundaries get incorrectly segmented or duplicated, producing fragmented transcripts with timing gaps or repeated text segments

whisper/transcribe.py:transcribe

warning Scale unguarded

Audio processing uses hardcoded constants N_SAMPLES=480000 (30 seconds * 16kHz) and N_FRAMES=3000, but never validates that input audio length matches these expectations

If this fails: Audio shorter than 30 seconds gets zero-padded while longer audio gets truncated, but downstream components expecting exactly 3000 frames crash if the padding/trimming logic changes or fails

whisper/audio.py:N_SAMPLES

warning Environment unguarded

FFmpeg executable exists in system PATH with support for the specific input audio format, and the subprocess call will complete successfully without timeout or resource limits

If this fails: If FFmpeg is missing, has different command-line arguments, or hits memory/CPU limits on large files, the subprocess raises CalledProcessError but this isn't handled, crashing the entire transcription pipeline

whisper/audio.py:load_audio

warning Contract unguarded

The tokenizer's sot_sequence (start-of-transcript tokens) contains language and task tokens in a specific order that matches what the model was trained to expect

If this fails: If the tokenizer produces a different token sequence format or the model expects different special tokens, the decoder generates garbage text because the conditioning tokens don't match the training distribution

whisper/model.py:Whisper.decode

warning Ordering weakly guarded

Input tensor x has its time dimension as the last axis, and F.pad with 'reflect' mode is applied along the last dimension for temporal smoothing

If this fails: If cross-attention weights have a different axis ordering (e.g., time on axis 1), the median filtering operates on the wrong dimension, producing meaningless smoothed attention patterns and incorrect word alignments

whisper/timing.py:median_filter

warning Domain unguarded

The tiktoken tokenizer vocabulary exactly matches the model's training vocabulary, with identical token IDs for all special tokens including language identifiers, task tokens, and timestamp tokens

If this fails: If there's a version mismatch between tiktoken and the model's expected vocabulary, token IDs map to wrong words or cause out-of-vocabulary errors, producing completely incorrect transcriptions

whisper/tokenizer.py:get_tokenizer

info Resource weakly guarded

PyTorch version supports scaled_dot_product_attention for memory-efficient attention, and falls back gracefully to manual attention computation when SDPA is unavailable

If this fails: On older PyTorch versions or specific hardware where SDPA fails unexpectedly, the fallback manual attention may have different numerical precision or memory usage patterns, causing subtle differences in transcription quality or OOM errors

whisper/model.py:scaled_dot_product_attention

System Behavior

How the system operates at runtime — where data accumulates, what loops, what waits, and what controls what.

Data Pools

Model Cache (file-store)
Downloaded model checkpoints stored locally with SHA256 verification to avoid re-downloading

Cross-attention Weights (in-memory)
Attention matrices from specific decoder heads that correlate with audio-text alignment, used for word timing

Feedback Loops

Autoregressive Generation (recursive, reinforcing) — Trigger: Start-of-transcript token. Action: TextDecoder generates next token using cross-attention to audio and self-attention to previous tokens. Exit: End-of-transcript token or maximum sequence length.
Temperature Scaling Fallback (retry, balancing) — Trigger: Low average log probability or high compression ratio indicating poor quality. Action: Retry decoding with higher temperature values from provided list to increase token diversity. Exit: Acceptable quality metrics or temperature list exhausted.

Delays

Model Download (async-processing, ~varies by model size and network speed) — First run of each model size downloads checkpoint from Azure CDN with progress bar
Audio Chunk Processing (batch-window, ~30 seconds per chunk) — Long audio files processed in sliding windows with overlap for seamless transcription

Control Points

Model Size Selection (architecture-switch) — Controls: Trade-off between inference speed and transcription quality via model parameters (39M to 1550M). Default: available models: tiny, base, small, medium, large variants
Task Mode (runtime-toggle) — Controls: Whether to transcribe in original language or translate to English. Default: transcribe
Temperature Scaling (sampling-strategy) — Controls: Randomness in token selection from logits, affects transcription determinism vs diversity. Default: temperature
Device Placement (device-selection) — Controls: Whether model runs on CPU or CUDA GPU for inference speed. Default: auto-detected from torch.cuda.is_available()

Technology Stack

PyTorch (framework)
Primary tensor computation framework for transformer model inference and automatic differentiation

tiktoken (library)
Fast tokenization library handling text encoding/decoding with Whisper's vocabulary and special tokens

NumPy (library)
Numerical operations for audio processing, mel spectrogram computation, and array manipulations

ffmpeg (runtime)
External audio processing tool for format conversion, resampling, and mono channel extraction

Triton (compute)
GPU kernel optimization for median filtering and dynamic time warping operations when CUDA available

Numba (compute)
JIT compilation of dynamic time warping algorithms for CPU-based word timestamp alignment

Key Components

load_model (loader) — Downloads and loads pre-trained transformer models from OpenAI, handling model caching and device placement whisper/__init__.py
Whisper (orchestrator) — Main transformer model class that coordinates encoder-decoder inference, integrates transcribe, decode, and detect_language methods whisper/model.py
AudioEncoder (encoder) — Transformer encoder that processes mel spectrograms into contextualized audio representations using self-attention whisper/model.py
TextDecoder (decoder) — Autoregressive transformer decoder with cross-attention to audio features, generates text tokens one at a time whisper/model.py
transcribe (orchestrator) — High-level transcription pipeline that coordinates audio loading, chunking, inference, and post-processing with word timestamps whisper/transcribe.py
log_mel_spectrogram (transformer) — Converts audio waveform to log mel spectrogram using STFT with 400-point FFT and 80 mel filter banks whisper/audio.py
decode (processor) — Implements beam search and sampling strategies for text generation, managing temperature scaling and length penalties whisper/decoding.py
detect_language (resolver) — Analyzes audio features to predict spoken language using language classification tokens in the transformer whisper/decoding.py
Tokenizer (encoder) — Wraps tiktoken to encode/decode text with special tokens for tasks, languages, and timestamps in Whisper's token vocabulary whisper/tokenizer.py
add_word_timestamps (processor) — Aligns generated tokens with audio timing using cross-attention weights and dynamic time warping for word-level timestamps whisper/timing.py

Explore the interactive analysis

See the full architecture map, data flow, and code patterns visualization.

Analyze on CodeSea

Related Ml Inference Repositories

Frequently Asked Questions

What is whisper used for?

Converts audio files to transcribed text using transformer models openai/whisper is a 10-component ml inference written in Python. Data flows through 7 distinct pipeline stages. The codebase contains 20 files.

How is whisper architected?

whisper is organized into 3 architecture layers: Audio Processing, Model Inference, Decoding & Output. Data flows through 7 distinct pipeline stages. This layered structure keeps concerns separated and modules independent.

How does data flow through whisper?

Data moves through 7 stages: Load audio file → Generate mel spectrogram → Encode audio features → Configure decoding → Generate text tokens → .... Audio files are loaded and resampled to 16kHz mono, then converted to 80-channel log mel spectrograms in 30-second chunks. A transformer encoder processes these features while a decoder generates text tokens autoregressively, guided by special tokens for task specification and language identification. Cross-attention weights enable word-level timestamp alignment through dynamic time warping. This pipeline design reflects a complex multi-stage processing system.

What technologies does whisper use?

The core stack includes PyTorch (Primary tensor computation framework for transformer model inference and automatic differentiation), tiktoken (Fast tokenization library handling text encoding/decoding with Whisper's vocabulary and special tokens), NumPy (Numerical operations for audio processing, mel spectrogram computation, and array manipulations), ffmpeg (External audio processing tool for format conversion, resampling, and mono channel extraction), Triton (GPU kernel optimization for median filtering and dynamic time warping operations when CUDA available), Numba (JIT compilation of dynamic time warping algorithms for CPU-based word timestamp alignment). A focused set of dependencies that keeps the build manageable.

What system dynamics does whisper have?

whisper exhibits 2 data pools (Model Cache, Cross-attention Weights), 2 feedback loops, 4 control points, 2 delays. The feedback loops handle recursive and retry. These runtime behaviors shape how the system responds to load, failures, and configuration changes.

What design patterns does whisper use?

4 design patterns detected: Encoder-Decoder Transformer, Special Token Protocol, Sliding Window Processing, Quality-based Retry.

Analyzed on April 20, 2026 by CodeSea. Written by Karolina Sarna.