Topics

Browse all published topics in the mathematics of retrieval-augmented generation.

All ann-indexing embedding-geometry generation-grounding neural-retrieval probabilistic-ir rag-information-theory ranking-fusion retrieval-evaluation retrieval-foundations vector-quantization

intermediate probabilistic-ir

BM25 and the Binary Independence Model

How probability theory turns term counts into a ranking function — and why saturation and length normalization are the two ideas that matter

BM25 is the strongest lexical retrieval baseline in information retrieval, and it is not an arbitrary formula: its inverse-document-frequency factor falls out of the Binary Independence Model with a Jeffreys prior, and its term-frequency saturation is motivated by the 2-Poisson eliteness model. We derive the Robertson–Spärck-Jones weight from the Probability Ranking Principle, show how smoothed IDF emerges, motivate the saturating tf transform and the document-length normalization, assemble the BM25 scoring function, and prove its limit behavior (k₁→0 recovers the binary model, k₁→∞ recovers length-normalized raw term frequency, b interpolates length normalization). An interactive scoring laboratory and a from-scratch NumPy implementation whose tests verify these limits accompany the derivation, with a worked finance example over earnings-call transcripts and 10-K filings.

3 prerequisites

advanced rag-information-theory

Capstone: The Mathematics of a Production Multimodal Financial RAG System

The retrieval stack arrived as a dozen separate topics — lexical scoring, dense MIPS, IVF and PQ and HNSW, late interaction served by PLAID, reciprocal-rank fusion; this capstone composes the published stack into one finance pipeline and proves the three laws that govern it end to end — cascade recall multiplies across stages, hybrid fusion gains exactly when its legs de-correlate, and a fixed compute budget is allocated across heterogeneous per-modality legs by water-filling — with the single exact statement that a full-budget pipeline recovers brute-force retrieval, and everything above it a heuristic speed-for-recall trade measured honestly on a finance frontier

The retrieval half of a production RAG system arrived as a dozen topics; this capstone composes the published stack — lexical scoring, dense MIPS, IVF/PQ/HNSW candidate generation, PLAID-served late interaction, reciprocal-rank fusion — into one finance pipeline and proves the three laws that govern it end to end. Cascade recall multiplies across independent stages, a conservative lower bound the correlated real legs sit above; hybrid fusion beats its best leg by a margin that grows as the legs de-correlate, and fails when a false positive is co-endorsed by two legs; and a fixed compute budget is allocated across heterogeneous per-modality legs by water-filling, equalizing the marginal recall-per-cost. The one exact statement is the collapse anchor — a full-budget pipeline is brute-force retrieval — and a tested notebook that imports the fusion, PLAID, IVF, and over-fetch code owns every number on the finance frontier. The generation and grounding layer above retrieval is named as future work, not derived.

5 prerequisites

intermediate embedding-geometry

Chunking as a Segmentation and Optimization Problem

Where to cut a document, posed as a coherence-maximizing segmentation with an exact dynamic-programming optimum — and the proxy it secretly optimizes

Chunking is the first thing a retrieval pipeline does to a document and the least mathematized — the default is to split every few hundred tokens and move on. Posed properly it is a one-dimensional segmentation problem: choose boundaries that maximize within-chunk coherence. We show that coherence has a clean closed form — for L2-normalized sentence embeddings the within-segment cost is the segment length minus the norm of the sum of its embeddings, which is the length times one minus the mean resultant length, the von Mises-Fisher concentration statistic from the prerequisite topic — so minimizing total cost carves the document into tight clusters on the sphere. Because the cost is additive across segments, the globally optimal segmentation is computed exactly by an O(n squared) dynamic program, which we prove optimal and verify against brute force; TextTiling's greedy depth scores and fixed-size chunking are heuristics that cannot beat it. We then confront the honest catch that makes this more than an algorithms exercise: coherence is a proxy for downstream retrieval quality, and the harness shows boundary-recovery F1 peaking at the true section count while the coherence cost keeps falling under over-segmentation. An interactive Chunking Laboratory and a tested implementation accompany the derivation, with a synthetic 10-K filing on which the optimal segmentation recovers the section structure that fixed-size chunking misses.

Topics

BM25 and the Binary Independence Model

Capstone: The Mathematics of a Production Multimodal Financial RAG System

Chunking as a Segmentation and Optimization Problem

Conformal Factuality: Distribution-Free Correctness Guarantees for Generation

Context Selection: Submodular Coverage, MMR, and Determinantal Point Processes

Contrastive Learning for Retrieval: InfoNCE, Temperature, and Negative Sampling

Cross-Encoders and the Reranking Cascade

Cross-Modal Contrastive Alignment and the Modality Gap

Dense Retrieval and Dual Encoders: Architecture, Expressivity, and the Cost of Negatives

Faithfulness and Groundedness as Measurable Quantities

Filtered and Incremental ANN: Predicate Search, Deletion, and Graph Connectivity

GraphRAG: Community Detection and the Modularity of Knowledge

Hard-Negative Mining and Debiased Contrastive Training (ANCE)

High-Dimensional Geometry and the Concentration of Distances

HNSW: Hierarchical Navigable Small-World Construction and Search

How Many Dimensions Does Relevance Need? Sign-Rank and Margin Complexity

Knowledge Distillation for Retrieval: Teacher–Student Transfer (MarginMSE)

LambdaRank, LambdaMART, and Listwise Objectives

Late Interaction and Learned Sparse Retrieval: ColBERT and SPLADE

Learning to Rank: Pointwise, Pairwise, and RankNet

LLM Rerankers: Listwise Permutation Objectives and RankGPT

LLM-as-Judge and Faithfulness: RAGAS as a Family of Estimators

Locality-Sensitive Hashing: Collision Probability and the ρ Exponent

Matryoshka Representations: Jointly Trained Nested Subspaces

MIPS Hardness and the Limits of Exact Nearest-Neighbor Search

Multi-Hop and Iterative Retrieval as Search over an Evidence Space

Multi-Vector ANN: Indexing and Pruning MaxSim at Scale (PLAID)

Navigable Small-World Graphs and the Mathematics of Greedy Routing

NDCG: Graded Relevance and Discount Geometry

Normalization, the Hypersphere, and von Mises–Fisher Geometry

Optimized Product Quantization and Score-Aware Quantization

PCA as Optimal Linear Dimensionality Reduction for Embeddings

Pointwise Mutual Information: What Retrieval Adds to Generation, in Bits

Product Quantization and Asymmetric Distance Computation

Query Transformation and HyDE: Correcting Distribution Shift in Embedding Space

Query-Likelihood Language Models and Smoothing

Random Projections and the Johnson–Lindenstrauss Lemma

Rank Fusion: Reciprocal Rank Fusion and the Geometry of Rank Aggregation

Relevance Feedback and Query Expansion: Rocchio and RM3

Retrieval versus Long Context: Attention Complexity and Positional Bias

Score Calibration, Drift Detection, and Significance Testing for Retrieval

Selective Generation: When a RAG System Should Abstain

Set Metrics: Precision, Recall, MAP, and MRR as Estimators

The Inverted Index and Safe Dynamic Pruning (WAND, BlockMax-WAND)

The Probability Ranking Principle

The Retrieval Problem: Relevance, Similarity, and the Geometry of Scores

The Retriever as a Noisy Channel: Recall, Precision, and Information Limits

The Vector Space Model and TF-IDF

Vector Quantization and the Lloyd–Max Optimality Conditions

Voronoi Partitioning and the Inverted-File Index