Curriculum Roadmap

A planned map of content areas. Published topics are linked; everything else is on the roadmap.

Explore the full dependency graph, or browse by track below.

Retrieval FoundationsEmbedding GeometryProbabilistic IrAnn IndexingVector QuantizationNeural RetrievalRanking FusionRetrieval EvaluationGeneration GroundingRag Information TheoryFormalmlPublishedPlannedCross-site

Foundations

Retrieval Foundations

Track complete

Retrieval as ranking by a relevance functional; the metric and inner-product structure of similarity; and the complexity-theoretic limits of exact search. The root of the dependency graph.

foundational retrieval-foundations

The Retrieval Problem: Relevance, Similarity, and the Geometry of Scores

Retrieval as ranking by a relevance functional — and the three similarity scores that agree on the sphere and diverge off it

Retrieval is ranking: given a query, score every document by a relevance functional rel(q, d) and return the top k, a set-valued operator on the resulting order. Because only the order matters, relevance is ordinal even though scores are cardinal — the ranking is invariant under any strictly monotone transform of the score, a fact we will lean on repeatedly. We then study the three similarity functions retrieval actually uses — Euclidean distance, the dot product, and cosine similarity — through the single identity ||a-b||^2 = ||a||^2 + ||b||^2 - 2<a,b>. On the unit sphere this identity collapses the three into one: ranking by Euclidean distance, by dot product, and by cosine all induce the same order. Off the sphere they diverge, because magnitude matters for the dot product but is quotiented away by cosine — the divergence that motivates normalization throughout the rest of the curriculum. We separate which of these are true metrics (Euclidean is; cosine distance violates the triangle inequality; the dot product is not a metric at all) because the triangle inequality is exactly the structure that tree- and graph-based approximate-nearest-neighbor indexes later exploit. The level sets make the picture geometric: equal-score loci are hyperplanes for the dot product, spheres for Euclidean distance, and cones for cosine. An interactive similarity playground and a tested, deterministic implementation accompany the derivation, with a finance example showing the same query ranking documents differently under dot product and cosine when document norms vary.

Start here

advanced retrieval-foundations

MIPS Hardness and the Limits of Exact Nearest-Neighbor Search

Why maximum inner-product search is not a metric problem, why exact high-dimensional search has no truly-sublinear algorithm, and why we approximate

Retrieval ranks documents by maximum inner product: return the document maximizing <q, d>. We first show why this is not the metric nearest-neighbor problem it is often mistaken for — the inner product has no triangle inequality, and a vector need not be its own best match — so the space-partitioning intuition that organizes metric search does not transfer. We then give the asymmetric lifting transforms (Bachrach et al.; Shrivastava-Li; Neyshabur-Srebro) that turn MIPS into Euclidean nearest-neighbor search on a lifted sphere, prove the transform preserves the argmax exactly, and flag honestly that it does not preserve approximation ratios. With the problems reduced to one another, we reach the hardness: exact high-dimensional nearest-neighbor and closest-pair search have no known algorithm that is simultaneously exact, truly sublinear per query, and near-linear in space. We present the Orthogonal Vectors problem and its reduction to closest/farthest pair, state the Strong Exponential Time Hypothesis precisely, and derive the conditional n^(2-o(1)) lower bound — emphasizing that this is conditional hardness, not a proven impossibility. The payoff is the trade-off triangle that motivates the rest of the curriculum: in high ambient dimension you cannot have exactness, sublinear time, and near-linear space at once, so you relax one — give up exactness for approximate indexes, or exploit the low intrinsic dimension that the concentration topic showed real embeddings actually have. An interactive laboratory and a tested notebook accompany the derivation, with a finance example on why exact MIPS over a multimodal corpus is hopeless at query time.

Curriculum Roadmap

Foundations

Retrieval Foundations

The Retrieval Problem: Relevance, Similarity, and the Geometry of Scores

MIPS Hardness and the Limits of Exact Nearest-Neighbor Search

Embedding-Space Geometry

Chunking as a Segmentation and Optimization Problem

High-Dimensional Geometry and the Concentration of Distances

Normalization, the Hypersphere, and von Mises–Fisher Geometry

Random Projections and the Johnson–Lindenstrauss Lemma

Matryoshka Representations: Jointly Trained Nested Subspaces

PCA as Optimal Linear Dimensionality Reduction for Embeddings

Retrieval Mechanics

Probabilistic IR

The Vector Space Model and TF-IDF

BM25 and the Binary Independence Model

The Inverted Index and Safe Dynamic Pruning (WAND, BlockMax-WAND)

The Probability Ranking Principle

Relevance Feedback and Query Expansion: Rocchio and RM3

Query-Likelihood Language Models and Smoothing

Vector Quantization

Optimized Product Quantization and Score-Aware Quantization

Product Quantization and Asymmetric Distance Computation

Vector Quantization and the Lloyd–Max Optimality Conditions

ANN Index Structures

Filtered and Incremental ANN: Predicate Search, Deletion, and Graph Connectivity

HNSW: Hierarchical Navigable Small-World Construction and Search

Voronoi Partitioning and the Inverted-File Index

Locality-Sensitive Hashing: Collision Probability and the ρ Exponent

Multi-Vector ANN: Indexing and Pruning MaxSim at Scale (PLAID)

Navigable Small-World Graphs and the Mathematics of Greedy Routing

Learned Retrieval & Ranking

Neural & Learned Retrieval

Cross-Encoders and the Reranking Cascade

Cross-Modal Contrastive Alignment and the Modality Gap

Dense Retrieval and Dual Encoders: Architecture, Expressivity, and the Cost of Negatives

How Many Dimensions Does Relevance Need? Sign-Rank and Margin Complexity

Contrastive Learning for Retrieval: InfoNCE, Temperature, and Negative Sampling

Late Interaction and Learned Sparse Retrieval: ColBERT and SPLADE

Hard-Negative Mining and Debiased Contrastive Training (ANCE)

Knowledge Distillation for Retrieval: Teacher–Student Transfer (MarginMSE)

Ranking, Fusion & Reranking

LambdaRank, LambdaMART, and Listwise Objectives

Learning to Rank: Pointwise, Pairwise, and RankNet

LLM Rerankers: Listwise Permutation Objectives and RankGPT

Rank Fusion: Reciprocal Rank Fusion and the Geometry of Rank Aggregation

Retrieval & RAG Evaluation

NDCG: Graded Relevance and Discount Geometry

Set Metrics: Precision, Recall, MAP, and MRR as Estimators

Conformal Factuality: Distribution-Free Correctness Guarantees for Generation

LLM-as-Judge and Faithfulness: RAGAS as a Family of Estimators

Score Calibration, Drift Detection, and Significance Testing for Retrieval

Generation & Reasoning

Generation & Grounding

Faithfulness and Groundedness as Measurable Quantities

Query Transformation and HyDE: Correcting Distribution Shift in Embedding Space

Retrieval versus Long Context: Attention Complexity and Positional Bias

Selective Generation: When a RAG System Should Abstain

Information Theory of RAG

Capstone: The Mathematics of a Production Multimodal Financial RAG System

Context Selection: Submodular Coverage, MMR, and Determinantal Point Processes

GraphRAG: Community Detection and the Modularity of Knowledge

Multi-Hop and Iterative Retrieval as Search over an Evidence Space

Pointwise Mutual Information: What Retrieval Adds to Generation, in Bits

The Retriever as a Noisy Channel: Recall, Precision, and Information Limits

Building your foundations?