Curriculum Roadmap

A planned map of content areas. Published topics are linked; everything else is on the roadmap.

Foundations

Retrieval Foundations

2 planned

Retrieval as ranking by a relevance functional; the metric and inner-product structure of similarity; and the complexity-theoretic limits of exact search. The root of the dependency graph.

  • The Retrieval Problem: Relevance, Similarity, and the Geometry of Scores Planned
  • MIPS Hardness and the Limits of Exact Nearest-Neighbor Search Planned

Embedding-Space Geometry

6 planned

Where embeddings live and what ANN must contend with: concentration of measure, hypersphere and von Mises–Fisher geometry, PCA and random projections, Johnson–Lindenstrauss, and chunking as segmentation.

Need the ML foundations? formalml.com →

  • High-Dimensional Geometry and the Concentration of Distances Planned
  • Normalization, the Hypersphere, and von Mises–Fisher Geometry Planned
  • PCA as Optimal Linear Dimensionality Reduction for Embeddings Planned
  • Matryoshka Representations: Jointly Trained Nested Subspaces Planned
  • Random Projections and the Johnson–Lindenstrauss Lemma Planned
  • Chunking as a Segmentation and Optimization Problem Planned

Retrieval Mechanics

Probabilistic IR

1 published / 5 planned

The classical algebraic and probabilistic retrieval models that form the lexical half of hybrid retrieval: the vector space model, the Probability Ranking Principle, BM25, query-likelihood models, and the inverted index.

  • The Vector Space Model and TF-IDF Planned
  • The Inverted Index and Safe Dynamic Pruning (WAND, BlockMax-WAND) Planned
  • The Probability Ranking Principle Planned
  • Query-Likelihood Language Models and Smoothing Planned
  • Relevance Feedback and Query Expansion: Rocchio and RM3 Planned

Vector Quantization

3 planned

Lossy compression of embedding vectors for memory-bounded search — rate-distortion and estimation theory: Lloyd–Max optimality, product quantization, and score-aware anisotropic quantization.

Need the ML foundations? formalml.com →

  • Vector Quantization and the Lloyd–Max Optimality Conditions Planned
  • Product Quantization and Asymmetric Distance Computation Planned
  • Optimized PQ and Score-Aware Quantization (OPQ, ScaNN) Planned

ANN Index Structures

6 planned

The data structures behind sublinear vector retrieval, at the level of their actual mathematics: IVF Voronoi partitioning, LSH sensitivity theory, navigable small-world graphs and HNSW, multi-vector and filtered ANN.

  • Voronoi Partitioning and the IVF Index Planned
  • Locality-Sensitive Hashing: Collision Probability and the ρ Exponent Planned
  • Navigable Small-World Graphs and the Mathematics of Greedy Routing Planned
  • HNSW: Hierarchical Navigable Small-World Construction and Search Planned
  • Filtered and Incremental ANN: Predicate Search, Deletion, and Graph Connectivity Planned
  • Multi-Vector ANN: Indexing and Pruning MaxSim at Scale (PLAID) Planned

Learned Retrieval & Ranking

Neural & Learned Retrieval

8 planned

Learned representations for retrieval, defined by training objectives and expressivity claims: InfoNCE contrastive training, dense dual encoders, late interaction and learned sparse, cross-encoders, distillation, and cross-modal alignment.

Need the ML foundations? formalml.com →

  • Contrastive Learning for Retrieval: InfoNCE, Temperature, and Negative Sampling Planned
  • Hard-Negative Mining and Debiased Contrastive Training (ANCE) Planned
  • Dense Retrieval and Dual Encoders (DPR) Planned
  • How Many Dimensions Does Relevance Need? Sign-Rank and Margin Complexity Planned
  • Late Interaction and Learned Sparse Retrieval: ColBERT and SPLADE Planned
  • Cross-Encoders and the Reranking Cascade Planned
  • Knowledge Distillation for Retrieval: Teacher–Student Transfer (MarginMSE) Planned
  • Cross-Modal Contrastive Alignment and the Modality Gap Planned

Ranking, Fusion & Reranking

4 planned

The mathematics of producing, combining, and reordering ranked lists: learning-to-rank, reciprocal rank fusion and its social-choice grounding, cross-encoder cascades, and LLM listwise rerankers.

  • Learning to Rank: Pointwise, Pairwise, and RankNet Planned
  • LambdaRank, LambdaMART, and Listwise Objectives Planned
  • Rank Fusion: Reciprocal Rank Fusion and the Geometry of Rank Aggregation Planned
  • LLM Rerankers: Listwise Permutation Objectives and RankGPT Planned

Retrieval & RAG Evaluation

5 planned

Evaluation treated as statistics: ranking metrics as estimators, significance testing, calibration, drift detection, LLM-as-judge reliability, and distribution-free conformal factuality guarantees.

Need the ML foundations? formalml.com →

  • Set Metrics: Precision, Recall, MAP, and MRR as Estimators Planned
  • NDCG: Graded Relevance and Discount Geometry Planned
  • Score Calibration, Drift Detection, and Significance Testing for Retrieval Planned
  • LLM-as-Judge and Faithfulness: RAGAS as a Family of Estimators Planned
  • Conformal Factuality: Distribution-Free Correctness Guarantees for Generation Planned

Generation & Reasoning

Generation & Grounding

4 planned

The mathematics of what happens once context is retrieved: the retrieval-vs-long-context tradeoff, query transformation as distribution-shift correction, faithfulness as a measurable quantity, and selective generation.

Need the ML foundations? formalml.com →

  • Retrieval versus Long Context: Attention Complexity and Positional Bias Planned
  • Query Transformation and HyDE: Correcting Distribution Shift in Embedding Space Planned
  • Faithfulness and Groundedness as Measurable Quantities Planned
  • Selective Generation: When a RAG System Should Abstain Planned

Information Theory of RAG

6 planned

The "why retrieval works" layer: mutual information between query, context, and answer; the retriever as a noisy channel; submodular and DPP context selection; multi-hop retrieval; GraphRAG; and the multimodal financial capstone.

Need the ML foundations? formalml.com →

  • Pointwise Mutual Information: What Retrieval Adds to Generation, in Bits Planned
  • The Retriever as a Noisy Channel: Recall, Precision, and Information Limits Planned
  • Context Selection: Submodular Coverage, MMR, and Determinantal Point Processes Planned
  • Multi-Hop and Iterative Retrieval as Search over an Evidence Space Planned
  • GraphRAG: Community Detection and the Modularity of Knowledge Planned
  • Capstone: Architecture and Mathematics of a Production Multimodal Financial RAG System Planned

Building your foundations?

Many topics here build directly on machine-learning theory — representation learning, information theory, conformal prediction — covered on formalml.com . The deeper foundations live on formalcalculus.com (linear algebra, optimization, analysis) and formalstatistics.com (estimation, testing, calibration) — all with the same geometric-first approach.