Curriculum Roadmap
A planned map of content areas. Published topics are linked; everything else is on the roadmap.
Explore the full dependency graph, or browse by track below.
Foundations
Retrieval Foundations
2 plannedRetrieval as ranking by a relevance functional; the metric and inner-product structure of similarity; and the complexity-theoretic limits of exact search. The root of the dependency graph.
- The Retrieval Problem: Relevance, Similarity, and the Geometry of Scores Planned
- MIPS Hardness and the Limits of Exact Nearest-Neighbor Search Planned
Embedding-Space Geometry
6 plannedWhere embeddings live and what ANN must contend with: concentration of measure, hypersphere and von Mises–Fisher geometry, PCA and random projections, Johnson–Lindenstrauss, and chunking as segmentation.
Need the ML foundations? formalml.com →
- High-Dimensional Geometry and the Concentration of Distances Planned
- Normalization, the Hypersphere, and von Mises–Fisher Geometry Planned
- PCA as Optimal Linear Dimensionality Reduction for Embeddings Planned
- Matryoshka Representations: Jointly Trained Nested Subspaces Planned
- Random Projections and the Johnson–Lindenstrauss Lemma Planned
- Chunking as a Segmentation and Optimization Problem Planned
Retrieval Mechanics
Probabilistic IR
1 published / 5 plannedThe classical algebraic and probabilistic retrieval models that form the lexical half of hybrid retrieval: the vector space model, the Probability Ranking Principle, BM25, query-likelihood models, and the inverted index.
- The Vector Space Model and TF-IDF Planned
- The Inverted Index and Safe Dynamic Pruning (WAND, BlockMax-WAND) Planned
- The Probability Ranking Principle Planned
- Query-Likelihood Language Models and Smoothing Planned
- Relevance Feedback and Query Expansion: Rocchio and RM3 Planned
Vector Quantization
3 plannedLossy compression of embedding vectors for memory-bounded search — rate-distortion and estimation theory: Lloyd–Max optimality, product quantization, and score-aware anisotropic quantization.
Need the ML foundations? formalml.com →
- Vector Quantization and the Lloyd–Max Optimality Conditions Planned
- Product Quantization and Asymmetric Distance Computation Planned
- Optimized PQ and Score-Aware Quantization (OPQ, ScaNN) Planned
ANN Index Structures
6 plannedThe data structures behind sublinear vector retrieval, at the level of their actual mathematics: IVF Voronoi partitioning, LSH sensitivity theory, navigable small-world graphs and HNSW, multi-vector and filtered ANN.
- Voronoi Partitioning and the IVF Index Planned
- Locality-Sensitive Hashing: Collision Probability and the ρ Exponent Planned
- Navigable Small-World Graphs and the Mathematics of Greedy Routing Planned
- HNSW: Hierarchical Navigable Small-World Construction and Search Planned
- Filtered and Incremental ANN: Predicate Search, Deletion, and Graph Connectivity Planned
- Multi-Vector ANN: Indexing and Pruning MaxSim at Scale (PLAID) Planned
Learned Retrieval & Ranking
Neural & Learned Retrieval
8 plannedLearned representations for retrieval, defined by training objectives and expressivity claims: InfoNCE contrastive training, dense dual encoders, late interaction and learned sparse, cross-encoders, distillation, and cross-modal alignment.
Need the ML foundations? formalml.com →
- Contrastive Learning for Retrieval: InfoNCE, Temperature, and Negative Sampling Planned
- Hard-Negative Mining and Debiased Contrastive Training (ANCE) Planned
- Dense Retrieval and Dual Encoders (DPR) Planned
- How Many Dimensions Does Relevance Need? Sign-Rank and Margin Complexity Planned
- Late Interaction and Learned Sparse Retrieval: ColBERT and SPLADE Planned
- Cross-Encoders and the Reranking Cascade Planned
- Knowledge Distillation for Retrieval: Teacher–Student Transfer (MarginMSE) Planned
- Cross-Modal Contrastive Alignment and the Modality Gap Planned
Ranking, Fusion & Reranking
4 plannedThe mathematics of producing, combining, and reordering ranked lists: learning-to-rank, reciprocal rank fusion and its social-choice grounding, cross-encoder cascades, and LLM listwise rerankers.
- Learning to Rank: Pointwise, Pairwise, and RankNet Planned
- LambdaRank, LambdaMART, and Listwise Objectives Planned
- Rank Fusion: Reciprocal Rank Fusion and the Geometry of Rank Aggregation Planned
- LLM Rerankers: Listwise Permutation Objectives and RankGPT Planned
Retrieval & RAG Evaluation
5 plannedEvaluation treated as statistics: ranking metrics as estimators, significance testing, calibration, drift detection, LLM-as-judge reliability, and distribution-free conformal factuality guarantees.
Need the ML foundations? formalml.com →
- Set Metrics: Precision, Recall, MAP, and MRR as Estimators Planned
- NDCG: Graded Relevance and Discount Geometry Planned
- Score Calibration, Drift Detection, and Significance Testing for Retrieval Planned
- LLM-as-Judge and Faithfulness: RAGAS as a Family of Estimators Planned
- Conformal Factuality: Distribution-Free Correctness Guarantees for Generation Planned
Generation & Reasoning
Generation & Grounding
4 plannedThe mathematics of what happens once context is retrieved: the retrieval-vs-long-context tradeoff, query transformation as distribution-shift correction, faithfulness as a measurable quantity, and selective generation.
Need the ML foundations? formalml.com →
- Retrieval versus Long Context: Attention Complexity and Positional Bias Planned
- Query Transformation and HyDE: Correcting Distribution Shift in Embedding Space Planned
- Faithfulness and Groundedness as Measurable Quantities Planned
- Selective Generation: When a RAG System Should Abstain Planned
Information Theory of RAG
6 plannedThe "why retrieval works" layer: mutual information between query, context, and answer; the retriever as a noisy channel; submodular and DPP context selection; multi-hop retrieval; GraphRAG; and the multimodal financial capstone.
Need the ML foundations? formalml.com →
- Pointwise Mutual Information: What Retrieval Adds to Generation, in Bits Planned
- The Retriever as a Noisy Channel: Recall, Precision, and Information Limits Planned
- Context Selection: Submodular Coverage, MMR, and Determinantal Point Processes Planned
- Multi-Hop and Iterative Retrieval as Search over an Evidence Space Planned
- GraphRAG: Community Detection and the Modularity of Knowledge Planned
- Capstone: Architecture and Mathematics of a Production Multimodal Financial RAG System Planned
Building your foundations?
Many topics here build directly on machine-learning theory — representation learning, information theory, conformal prediction — covered on formalml.com . The deeper foundations live on formalcalculus.com (linear algebra, optimization, analysis) and formalstatistics.com (estimation, testing, calibration) — all with the same geometric-first approach.