About
Mission
formalRAG treats retrieval-augmented generation as a mathematical subject. The web is saturated with tutorials on how to build a RAG application; almost none treat the mathematics of retrieval with rigor. This site fills that gap. Every topic receives the three-pillar treatment shared across the formal series: rigorous mathematical exposition, interactive visual intuition, and working code you can run immediately.
RAG is half mathematics and half systems engineering. We formalize the genuine mathematics — the geometry of embedding spaces, the algorithms behind approximate nearest-neighbor search, probabilistic information retrieval, ranking theory, the statistics of evaluation, and the information theory of what retrieval adds to generation — and we treat the systems layer wherever it has real mathematical content. Where a celebrated method's guarantees are heuristic rather than proven, we say so plainly.
The formal series
formalRAG is the fourth site in a family of mathematics-heavy explainers. Its prerequisites often live on its sibling sites: formalML (representation learning, information theory, the foundations of ML), formalStatistics (estimation, hypothesis testing, calibration), and formalCalculus (linear algebra, optimization, analysis). Topics link directly to the foundations they rely on.
Author
Jonathan Rocha is a data scientist and researcher specializing in financial NLP and the mathematical foundations of machine learning. He holds an MS in Data Science from SMU, an MA in English from Texas A&M University, and a BA in History from Texas A&M. He is the author of Applied NLP for Finance and builds production retrieval-augmented generation systems, including a multimodal financial RAG system that indexes earnings-call audio, SEC filings, financial charts, and news under unified embeddings — the running case study that threads this site and anchors its capstone.
DataSalt
formalrag.com is an independent educational project by the founder of DataSalt LLC.