Browse all published topics in the mathematics of retrieval-augmented generation.
How probability theory turns term counts into a ranking function — and why saturation and length normalization are the two ideas that matter
BM25 is the strongest lexical retrieval baseline in information retrieval, and it is not an arbitrary formula: its inverse-document-frequency factor falls out of the Binary Independence Model with a Jeffreys prior, and its term-frequency saturation is motivated by the 2-Poisson eliteness model. We derive the Robertson–Spärck-Jones weight from the Probability Ranking Principle, show how smoothed IDF emerges, motivate the saturating tf transform and the document-length normalization, assemble the BM25 scoring function, and prove its limit behavior (k₁→0 recovers the binary model, k₁→∞ recovers length-normalized raw term frequency, b interpolates length normalization). An interactive scoring laboratory and a from-scratch NumPy implementation whose tests verify these limits accompany the derivation, with a worked finance example over earnings-call transcripts and 10-K filings.