How an inverted index works
A from-scratch visualization of the data structure DuckDB FTS (and almost every text search engine) builds to make keyword queries fast. Click around — it's interactive.
A forward index answers "what tokens does this document contain?" — useful for showing documents. An inverted index answers "what documents contain this token?" — useful for searching. The flip is the whole trick. Plus a tokenization + stopword-removal pipeline before the flip to keep the index small and matchable.
1.The pipeline
An FTS engine doesn't store raw text. It runs every document through a preprocessing pipeline that yields a clean list of "indexable" tokens, then stores the inversion of those tokens:
the, is, to…) that match almost everythingcooked → cook, tickets → ticket. We skip stemming here for clarity.Stopword list used here:
2.The data — five documents
We have five (fictional) iMessage chunks. They're shown below as a forward index: each document paired with the tokens it contains. Stopwords are struck through to show they're filtered out before indexing.
The flip is the whole structural difference. The forward index has one entry per document. The
inverted index has one entry per unique token, and each entry holds a "postings list" of every document that
contains it. Querying "thanksgiving" against the forward index is O(N) — we'd have to read every
document. Against the inverted index it's O(1) — direct lookup of the term's postings list.
3.Try a multi-term query
Type a query below. The engine tokenizes it, drops stopwords, looks up each remaining term in the inverted index, and intersects (AND) or unions (OR) the postings lists. Documents matching are highlighted in the panel above.
The cost of resolving an N-term query is approximately: cost ≈ sum(|postings_list(term)|) — already-tiny because each list is just a list of doc IDs. Real engines (DuckDB included) sort postings lists by doc ID so intersection is a linear merge, not a quadratic loop.
4.From inverted index to BM25 score
An inverted index just tells you which docs contain a term. BM25 builds on that with three extra numbers per match:
- tf (term frequency): how many times the term appears in the doc. More mentions of the query term → likely more relevant.
- df (document frequency): how many docs in the corpus contain the term. Common terms get downweighted via
idf = log(N/df). - doc length: matches in a short, focused doc count for more than matches in a long, sprawling one (BM25 normalizes against avg doc length with parameter b).
All three of these are either already in the inverted index or computed from it. The postings list gives you tf per doc (just count entries for that doc) and df for the term (length of the postings list). Doc length is a separate small lookup table.
BM25(query, doc) = Σ idf(t) · (tf(t,doc) · (k₁+1)) / (tf(t,doc) + k₁ · (1 - b + b · |doc|/avgdl))
t ∈ query
This is exactly what DuckDB's FTS extension computes when you call match_bm25(...) against an indexed table.
The inverted index does the heavy lifting (finding candidate docs); BM25 just plugs in the per-term and per-doc counts
the index has already kept.