be right back · interactive explainer

How an inverted index works

A from-scratch visualization of the data structure DuckDB FTS (and almost every text search engine) builds to make keyword queries fast. Click around — it's interactive.

tl;dr

A forward index answers "what tokens does this document contain?" — useful for showing documents. An inverted index answers "what documents contain this token?" — useful for searching. The flip is the whole trick. Plus a tokenization + stopword-removal pipeline before the flip to keep the index small and matchable.

1.The pipeline

An FTS engine doesn't store raw text. It runs every document through a preprocessing pipeline that yields a clean list of "indexable" tokens, then stores the inversion of those tokens:

step 1

tokenize

Lowercase, split on whitespace and punctuation

step 2

strip stopwords

Drop high-frequency words (the, is, to…) that match almost everything

step 3

stem (optional)

cooked → cook, tickets → ticket. We skip stemming here for clarity.

step 4

invert

For each token, record every (doc_id, position) pair it appears in.

Stopword list used here:

2.The data — five documents

We have five (fictional) iMessage chunks. They're shown below as a forward index: each document paired with the tokens it contains. Stopwords are struck through to show they're filtered out before indexing.

Forward index — documents as token streams

Inverted index — same data, flipped (click a term to highlight)

The flip is the whole structural difference. The forward index has one entry per document. The inverted index has one entry per unique token, and each entry holds a "postings list" of every document that contains it. Querying "thanksgiving" against the forward index is O(N) — we'd have to read every document. Against the inverted index it's O(1) — direct lookup of the term's postings list.

3.Try a multi-term query

Type a query below. The engine tokenizes it, drops stopwords, looks up each remaining term in the inverted index, and intersects (AND) or unions (OR) the postings lists. Documents matching are highlighted in the panel above.

The cost of resolving an N-term query is approximately: cost ≈ sum(|postings_list(term)|) — already-tiny because each list is just a list of doc IDs. Real engines (DuckDB included) sort postings lists by doc ID so intersection is a linear merge, not a quadratic loop.

4.From inverted index to BM25 score

An inverted index just tells you which docs contain a term. BM25 builds on that with three extra numbers per match:

tf (term frequency): how many times the term appears in the doc. More mentions of the query term → likely more relevant.
df (document frequency): how many docs in the corpus contain the term. Common terms get downweighted via idf = log(N/df).
doc length: matches in a short, focused doc count for more than matches in a long, sprawling one (BM25 normalizes against avg doc length with parameter b).

All three of these are either already in the inverted index or computed from it. The postings list gives you tf per doc (just count entries for that doc) and df for the term (length of the postings list). Doc length is a separate small lookup table.

BM25(query, doc) = Σ idf(t) · (tf(t,doc) · (k₁+1)) / (tf(t,doc) + k₁ · (1 - b + b · |doc|/avgdl))
                  t ∈ query

This is exactly what DuckDB's FTS extension computes when you call match_bm25(...) against an indexed table. The inverted index does the heavy lifting (finding candidate docs); BM25 just plugs in the per-term and per-doc counts the index has already kept.