./how-its-built/vector-search-rag
[04]

Vector Search & RAG

How semantic search and retrieval-augmented generation power intelligent conversations.

Overview

The ingestion pipeline left us with our indexed chunks in DuckDB. For me personally, ~250K chunks. That's dope, but no one cares if we can't use them.

Hence, RAG - basically the bridge between raw vector search and a helpful conversational answer.

I'm not going to go through vector search in a ton of detail. A simple one liner is: in a high dimensional space, when we model tokens, fragments, or sentences as vectors then directionality between these vectors carries semantic meaning. For example, you can do things like:

vkingvman+vwomanvqueen v_\text{king} - v_\text{man} + v_\text{woman} \approx v_\text{queen}

In all my reading / research over the past like 5 years, there's been no clearer visualization than a 3B1B video here:

If you play the video above, it should start from the pertinent part.

Hybrid Search: Semantic + BM25 + Keyword

Ok, this for sure has room for lots of improvement. However, our approach runs three strategies in parallel and merges the results:

  1. semantic search - HNSW index lookup, cosine similarity, overfetches 3x the requested limit (so we can dedup)
  2. BM25 full text search - DuckDB's FTS extension with English stemming and stopword removal. This helps us find exact keyword matchces that embedding similarity might rank lower on
  3. keyword search - your vanilla ilike approach against chunk text. tries to catch things that the BM25 stemmer might disregard with normalization

Reciprocal Rank Fusion (RRF)

We'll unpack BM25 in detail below — for now, just treat it as a keyword-relevance score over an inverted index.

RRF is the algorithm that merges ranked lists from different retrieval systems without needing to normalize their scores. This is basically because we do not want to compare apples to oranges. Basically, this helps us handle:

  • cosine similarity from vector search
  • BM25 scores
  • binary ilike search

For each document dd, RRF sums its reciprocal rank across every ranker rr in the set RR:

RRF(d)=rR1k+rankr(d)\text{RRF}(d) = \sum_{r \in R} \frac{1}{k + \text{rank}_r(d)}

where:

  • dd is a document (chunk)
  • RR is the set of rankers (semantic, BM25, ilike)
  • rankr(d)\text{rank}_r(d) is dd's 1-indexed position in ranker rr (or \infty if rr didn't return it)
  • kk is a smoothing constant (we use k=60k = 60 which is OG)

A real example from my own data

Ok an example might be more helpful. This is using my real data, so here's free exposure to my text messages. Names of contacts have been excluded. The C is just for contact (although they are coming from diff contacts).

I ran thanksgiving turkey against my actual DuckDB and analyzed the candidate chunks.

chunk_id conversation date chunk text sem bm25 ilike
175976 Me <> C 2019-11-28 Me: Happy thanksgiving!! I am so thankful for you and appreciate you so much as a person. Love you tons hope you have a good turkey day / Aunt: Let us know when we can face time. Love you so and I am so thankful for you. 4 5 12
181896 Me <> C 2020-11-23 C: Wait actually my sister went out last week before coming to cincy and when she got to my parents apartment she tested positive for covid so everyone is on high alert since I can't do thanksgiving with them now lol. So maybe let's hit like Friday if that's okay so all the thanksgiving stuff is done / Me: Ahh yeah that checks out haha / Me: Maybe Sunday bc I hve my turkey day stuff Saturday? 19 11 5
106915 Me <> C 2024-11-28 C: I'm sorry that is annoying / Me: How are you doing? What are you making for thanksgiving / Me: Also can you tell me what your mom makes every year again that you love so much / C: Turkey is in the oven! / C: And we are making the stuffing and soup now / C: And mashed potatoes and green beans are up next 24 1
172415 Me <> C 2024-05-05 C: You cooked an entire thanksgiving meal / Me: Haha yes I am very pleased 2

Some notes:

  • 175976 and 181896 literally contain both "thanksgiving" and "turkey" as tokens. That's why BM25 and ilike both rank them highly. Surrounding context is also very thanksgiving turkey related so semantic search aligns.
  • 106915 doesn't show up in semantic top-15. The chunk is a bit more about "what are you making" and "stuffing/soup/mashed potatoes," which embed closer to a generic cooking cluster than to my thanksgiving + turkey query. But it's ilike rank 1 because every keyword token is in the chunk text.
  • 172415 is a tiny 2-message chunk. Semantically dead-on ("cooked an entire thanksgiving meal"), but only 1 of the 2 query tokens is present so ilike rejects it, and BM25 — which is length-normalized — rates it less competitively against richer chunks.

Plugging each rank into 1k+rank\frac{1}{k + \text{rank}} with k=60k = 60 (so the denominator is just 60+rank60 + \text{rank}), and summing across the three rankers (absent → 0):

RRF(175976)=160+4+160+5+160+12=164+165+1720.0449RRF(181896)=160+19+160+11+160+5=179+171+1650.0421RRF(106915)=0+160+24+160+1=184+1610.0283RRF(172415)=160+2+0+0=1620.0161\begin{aligned} \text{RRF}(175976) &= \tfrac{1}{60+4} + \tfrac{1}{60+5} + \tfrac{1}{60+12} &&= \tfrac{1}{64} + \tfrac{1}{65} + \tfrac{1}{72} &&\approx 0.0449 \\ \text{RRF}(181896) &= \tfrac{1}{60+19} + \tfrac{1}{60+11} + \tfrac{1}{60+5} &&= \tfrac{1}{79} + \tfrac{1}{71} + \tfrac{1}{65} &&\approx 0.0421 \\ \text{RRF}(106915) &= 0 + \tfrac{1}{60+24} + \tfrac{1}{60+1} &&= \tfrac{1}{84} + \tfrac{1}{61} &&\approx 0.0283 \\ \text{RRF}(172415) &= \tfrac{1}{60+2} + 0 + 0 &&= \tfrac{1}{62} &&\approx 0.0161 \\ \end{aligned}

Pure-RRF order: 175976 > 181896 > 106915 > 172415.

One other note, in our production code, we have our rrf_merge multiply each chunk's RRF score by a recency factor of 11+0.01age_days\tfrac{1}{1 + 0.01 \cdot \text{age\_days}}. This is a deliberate product decision, but people (mainly myself and my partner) found ourselves wanting to find information that arrived more recently, than just vanilla RRF results.

Full-Text Search with DuckDB

One callout I want to make because I don't think it's really clear is that full-text search (FTS) and best-match 25 (BM25) are two different layers. They're connected, but technically different. The best article to really read is from Anthropic (shocker) here: Introducing Contextual Retrieval.

  • FTS is the whole pipeline that turns raw text columns into a searchable structure: tokenize the text into words, run them through a stemmer (i.e. truncate running to run), drop stopwords, build an inverted index (i.e. word to list of documents that contain it), and then provide a query interface on top.
  • BM25 is really JUST a scoring function - a formula for ranking how relevant a document is to a query, given the term frequencies in the inverted index that FTS built.

You can have FTS with a different scoring function OR you can also implement BM25 against any inverted index. They just commonly ship together because BM25 is what everyone wants once they have an inverted index.

DuckDB's FTS extension gives us both: the indexing pipeline and a match_bm25() function as the default scorer.

The FTS index

You'll note that DuckDB does not ship FTS in the core. It's an extension that you have to install and load. Then you create the index on the column you want to search:

INSTALL fts;
LOAD fts;

PRAGMA create_fts_index(
    'message_chunks',     -- table to index
    'chunk_id',           -- primary key column
    'formatted_text',     -- column whose text we want to search
    stemmer='english',    -- "running" → "run"
    stopwords='english',  -- drop "the", "and", "of", ...
    overwrite=1           -- recreate if it already exists
);

This is the literal PRAGMA we run during ingestion — same call on both the Python side (mlx/src/data/storage.py, when we finalize the ingestion run) and the Rust side (crates/ingestion/src/queries.rs::CREATE_FTS_INDEX, which acts as a safety net on app startup if the index is somehow missing).

That PRAGMA is doing all the heavy lifting of "make this column searchable": tokenizing, stemming with the English Snowball stemmer, applying the English stopword list, and building an inverted posting list.

You'll note after doing this that a separate schema is created (fts_main_message_chunks). This holds the index and a match_bm25() function you call from queries. An actual snippet from my code:

    let sql = format!(
        r#"
        SELECT
            chunk_id,
            contact_id,
            contact_name,
            chunk_data,
            message_ids::VARCHAR AS message_ids,
            start_timestamp,
            end_timestamp,
            fts_main_message_chunks.match_bm25(chunk_id, '{sanitized_query}') AS bm25_score{embedding_col}
        FROM message_chunks
        WHERE fts_main_message_chunks.match_bm25(chunk_id, '{sanitized_query}') IS NOT NULL
            {contact_filter}
            {min_date_filter}
            {max_date_filter}
        ORDER BY bm25_score DESC
        LIMIT {limit}
        "#,
        sanitized_query = sanitized_query,
        embedding_col = embedding_col,
        contact_filter = contact_filter,
        min_date_filter = min_date_filter,
        max_date_filter = max_date_filter,
        limit = limit,
    );

A couple of gotchas worth flagging on the indexing side:

  • The index is dropped and rebuilt on every full ingestion. We treat ingestion as a full rebuild rather than incremental — clear_chunks() drops both HNSW and FTS before inserting fresh.
  • DuckDB version pinning matters. Python and Rust must agree on the DuckDB version exactly, or the FTS extension binary won't load the index the other side wrote. We pin to 1.4.4 on both.

BM25, the scoring function

Once the inverted index exists, BM25 ("Best Matching 25") is the formula that takes a query and ranks documents. It's been the default in Lucene/Elasticsearch/Postgres FTS for decades, and embeddings haven't replaced it — they just sit next to it.

The intuition is three terms multiplied together:

BM25(d,q)=tqIDF(t)rarityf(t,d)(k1+1)f(t,d)+k1(1b+bdavgd)normalized term frequency\text{BM25}(d, q) = \sum_{t \in q} \underbrace{\text{IDF}(t)}_{\text{rarity}} \cdot \underbrace{\frac{f(t, d) \cdot (k_1 + 1)}{f(t, d) + k_1 \cdot \left(1 - b + b \cdot \frac{|d|}{\text{avg}|d|}\right)}}_{\text{normalized term frequency}}

where:

  • dd is a document (one of our chunks)
  • qq is the query and tt is a term inside it
  • f(t,d)f(t, d) is the term frequency - how many times term tt appears in document dd
  • d|d| is the length of dd in tokens; avgd\text{avg}|d| is the average across the corpus
  • k1k_1 controls term-frequency saturation (DuckDB default: 1.21.2)
  • bb controls how aggressively long documents are penalized (DuckDB default: 0.750.75)
  • IDF(t)\text{IDF}(t) is the inverse document frequency of tt - a rarity weight, formally log(Nn(t)+0.5n(t)+0.5+1)\log \left( \tfrac{N - n(t) + 0.5}{n(t) + 0.5} + 1 \right), where NN is the total number of documents and n(t)n(t) is how many of them contain tt. Rare terms get bigger weights; ubiquitous terms get near-zero weights.

So squinting at the formula, you can read it as:

  • Rare terms are worth more (IDF\text{IDF}) - matching on "thanksgiving" is more informative than matching on "the".
  • Terms that show up a lot in a document are worth more, but with diminishing returns (the saturating f(t,d)f(t, d) in the numerator) — the 10th occurrence of "turkey" in a chunk doesn't move the needle the way the 1st did.
  • Long documents get penalized (d/avgd|d| / \text{avg}|d| length normalization) - so a 500-message chunk doesn't dominate just by having more words.

That last term is exactly why chunk 172415 in the RRF example above — the tiny 2-message chunk — got rated less competitively by BM25 than the longer, denser thanksgiving chunks. Length normalization works against very short documents.

DuckDB exposes BM25 as a function you call against the FTS index:

SELECT
    chunk_id,
    formatted_text,
    fts_main_message_chunks.match_bm25(chunk_id, 'thanksgiving turkey') AS bm25_score
FROM message_chunks
WHERE fts_main_message_chunks.match_bm25(chunk_id, 'thanksgiving turkey') IS NOT NULL
ORDER BY bm25_score DESC
LIMIT 30

One gotcha here: match_bm25() doesn't support bind parameters. You have to interpolate the query text as a string literal, which means sanitizing single quotes yourself. Not ideal, but well-trodden territory.

Our own stopword pass

DuckDB's stopwords='english' handles the classic list at index time. But we also pre-strip an additional list of tokens from the query before it ever hits match_bm25():

pub static STOPWORDS: LazyLock<HashSet<&'static str>> = LazyLock::new(|| {
    HashSet::from([
        "the", "a", "an", "is", "was", "were", "been", "be", "have", "has", "had",
        "do", "does", "did", "will", "would", "could", "should", "may", "might",
        // ...
        "i", "me", "you", "we", "they", "he", "she",
    ])
});

Two reasons for the second pass:

  1. Defensive overlap. DuckDB's list is fine, but it gets applied at index time; we want symmetric filtering on the query side too. Pre-filtering means the query that hits match_bm25() is already clean and stable across DuckDB version upgrades.
  2. iMessage-specific noise. Pronouns like "i", "me", "you", "we" aren't in the standard English stopword list (they're not "function words" in the linguistic sense), but in iMessage data they're in basically every chunk — they carry zero signal for search. Same for words like "thing", "stuff", "place".

There's also a guard: if all tokens get stripped (e.g., someone searches "what is the thing"), we fall back to the original query rather than sending an empty string to BM25. Returning nothing is worse than returning fuzzy results.

There's more on DuckDB FTS here DLightweight Text Analytics Workflows with DuckDBduckdb.org and here.

Handling Chunk Overlap

Right, so you'll hopefully recall from ingestion pipeline that we have chunking with some overlap (window = 8, stride = 4). So this means that if a search finds both chunks, then we'd show the message (or at least some portion of the chunk multiple times). We handle this with two strategies that target different kinds of duplication: message overlap filtering catches literal duplication (chunks that share raw message IDs), and MMR catches semantic near-duplicates (chunks that say roughly the same thing even without sharing messages). In practice these run as alternative paths — MMR is the production default when chunk embeddings are available, and message overlap filtering is the lightweight fallback when they aren't (e.g. BM25/keyword-only results).

Message Overlap Filtering

We just utilize and run a check based on how many new message IDs each candidate adds. A chunk that shares 4/8 messages with an already selected chunk has exactly 50% novelty. A chunk sharing 5+ messages gets filtered.

Basically from our source code:

fn deduplicate_chunks(chunks: Vec<SearchResult>, max_results: usize, min_similarity: f32) ->
Vec<SearchResult> {
    let mut seen_messages: HashSet<i64> = HashSet::new();
    let mut results = Vec::new();

    for chunk in chunks {
        if chunk.similarity < min_similarity {
            continue;
        }

        let new_count = chunk
            .message_ids
            .iter()
            .filter(|id| !seen_messages.contains(id))
            .count();

        let novelty_ratio = if chunk.message_ids.is_empty() {
            0.0
        } else {
            new_count as f32 / chunk.message_ids.len() as f32
        };

        if novelty_ratio >= 0.5 || results.is_empty() {
            seen_messages.extend(&chunk.message_ids);
            results.push(chunk);

            if results.len() >= max_results {
                break;
            }
        }
    }

    results
}

Maximal Marginal Relevance (MMR)

This is a more classic approach, but we use maximal marginal relevance to balance relevance against diversity. This can be mathematically shown as:

mmr score=λrelevance(1λ)max similarity to selected\text{mmr score} = \lambda \cdot \text{relevance} - (1 - \lambda) \cdot \text{max similarity to selected}

The λ\lambda controls the trade off of diversity. λ=1\lambda = 1 is pure relevance and λ=0\lambda = 0 is pure diversity.

We pick λ=0.7\lambda = 0.7. We weight 70% towards relevance to the query and 30% towards being different from the results that we already have. This is a greedy algo: at each step, we pick the candidate with the highest MMR, add to our set, and repeat.

Contact Diversity

This is maybe a bit of a 🌶️ product take perhaps, but most likely there are people that you text very frequently and they dominate the distribution of your iMessaging contact. We have a slight penalty after the second result from the same contact so that we're not over indexing always on one persona. If someone's convo is genuinely the most relevant, it'll still make it through.

Cross-Encoder Reranking

Everything up to this point has used a bi-encoder. ModernBERT encodes the query once, encodes each chunk once (at ingestion time, whenever that is) and then we compare with your vanilla cosine similarity. The query and the chunk never actually see each other inside the model. They land in that R768\mathbb{R}^{768} vec independently, and then just dot product for cosine sim.

What's great about this is... it's fast! We precompute on ingestion time, build a HNSW index, query at time you need to embed the query, and then do approx nearest neighbor lookup. I.e. it's O(logn)O(\log n) approximately on the chunk side.

The cross encoder is different - and heavier. It concatenates [CLS] query [SEP] chunk [SEP] and runs the pairs through the transformer together, so attention can model the word-to-word interactions between query and chunk. This is way better - but also! More expensive. Running on pairs at query time is a thicc boi, and we don't have anything precomputed (what - we're gonna do precomputation of every pair between a user gen'd query and chunk somehow?).

So this hybrid approach is what we do to limit the top 30 candidates that hybrid search + RRF + MMR already cut down.

A decent clanker generated diagram can be found here:

The model

We use cross-encoder/ms-marco-MiniLM-L-6-v2 — 22M params, 6 transformer layers, trained on the MS MARCO passage ranking dataset. On Apple Silicon, scoring 30 candidates lands in the ~50–150ms range once the model is loaded. Cold start (first call, model has to load into memory) is ~2s.

We pull this via sentence-transformers rather than running it through MLX, mostly because the model is tiny enough that the CPU overhead doesn't matter and the CrossEncoder API is one line:

from sentence_transformers import CrossEncoder

model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [[query, chunk_text] for chunk_text in passages]
scores = model.predict(pairs)  # raw logits, not normalized to [0, 1]

You'll note these are logits... So absolute values aren't really comparable to cosine similarities. So we just overwrite the result.similarity = score and resort. I think that's generally best practice / fine given we don't have anything else besides showcasing this to the user after this.

Over-fetching for the reranker

There's a small trick in hybrid_search that's worth pointing out. When rerank = true, we don't dedup down to the user's requested limit — we keep more candidates around:

let dedup_limit = if options.rerank {
    // Over-fetch for reranking: get more candidates, let cross-encoder pick the best
    (options.limit * 3).max(30)
} else {
    options.limit
};

The reasoning: dedup (MMR + message-overlap + contact diversity) is good but coarse. It's optimizing for "don't show the user the same chunk twice", not "is this chunk actually the most relevant". The cross-encoder is the thing that's actually good at the second question, so we want to give it a bigger pool to choose from. After reranking, we truncate to options.limit.

Latency Implications

I default on the rerank portion, but in SearchOptions you can also disable it. The reranking portion with our document size is only going to take 50-150ms instead of the LLM portion which is probably on the order of seconds.

Vector Search & RAG - Be Right Back