Be Right Back

Overview

Ok so with the data-foundation in place, we are now able to read iMessages, decode various Apple specific binary blobs, and snapshot the database. So the next thing is: how do we turn these potentially millions of messages (i'm at ~1M) into something an LLM can search and reason about?

The answer (before we get to any of the RAG / BM25 approach) is a data pipeline that sets up our database structure cleanly. I utilize Rust and Python in a bit of a convoluted manner, but the reason is a mixture of performance, Tauri, ergonomics, and benchmarking. The Rust more or less handles the OS layer (snapshots, metadata, contact rez) and Python handles the ML (chunking, embedding, indexing). They communicate over a single pipe: JSON lines on stdout. I'll discuss this in a lot more detail.

I've distilled this down to 6 stages:

#	Stage	Runner	What It Does
1	Snapshot	Rust	Copy iMessage DB + AddressBook, init metadata
2	Parsing	Rust	Count messages and contacts from snapshot
3	Chunking	Python	Sliding-window chunks with conversation awareness
4	Embedding	Python/MLX	Generate 768-dim vectors on Apple Silicon GPU
5	Saving	Python	Bulk write to DuckDB + create HNSW/FTS indexes
6	Analyzing	Python	UMAP projections + HDBSCAN clustering for Insights

Each stage reports independent 0-100% progress to the FE. There will be six progress bars that are updated in pseudo-realtime.

Stage 1: Snapshot

This stage is where we front-load as much expensive work as possible so that we can have a reasonably fast app startup for the future.

The snapshot creation bit was actually covered in the previous data foundation section. So recall that we're copying the chat.db, grabbing the WAL + shm atomically, grabbing the AddressBook, but then we also do some metadata collection as well. Here are some of the things that we do:

count messages and contacts
open/create brb.db (this is our sqlite metadata database)
record ingestion metadata - snapshot path, message count, date range
pre-compute contact stats - avg response times, date ranges per contact. we compute this once during ingestion so we can load them in $O(1)$
resolve display names - you'd be surprised but the join against the AddressBook to map phone numbers/emails to real names can be expensive.
brb.duckdb - duckdb creation with empty tables ready for embedding

Two Databases

We utilize two separate databases here. And if you're thinking it might be slightly redundant, you kind of have a point, but the access patterns are use cases are different enough that this is entirely justified. Yes, yes, I know / have read about sqlite-vector.

In my mind, it was justified splitting this out. I know it's colored with the bias that I wanted to experiment with DuckDB, but sqlite is used for my OLTP workflows while DuckDB is used for my OLAP workflows. DuckDB is columnar, hence better for analytical operations. This lets us utilize vectorized scans,the HNSW/FTS extension, and bulk inserts.

Our sqlite3 table (brb.db) has metadata, contacts, stats. It has fast ACID transactions, small footprint, and is great for key-value lookups. Our DuckDB table (brb.duckdb) is our main holder of all the vector embeddings with HNSW indexes, full-text search with BM25, and more. This is purpose built for the RAG search portion, which we'll discuss in detail in the next section.

Also note, I pin DuckDB to exactly 1.4.4 on both the Rust and the Python side. Version mismatches here cause the VSS (vector similarity search) extension to crash because the extension binary format isn't stable across versions. This absolutely ruined me for a bit, and was causing my app to recreate indexes across languages when corruption was basically detected.

Stage 2: Parsing

This is pretty straightforward but... obviously like I stated above, we're parsing some metadata from our messages into the sqlite database.

Python Pipeline Notes

So! These next few sections are the most interesting part. The chunking, embedding, and saving all run inside of a single Python subprocess. I want to give a bit more context about that portion. Rust spawns the Python subprocess and reads JSON progress lines from stdout. Basically:

let embedding_result = python::run_python_embeddings_cancellable(
    &snapshot_db,
    &addressbook_db,
    &output_db,
    ingestion_id,
    |stage, percent, message, stats| {
        progress_callback(stage, percent, message, stats);
    },
    &is_cancelled,
)?;

Communication Protocol

This was a design choice that is maybe a bit juvenile and I'm not convinced I'd do it again. Part of it was because of project / scope creep, and still trying to move quick.

The two processes talk over OS pipes. This is actually what my old company used to use at Ab Initio quite a bit for some of their graph protocol and DML components.

Security wise, pipes are relatively safe. They're kernel-mediated. This means that no other process on the system can read from or write to them. There's no networking exposure, no socket file on disk that another process could sniff or anything like that.

The tradeoff is a bit of programming overhead but welcome to the era of AI. For example, we could be kinda burned here if a Python dep prints a stray message to stdout or a segfault cuts a jsonl line in half. That's why on the Rust side, we attempt to handle it gracefully:

        match serde_json::from_str::<PythonMessage>(&line) {
            Ok(PythonMessage::StageProgress {
                stage,
                percent,
                message,
                stats,
                current_contact,
            }) => {
                let ingestion_stage = map_python_stage(&stage);
                let ingestion_stats = convert_stats(&stats, current_contact);
                progress_callback(ingestion_stage, percent, &message, ingestion_stats);
            }
            Ok(PythonMessage::Result { success, data, error }) => {
                if success {
                    final_result = serde_json::from_value(data).ok();
                } else {
                    error_message = error;
                }
            }
            Ok(PythonMessage::Log { level, message }) => match level.as_str() {
                "info" => info!(target: "python_embed", "{}", message),
                "warn" => warn!(target: "python_embed", "{}", message),
                "error" => error!(target: "python_embed", "{}", message),
                _ => debug!(target: "python_embed", "{}", message),
            },
            Err(_) => {
                // Not JSON, might be a raw log line
                debug!(target: "python_embed", "Non-JSON output: {}", line);
            }

Stage 3: Chunking

Raw messages are not useful to an embedding model. We do this for multiple reasons, but we are grouping messages into chunks using a sliding window. In production we use window_size=8, stride=4 (the function signature below shows the original library defaults — the production values get passed in from the caller). You'll note... that this is redundant by design. This is not that uncommon in RAG approaches. Messages at chunk boundaries intentionally appear in two groups so they have enough context.


def create_chunks(
    messages: list[dict],
    window_size: int = 5,
    stride: int = 3,
    min_chunk_chars: int = 10,
    contact_name: str | None = None,
) -> list[dict]:
    """Create conversation-aware, overlapping chunks from messages.

    Uses a sliding window with stride < window_size to create overlapping chunks,
    ensuring messages at chunk boundaries are seen together in at least one chunk.
    """
    chunks = []
    conversations = split_into_conversations(messages)

    for conversation in conversations:
        for i in range(0, len(conversation), stride):
            window = conversation[i : i + window_size]

Contact-Batched Processing

We do not load all the messages, create chunks, and embed everything. That causes memory to absolutely spike and tear through your vram. This was something I spiked into a good bit. We process in batches kinda like this:

# adaptive batch sizing
if pos < _HEAVY_CONTACT_COUNT:  # top 50 by message count
    size = _HEAVY_CONTACT_BATCH_SIZE   # 10 contacts
else:
    size = _NORMAL_CONTACT_BATCH_SIZE  # 50 contacts

Each batch follows the same sequence: read -> chunk -> embed -> write -> free. Memory cleanup happens between batches, garbage collection plus a DuckDB CHECKPOINT every 5 batches to flush the WAL and return some memory back to the OS. Peak RSS SHOULD stay around 200MB (still thick).

Stage 4: Embedding

Each chunk gets embedded into a 768-dimensional vector using mlx-community/nomicai-modernbert-embed-base-8bit. This is a sentence-level transformer model already pre-quantized for Apple Silicon (hence the mlx). mlx runs inference directly on the GPU (metal) which is the whole point of targeting apple os.

Why `nomicai-modernbert`?

There are a couple of reasons.

modern

This encoder model is a late 2024 redesign of the original BERT encoder. It has some of the more recent bells and whistles. Rotary embeddings, alternating local vs global attention, unpadded training, etc. Plus it's got an 8k context. You'll note that I actually truncate on 215 tokens because most iMessages are not that long, but it's nice to have the flexibility.

asymmetric retrieval fine-tuning

Nomic fine-tuned this puppy with the Nomic-embed recipe, which means that we literally have document vs query prefix names. We prefix search_document: or search_query: and as a result, the fine-tuned model learns that these are in slightly different sub-spaces of the embedding manifold. The prefixes help you slightly route into the right one.

8-bit quantized, mlx friendly

The mlx-community fork is 8-bit quantized and packaged for Apple's hardware. This means we get roughly ~2x the throughput of the fp16 version on an M-series GPU chip.

768 dim tradeoff

DuckDB's HNSW index cost scales with dimension count. At ~245K chunks (my current corpus), 768-dim embeddings are ~750MB of raw vectors. 1024 dim would be ~1GB, and the HNSW graph itself adds another multiple on top of that. At higher dims and larger corpora the working set blows past available RAM fast. I don't really think the recall difference is impactful beyond the 768 dim threshold.

Why Sentence Level Transformer?

This has become pretty standard for RAG approaches to utilize sentence level transformers.

A plain language model (gpt, opus, sonnet) outputs one hidden state per token. These hidden states DO carry semantic meaning — they have to, otherwise next-token prediction wouldn't work — BUT the meaning is encoded in the wrong shape. They're operating in a latent space optimized to be useful for the prediction head sitting at the end of the network. The geometry got sculpted for "what tokens help me guess the next token," not for "vectors of similar-meaning text should point in similar directions."

That distinction matters because the entire RAG setup is built on cosine similarity. We turn ~245K chunks into points in space, turn the query into a point in the same space, and ask "which chunk points are closest in direction to the query point?" For that to work, the space has to have a very specific property: semantic closeness has to correspond to directional closeness. That property is not free. It only exists if the model was trained to produce it.

And we still need a single vector per chunk, which means we have to pool the per-token hidden states somehow — mean-pool, take [CLS], take the last token, whatever. None of those pooling strategies were in the training objective of a chat model. The model was never told "the mean of your token vectors should be close to the mean of a semantically similar chunk." So even if we pool, the result lands in a space whose geometry was shaped for a different task. Vanilla BERT's [CLS] embedding is famously terrible at semantic similarity out of the box for exactly this reason.

Sentence transformers fix this directly. They're BERT-family encoders that were additionally fine-tuned with contrastive learning. The training loop looks roughly like: show the model two sentences that mean similar things, compute their embeddings, compute the cosine similarity, and define a loss that says "this number should be high." Show it two dissimilar sentences and the loss says "this number should be low." Run that over millions of pairs and the network's weights get adjusted until the output geometry literally has the property we need - direction equals meaning.

Practically, this shows up in three places:

One vector per chunk. The model takes the full chunk (up to 512 tokens) and emits a single 768-dim vector. No reducer, no "which token do I pick," no guessing.
Cosine is meaningful. Because of the contrastive training objective, two chunks about "dinner plans" land near each other in embedding space. That's the only reason hybrid search works.
Cheap enough to run on ~245K chunks. nomicai-modernbert-embed-base-8bit is ~150M params. Using a 9B chat model as an embedder (pool the last hidden state, last-token style, à la GritLM / E5-Mistral) would be ~60x the FLOPs per chunk for a marginal quality bump. Not worth it at this scale.

You can see the pooling logic in ml/embeddings.py:68 - I do mean_pooling then normalize_embeddings (L2) as a fallback when the model doesn't return text_embeds directly. This is the canonical sentence-transformer recipe.

Why sort?

This was truly one of the biggest perf wins I had - for drastically dropping the vram usage.

It's this line in my pipelines/embedding.py:390

chunks.sort(key=lambda c: len(c.get("text", "")))

To really understand this, we do need to spend some time talking about transformers. At a high level, transformers are a shit ton of matrix operations. Per batch, we have $N$ tensors of [batch_size, seq_len, hidden_dim]. That seq_len has to be a single number!! If they are a different number, then our code has to pad the shorter ones with a [PAD] token so that all the tensors have the same shape. The model then computes everything, and then at the end, uses the PAD tokens to zero out those contributions.

What this means in reality is that in terms of FLOPs, batches cost the max length of their longest tensor seq_len, applied to every tensor.

So when we don't sort, what happens is that we'll have some batch with a small chunk like "yeah" for example, and if the variance on the seq_len is too high, then we get burned by having the insert a ton of PAD tokens and do a ton of useless compute.

Your batch padding changes from potentially as bad as your variance across your whole dataset, to just the variance within a single batch.

In addition! another super fun and interesting win is the vram impact. At a conceptual level, this of course makes sense that it'd be faster, but it's also a nice memory win.

VRAM during the forward pass is dominated by activations. Activations are basically the intermediate tensors each layer produces and keeps around. They scale like: batch_size * seq_len * hidden_dim * num_layers, and attention layers add a quadratic term on top: batch_size * seq_len**2 * num_heads. So without sorting, every batch has the chance to contain those outlier tokens, and the whole batch pays the quadratic cost of the longest sequence in it.

The natural follow-up here is dynamic batch size by token budget — instead of a fixed batch_size=128, you'd target a constant token count per batch (e.g. batch_size * seq_len ≈ 16K), so short-sequence batches pack more rows. I haven't implemented this yet; sorting alone got the win I needed, and dynamic batching is sitting on the follow-up list.

Stage 5: Saving

Saving can basically be distilled down into "writing pertinent data to the database". That is inclusive of our DuckDB and sqlite3 dbs. There are a couple of foundational decisions here that give us a solid win.

Foundation 1: Parquet as an middle ground

Hilarious, but it just keeps coming up as a win. I've talked about Parquet file format here on my blog. So I won't discuss it much further. Feel free to download the Walk in the Parquet MacOS app.

However, one big win I found for actually writing was to create temporary parquet file formats for loading. This bit of code:

def _write_parquet_to_duckdb(
    conn: duckdb.DuckDBPyConnection, table: pa.Table, target_table: str
) -> int:
    with tempfile.NamedTemporaryFile(suffix=".parquet", delete=False) as f:
        parquet_path = f.name
        pq.write_table(table, parquet_path)
    try:
        conn.execute(f"COPY {target_table} FROM '{parquet_path}'")
        return table.num_rows
    finally:
        Path(parquet_path).unlink(missing_ok=True)

gave us a considerable speedup when writing.

I benchmarked this across four strategies on a 10K message sample on my lovely M-series Mac with MLX on GPU. Here's the comparison at batch_size=128:

strategy	total	embed	write	chunks/sec
`parquet_buffered`	9.12 s	8.96 s	0.124 s	365.7
`parquet_bulk`	13.53 s	12.73 s	0.762 s	246.3
`pandas_insert`	16.79 s	14.55 s	2.197 s	198.6
`direct_executemany`	16.88 s	9.94 s	6.902 s	197.6

Let me further explain what each method is doing:

direct_executemany
- the simplest approach
- we're building a list of Python tuples
- then we call conn.executemany("INSERT INTO message_chunks VALUES (?, ?, ?, ...)", rows)
- DuckDB binds parameters and executes the insert row-by-row
- There is no intermediate file, no Arrow buffer - just Python objects straight to SQL
- Crucially, every 768-float embedding gets shipped through the parameter binding path one row at a time.
pandas_insert
- Here we're building a pandas.DataFrame from the same rows
- then we run INSERT INTO message_chunks select * from dm
- DuckDB's Python integration sees df in the calling scope and reads it as a virtual table via Arrow under the hood
- The df buys us a bit but we still pay for the Python dict -> df -> Arrow conversion on the write
parquet_bulk
- we skip pandas entirely! it's not always the best, like so many people say online
- here's what we do:
  - Build a pyarrow.Table directly (with explicit Arrow types — pa.list_(pa.float32()) for embeddings, pa.timestamp("us", tz="UTC") for timestamps, etc.)
  - write it to a temp .parquet file
  - COPY message_chunks FROM '/tmp/xxx.parquet'
- DuckDB's Parquet reader is a vectorized columnar scanner that reads straight into its execution engine
- one Parquet write per call to write_chunks(), which means once per embed batch.
parquet_buffered
- Same machinery as parquet_bulk, but the strategy accumulates chunks in memory and only flushes a Parquet file when the buffer hits 5K rows (or the run ends)
- At batch_size=128, this is roughly one Parquet write per ~40 batches instead of one per batch.

A few things that are helpful to note:

parquet_buffered writes are ~55x faster than direct_executemany at the same batch size
- i.e. 0.124 s vs 6.9 s
- Going through DuckDB's SQL parser and parameter binding per row is brutal once each row carries a 768-element float array
- executemany doesn't actually batch on the wire
- it's a Python convenience that loops execute() under the hood, so the engine sees N independent statements, each re-validating types and re-allocating the embedding list.
pandas_insert is ~18x slower on the write step.
- Pandas pushes into DuckDB via Arrow under the hood, but the Python object -> DataFrame -> Arrow round-trip eats most of the win
- Building the PyArrow table directly skips that.
Writes are 1.4% of total time on the parquet path, vs 40.9% on the executemany path.
- Once writes are essentially free, the bottleneck shifts back to the GPU (embedding), which is exactly where you want it.
- The embedder stays fed instead of stalling on the DB.

So writing isn't free, but on the parquet path it's close enough that we can stop thinking about it. The whole point of the temp-Parquet dance is to push the bottleneck back onto the GPU where it belongs. This is way more trivial and a smaller example, but there's (unsurprisingly) a great Youtube video on GPU optimization from Jane Street here:

Foundation 2: Indexes are built after

This one is probably a bit more obvious.

We're building HNSW and FTS indexes. These are used for RAG and vector search. More or less, HNSW maintains a graph between vectors, and FTS maintains an inverted index from terms to doc_ids.

We do this at the end because of the cost of indexing. HNSW has to recompute and rebalance the graph per insert. FTS for DuckDB (based on my understanding) does not have a way to recompute on the fly, it's a static index.

These are expensive indexes and doing them constantly takes compute and time. We're trying to be mindful of those resources. The punchline: building the index once at the end (a bulk construction over the final set of vectors) is fundamentally cheaper than maintaining it incrementally across N inserts, because each incremental insert pays for graph rebalancing (HNSW) or index rewrite (FTS) on top of the actual write.

Foundation 3: Two indexes (HNSW and FTS)

We're going to address this a lot more in the next section, but we want both because they fail in opposite ways.

HNSW finds semantic matches "you wanna get dinner?" potentially linking to "meals"
- if you search for someone's exact phone number though, HNSW might miss it because that individual token doesn't mean shit to it
FTS finds the exact word match

We're using a classic approach for handling these hybrid search results with Reciprocal Rank Fusion. I'll discuss this more in the next section as well.

Stage 6: Analyzing

Alright! So this is getting into the portion of how we use the embeddings we just generated and save.

A very important distinction here is that the output of the pipeline above is not for search. It's mainly for the Insights page, which shows a 2D/3D map of your conversational landscape with clusters.

We can decompose this into a couple of foundations as well.

Foundation 1: Dimensionality Jumps (768 -> 50 -> 2)

So we basically get our dimension into something our feeble human minds can better visualize by three separate technics.

PCA: 768 -> 50 dims through PCA. Hover for more info.
UMAP: 50 -> 2 as well as 50 -> 3 for the 3D view
HDBSCAN: clustering in 2D

A reasonable question: why three steps? Why not just UMAP straight from 768 to 2, or cluster directly in 768D? Two reasons.

The curse of dimensionality breaks density. In high dimensions, the ratio of (distance to nearest neighbor) / (distance to farthest neighbor) approaches 1 — every point becomes roughly equidistant from every other point. Density-based clustering relies on "this region is denser than that region," which is meaningless when distances don't separate. So we have to reduce dimensions before HDBSCAN sees anything. This is the formal Beyer et al. (1999) result, "When Is Nearest Neighbor Meaningful?"

PCA is cheap, UMAP is expensive. PCA is linear — pure eigendecomposition, runs in seconds on ~1M × 768. UMAP is nonlinear and iterative — minutes to hours. Doing UMAP on the full 768 dims wastes compute on dimensions PCA could have stripped first. So PCA cheaply trims 768 → 50 (keeping ~80–90% of explained variance), then UMAP does the expensive nonlinear projection on the surviving 50D, then HDBSCAN clusters on the 2D output. Each step uses the right tool for what's expensive about it.

I'll discuss these a bit more below, with a focus on UMAP and HDBSCAN a bit more below.

PCA - Principal Component Analysis

Again, I would encourage you check out docs here. I'm assuming if you've gotten this far in the blog you're at least pretty familiar with this one given it's a pretty core mechanism of linear alg. TLDR, PCA finds the direction your data varies in the most and then lets your reduce the dimensionality by only keeping those top parts. 50 dimensions keep the vast majority of the directions.

It's technically the eigendecomposition of the covariance matrix. I spent awhile reading The Little Book of Linear Algebra, which you can check out here for PCA.

You:

compute the covariance matrix
compute the eigenvec adn eigenvalue
sort by eigenvalue
top $k$ are your first $k$ components
project existing data onto those $k$ components

UMAP - Uniform Manifold Approximation and Projection (for Dimension Reduction)

I sadly haven't understood this one to the degree that I wish yet. You should check out their [main website][umap-main]. I am not familiar with Riemannian manifolds which is one of the core assumptions for the modeling.

At a high level, UMAP finds a low dimensional 2D or 3D layout for points so that points that were nearby in the high dimensional space stay close, and points that were far away, stay far away.

The mechanism is as follows:

build a graph that captures the high dimensional structure
1. for every point, we find $k$ nearest neighbors
2. we connect them with weighted edges
3. weighting is calibrated locally (this is what gives it the uniform bit)
  1. It's basically asking: how close is this neighbor relative to the typical neighbor distance for this point
  2. yes, i know this is kind of handwavy
4. you output this weighted graph where edges represent points on the higher dimension manifold curve
build a matching graph in the lower dimension
1. this typically uses a spectral embedding to generate a 2D point cloud
2. we optimize so that the 2d edge weights largely match the higher dimensional graph, normally have the objective be a cross entropy loss
3. two main controls - n_neighbors and min_dist

My first instinct was to use t-SNE which is apparently dumb and boomer like. UMAP is apparently what the kids are using these days because it preserves more of the global structure (t-SNE is great at local neighborhoods but tends to scatter the macro layout), scales better empirically to larger point clouds, and is more deterministic across runs given the same seed and parameters.

If you want to go deeper, Leland McInnes (UMAP's author) gave a great talk at SciPy 2018 that explains the intuition way better than I can:

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise

And you thought UMAP was a mouthful eh?

HDBSCAN finds clusters by looking for regions of high point density separated by regions of low point density.

I intuitively thought "oh what about k-means" like a classic rookie, BUT that obviously doesn't work, because we don't know $k$ . And also I want some control around noise where chunks aren't clustered cleanly.

Another good way to think about it - from Claude distilling it down - is that we can imagine the 2d/3d point cloud as a topographic map. If points are dense, then it's kinda high elevation. A "cluster" is basically a peak - it's a connected region above some density threshold. Noise is everything in the valleys.

There is an obvious balance here, hence why Be Right Back let's you pick and adjust these thresholds and parameters.

John Healy (one of HDBSCAN's authors) has a great walkthrough of the how and the why here:

Ok so the mechanics:

compute mutual reachability distance
1. for each pair of points, we find a modified distance that says "how hard is it to walk from point A to point B without leaving dense regions"
2. this inherently penalizes points in sparse regions
3. so basically mutual_reachability(A, B) = max(core_dist(A), core_dist(B), dist(A, B))
build a MST
1. we build this graph with mutual reachability distances as edge weights
2. MST is built using the smallest possible edge weight
build a cluster hierarchy
1. This is really just building a dendogram. See the note below this is a BIT similar to HAC
pick stable clusters
1. this is the magic sauce
2. we compute a stability score for each candidate cluster
  1. basically, how much of the density-threshold-range did this cluster survive
3. and then we pick the set of clusters that maximize total stability
(kinda cleanup task) label noise clusters
1. the ones that did not into any clusters get labelled with a -1 so we know to ignore

This final step is how we get $k$ automatically. $k$ is whatever number of stable clusters exist in the data.

Note

There is a ton of really cool overlap and connections from previous projects. For example, there is obviously some overlap between the first part of HDBSCAN and hierarchical agglomerative clustering (which I talk a lot about here). A really cool take home for Arena. Both produce a dendogram as a first step. Both use single-linkage (or HDBSCAN does and obviously HAC can use different linkages). HDBSCAN walks the whole dendogram and asks every cluster "how stable is this cluster across denisty thresholds"

As far as I can understand the two main differences are: 1) HDBSCAN picks clusters from different levels of the dendrogram (HAC cannot), and 2) HDBSCAN edge weights use mutual reachability distance instead of raw pairwise distance — that's the key conceptual difference, and it's independent of which underlying distance function you start from (Euclidean, Levenshtein, whatever). You could in principle rerun HAC with mutual reachability and it'd still be HAC.

Foundation 2: Why HDBSCAN, not k-means

K-means is the default clustering algorithm for most people, but it's wrong for this task in two specific ways:

K-means requires you to pre-specify K. You don't know how many topics are in someone's iMessage history. 5? 50? 500? The algorithm forces you to guess. HDBSCAN figures it out from the data via a min_cluster_size parameter, which is a much more meaningful knob ("clusters smaller than this aren't real topics, they're noise").
K-means assumes round, equally-sized clusters. It minimizes within-cluster variance under a Euclidean assumption, which means every point gets assigned to some cluster, even outliers. iMessage data has natural noise — one-off conversations that don't belong to any topic. K-means would distort real clusters by pulling outliers in. HDBSCAN labels them as -1 (noise) explicitly, which is what you want.

The "H" in HDBSCAN is for Hierarchical — it builds a cluster tree at varying density thresholds and picks the most stable level automatically. The "DB" is for Density-Based — it defines a cluster as a region of high point density separated by regions of low density. Together: an algorithm that finds variable-shape clusters, handles noise gracefully, and picks K for you.

The code also has some interesting post-processing: merge_similar_clusters (combine clusters that are semantically close) and reassign_noise_points (give noise points their nearest cluster if it's close enough). These are practical compromises — HDBSCAN over-fragments and over-flags noise on real data, so we reel it back in.

Foundation 3: Why the clustering is in 2D instead of 50D

This is a subtle one and it's worth getting right because it sounds wrong at first.

The "pure" approach would be: cluster in 50D PCA space (because density estimation is more meaningful there than in 2D), then project the cluster labels onto the 2D UMAP for visualization. This is what some pipelines do.

But it has a problem: UMAP doesn't preserve global structure perfectly. Two points that are close in 50D might be far apart in the 2D UMAP (and vice versa). If you cluster in 50D and visualize in 2D, you'll see clusters that look incoherent — points labeled the same color but spread across the 2D map.

By clustering in 2D directly, the cluster labels match the visual structure the user sees on the Insights page. Every visible "blob" gets a label; every "hole" is noise. The visualization is consistent with the clustering, not just decorated by it.

The tradeoff: you're trusting UMAP's 2D layout to reflect real structure. UMAP is well-validated for this — it preserves local neighborhood structure by design — but it can produce occasional artifacts (spurious clusters from layout noise). For a UI that's communicating "here are your conversational themes," visual coherence is more important than mathematical purity.

Foundation 4: The LLM labeling step

The clustering produces numeric cluster IDs (0, 1, 2, ...) but the UI shows human-readable labels like "Travel Plans" or "Work Projects". Where do those come from?

The pipeline takes a sample of chunks from each cluster, sends them to a local LLM with a prompt like "what topic do these messages share?", and uses the LLM's response as the cluster label. This is a small but interesting design choice:

It's a one-shot operation — runs once per cluster, ~12 clusters typically, so ~12 LLM calls
It's the slowest part of the analytics pipeline (the progress bar weights labeling at ~25% of total time)
It happens after clustering, so the LLM doesn't influence which chunks go in which cluster — only what the cluster is named

This separation matters: the what of clustering (which chunks go together) is determined by the embedding geometry, which is reproducible and deterministic. The naming is determined by the LLM, which is fuzzy and nondeterministic. If the LLM hallucinates a bad name, the clustering itself is still correct.

Conclusion

And thus! Concludes our main pipeline. We start with the raw chat.db and end with two distinct artifacts: a DuckDB full of 768-dim chunks indexed by HNSW and FTS, and 2D cluster map with LLM-generated labels. There are two pretty different down stream consumers.

The first artifact is what feeds Casper, which I'll talk about in the next portion. This has some fun about query planning, RRF, MMR, cross-encoder re-ranking.

But the main emphasis from this portion, is that there are some huge wins with relatively mundane and simple tricks. Sorting chunks before batching, write through Parquet, build indexes once at the end instead of incrementally, etc. But again, this is what makes 1M messages searchable within minutes on your laptop.

Ingestion Pipeline - From Raw to Embedded