Overview
Ok so with the data-foundation in place, we are now able to read iMessages, decode various Apple specific binary blobs, and snapshot the database. So the next thing is: how do we turn these potentially millions of messages (i'm at ~1M) into something an LLM can search and reason about?
The answer (before we get to any of the RAG / BM25 approach) is a data pipeline that sets up our database structure cleanly. I utilize Rust and Python in a bit of a convoluted manner, but the reason is a mixture of performance, Tauri, ergonomics, and benchmarking. The Rust more or less handles the OS layer (snapshots, metadata, contact rez) and Python handles the ML (chunking, embedding, indexing). They communicate over a single pipe: JSON lines on stdout. I'll discuss this in a lot more detail.
I've distilled this down to 6 stages:
| # | Stage | Runner | What It Does |
|---|---|---|---|
| 1 | Snapshot | Rust | Copy iMessage DB + AddressBook, init metadata |
| 2 | Parsing | Rust | Count messages and contacts from snapshot |
| 3 | Chunking | Python | Sliding-window chunks with conversation awareness |
| 4 | Embedding | Python/MLX | Generate 768-dim vectors on Apple Silicon GPU |
| 5 | Saving | Python | Bulk write to DuckDB + create HNSW/FTS indexes |
| 6 | Analyzing | Python | UMAP projections + HDBSCAN clustering for Insights |
Each stage reports independent 0-100% progress to the FE. There will be six progress bars that are updated in pseudo-realtime.
Stage 1: Snapshot
This stage is where we front-load as much expensive work as possible so that we can have a reasonably fast app startup for the future.
The snapshot creation bit was actually covered in the previous data foundation section. So recall that we're copying the chat.db, grabbing the WAL + shm atomically, grabbing the AddressBook, but then we also do some metadata collection as well. Here are some of the things that we do:
- count messages and contacts
- open/create brb.db (this is our sqlite metadata database)
- record ingestion metadata - snapshot path, message count, date range
- pre-compute contact stats - avg response times, date ranges per contact. we compute this once during ingestion so we can load them in
- resolve display names - you'd be surprised but the join against the
AddressBookto map phone numbers/emails to real names can be expensive. - brb.duckdb - duckdb creation with empty tables ready for embedding
Two Databases
We utilize two separate databases here. And if you're thinking it might be slightly redundant, you kind of have a point, but the access patterns are use cases are different enough that this is entirely justified. Yes, yes, I know / have read about sqlite-vector.
In my mind, it was justified splitting this out. I know it's colored with the bias that I wanted to experiment with DuckDB, but sqlite is used for my OLTP workflows while DuckDB is used for my OLAP workflows. DuckDB is columnar, hence better for analytical operations. This lets us utilize vectorized scans,the HNSW/FTS extension, and bulk inserts.
Our sqlite3 table (brb.db) has metadata, contacts, stats. It has fast ACID transactions, small footprint, and is great for key-value lookups. Our DuckDB table (brb.duckdb) is our main holder of all the vector embeddings with HNSW indexes, full-text search with BM25, and more. This is purpose built for the RAG search portion, which we'll discuss in detail in the next section.
Also note, I pin DuckDB to exactly 1.4.4 on both the Rust and the Python side. Version mismatches here cause the VSS (vector similarity search) extension to crash because the extension binary format isn't stable across versions. This absolutely ruined me for a bit, and was causing my app to recreate indexes across languages when corruption was basically detected.
Stage 2: Parsing
This is pretty straightforward but... obviously like I stated above, we're parsing some metadata from our messages into the sqlite database.
Python Pipeline Notes
So! These next few sections are the most interesting part. The chunking, embedding, and saving all run inside of a single Python subprocess. I want to give a bit more context about that portion. Rust spawns the Python subprocess and reads JSON progress lines from stdout. Basically:
let embedding_result = python::run_python_embeddings_cancellable(
&snapshot_db,
&addressbook_db,
&output_db,
ingestion_id,
|stage, percent, message, stats| {
progress_callback(stage, percent, message, stats);
},
&is_cancelled,
)?;
Communication Protocol
This was a design choice that is maybe a bit juvenile and I'm not convinced I'd do it again. Part of it was because of project / scope creep, and still trying to move quick.
The two processes talk over OS pipes. This is actually what my old company used to use at Ab Initio quite a bit for some of their graph protocol and DML components.
Security wise, pipes are relatively safe. They're kernel-mediated. This means that no other process on the system can read from or write to them. There's no networking exposure, no socket file on disk that another process could sniff or anything like that.
The tradeoff is a bit of programming overhead but welcome to the era of AI. For example, we could be kinda burned here if a Python dep prints a stray message to stdout or a segfault cuts a jsonl line in half. That's why on the Rust side, we attempt to handle it gracefully:
match serde_json::from_str::<PythonMessage>(&line) {
Ok(PythonMessage::StageProgress {
stage,
percent,
message,
stats,
current_contact,
}) => {
let ingestion_stage = map_python_stage(&stage);
let ingestion_stats = convert_stats(&stats, current_contact);
progress_callback(ingestion_stage, percent, &message, ingestion_stats);
}
Ok(PythonMessage::Result { success, data, error }) => {
if success {
final_result = serde_json::from_value(data).ok();
} else {
error_message = error;
}
}
Ok(PythonMessage::Log { level, message }) => match level.as_str() {
"info" => info!(target: "python_embed", "{}", message),
"warn" => warn!(target: "python_embed", "{}", message),
"error" => error!(target: "python_embed", "{}", message),
_ => debug!(target: "python_embed", "{}", message),
},
Err(_) => {
// Not JSON, might be a raw log line
debug!(target: "python_embed", "Non-JSON output: {}", line);
}
Stage 3: Chunking
Raw messages are not useful to an embedding model. We do this for multiple reasons, but we are grouping messages into chunks using a sliding window. In production we use window_size=8, stride=4 (the function signature below shows the original library defaults — the production values get passed in from the caller). You'll note... that this is redundant by design. This is not that uncommon in RAG approaches. Messages at chunk boundaries intentionally appear in two groups so they have enough context.
def create_chunks(
messages: list[dict],
window_size: int = 5,
stride: int = 3,
min_chunk_chars: int = 10,
contact_name: str | None = None,
) -> list[dict]:
"""Create conversation-aware, overlapping chunks from messages.
Uses a sliding window with stride < window_size to create overlapping chunks,
ensuring messages at chunk boundaries are seen together in at least one chunk.
"""
chunks = []
conversations = split_into_conversations(messages)
for conversation in conversations:
for i in range(0, len(conversation), stride):
window = conversation[i : i + window_size]
Contact-Batched Processing
We do not load all the messages, create chunks, and embed everything. That causes memory to absolutely spike and tear through your vram. This was something I spiked into a good bit. We process in batches kinda like this:
# adaptive batch sizing
if pos < _HEAVY_CONTACT_COUNT: # top 50 by message count
size = _HEAVY_CONTACT_BATCH_SIZE # 10 contacts
else:
size = _NORMAL_CONTACT_BATCH_SIZE # 50 contacts
Each batch follows the same sequence: read -> chunk -> embed -> write -> free. Memory cleanup happens between batches, garbage collection plus a DuckDB CHECKPOINT every 5 batches to flush the WAL and return some memory back to the OS. Peak RSS SHOULD stay around 200MB (still thick).
Stage 4: Embedding
Each chunk gets embedded into a 768-dimensional vector using mlx-community/nomicai-modernbert-embed-base-8bit. This is a sentence-level transformer model already pre-quantized for Apple Silicon (hence the mlx). mlx runs inference directly on the GPU (metal) which is the whole point of targeting apple os.
Why nomicai-modernbert?
There are a couple of reasons.
- modern
This encoder model is a late 2024 redesign of the original BERT encoder. It has some of the more recent bells and whistles. Rotary embeddings, alternating local vs global attention, unpadded training, etc. Plus it's got an 8k context. You'll note that I actually truncate on 215 tokens because most iMessages are not that long, but it's nice to have the flexibility.
- asymmetric retrieval fine-tuning
Nomic fine-tuned this puppy with the Nomic-embed recipe, which means that we literally have document vs query prefix names. We prefix search_document: or search_query: and as a result, the fine-tuned model learns that these are in slightly different sub-spaces of the embedding manifold. The prefixes help you slightly route into the right one.
- 8-bit quantized, mlx friendly
The mlx-community fork is 8-bit quantized and packaged for Apple's hardware. This means we get roughly ~2x the throughput of the fp16 version on an M-series GPU chip.
- 768 dim tradeoff
DuckDB's HNSW index cost scales with dimension count. At ~245K chunks (my current corpus), 768-dim embeddings are ~750MB of raw vectors. 1024 dim would be ~1GB, and the HNSW graph itself adds another multiple on top of that. At higher dims and larger corpora the working set blows past available RAM fast. I don't really think the recall difference is impactful beyond the 768 dim threshold.
Why Sentence Level Transformer?
This has become pretty standard for RAG approaches to utilize sentence level transformers.
A plain language model (gpt, opus, sonnet) outputs one hidden state per token. These hidden states DO carry semantic meaning — they have to, otherwise next-token prediction wouldn't work — BUT the meaning is encoded in the wrong shape. They're operating in a latent space optimized to be useful for the prediction head sitting at the end of the network. The geometry got sculpted for "what tokens help me guess the next token," not for "vectors of similar-meaning text should point in similar directions."
That distinction matters because the entire RAG setup is built on cosine similarity. We turn ~245K chunks into points in space, turn the query into a point in the same space, and ask "which chunk points are closest in direction to the query point?" For that to work, the space has to have a very specific property: semantic closeness has to correspond to directional closeness. That property is not free. It only exists if the model was trained to produce it.
And we still need a single vector per chunk, which means we have to pool the per-token hidden states somehow — mean-pool, take [CLS], take the last token, whatever. None of those pooling strategies were in the training objective of a chat model. The model was never told "the mean of your token vectors should be close to the mean of a semantically similar chunk." So even if we pool, the result lands in a space whose geometry was shaped for a different task. Vanilla BERT's [CLS] embedding is famously terrible at semantic similarity out of the box for exactly this reason.
Sentence transformers fix this directly. They're BERT-family encoders that were additionally fine-tuned with contrastive learning. The training loop looks roughly like: show the model two sentences that mean similar things, compute their embeddings, compute the cosine similarity, and define a loss that says "this number should be high." Show it two dissimilar sentences and the loss says "this number should be low." Run that over millions of pairs and the network's weights get adjusted until the output geometry literally has the property we need - direction equals meaning.
Practically, this shows up in three places:
- One vector per chunk. The model takes the full chunk (up to 512 tokens) and emits a single 768-dim vector. No reducer, no "which token do I pick," no guessing.
- Cosine is meaningful. Because of the contrastive training objective, two chunks about "dinner plans" land near each other in embedding space. That's the only reason hybrid search works.
- Cheap enough to run on ~245K chunks.
nomicai-modernbert-embed-base-8bitis ~150M params. Using a 9B chat model as an embedder (pool the last hidden state, last-token style, à la GritLM / E5-Mistral) would be ~60x the FLOPs per chunk for a marginal quality bump. Not worth it at this scale.
You can see the pooling logic in ml/embeddings.py:68 - I do mean_pooling then normalize_embeddings (L2) as a fallback when the model doesn't return text_embeds directly. This is the canonical sentence-transformer recipe.
Why sort?
This was truly one of the biggest perf wins I had - for drastically dropping the vram usage.
It's this line in my pipelines/embedding.py:390
chunks.sort(key=lambda c: len(c.get("text", "")))
To really understand this, we do need to spend some time talking about transformers. At a high level, transformers are a shit ton of matrix operations. Per batch, we have tensors of [batch_size, seq_len, hidden_dim]. That seq_len has to be a single number!! If they are a different number, then our code has to pad the shorter ones with a [PAD] token so that all the tensors have the same shape. The model then computes everything, and then at the end, uses the PAD tokens to zero out those contributions.
What this means in reality is that in terms of FLOPs, batches cost the max length of their longest tensor seq_len, applied to every tensor.
So when we don't sort, what happens is that we'll have some batch with a small chunk like "yeah" for example, and if the variance on the seq_len is too high, then we get burned by having the insert a ton of PAD tokens and do a ton of useless compute.
Your batch padding changes from potentially as bad as your variance across your whole dataset, to just the variance within a single batch.
In addition! another super fun and interesting win is the vram impact. At a conceptual level, this of course makes sense that it'd be faster, but it's also a nice memory win.
VRAM during the forward pass is dominated by activations. Activations are basically the intermediate tensors each layer produces and keeps around. They scale like: batch_size * seq_len * hidden_dim * num_layers, and attention layers add a quadratic term on top: batch_size * seq_len**2 * num_heads. So without sorting, every batch has the chance to contain those outlier tokens, and the whole batch pays the quadratic cost of the longest sequence in it.
The natural follow-up here is dynamic batch size by token budget — instead of a fixed batch_size=128, you'd target a constant token count per batch (e.g. batch_size * seq_len ≈ 16K), so short-sequence batches pack more rows. I haven't implemented this yet; sorting alone got the win I needed, and dynamic batching is sitting on the follow-up list.
Stage 5: Saving
Saving can basically be distilled down into "writing pertinent data to the database". That is inclusive of our DuckDB and sqlite3 dbs. There are a couple of foundational decisions here that give us a solid win.
Foundation 1: Parquet as an middle ground
Hilarious, but it just keeps coming up as a win. I've talked about Parquet file format here on my blog. So I won't discuss it much further. Feel free to download the Walk in the Parquet MacOS app.
However, one big win I found for actually writing was to create temporary parquet file formats for loading. This bit of code:
def _write_parquet_to_duckdb(
conn: duckdb.DuckDBPyConnection, table: pa.Table, target_table: str
) -> int:
with tempfile.NamedTemporaryFile(suffix=".parquet", delete=False) as f:
parquet_path = f.name
pq.write_table(table, parquet_path)
try:
conn.execute(f"COPY {target_table} FROM '{parquet_path}'")
return table.num_rows
finally:
Path(parquet_path).unlink(missing_ok=True)
gave us a considerable speedup when writing.
I benchmarked this across four strategies on a 10K message sample on my lovely M-series Mac with MLX on GPU. Here's the comparison at batch_size=128:
| strategy | total | embed | write | chunks/sec |
|---|---|---|---|---|
parquet_buffered |
9.12 s | 8.96 s | 0.124 s | 365.7 |
parquet_bulk |
13.53 s | 12.73 s | 0.762 s | 246.3 |
pandas_insert |
16.79 s | 14.55 s | 2.197 s | 198.6 |
direct_executemany |
16.88 s | 9.94 s | 6.902 s | 197.6 |
Let me further explain what each method is doing:
direct_executemany- the simplest approach
- we're building a list of Python tuples
- then we call
conn.executemany("INSERT INTO message_chunks VALUES (?, ?, ?, ...)", rows) - DuckDB binds parameters and executes the insert row-by-row
- There is no intermediate file, no Arrow buffer - just Python objects straight to SQL
- Crucially, every 768-float embedding gets shipped through the parameter binding path one row at a time.
pandas_insert- Here we're building a
pandas.DataFramefrom the same rows - then we run
INSERT INTO message_chunks select * from dm - DuckDB's Python integration sees
dfin the calling scope and reads it as a virtual table via Arrow under the hood - The df buys us a bit but we still pay for the Python dict -> df -> Arrow conversion on the write
- Here we're building a
parquet_bulk- we skip pandas entirely! it's not always the best, like so many people say online
- here's what we do:
- Build a
pyarrow.Tabledirectly (with explicit Arrow types —pa.list_(pa.float32())for embeddings,pa.timestamp("us", tz="UTC")for timestamps, etc.) - write it to a temp
.parquetfile COPY message_chunks FROM '/tmp/xxx.parquet'
- Build a
- DuckDB's Parquet reader is a vectorized columnar scanner that reads straight into its execution engine
- one Parquet write per call to
write_chunks(), which means once per embed batch.
parquet_buffered- Same machinery as
parquet_bulk, but the strategy accumulates chunks in memory and only flushes a Parquet file when the buffer hits 5K rows (or the run ends) - At
batch_size=128, this is roughly one Parquet write per ~40 batches instead of one per batch.
- Same machinery as
A few things that are helpful to note:
parquet_bufferedwrites are ~55x faster thandirect_executemanyat the same batch size- i.e.
0.124 s vs 6.9 s - Going through DuckDB's SQL parser and parameter binding per row is brutal once each row carries a 768-element float array
executemanydoesn't actually batch on the wire- it's a Python convenience that loops
execute()under the hood, so the engine sees N independent statements, each re-validating types and re-allocating the embedding list.
- i.e.
pandas_insertis ~18x slower on the write step.- Pandas pushes into DuckDB via Arrow under the hood, but the Python object -> DataFrame -> Arrow round-trip eats most of the win
- Building the PyArrow table directly skips that.
- Writes are 1.4% of total time on the parquet path, vs 40.9% on the
executemanypath.- Once writes are essentially free, the bottleneck shifts back to the GPU (embedding), which is exactly where you want it.
- The embedder stays fed instead of stalling on the DB.
So writing isn't free, but on the parquet path it's close enough that we can stop thinking about it. The whole point of the temp-Parquet dance is to push the bottleneck back onto the GPU where it belongs. This is way more trivial and a smaller example, but there's (unsurprisingly) a great Youtube video on GPU optimization from Jane Street here:
Foundation 2: Indexes are built after
This one is probably a bit more obvious.
We're building HNSW and FTS indexes. These are used for RAG and vector search. More or less, HNSW maintains a graph between vectors, and FTS maintains an inverted index from terms to doc_ids.
We do this at the end because of the cost of indexing. HNSW has to recompute and rebalance the graph per insert. FTS for DuckDB (based on my understanding) does not have a way to recompute on the fly, it's a static index.
These are expensive indexes and doing them constantly takes compute and time. We're trying to be mindful of those resources. The punchline: building the index once at the end (a bulk construction over the final set of vectors) is fundamentally cheaper than maintaining it incrementally across N inserts, because each incremental insert pays for graph rebalancing (HNSW) or index rewrite (FTS) on top of the actual write.
Foundation 3: Two indexes (HNSW and FTS)
We're going to address this a lot more in the next section, but we want both because they fail in opposite ways.
- HNSW finds semantic matches "you wanna get dinner?" potentially linking to "meals"
- if you search for someone's exact phone number though, HNSW might miss it because that individual token doesn't mean shit to it
- FTS finds the exact word match
We're using a classic approach for handling these hybrid search results with Reciprocal Rank Fusion. I'll discuss this more in the next section as well.
Stage 6: Analyzing
Alright! So this is getting into the portion of how we use the embeddings we just generated and save.
A very important distinction here is that the output of the pipeline above is not for search. It's mainly for the Insights page, which shows a 2D/3D map of your conversational landscape with clusters.
We can decompose this into a couple of foundations as well.
Foundation 1: Dimensionality Jumps (768 -> 50 -> 2)
So we basically get our dimension into something our feeble human minds can better visualize by three separate technics.
- PCA: 768 -> 50 dims through PCA. Hover for more info.
- UMAP: 50 -> 2 as well as 50 -> 3 for the 3D view
- HDBSCAN: clustering in 2D
A reasonable question: why three steps? Why not just UMAP straight from 768 to 2, or cluster directly in 768D? Two reasons.
The curse of dimensionality breaks density. In high dimensions, the ratio of (distance to nearest neighbor) / (distance to farthest neighbor) approaches 1 — every point becomes roughly equidistant from every other point. Density-based clustering relies on "this region is denser than that region," which is meaningless when distances don't separate. So we have to reduce dimensions before HDBSCAN sees anything. This is the formal Beyer et al. (1999) result, "When Is Nearest Neighbor Meaningful?"
PCA is cheap, UMAP is expensive. PCA is linear — pure eigendecomposition, runs in seconds on ~1M × 768. UMAP is nonlinear and iterative — minutes to hours. Doing UMAP on the full 768 dims wastes compute on dimensions PCA could have stripped first. So PCA cheaply trims 768 → 50 (keeping ~80–90% of explained variance), then UMAP does the expensive nonlinear projection on the surviving 50D, then HDBSCAN clusters on the 2D output. Each step uses the right tool for what's expensive about it.
I'll discuss these a bit more below, with a focus on UMAP and HDBSCAN a bit more below.
PCA - Principal Component Analysis
Again, I would encourage you check out docs here. I'm assuming if you've gotten this far in the blog you're at least pretty familiar with this one given it's a pretty core mechanism of linear alg. TLDR, PCA finds the direction your data varies in the most and then lets your reduce the dimensionality by only keeping those top parts. 50 dimensions keep the vast majority of the directions.
It's technically the eigendecomposition of the covariance matrix. I spent awhile reading The Little Book of Linear Algebra, which you can check out here for PCA.
You:
- compute the covariance matrix
- compute the eigenvec adn eigenvalue
- sort by eigenvalue
- top are your first components
- project existing data onto those components
UMAP - Uniform Manifold Approximation and Projection (for Dimension Reduction)
I sadly haven't understood this one to the degree that I wish yet. You should check out their [main website][umap-main]. I am not familiar with Riemannian manifolds which is one of the core assumptions for the modeling.
At a high level, UMAP finds a low dimensional 2D or 3D layout for points so that points that were nearby in the high dimensional space stay close, and points that were far away, stay far away.
The mechanism is as follows:
- build a graph that captures the high dimensional structure
- for every point, we find nearest neighbors
- we connect them with weighted edges
- weighting is calibrated locally (this is what gives it the uniform bit)
- It's basically asking: how close is this neighbor relative to the typical neighbor distance for this point
- yes, i know this is kind of handwavy
- you output this weighted graph where edges represent points on the higher dimension manifold curve
- build a matching graph in the lower dimension
- this typically uses a spectral embedding to generate a 2D point cloud
- we optimize so that the 2d edge weights largely match the higher dimensional graph, normally have the objective be a cross entropy loss
- two main controls -
n_neighborsandmin_dist
My first instinct was to use t-SNE which is apparently dumb and boomer like. UMAP is apparently what the kids are using these days because it preserves more of the global structure (t-SNE is great at local neighborhoods but tends to scatter the macro layout), scales better empirically to larger point clouds, and is more deterministic across runs given the same seed and parameters.
HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise
And you thought UMAP was a mouthful eh?
HDBSCAN finds clusters by looking for regions of high point density separated by regions of low point density.
I intuitively thought "oh what about k-means" like a classic rookie, BUT that obviously doesn't work, because we don't know . And also I want some control around noise where chunks aren't clustered cleanly.
Another good way to think about it - from Claude distilling it down - is that we can imagine the 2d/3d point cloud as a topographic map. If points are dense, then it's kinda high elevation. A "cluster" is basically a peak - it's a connected region above some density threshold. Noise is everything in the valleys.
There is an obvious balance here, hence why Be Right Back let's you pick and adjust these thresholds and parameters.
Ok so the mechanics:
- compute mutual reachability distance
- for each pair of points, we find a modified distance that says "how hard is it to walk from point A to point B without leaving dense regions"
- this inherently penalizes points in sparse regions
- so basically
mutual_reachability(A, B) = max(core_dist(A), core_dist(B), dist(A, B))
- build a MST
- we build this graph with mutual reachability distances as edge weights
- MST is built using the smallest possible edge weight
- build a cluster hierarchy
- This is really just building a dendogram. See the note below this is a BIT similar to HAC
- pick stable clusters
- this is the magic sauce
- we compute a stability score for each candidate cluster
- basically, how much of the density-threshold-range did this cluster survive
- and then we pick the set of clusters that maximize total stability
- (kinda cleanup task) label noise clusters
- the ones that did not into any clusters get labelled with a -1 so we know to ignore
This final step is how we get automatically. is whatever number of stable clusters exist in the data.
Foundation 2: Why HDBSCAN, not k-means
K-means is the default clustering algorithm for most people, but it's wrong for this task in two specific ways:
- K-means requires you to pre-specify K. You don't know how many topics are in someone's iMessage history. 5? 50? 500? The algorithm forces you to guess. HDBSCAN figures it out from the data via a
min_cluster_sizeparameter, which is a much more meaningful knob ("clusters smaller than this aren't real topics, they're noise"). - K-means assumes round, equally-sized clusters. It minimizes within-cluster variance under a Euclidean assumption, which means every point gets assigned to some cluster, even outliers. iMessage data has natural noise — one-off conversations that don't belong to any topic. K-means would distort real clusters by pulling outliers in. HDBSCAN labels them as
-1(noise) explicitly, which is what you want.
The "H" in HDBSCAN is for Hierarchical — it builds a cluster tree at varying density thresholds and picks the most stable level automatically. The "DB" is for Density-Based — it defines a cluster as a region of high point density separated by regions of low density. Together: an algorithm that finds variable-shape clusters, handles noise gracefully, and picks K for you.
The code also has some interesting post-processing: merge_similar_clusters (combine clusters that are semantically close) and reassign_noise_points (give noise points their nearest cluster if it's close enough). These are practical compromises — HDBSCAN over-fragments and over-flags noise on real data, so we reel it back in.
Foundation 3: Why the clustering is in 2D instead of 50D
This is a subtle one and it's worth getting right because it sounds wrong at first.
The "pure" approach would be: cluster in 50D PCA space (because density estimation is more meaningful there than in 2D), then project the cluster labels onto the 2D UMAP for visualization. This is what some pipelines do.
But it has a problem: UMAP doesn't preserve global structure perfectly. Two points that are close in 50D might be far apart in the 2D UMAP (and vice versa). If you cluster in 50D and visualize in 2D, you'll see clusters that look incoherent — points labeled the same color but spread across the 2D map.
By clustering in 2D directly, the cluster labels match the visual structure the user sees on the Insights page. Every visible "blob" gets a label; every "hole" is noise. The visualization is consistent with the clustering, not just decorated by it.
The tradeoff: you're trusting UMAP's 2D layout to reflect real structure. UMAP is well-validated for this — it preserves local neighborhood structure by design — but it can produce occasional artifacts (spurious clusters from layout noise). For a UI that's communicating "here are your conversational themes," visual coherence is more important than mathematical purity.
Foundation 4: The LLM labeling step
The clustering produces numeric cluster IDs (0, 1, 2, ...) but the UI shows human-readable labels like "Travel Plans" or "Work Projects". Where do those come from?
The pipeline takes a sample of chunks from each cluster, sends them to a local LLM with a prompt like "what topic do these messages share?", and uses the LLM's response as the cluster label. This is a small but interesting design choice:
- It's a one-shot operation — runs once per cluster, ~12 clusters typically, so ~12 LLM calls
- It's the slowest part of the analytics pipeline (the progress bar weights labeling at ~25% of total time)
- It happens after clustering, so the LLM doesn't influence which chunks go in which cluster — only what the cluster is named
This separation matters: the what of clustering (which chunks go together) is determined by the embedding geometry, which is reproducible and deterministic. The naming is determined by the LLM, which is fuzzy and nondeterministic. If the LLM hallucinates a bad name, the clustering itself is still correct.
Conclusion
And thus! Concludes our main pipeline. We start with the raw chat.db and end with two distinct artifacts: a DuckDB full of 768-dim chunks indexed by HNSW and FTS, and 2D cluster map with LLM-generated labels. There are two pretty different down stream consumers.
The first artifact is what feeds Casper, which I'll talk about in the next portion. This has some fun about query planning, RRF, MMR, cross-encoder re-ranking.
But the main emphasis from this portion, is that there are some huge wins with relatively mundane and simple tricks. Sorting chunks before batching, write through Parquet, build indexes once at the end instead of incrementally, etc. But again, this is what makes 1M messages searchable within minutes on your laptop.