./how-its-built/casper-agent
[05]

Casper - The Agent System

How Casper orchestrates query planning, hybrid search, reranking, and context building to answer questions about your messages.

Overview

Our vector search section is helpful for retrieval given various prompts and queries. However, it is not quite at the point where it is "agentic" (buzzword flag).

The point is that in order to have our Casper agent be a bit more helpful, we need to build tooling and guard rails so that we can actually surface useful information, not just find the matching chunk that is basically the output of our vector search portion.

Agent Loop

You can kinda think about this section as Casper's heart. This is the core agent loop that runs, has a set of tools, chooses to invoke said tools, and then eventually returns a response to the user.

Casper - our beloved yet slightly dark mascot for this time suck of a side project - runs up to 5 iterations, although that is configurable. It basically is this...

pub async fn ask_with_history_progress(
    &mut self,
    question: &str,
    history: &[ChatMessage],
    progress: Option<&dyn Fn(AgentProgressEvent)>,
) -> Result<AgentResponse> {
    // Enrich question with temporal extraction, contact prefetching
    let enriched = self.enrich_question(question).await?;
    
    for iteration in 0..self.config.max_iterations {
        let response = self.engine.generate(&messages, &config).await?;
        
        let parsed = parse_tool_calls(&response);
        
        if parsed.tool_calls.is_empty() {
            // Model returned a direct answer
            return Ok(AgentResponse::from_text(parsed.text));
        }
        
        // Execute tool calls, append results, continue loop
        for tool_call in &parsed.tool_calls {
            let result = self.execute_tool_call(tool_call).await?;
            messages.push(ChatMessage::tool_response(result));
        }
    }
    
    // Max iterations reached — synthesize from what we have
    self.synthesize_answer(&messages).await
}
Hating on "basically" there? Fine, this is the actual function:
/// Ask with conversation history and a progress callback.
    ///
    /// Same as `ask_with_history` but emits `AgentProgressEvent`s in real-time
    /// via the provided callback, allowing callers to forward events to a UI.
    pub async fn ask_with_history_progress<F: FnMut(AgentProgressEvent)>(
        &mut self,
        question: &str,
        history: &[ChatMessage],
        mut progress: F,
    ) -> Result<AgentResponse> {
        let start = Instant::now();

        let system_prompt = self.build_system_prompt();
        let max_iterations = self.config.max_iterations;
        let generation_config = self.config.generation_config.clone();
        let mut messages = vec![ChatMessage::system(&system_prompt)];

        // TODO(@larkin): would lvoe not to clone here
        messages.extend(history.iter().cloned());

        // yeah, i know this is bad... 
        // 
        // for inferential questions (relationship, frequency, "who" questions),
        // automatically prefetch top contacts so the LLM has frequency data
        // without needing to call list_contacts itself. This compensates for
        // smaller models that may not follow multi-step tool strategies.
        let enriched_question = if Self::is_inferential_question(question) {
            info!("Detected inferential question, prefetching top contacts");
            match self.prefetch_top_contacts() {
                Some(context) => format!("{}\n\n{}", question, context),
                None => question.to_string(),
            }
        } else {
            question.to_string()
        };

        // Add the current question (with optional context enrichment)
        messages.push(ChatMessage::user(&enriched_question));

        info!(question = question, "Starting agent loop");
        let mut engine = self.engine.take().ok_or(InferenceError::NotInitialized)?;
        debug!("Using Qwen3 model for agent");
        let model_to_load = Some(DEFAULT_MLX_AGENT_MODEL);

        // Load the model (hot-swap if needed)
        if let Err(e) = engine.load_model_with_override(model_to_load) {
            self.engine = Some(engine);
            return Err(e);
        }

        let mut all_tool_calls = Vec::new();
        let mut iterations = 0;
        let mut previous_responses: Vec<String> = Vec::new();
        let mut last_tool_call_json: Option<String> = None;

        let result = loop {
            iterations += 1;

            if iterations > max_iterations {
                warn!(
                    iterations = iterations,
                    max = max_iterations,
                    "Exceeded max iterations, running synthesis pass"
                );

                // if we've hit our max, we'll just synthesis and report back
                messages.push(ChatMessage::user(prompts::SYNTHESIS_PROMPT));
                let synthesis = match engine.generate(&messages, &generation_config).await {
                    Ok(r) => {
                        info!(
                            raw_len = r.len(),
                            raw_preview = %r.chars().take(300).collect::<String>(),
                            "Synthesis pass raw response"
                        );
                        // Parse to extract just the text (ignore any accidental tool calls)
                        let text = match parse_tool_calls(&r) {
                            Ok(parsed) => {
                                info!(
                                    has_thinking = parsed.thinking.is_some(),
                                    has_tool_calls = parsed.has_tool_calls,
                                    text_len = parsed.final_text().len(),
                                    "Synthesis pass parsed"
                                );
                                parsed.final_text()
                            }
                            Err(_) => r,
                        };
                        // Guard against empty synthesis (model generated only <think> block
                        // with no answer text). Fall back to tool result summary.
                        if text.trim().is_empty() {
                            warn!("Synthesis pass returned empty text, falling back to tool result summary");
                            Self::synthesize_from_tool_results(&all_tool_calls)
                        } else {
                            text
                        }
                    }
                    Err(e) => {
                        warn!(error = %e, "Synthesis generation failed, using fallback");
                        "I gathered some information but couldn't complete the full analysis. Please try a more specific question.".to_string()
                    }
                };

                break Ok(AgentResponse {
                    answer: synthesis,
                    tool_calls: all_tool_calls,
                    iterations,
                    duration_ms: start.elapsed().as_millis() as u64,
                    thinking: None,
                });
            }

            progress(AgentProgressEvent::Thinking {
                iteration: iterations,
                max_iterations,
            });

            debug!(iteration = iterations, "Generating LLM response");
            let response = match engine.generate(&messages, &generation_config).await {
                Ok(r) => r,
                Err(e) => {
                    self.engine = Some(engine);
                    return Err(e);
                }
            };

            info!(
                iteration = iterations,
                response_len = response.len(),
                response_preview = %response.chars().take(300).collect::<String>(),
                "LLM response received"
            );

            // Parse the response for tool calls first (before repetition check)
            // This way we can distinguish between repeated tool calls and repeated answers
            let parsed = match parse_tool_calls(&response) {
                Ok(p) => p,
                Err(e) => {
                    self.engine = Some(engine);
                    return Err(e);
                }
            };

            if parsed.has_tool_calls {
                let current_tool_json = serde_json::to_string(&parsed.tool_calls).ok();
                if let (Some(current), Some(last)) = (&current_tool_json, &last_tool_call_json)
                    && current == last
                {
                    let last_result = all_tool_calls
                        .last()
                        .map(|r| {
                            if r.result.success {
                                format!(
                                    "Based on the tool results, I couldn't find the information you're looking for. {}",
                                    r.result.data.get("note").and_then(|n| n.as_str()).unwrap_or("")
                                )
                            } else {
                                "I encountered an error while searching. Please try again.".to_string()
                            }
                        })
                        .unwrap_or_else(|| "I was unable to find the requested information.".to_string());

                    break Ok(AgentResponse {
                        answer: last_result,
                        tool_calls: all_tool_calls,
                        iterations,
                        duration_ms: start.elapsed().as_millis() as u64,
                        thinking: parsed.thinking.clone(),
                    });
                }
                last_tool_call_json = current_tool_json;

                info!(
                    count = parsed.tool_calls.len(),
                    iteration = iterations,
                    "Executing tool calls"
                );

                // Inject embeddings for search_messages calls.
                // Run on a blocking thread because embed_query() does synchronous
                // socket I/O to the Python daemon, which would starve the tokio
                // executor under concurrent queries.
                let question_owned = question.to_string();
                let tool_calls = parsed.tool_calls;
                let (engine_back, calls_result) =
                    tokio::task::spawn_blocking(move || {
                        let result = CasperAgent::inject_embeddings(
                            &mut engine,
                            tool_calls,
                            &question_owned,
                        );
                        (engine, result)
                    })
                    .await
                    .map_err(|e| {
                        InferenceError::BackendError(format!(
                            "embedding task panicked: {e}"
                        ))
                    })?;
                engine = engine_back;
                let calls = match calls_result {
                    Ok(c) => c,
                    Err(e) => {
                        self.engine = Some(engine);
                        return Err(e);
                    }
                };

                let (mut records, _) = self.execute_tool_calls_with_progress(&calls, iterations, &mut progress);

                // here's our lovely cross encoder rerank
                Self::rerank_search_results(&mut engine, &mut records);
                let formatted_response = Self::format_tool_records(&records);
                all_tool_calls.extend(records);

                // Add assistant message with ONLY the tool calls in Qwen3.5 XML format
                // Don't include any hallucinated text that came after the tool call
                let tool_call_xml = parser::format_tool_calls_xml(&calls);
                messages.push(ChatMessage::assistant(&tool_call_xml));
                messages.push(ChatMessage::tool_response(&formatted_response));
                previous_responses.push(response.clone());
            } else {
                let answer = parsed.final_text();

                // yes yes... i know this is also jank
                // small model problems
                //
                // Detect empty responses (model generated only <think></think> + EOS).
                // Common with small quantized models on simple/meta questions.
                // Nudge the model to actually respond.
                if iterations < max_iterations && answer.trim().is_empty() {
                    warn!(
                        iteration = iterations,
                        "Detected empty response, nudging model to answer"
                    );

                    messages.push(ChatMessage::assistant(&response));
                    messages.push(ChatMessage::user(
                        "Please respond to the user's question directly with a helpful answer.",
                    ));
                    previous_responses.push(response.clone());
                    continue;
                }

                // Detect when the model narrates intent to search instead of
                // actually calling tools (common small-model failure mode).
                // If the answer reads like a plan ("Let me search...", "I'll try...")
                // and we haven't exhausted iterations, nudge it to act.
                if iterations < max_iterations && Self::is_narrated_intent(&answer) {
                    warn!(
                        answer_preview = %answer.chars().take(120).collect::<String>(),
                        iteration = iterations,
                        "Detected narrated intent instead of tool call, nudging model to act"
                    );

                    messages.push(ChatMessage::assistant(&response));
                    messages.push(ChatMessage::user(
                        "Don't describe what you plan to do — actually call the tool now. \
                         Use <tool_call> to search.",
                    ));
                    previous_responses.push(response.clone());
                    continue;
                }

                progress(AgentProgressEvent::Answering);
                let mut answer = answer;
                if Self::is_repetitive(&answer, &previous_responses) {
                    warn!(
                        answer_preview = %answer.chars().take(100).collect::<String>(),
                        "Detected repetitive text answer, synthesizing from tool results"
                    );
                    answer = Self::synthesize_from_tool_results(&all_tool_calls);
                }
                if answer.trim().is_empty() {
                    warn!("Final answer is empty, falling back to tool result summary");
                    answer = Self::synthesize_from_tool_results(&all_tool_calls);
                }

                info!(
                    iterations = iterations,
                    answer_len = answer.len(),
                    "Agent reached final answer"
                );

                let duration_ms = start.elapsed().as_millis() as u64;

                break Ok(AgentResponse {
                    answer,
                    tool_calls: all_tool_calls,
                    iterations,
                    duration_ms,
                    thinking: parsed.thinking,
                });
            }
        };
        self.engine = Some(engine);
        result
    }

Loop Protections

I'm not going to go into them, but if you expand the collapsible section, you'll note that we have some small model problems that we try to handle... I'm confident by Qwen 4.5 we won't need to handle this... but feel free to rip my / Claude's code apart.

Tools

What is an agent without tools? A pretty fucking bad agent.

Casper has 5 tools - these are all Rust functions.

search_messages

This is the primary tool. It calls the search pipeline basically with this payload:

{
  "name": "search_messages",
  "parameters": {
    "query": "Thanksgiving dinner plans",
    "contact_id": "+15551234567",
    "limit": 7,
    "min_date": "2025-11-01",
    "max_date": "2025-11-30"
  }
}

Before the search is actually executed, we inject an embedding vector for the query. This is transparent to the LLM. We write the query text and the agent handles the embedding.

list_contacts

Lists contacts. The model can inspect the listed result.

get_contact_stats

This gets the aggregated detailed stats for a specific contact: message counts, date ranges, emoji usages, avg response time, yadda yadda. This is the precomputed data in brb.db

get_conversation_context

Retrieves messages around a specific point in time. Useful for "what happened around / after X".

get_message_volume_by_period

Aggregates message volume per contact in a date range. Used for "life period" questions like "who did i talk to most in college"

Casper - The Agent System - Be Right Back