Be Right Back

Overview

Our vector search section is helpful for retrieval given various prompts and queries. However, it is not quite at the point where it is "agentic" (buzzword flag).

The point is that in order to have our Casper agent be a bit more helpful, we need to build tooling and guard rails so that we can actually surface useful information, not just find the matching chunk that is basically the output of our vector search portion.

Agent Loop

You can kinda think about this section as Casper's heart. This is the core agent loop that runs, has a set of tools, chooses to invoke said tools, and then eventually returns a response to the user.

Casper - our beloved yet slightly dark mascot for this time suck of a side project - runs up to 5 iterations, although that is configurable. It basically is this...

pub async fn ask_with_history_progress(
    &mut self,
    question: &str,
    history: &[ChatMessage],
    progress: Option<&dyn Fn(AgentProgressEvent)>,
) -> Result<AgentResponse> {
    // Enrich question with temporal extraction, contact prefetching
    let enriched = self.enrich_question(question).await?;
    
    for iteration in 0..self.config.max_iterations {
        let response = self.engine.generate(&messages, &config).await?;
        
        let parsed = parse_tool_calls(&response);
        
        if parsed.tool_calls.is_empty() {
            // Model returned a direct answer
            return Ok(AgentResponse::from_text(parsed.text));
        }
        
        // Execute tool calls, append results, continue loop
        for tool_call in &parsed.tool_calls {
            let result = self.execute_tool_call(tool_call).await?;
            messages.push(ChatMessage::tool_response(result));
        }
    }
    
    // Max iterations reached — synthesize from what we have
    self.synthesize_answer(&messages).await
}

Hating on "basically" there? Fine, this is the actual function:

/// Ask with conversation history and a progress callback.
    ///
    /// Same as `ask_with_history` but emits `AgentProgressEvent`s in real-time
    /// via the provided callback, allowing callers to forward events to a UI.
    pub async fn ask_with_history_progress<F: FnMut(AgentProgressEvent)>(
        &mut self,
        question: &str,
        history: &[ChatMessage],
        mut progress: F,
    ) -> Result<AgentResponse> {
        let start = Instant::now();

        let system_prompt = self.build_system_prompt();
        let max_iterations = self.config.max_iterations;
        let generation_config = self.config.generation_config.clone();
        let mut messages = vec![ChatMessage::system(&system_prompt)];

        // TODO(@larkin): would lvoe not to clone here
        messages.extend(history.iter().cloned());

        // yeah, i know this is bad... 
        // 
        // for inferential questions (relationship, frequency, "who" questions),
        // automatically prefetch top contacts so the LLM has frequency data
        // without needing to call list_contacts itself. This compensates for
        // smaller models that may not follow multi-step tool strategies.
        let enriched_question = if Self::is_inferential_question(question) {
            info!("Detected inferential question, prefetching top contacts");
            match self.prefetch_top_contacts() {
                Some(context) => format!("{}\n\n{}", question, context),
                None => question.to_string(),
            }
        } else {
            question.to_string()
        };

        // Add the current question (with optional context enrichment)
        messages.push(ChatMessage::user(&enriched_question));

        info!(question = question, "Starting agent loop");
        let mut engine = self.engine.take().ok_or(InferenceError::NotInitialized)?;
        debug!("Using Qwen3 model for agent");
        let model_to_load = Some(DEFAULT_MLX_AGENT_MODEL);

        // Load the model (hot-swap if needed)
        if let Err(e) = engine.load_model_with_override(model_to_load) {
            self.engine = Some(engine);
            return Err(e);
        }

        let mut all_tool_calls = Vec::new();
        let mut iterations = 0;
        let mut previous_responses: Vec<String> = Vec::new();
        let mut last_tool_call_json: Option<String> = None;

        let result = loop {
            iterations += 1;

            if iterations > max_iterations {
                warn!(
                    iterations = iterations,
                    max = max_iterations,
                    "Exceeded max iterations, running synthesis pass"
                );

                // if we've hit our max, we'll just synthesis and report back
                messages.push(ChatMessage::user(prompts::SYNTHESIS_PROMPT));
                let synthesis = match engine.generate(&messages, &generation_config).await {
                    Ok(r) => {
                        info!(
                            raw_len = r.len(),
                            raw_preview = %r.chars().take(300).collect::<String>(),
                            "Synthesis pass raw response"
                        );
                        // Parse to extract just the text (ignore any accidental tool calls)
                        let text = match parse_tool_calls(&r) {
                            Ok(parsed) => {
                                info!(
                                    has_thinking = parsed.thinking.is_some(),
                                    has_tool_calls = parsed.has_tool_calls,
                                    text_len = parsed.final_text().len(),
                                    "Synthesis pass parsed"
                                );
                                parsed.final_text()
                            }
                            Err(_) => r,
                        };
                        // Guard against empty synthesis (model generated only <think> block
                        // with no answer text). Fall back to tool result summary.
                        if text.trim().is_empty() {
                            warn!("Synthesis pass returned empty text, falling back to tool result summary");
                            Self::synthesize_from_tool_results(&all_tool_calls)
                        } else {
                            text
                        }
                    }
                    Err(e) => {
                        warn!(error = %e, "Synthesis generation failed, using fallback");
                        "I gathered some information but couldn't complete the full analysis. Please try a more specific question.".to_string()
                    }
                };

                break Ok(AgentResponse {
                    answer: synthesis,
                    tool_calls: all_tool_calls,
                    iterations,
                    duration_ms: start.elapsed().as_millis() as u64,
                    thinking: None,
                });
            }

            progress(AgentProgressEvent::Thinking {
                iteration: iterations,
                max_iterations,
            });

            debug!(iteration = iterations, "Generating LLM response");
            let response = match engine.generate(&messages, &generation_config).await {
                Ok(r) => r,
                Err(e) => {
                    self.engine = Some(engine);
                    return Err(e);
                }
            };

            info!(
                iteration = iterations,
                response_len = response.len(),
                response_preview = %response.chars().take(300).collect::<String>(),
                "LLM response received"
            );

            // Parse the response for tool calls first (before repetition check)
            // This way we can distinguish between repeated tool calls and repeated answers
            let parsed = match parse_tool_calls(&response) {
                Ok(p) => p,
                Err(e) => {
                    self.engine = Some(engine);
                    return Err(e);
                }
            };

            if parsed.has_tool_calls {
                let current_tool_json = serde_json::to_string(&parsed.tool_calls).ok();
                if let (Some(current), Some(last)) = (&current_tool_json, &last_tool_call_json)
                    && current == last
                {
                    let last_result = all_tool_calls
                        .last()
                        .map(|r| {
                            if r.result.success {
                                format!(
                                    "Based on the tool results, I couldn't find the information you're looking for. {}",
                                    r.result.data.get("note").and_then(|n| n.as_str()).unwrap_or("")
                                )
                            } else {
                                "I encountered an error while searching. Please try again.".to_string()
                            }
                        })
                        .unwrap_or_else(|| "I was unable to find the requested information.".to_string());

                    break Ok(AgentResponse {
                        answer: last_result,
                        tool_calls: all_tool_calls,
                        iterations,
                        duration_ms: start.elapsed().as_millis() as u64,
                        thinking: parsed.thinking.clone(),
                    });
                }
                last_tool_call_json = current_tool_json;

                info!(
                    count = parsed.tool_calls.len(),
                    iteration = iterations,
                    "Executing tool calls"
                );

                // Inject embeddings for search_messages calls.
                // Run on a blocking thread because embed_query() does synchronous
                // socket I/O to the Python daemon, which would starve the tokio
                // executor under concurrent queries.
                let question_owned = question.to_string();
                let tool_calls = parsed.tool_calls;
                let (engine_back, calls_result) =
                    tokio::task::spawn_blocking(move || {
                        let result = CasperAgent::inject_embeddings(
                            &mut engine,
                            tool_calls,
                            &question_owned,
                        );
                        (engine, result)
                    })
                    .await
                    .map_err(|e| {
                        InferenceError::BackendError(format!(
                            "embedding task panicked: {e}"
                        ))
                    })?;
                engine = engine_back;
                let calls = match calls_result {
                    Ok(c) => c,
                    Err(e) => {
                        self.engine = Some(engine);
                        return Err(e);
                    }
                };

                let (mut records, _) = self.execute_tool_calls_with_progress(&calls, iterations, &mut progress);

                // here's our lovely cross encoder rerank
                Self::rerank_search_results(&mut engine, &mut records);
                let formatted_response = Self::format_tool_records(&records);
                all_tool_calls.extend(records);

                // Add assistant message with ONLY the tool calls in Qwen3.5 XML format
                // Don't include any hallucinated text that came after the tool call
                let tool_call_xml = parser::format_tool_calls_xml(&calls);
                messages.push(ChatMessage::assistant(&tool_call_xml));
                messages.push(ChatMessage::tool_response(&formatted_response));
                previous_responses.push(response.clone());
            } else {
                let answer = parsed.final_text();

                // yes yes... i know this is also jank
                // small model problems
                //
                // Detect empty responses (model generated only <think></think> + EOS).
                // Common with small quantized models on simple/meta questions.
                // Nudge the model to actually respond.
                if iterations < max_iterations && answer.trim().is_empty() {
                    warn!(
                        iteration = iterations,
                        "Detected empty response, nudging model to answer"
                    );

                    messages.push(ChatMessage::assistant(&response));
                    messages.push(ChatMessage::user(
                        "Please respond to the user's question directly with a helpful answer.",
                    ));
                    previous_responses.push(response.clone());
                    continue;
                }

                // Detect when the model narrates intent to search instead of
                // actually calling tools (common small-model failure mode).
                // If the answer reads like a plan ("Let me search...", "I'll try...")
                // and we haven't exhausted iterations, nudge it to act.
                if iterations < max_iterations && Self::is_narrated_intent(&answer) {
                    warn!(
                        answer_preview = %answer.chars().take(120).collect::<String>(),
                        iteration = iterations,
                        "Detected narrated intent instead of tool call, nudging model to act"
                    );

                    messages.push(ChatMessage::assistant(&response));
                    messages.push(ChatMessage::user(
                        "Don't describe what you plan to do — actually call the tool now. \
                         Use <tool_call> to search.",
                    ));
                    previous_responses.push(response.clone());
                    continue;
                }

                progress(AgentProgressEvent::Answering);
                let mut answer = answer;
                if Self::is_repetitive(&answer, &previous_responses) {
                    warn!(
                        answer_preview = %answer.chars().take(100).collect::<String>(),
                        "Detected repetitive text answer, synthesizing from tool results"
                    );
                    answer = Self::synthesize_from_tool_results(&all_tool_calls);
                }
                if answer.trim().is_empty() {
                    warn!("Final answer is empty, falling back to tool result summary");
                    answer = Self::synthesize_from_tool_results(&all_tool_calls);
                }

                info!(
                    iterations = iterations,
                    answer_len = answer.len(),
                    "Agent reached final answer"
                );

                let duration_ms = start.elapsed().as_millis() as u64;

                break Ok(AgentResponse {
                    answer,
                    tool_calls: all_tool_calls,
                    iterations,
                    duration_ms,
                    thinking: parsed.thinking,
                });
            }
        };
        self.engine = Some(engine);
        result
    }

Loop Protections

I'm not going to go into them, but if you expand the collapsible section, you'll note that we have some small model problems that we try to handle... I'm confident by Qwen 4.5 we won't need to handle this... but feel free to rip my / Claude's code apart.

Tools

What is an agent without tools? A pretty fucking bad agent.

Casper has 5 tools - these are all Rust functions.

`search_messages`

This is the primary tool. It calls the search pipeline basically with this payload:

{
  "name": "search_messages",
  "parameters": {
    "query": "Thanksgiving dinner plans",
    "contact_id": "+15551234567",
    "limit": 7,
    "min_date": "2025-11-01",
    "max_date": "2025-11-30"
  }
}

Before the search is actually executed, we inject an embedding vector for the query. This is transparent to the LLM. We write the query text and the agent handles the embedding.

`list_contacts`

Lists contacts. The model can inspect the listed result.

`get_contact_stats`

This gets the aggregated detailed stats for a specific contact: message counts, date ranges, emoji usages, avg response time, yadda yadda. This is the precomputed data in brb.db

`get_conversation_context`

Retrieves messages around a specific point in time. Useful for "what happened around / after X".

`get_message_volume_by_period`

Aggregates message volume per contact in a date range. Used for "life period" questions like "who did i talk to most in college"

Query Planning

A challenging part of this was - again - dealing with some of the nuances of small models that are free with leading models. For example, with our search_messages payload, we have a min_date and max_date. It would make sense for the model to just see that and infer what should be searched but... I found that qwen / casper didn't always love that approach.

So while a little jank, a planner step takes a crack at extracting the date range. A friend from OpenAI always touted that you'd be stunned how much regex still is propping up core workflows. Our approach is therefore regex-first, then LLM fallback.

Regex first

Again! I know!! This is not the best. I have 18 different regex matching patterns. This is that sweet sweet deterministic code that I can write basic unit tests for an cleanly understand. A snippet:

static RE_YESTERDAY: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)\byesterday\b").unwrap());
static RE_TODAY: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)\btoday\b").unwrap());
static RE_LAST_WEEK: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)\blast\s+week\b").unwrap());
static RE_THIS_WEEK: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)\bthis\s+week\b").unwrap());
static RE_LAST_MONTH: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)\blast\s+month\b").unwrap());
static RE_THIS_MONTH: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)\bthis\s+month\b").unwrap());
static RE_THIS_YEAR: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)\bthis\s+year\b").unwrap());
static RE_LAST_YEAR: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)\blast\s+year\b").unwrap());
static RE_IN_YEAR: LazyLock<Regex> = LazyLock::new(|| Regex::new(r"(?i)\bin\s+(20\d{2})\b").unwrap());

Fallback

This regex does not catch everything. Think about a query saying, when covid started. I'm not gonna catch that.

I threw a query golden set together, and regex scored 21/31 (68%). So now when regex comes up with an empty response, a temporal_hint check asks "does this mention time". If any of that shows up, we make one call to try and extract the range as JSON.

Merge: regex always wins

Our LLM is again a fallback case, but we do end up hitting it a good chunk. If regex found a hit, we keep it and stop there. But you'll note that with our hybrid approach, it's way better on our golden set.

Net result on the golden set:

	regex only	hybrid
in-distribution	100%	100%
paraphrases	17%	75%
overall	68%	90%

Recency Boost

One other thing that was a bit of a product decision. If we can detect recency phrases, like "last" or "latest" then we set the prefer_recent = true. After our retrieval, we get an exponential decay multiplier:

\text{decay} = e^{-0.03 * \text{age}_{\text{days}}}

Tool Calling Format

Again, we are on the incredible Qwen 3.5 model. However! When I started this project we were using qwen3. That's how fucking long it's taken me.

The big reason for bringing this up is because of the tool calling support and formatting changes.

Qwen3 wrapped tool calls as JSON inside a <tool_call> tags. This is decently familiar at least from what I've seen.

<tool_call>
{"name": "search_messages", "arguments": {"query": "thanksgiving plans", "limit": 7}}
</tool_call>

Qwen3.5 however axed that. I'm not sure why - although I have read things about how xml is preferred by model languages.

MXML Tool Callshttps://docs.morphllm.com/

However, xml is obviously a bit more token-inefficient than json... Regardless! The equivalent call would be like so:

<tool_call>
<function=search_messages>
<parameter=query>
thanksgiving plans
</parameter>
<parameter=limit>
7
</parameter>
</function>
</tool_call>

Parser Dialects

A borderline insane amount of tech debt for a one year old side research project. Perhaps its a symptom of the times, perhaps its agentic tooling. However, I did not always see reliable format emission from Qwen3.5.

There's a confirmed upstream badcase where 3.5 flip-flops between JSON and XML tool calls — worst on the 9B model with reasoning mode on. This is exactly the configuration Casper runs. The maintainers reproduced it, tagged it badcase-confirmed, and the issue is now closed against the Qwen3.6 repo, so it looks like 3.6 is meant to put this to bed?? Really not sure and haven't yet ripped a local 3.6 release.

But so the TLDR is that our parser handles a couple formats still:

Qwen3.5 XML — <tool_call><function=name><parameter=p>val</parameter></function></tool_call> (the native format)
Legacy wrapped — <|tool_call|>[{...}]<|/tool_call|>
Qwen3 JSON wrapped — <tool_call>{"name": ..., "arguments": ...}</tool_call>
Bare JSON — [{"name": ..., "parameters": ...}] with no tags at all

Daemon Architecture

Casper runs on its own inference daemon, separate from the persona chat. It loads the base model with no LoRA adapter, because it doesn't need to sound like you - it just needs to be able to utilize tool use. That separation means that you can run a search while a persona reply is mid-generation. The daemon lifecycle itself is its own rabbit hole. I'll discuss that more in the inference system section.

Casper - The Agent System