The first, second, and third posts in the Engineering Neura series covered what Neura's whole-turn LangGraph runtime is designed to do, and how to trace what it actually did. This fourth and final post goes one level deeper: not just which tools ran and which LLM calls fired, but which pieces of the assembled context the model reported drawing on when it produced its response.
In a regulated clinical product, "the model probably used the context" is not acceptable. When a care coordinator asks why the agent surfaced a particular provider, or a compliance reviewer audits a high-stakes recommendation, the answer needs to reference something specific and traceable. This post covers the pattern that makes that possible: an inline citation grammar the model embeds in its responses, combined with a field-type-aware matcher that provides secondary evidence. Neither track proves causal use of the context. Together they produce an inspectable, per-field attribution surface that reviewers can actually work with.
Citations could be returned as a separate JSON field via llm.with_structured_output(CitedResponse) with text: str; citations: list[Citation]. The inline-marker grammar described below is the shape that ships in this codebase for three reasons. Markers are six tokens per citation, against a JSON wrapper on every response. Markers stream naturally inside text deltas, where structured output requires JSON-mode buffering that breaks token-by-token UI rendering. Markers were reliable enough on the small models we use for cheap turns after a calibration pass; this needs to be validated per model and prompt. Where streaming and per-turn cost are not constraints, with_structured_output is the simpler shape.
The problem
Context injection is fire-and-forget. The agent's system prompt receives an assembled bundle: extracted client signal, provider attributes, safety status, retrieved patient records. The model can produce a polished response that has nothing to do with any of it. In a regulated domain this matters more than usual. When a clinician asks why the agent recommended a particular provider, the answer needs to be grounded in something specific even if the underlying use is not provable from the response alone.
Naive observability is substring matching: scan the response for any field value from the context and call it a hit. That fails on paraphrase. It fails on numeric fields because 14 appears in 2014. It fails on summary fields because long blobs share too many incidental words with any response. Word boundaries fix the substring-collision class of bug (14 inside 2014, risk inside brisk), but they do not fix the semantic ambiguity class: an enum value of elevated still \belevated\b-matches a response that says "elevated mood swings," which is why common-word enums also need citation support, surrounding-label checks, section-aware matching, or capped confidence rather than relying on the regex alone.
The pattern in this post is two-track. Track one is an inline citation grammar the model emits ([[CS:chief_complaint]]) when it reports being informed by a field; this is a strong attribution signal but it is self-reported and can be missing, fabricated, or duplicated. Track two is a field-type-aware matcher over the response text. Both produce the same shape: per-field confidence with a labeled match method, suitable for a reviewer UI.
The citation grammar
The model is given a terse instruction in its system prompt: when your response is informed by a context field, embed a marker right after the clause. Five short prefixes cover the five context tools.
agent_schemas/neura/agent.yamlcontext_attribution: |
When your response is informed by assembled context data, embed inline
citation markers using the format [[TOOL:field_name]].
Citation key prefixes:
- CS = client_signal e.g. [[CS:chief_complaint]]
- PG = provider_genome e.g. [[PG:severity_level]]
- PC = patient_context e.g. [[PC:budget_range]]
- SF = safety e.g. [[SF:risk_level]]
- TF = therapeutic_fit e.g. [[TF:fit_score]]
Rules:
- Only cite fields you actually used. Do not fabricate citations.
- Place the marker immediately after the clause that references the field.
- If no assembled context was used for a statement, do not add a marker.
The prefix-and-field grammar is short enough that small models emit it acceptably in our calibration set, which matters because not every chat turn runs on the most expensive tier. Adoption rate per (model, prompt) should be measured before relying on it in production. Markers are embedded inline, immediately after the clause they justify. That gives the UI the option to render citations as inline footnotes without needing the model to produce any extra structure.
The instruction is gated behind an ops toggle, so we can turn citations on and off per environment without redeploying. It is included in the system prompt only when the toggle is on, which keeps the prompt small for products that do not need the feature.
Parsing citations on the receiving end
The frontend strips citation markers before display and keeps the structured citation list separately. One regex covers the entire grammar.
agent-frontend/src/utils/contextAttribution.tsconst CITATION_REGEX = /\[\[(CS|PG|PC|SF|TF):([a-z_][a-z0-9_]*)\]\]/gi;
export function parseCitationMarkers(text: string): ParsedCitations {
const citations: Citation[] = [];
let m: RegExpExecArray | null;
CITATION_REGEX.lastIndex = 0;
while ((m = CITATION_REGEX.exec(text)) !== null) {
citations.push({ tool: m[1].toUpperCase(), field: m[2].toLowerCase(),
position: m.index });
}
const cleanText = text.replace(CITATION_REGEX, "").replace(/\s{2,}/g, " ");
return { cleanText, citations };
}
The same stripping runs in the message renderer, so users never see [[CS:chief_complaint]]. The structured citation list flows into the diagnostics panel as the primary attribution signal. Two cleanups happen here too: duplicate markers for the same (tool, field) within one response collapse to a single citation entry, and the parser is run only on the assistant turn, not on user input, so adversarial text containing citation-like strings in a user message never enters the citation list.
Field types, not strings
A citation is a strong but self-reported signal. The matcher provides secondary evidence over the response text, and a "match" is field-type-dependent. elevated as an enum value is a word-boundary match. 0.85 as a numeric score wants to be matched as "85%". A 200-character summary needs a keyword threshold to avoid false positives. Treating all of these as substrings is the source of the bad behavior.
const ENUM_FIELDS = new Set([
"risk_level", "intervention_mode", "recommended_action",
"severity_level", "modality_preference", "care_path_preference",
"provider_gender_preference", "complexity_level", "required_role",
]);
// Similar sets exist for NUMERIC_FIELDS (overall_score, fit_score,
// phq9_score, gad7_score, confidence) and SUMMARY_FIELDS (signal_summary,
// genome_summary, context_summary, assessment_summary).
function classifyField(key: string, value: unknown): FieldType {
const leaf = key.split(".").pop() || key;
if (ENUM_FIELDS.has(leaf)) return "enum";
if (NUMERIC_FIELDS.has(leaf)) return "numeric";
if (SUMMARY_FIELDS.has(leaf)) return "summary";
if (Array.isArray(value)) return "list";
return "field_value";
}
The classifier looks at the leaf field name, not the full path. The type registry stays small (a handful of Sets), and the same logical type works regardless of where it sits in the hierarchy.
Type-specific matchers
Each field type has a matcher tuned to how that type fails. Signatures are uniform: input is the value text and the response text, output is matched + confidence in [0, 1].
Enum: word-boundary plus simple normalization
function matchEnum(value: string, response: string) {
if (value.length < 2) return { matched: false, confidence: 0 };
const lower = response.toLowerCase();
const escaped = value.replace(/[.*+?^${}()|[\]\\]/g, "\\$&");
if (new RegExp(`\\b${escaped}\\b`, "i").test(response)) {
return { matched: true, confidence: 0.95 };
}
if (lower.includes(value.replace(/_/g, " "))) {
return { matched: true, confidence: 0.85 };
}
return { matched: false, confidence: 0 };
}
The word-boundary check eliminates a specific class of false positive: substring collisions like 14 matching inside 2014, or a value risk matching inside brisk. It does not resolve semantic ambiguity. Common enum words like elevated still match an unrelated phrase such as "elevated mood swings" under \belevated\b; that case needs a citation to anchor the field, a surrounding-label check (look for risk nearby), section-aware matching, or capped confidence. The underscore-to-space normalization handles enum values with snake_case names that the model rephrases naturally.
Numeric: word-boundary numbers and percentage reformatting
Same shape as matchEnum but with a numeric value escape, and one extra branch: when the value is in [0, 1], also check whether the response surfaces it as a percentage ("85%" or "85 percent") and return matched at slightly lower confidence. Word-boundary regexes prevent 14 from matching in 2014. We deliberately do not implement numeric tolerance ranges; the false-positive rate of fuzzy numeric matching outweighs the recall improvement.
Summary: keyword threshold over meaningful words
function matchSummary(text: string, response: string) {
if (text.length < 10) return { matched: false, confidence: 0 };
const responseLower = response.toLowerCase();
const textLower = text.toLowerCase();
const words = (textLower.match(/\b[a-z]{4,}\b/g) || [])
.filter(w => !STOPWORDS.has(w));
if (words.length === 0) return { matched: false, confidence: 0 };
const matchCount = words.filter(w => responseLower.includes(w)).length;
const ratio = matchCount / words.length;
return ratio >= 0.6
? { matched: true, confidence: Math.min(0.7, ratio) }
: { matched: false, confidence: 0 };
}
Both inputs are lowercased before tokenization; otherwise the [a-z] word regex would silently drop capitalized tokens like "Anxiety" and the response.includes(w) check would miss the same term in a different case. Lowercasing handles surface-form variation, not paraphrase: response.includes(w) still requires the same lemma to appear in the response, so a true rephrase ("feeling on edge" for "anxiety") will not match. Summaries are long blobs of free text and they share too many incidental words with any plausible response. A flat substring match marks every section referenced most of the time. The 60% threshold over four-character-or-longer words with stopwords removed was the threshold that maximized agreement with human labels in our calibration set; it is not a universal constant, and it should be re-calibrated against your own labeled set before being trusted for review. The capped confidence (0.7 max) reflects that this signal is weaker than an enum word-boundary match.
Suggested-question matcher
Suggested questions are a different shape: assembled context contains gaps and proposes follow-ups, and we want to know whether the agent's response is asking a similar thing. Same pattern as matchSummary but with an expanded stopword set that includes question-shaped words (what, how, could, please) so two unrelated questions do not look similar just because both are questions.
The combined attribution computation
Walk the leaves of each context section, classify each one, then apply citations first and the matcher second.
export function computeTypedAttribution(context, responseText) {
const { cleanText, citations } = parseCitationMarkers(responseText);
const cited = groupCitationsBySection(citations);
const sections: Record<string, SectionAttribution> = {};
let hasCitation = false, hasHeuristic = false;
for (const section of CONTEXT_SECTIONS) {
const refs: FieldAttribution[] = [];
for (const leaf of extractTypedLeaves(context[section])) {
const key = leaf.path.split(".").pop()!;
if (cited[section]?.has(key)) {
refs.push({ ...leaf, matchConfidence: 1.0, matchMethod: "citation" });
hasCitation = true;
continue;
}
const r = matchTypedLeaf(leaf, cleanText);
if (r.matched) {
refs.push({ ...leaf, matchConfidence: r.confidence,
matchMethod: r.method });
hasHeuristic = true;
}
}
sections[section] = summarize(refs);
}
const attributionSource = hasCitation && hasHeuristic ? "mixed"
: hasCitation ? "citation" : "heuristic";
return { sections, attributionSource };
}
The matchConfidence: 1.0 for citations reflects that a self-reported marker is the strongest available signal in this scheme, not that it establishes causal use of the field. Heuristic confidences cap below 1.0, so a UI sorted by confidence surfaces cited fields above heuristically matched ones. The attributionSource output (citation, heuristic, mixed) is the main fact a reviewer needs at a glance.
Citation guardrails
Self-reported attribution earns its keep when the system actively defends against the ways it can be wrong. Four checks run alongside the parser:
- Unknown-field validation. Citations whose
(tool, field)pair does not appear in the assembled context for that turn are dropped from the citation list and counted in a drift metric. A spike means the model is fabricating field names or the frontend's known-field set is stale; both are actionable. - Duplicate collapse. Multiple identical markers for the same field collapse to one citation entry. Position is kept from the first occurrence, so the inline footnote affordance still anchors correctly.
- Citation density cap. If the response contains more than one marker per ~25 tokens of clean text, the diagnostics panel flags it as over-cited. In practice this is the most reliable signal that the model has fallen into a citation loop, and it is a more useful symptom than any single fabricated field.
- Adversarial user text. The parser is only run on assistant turns. User messages containing citation-like substrings (
[[CS:foo]]) never feed the citation list; they appear verbatim in the user bubble and are escaped before being shown.
Why deterministic matchers, not LLM-as-judge
The heuristic fallback could be an LLM evaluator: pass each (response, field) pair to a small model and ask whether the response references the field. The typed matchers add no per-turn model calls, run in browser-side TypeScript with no round trip, and produce verdicts a clinical reviewer can inspect directly (matchEnum with a word-boundary regex is auditable; an LLM verdict is not). LLM-as-judge belongs in offline evaluation against curated sets, where cost and latency do not matter and the goal is to calibrate the typed-matcher thresholds. The 60% summary threshold above came from exactly such a calibration loop, scored against 200 human-labeled response-field pairs in a LangSmith dataset.
Provider-native citations are the right tool for document grounding
Anthropic's citations API and OpenAI's response annotations surface first-class citations to retrieved document chunks. For document-grounded RAG, those are the preferred shape: the provider validates the citation against the supplied chunk and the runtime cost of getting it wrong is low. The pattern in this post covers a different surface, structured-field attribution from upstream tools (a numeric fit_score, an enum severity_level, a free-text signal_summary), which provider citation APIs do not model. In a stack with both, the two are complementary: provider-native for retrieved chunks, inline markers for extracted fields.
Edge cases worth knowing about
- Stripping for display. Citation markers are stripped before render and before any downstream model consumes the prior assistant turn. A model that sees its own citations in conversation history quickly learns to over-cite.
- Citation drift, fabricated fields. Models occasionally invent a field name that does not exist (
[[CS:emotional_state]]when the actual field ischief_complaint). Unknown fields are dropped from the citation list and incremented on a drift counter; a sustained drift rate is a signal that the field-name dictionary is stale or the model is confabulating, not noise to silence. - Low-completeness gating. When the assembled context's completeness score is below the matching threshold, the prompt formatter hides certain sections entirely. Citations against hidden sections never appear because the model never saw the data.
- Per-message snapshot. The attribution panel reads a snapshot persisted at the moment the response was generated. Otherwise a later assembly run would retroactively change the attribution view of an old message.
- Calibration is not optional. Thresholds (the 60% summary ratio, the citation-density cap, the 0.7 confidence cap) are calibrated against a labeled evaluation set. Re-calibrate when the prompt, the model, or the field schema changes; otherwise the confidence numbers drift away from human judgement silently.
If you only adopt one part of this pattern, adopt the citation grammar with regex stripping plus the four guardrails (unknown-field drop, duplicate collapse, density cap, parser scoped to assistant turns). It costs a few lines of TypeScript, a paragraph in the system prompt, and an ops toggle. It gives reviewers an inspectable per-response answer to "which fields did the model report using." The typed-matcher fallback is worthwhile and was harder to get right.
Takeaway checklist
- Treat inline citations as a self-reported attribution signal, not evidence of causal context use. Five short prefixes is enough for most products.
- Strip citation markers before display and before they re-enter conversation history.
- Validate every citation against the assembled context for that turn. Drop unknown fields, collapse duplicates, cap citation density.
- Treat field types as first-class. Enum, numeric, summary, list, and a default field-value matcher cover the realistic surface area.
- Lowercase consistently and use word-boundary regexes for enums and numbers. Word boundaries close the substring-collision class (
14in2014); semantic overlap on common enum words still needs citations, surrounding-label checks, or capped confidence. - Cap heuristic confidences below 1.0. Citations carry the strongest available signal in this scheme; the UI reflects that without claiming causality.
- Surface the attribution source (
citation,heuristic,mixed) prominently, and recalibrate thresholds when the model, prompt, or field schema changes. - Use provider-native citation APIs for document-grounded RAG; use inline field markers for structured-field attribution. The two are complementary.