This case study is generalized and anonymized due to NDA. The architecture, constraints, and failure patterns are accurate. Domain specifics, identifiers, and metrics are intentionally abstracted.
Public-sector caseworkers process high volumes of regulated funding decisions for training, coaching, and certification programs. Each decision requires a written report documenting eligibility, program suitability, and compliance with current guidelines.
These reports must be:
The constraint wasn’t missing information. It was time and fragmentation: client data, rules, and documentation lived across disconnected systems, and the report was repetitive but high-stakes.
A simple “generate report from client data” prompt fails because:
LLMs can confidently cite guidelines that are outdated or not applicable. In regulated contexts, one incorrect reference can invalidate the whole output.
Reports need a stable, auditable section format. Even with good prompts, models drift, which breaks downstream checks and reviewer expectations.
When auditors ask “why did you approve this?”, the answer cannot be “the model thought so.” Every claim needs a source or a deterministic rule.
Once reviewers find one confident error, they stop trusting the system. Full automation kills engagement and removes the feedback loop you need to improve.
The design separates deterministic and probabilistic logic into explicit layers.
Rule-based checks run first (eligibility criteria, program category constraints, and hard blockers).
Output: structured pass/fail flags with versioned rule references. No LLM involvement. If ineligible, the system stops here.
For eligible cases, the system retrieves the relevant bundle: client profile, program details, policy excerpts, and prior notes.
Output: a structured context package with source references, logged before generation for auditability.
The model drafts specific sections (background summary, justification, expected outcomes, risks/alternatives).
Hard constraint: the model can only synthesize retrieved content. If evidence is missing, it must say so and request it.
Review happens in a structured interface: edit section drafts, re-generate sections, and flag incorrect outputs.
Edits and flags become structured feedback that improves retrieval, templates, and guardrails over time.
Approved sections are assembled with metadata: sources used, ruleset version, model/prompt version, and approval timestamp.
This makes the output reproducible and defensible under audit.
The system treats review as a product feature, not a safety net:
The point wasn’t “replace humans.” It was to move humans from repetitive drafting into higher-signal validation, under strict constraints.
What happened: newer policy updates were present but not retrieved reliably because semantic similarity favored older, frequently referenced text.
Fix: added metadata-aware retrieval (recency/version weighting) and required the model to cite the policy version explicitly.
What happened: early drafts sounded like decisions (“clearly eligible”, “strongly recommended”), which triggered resistance from reviewers.
Fix: enforced neutral language and separated “evidence summary” from “human decision.” Quality improved when the model stopped trying to persuade.
What happened: when drafts were too complete, reviewers skimmed and missed issues.
Fix: redesigned outputs into section drafts that require active interaction, and added UI cues for high-risk sections.
What improved: less time spent on repetitive drafting, higher consistency across reports, and better reviewer confidence because every section can be traced back to evidence or explicit rules.
What was sacrificed: more engineering overhead than “just prompt it.” New report types require updates to rules, retrieval, and review flows, not only prompt changes.
Why it was worth it: in regulated environments, one bad output can kill adoption. This system worked because it was designed for constraints, not for demos.