Back

Report Generation & Review System (Regulated Decision Support)

NDA note

This case study is generalized and anonymized due to NDA. The architecture, constraints, and failure patterns are accurate. Domain specifics, identifiers, and metrics are intentionally abstracted.

What you’ll learn

  • How to design “fail-closed” LLM workflows in regulated environments
  • How to structure human review so people actually review (not rubber-stamp)
  • What broke in production and what fixed it

Context and Constraints

Public-sector caseworkers process high volumes of regulated funding decisions for training, coaching, and certification programs. Each decision requires a written report documenting eligibility, program suitability, and compliance with current guidelines.

These reports must be:

  • Legally defensible if challenged
  • Consistent with frequently updated guidance
  • Auditable by external oversight bodies
  • Understandable to non-technical reviewers

The constraint wasn’t missing information. It was time and fragmentation: client data, rules, and documentation lived across disconnected systems, and the report was repetitive but high-stakes.

Why naive LLM usage would fail here

A simple “generate report from client data” prompt fails because:

Hallucinated rules

LLMs can confidently cite guidelines that are outdated or not applicable. In regulated contexts, one incorrect reference can invalidate the whole output.

Inconsistent structure

Reports need a stable, auditable section format. Even with good prompts, models drift, which breaks downstream checks and reviewer expectations.

No traceability

When auditors ask “why did you approve this?”, the answer cannot be “the model thought so.” Every claim needs a source or a deterministic rule.

Trust erosion

Once reviewers find one confident error, they stop trusting the system. Full automation kills engagement and removes the feedback loop you need to improve.

System architecture

The design separates deterministic and probabilistic logic into explicit layers.

Layer 1: Deterministic eligibility engine

Rule-based checks run first (eligibility criteria, program category constraints, and hard blockers).

Output: structured pass/fail flags with versioned rule references. No LLM involvement. If ineligible, the system stops here.

Layer 2: Retrieval-augmented context assembly

For eligible cases, the system retrieves the relevant bundle: client profile, program details, policy excerpts, and prior notes.

Output: a structured context package with source references, logged before generation for auditability.

Layer 3: LLM section drafting

The model drafts specific sections (background summary, justification, expected outcomes, risks/alternatives).

Hard constraint: the model can only synthesize retrieved content. If evidence is missing, it must say so and request it.

Layer 4: Human review and editing

Review happens in a structured interface: edit section drafts, re-generate sections, and flag incorrect outputs.

Edits and flags become structured feedback that improves retrieval, templates, and guardrails over time.

Layer 5: Final assembly and submission

Approved sections are assembled with metadata: sources used, ruleset version, model/prompt version, and approval timestamp.

This makes the output reproducible and defensible under audit.

Human-in-the-loop workflow

The system treats review as a product feature, not a safety net:

  • Active review by design: the UI nudges users into editing and validating sections, instead of approving a full auto-generated blob.
  • Section-level regeneration: re-draft one section without re-running the entire report.
  • Feedback reinjection: edits and flags feed back into retrieval and templates for future similar cases.
  • Evidence on demand: reviewers can inspect the cited sources behind each section.

The point wasn’t “replace humans.” It was to move humans from repetitive drafting into higher-signal validation, under strict constraints.

Failure cases and fixes

Failure: retrieval blind spots

What happened: newer policy updates were present but not retrieved reliably because semantic similarity favored older, frequently referenced text.

Fix: added metadata-aware retrieval (recency/version weighting) and required the model to cite the policy version explicitly.

Failure: overly confident tone

What happened: early drafts sounded like decisions (“clearly eligible”, “strongly recommended”), which triggered resistance from reviewers.

Fix: enforced neutral language and separated “evidence summary” from “human decision.” Quality improved when the model stopped trying to persuade.

Failure: “approve-all” behavior

What happened: when drafts were too complete, reviewers skimmed and missed issues.

Fix: redesigned outputs into section drafts that require active interaction, and added UI cues for high-risk sections.

Outcomes and tradeoffs

What improved: less time spent on repetitive drafting, higher consistency across reports, and better reviewer confidence because every section can be traced back to evidence or explicit rules.

What was sacrificed: more engineering overhead than “just prompt it.” New report types require updates to rules, retrieval, and review flows, not only prompt changes.

Why it was worth it: in regulated environments, one bad output can kill adoption. This system worked because it was designed for constraints, not for demos.

Back to overview