German Jobcenters process thousands of funding decisions annually for vocational training, coaching, and certification programs. Each decision requires a written report documenting client eligibility, program suitability, and compliance with federal guidelines.
These reports must be:
The problem was not lack of information, but time: case workers spent 40-60 minutes per report on repetitive documentation while regulations and client data lived across disconnected systems.
A simple “generate report from client data” prompt would fail because:
LLMs confidently cite non-existent funding guidelines or outdated rules. In a regulated environment, one hallucinated policy reference invalidates the entire decision.
Reports must follow strict section structures for auditing. LLMs vary formatting even with detailed prompts, making batch processing and compliance checks unreliable.
When auditors ask “why was this client approved?” the answer cannot be “the AI thought it was appropriate.” Every statement needs a citable source or deterministic rule.
Case workers who find one error stop trusting the system entirely. Full automation meant no engagement, no learning, and no improvement path.
The system separates deterministic and probabilistic logic into distinct layers:
Rule-based checks run first: age requirements, residency status, prior funding history, program category eligibility.
Output: Pass/fail flags with cited regulation references. No LLM involvement. If a client is ineligible, the system stops here.
For eligible clients, the system retrieves: current client profile, training program details, relevant funding guidelines (vector search over regulation corpus), case worker notes from previous interactions.
Output: Structured context bundle with source references. This gets logged for auditability before LLM processing.
The LLM generates draft text for specific report sections: client background summary, program justification, expected outcomes, potential risks or alternatives.
Critical constraint: The LLM only synthesizes retrieved information. It does not invent facts, cite regulations not in the context, or make approval recommendations.
Case workers review drafts in a structured interface. They can edit text, request re-generation for specific sections, or flag outputs as incorrect.
Edits and flags are captured as structured feedback. This data improves retrieval quality and prompt tuning over time.
Approved sections are assembled into the final report with metadata: sources cited, eligibility rule versions applied, LLM model and prompt version used, case worker ID and approval timestamp.
This audit trail enables full reproducibility and compliance verification.
The system treats human review as a core component, not a safety fallback:
This design reduces review time from 60 minutes to 15-20 minutes while maintaining engagement and catching errors the LLM cannot detect.
What happened: Regulation updates were added to the corpus but not indexed with sufficient context. The retrieval system returned outdated rules because semantic similarity favored older, more frequently cited documents.
Impact: Five reports cited superseded guidelines before case workers caught it in review. No reports were submitted, but trust in the system dropped.
Fix: Added temporal weighting to retrieval scoring. Documents published within the last quarter receive a relevance boost. Implemented a “regulation version” metadata field that the LLM must cite explicitly.
What happened: Early LLM outputs used phrasing like “this client is clearly eligible” or “approval is strongly recommended.” Case workers felt the system was making decisions rather than supporting them.
Impact: Workers either ignored recommendations or deferred to them without critical thinking. Both outcomes were failures.
Fix: Rewrote prompts to enforce neutral language: “based on the following evidence” instead of “clearly.” Added explicit instruction that the LLM presents information but does not make recommendations. Quality improved when the system stopped trying to be persuasive.
What improved: Report generation time reduced by ~70%. Regulation citation accuracy increased because retrieval replaced human memory. Case worker satisfaction improved because they spent less time on documentation and more on client interaction.
What was sacrificed: The system is slower to build and maintain than a pure-LLM solution. Prompt changes require approval from compliance teams. Adding new report types takes days, not hours, because deterministic rules and retrieval must be updated alongside LLM logic.
Why the tradeoff was worth it: Speed of iteration mattered less than zero tolerance for hallucinations. In regulated environments, one bad output can block adoption entirely. The system works because it was designed for the constraints, not despite them.