Jobcenter Report Generation & Review System

Context and Constraints

German Jobcenters process thousands of funding decisions annually for vocational training, coaching, and certification programs. Each decision requires a written report documenting client eligibility, program suitability, and compliance with federal guidelines.

These reports must be:

Legally defensible if challenged
Consistent with current regulations (which change quarterly)
Auditable by external oversight bodies
Understandable to non-technical case workers

The problem was not lack of information, but time: case workers spent 40-60 minutes per report on repetitive documentation while regulations and client data lived across disconnected systems.

Why Naive LLM Usage Would Fail Here

A simple “generate report from client data” prompt would fail because:

Hallucinated regulations

LLMs confidently cite non-existent funding guidelines or outdated rules. In a regulated environment, one hallucinated policy reference invalidates the entire decision.

Inconsistent formatting

Reports must follow strict section structures for auditing. LLMs vary formatting even with detailed prompts, making batch processing and compliance checks unreliable.

No traceability

When auditors ask “why was this client approved?” the answer cannot be “the AI thought it was appropriate.” Every statement needs a citable source or deterministic rule.

Trust erosion

Case workers who find one error stop trusting the system entirely. Full automation meant no engagement, no learning, and no improvement path.

System Architecture

The system separates deterministic and probabilistic logic into distinct layers:

Layer 1: Deterministic Eligibility Engine

Rule-based checks run first: age requirements, residency status, prior funding history, program category eligibility.

Output: Pass/fail flags with cited regulation references. No LLM involvement. If a client is ineligible, the system stops here.

Layer 2: Retrieval-Augmented Context Assembly

For eligible clients, the system retrieves: current client profile, training program details, relevant funding guidelines (vector search over regulation corpus), case worker notes from previous interactions.

Output: Structured context bundle with source references. This gets logged for auditability before LLM processing.

Layer 3: LLM-Based Section Drafting

The LLM generates draft text for specific report sections: client background summary, program justification, expected outcomes, potential risks or alternatives.

Critical constraint: The LLM only synthesizes retrieved information. It does not invent facts, cite regulations not in the context, or make approval recommendations.

Layer 4: Human Review and Editing

Case workers review drafts in a structured interface. They can edit text, request re-generation for specific sections, or flag outputs as incorrect.

Edits and flags are captured as structured feedback. This data improves retrieval quality and prompt tuning over time.

Layer 5: Final Assembly and Submission

Approved sections are assembled into the final report with metadata: sources cited, eligibility rule versions applied, LLM model and prompt version used, case worker ID and approval timestamp.

This audit trail enables full reproducibility and compliance verification.

Human-in-the-Loop Workflow

The system treats human review as a core component, not a safety fallback:

Active editing required: Case workers must interact with drafts—clicking “approve all” is intentionally not an option
Section-level regeneration: Workers can request new drafts for individual sections without reprocessing the entire report
Feedback reinjection: When a worker edits text or marks a section as incorrect, that context is added to the retrieval corpus for future similar cases
Explanation on demand: Workers can view source documents and reasoning chains used to generate each section

This design reduces review time from 60 minutes to 15-20 minutes while maintaining engagement and catching errors the LLM cannot detect.

Real Failure Cases and Fixes

Failure: Retrieval blind spots

What happened: Regulation updates were added to the corpus but not indexed with sufficient context. The retrieval system returned outdated rules because semantic similarity favored older, more frequently cited documents.

Impact: Five reports cited superseded guidelines before case workers caught it in review. No reports were submitted, but trust in the system dropped.

Fix: Added temporal weighting to retrieval scoring. Documents published within the last quarter receive a relevance boost. Implemented a “regulation version” metadata field that the LLM must cite explicitly.

Failure: Overly confident tone

What happened: Early LLM outputs used phrasing like “this client is clearly eligible” or “approval is strongly recommended.” Case workers felt the system was making decisions rather than supporting them.

Impact: Workers either ignored recommendations or deferred to them without critical thinking. Both outcomes were failures.

Fix: Rewrote prompts to enforce neutral language: “based on the following evidence” instead of “clearly.” Added explicit instruction that the LLM presents information but does not make recommendations. Quality improved when the system stopped trying to be persuasive.

Outcomes and Tradeoffs

What improved: Report generation time reduced by ~70%. Regulation citation accuracy increased because retrieval replaced human memory. Case worker satisfaction improved because they spent less time on documentation and more on client interaction.

What was sacrificed: The system is slower to build and maintain than a pure-LLM solution. Prompt changes require approval from compliance teams. Adding new report types takes days, not hours, because deterministic rules and retrieval must be updated alongside LLM logic.

Why the tradeoff was worth it: Speed of iteration mattered less than zero tolerance for hallucinations. In regulated environments, one bad output can block adoption entirely. The system works because it was designed for the constraints, not despite them.

Back to overview