How We Built an AI-Powered Document Compliance Engine - Part 3: The AI Pipeline
Written by
Tatjana Petkovska
Reading time
6 min read
Published
May 18, 2026
Inside the DocuGenius AI pipeline: secure PDF ingestion, paragraph-level chunking, pgvector semantic search, retrieval-augmented generation, and multi-agent LLM extraction for compliance.

Part 3 of a 6-part series on building DocuGenius, an AI-powered document compliance engine for document workflow automation across regulated industries.
ā Back to Part 2: The Problem & Architecture
TL;DR: The core of DocuGenius is a three-stage AI pipeline. Documents are ingested and embedded at the paragraph level (with page references preserved), relevant chunks are retrieved via semantic search for each compliance criterion, and a multi-agent graph runs parallel LLM extractions. This post covers how and why we built each stage the way we did.
The Processing Pipeline
Once a document lands in the system and a processing job is queued (the asynchronous, Lambda-to-Fargate flow covered in Part 2), the worker runs through a sequence of steps that transform a raw PDF into structured compliance verdicts. The first three steps that form the AI core of the system are ingestion, retrieval, and multi-agent extraction.
Step 1: Document Ingestion
Once the worker picks up a queued job, ingestion runs three operations:
- Text extraction via PyMuPDF, which produces text and bounding-box metadata per page
- Content hashing, which generates a SHA-256 hash of the document content, used for cache invalidation and deduplication
- Embedding generation, where each page is split into paragraph-level chunks (capped at ~600 characters, broken at line boundaries) and embedded using OpenAI's text-embedding-3-small model (1536 dimensions). Chunks are stored in pgvector with both page and paragraph references.
The paragraph-level chunking is a deliberate retrieval choice. A single page can mix multiple unrelated topics, and embedding the whole page averages those signals into one diluted vector. Paragraph-level chunks keep each embedding focused on a coherent span of text.
Each chunk still carries its source page number, so evidence highlighting on the PDF stays straightforward. The chunk text itself also gives the downstream fuzzy-match step a tight target instead of a whole page to search.
Step 2: Semantic Page Retrieval (RAG)
When a compliance extraction is requested, the system first narrows down which paragraphs are relevant to each criterion. For a criterion like "Patient has a documented diagnosis of Type 2 diabetes", the system embeds the criterion's retrieval query and runs a cosine similarity search against the document's paragraph embeddings, returning the top matching chunks (each tagged with its source page).
The key retrieval choice: we tune for recall, not precision. We would rather send a few irrelevant chunks to the LLM than miss the one paragraph that contains the evidence. The LLM handles noisy context well. Missing context is much harder to recover from.
Top-K is kept small (around 10 chunks), which is enough for the LLM to reason over while keeping token costs predictable.
Context Isolation Inside The Pipeline
Part 2 covers the platform-level security model, including encrypted storage, tenant separation, AWS trust boundary, and minimal logging. Inside the AI pipeline itself, one additional rule shapes the design:
Each processing step sees only the minimum context it needs.
- Queue messages carry job identifiers and routing metadata, not document text.
- Each leaf agent receives only the pages relevant to its single criterion, not the full document, and not other criteria's context.
- Intermediate prompts and traces are scoped per criterion, so a failure or log line never exposes the whole document.
This is what makes per-criterion decomposition (covered next) a security property as well as a token-budget one.
Step 3: Multi-Agent Extraction With Strands
This is the core of the system. Rather than sending one massive prompt with all criteria and all pages to a single LLM call, we decompose the problem into a directed graph of specialized agents:
- Coordinator Agent, which receives the full set of retrieved pages and distributes relevant context to each leaf agent
- Leaf Agents, one per criterion, each receiving only the pages relevant to its criterion plus a criterion-specific system prompt
- Aggregator Agent, which merges all leaf responses into a single validated JSON output

Figure 1: The Strands agent graph. The Coordinator distributes page context to parallel Leaf Agents, which independently extract evidence and verdicts. The Aggregator merges and validates all results into a single structured output.
We built this on the Strands Agents framework, which gives us async graph execution with configurable timeouts. Each leaf agent can use a different model and system prompt, stored in the database and configurable per criterion.
Why Multi-Agent Instead Of A Single Prompt?
Three reasons:
- Token budget. A 30-criterion compliance check against a 40-page document would exceed context limits in a single call. Decomposition keeps each agent's context focused.
- Per-criterion customization. Medical criteria need different extraction logic than financial criteria. Per-agent system prompts let domain experts tune behavior without touching code.
- Failure isolation. If one criterion's extraction fails or times out, the others still complete. You get partial results instead of total failure.
The trade-off is latency. A single LLM call would be faster than 30 parallel calls, but we set a dynamic timeout of 60 + N*10 seconds (where N is the criterion count) and run all leaf agents concurrently. In practice, a 20-criterion extraction completes in 40-90 seconds, which is acceptable for an async workflow.
Acknowledgement
This work was shaped by the engineering effort behind DocuGenius, built by the team at ITQuarks, whose product thinking, implementation, and delivery discipline were instrumental in bringing the platform to life.
Suggested CTA
If you're stress-testing an AI document review pipeline against real claims, eligibility, or supplier files, we'd be glad to compare notes on what scales and what breaks. Talk to the DocuGenius team.
What's Next
With raw evidence extracted from each page, the system still needs to make a final compliance decision. In Part 4, we cover how boolean tree evaluation turns individual LLM verdicts into a top-level compliance outcome, how we highlight the exact evidence in the PDF using fuzzy matching, and how the DMN rule engine lets compliance teams define new rule sets without writing code.
Part 4: Evaluation, Evidence Highlighting & the DMN Rule Engine publishes Wed 2026-05-20. Follow DocuGenius on LinkedIn to get notified the moment it drops, or bookmark the blog.