A two-phase hackathon. Phase 1 — Crack the PDF: prove end-to-end extraction on a single credit memo. Phase 2 — The Loan File: cross-document verification across the full loan package.
This document has two zones.
The Contract is locked. Every implementation must satisfy everything in it, regardless of language, framework, or architectural decomposition. Anything the audience sees at the demo lives here — the input, the output schema, the arithmetic checks, the deliverables.
Reference Implementation is guidance only. It exists to help teams get started, but any layer — language, libraries, agent framework, even the specialist-team architecture itself — can be substituted for an equivalent that hits The Contract. If in doubt, the Proofreader Checks in §8 are the ultimate arbiter of whether an implementation satisfies the requirements.
This demo can run in two modes. Decide on Saturday morning at kickoff, after the group has read The Contract together.
The rest of this document works for either mode. Where a section differs by mode, both variants are shown.
Everything in this zone is locked. Every implementation must satisfy these sections, regardless of technology choice, architecture, or format. The audience sees all of this at the demo.
Prove the refinery works. One sample credit memo goes in; verified structured data comes out. Every field traces back to a page in the original PDF. Every number reconciles against the proofreader's arithmetic checks. Nothing silently wrong.
This is the credibility anchor for the broader supply chain. If we can show end-to-end extraction with verification on one document in a weekend, the business case for building the full platform writes itself.
A 22-page MBFS Credit Memorandum for a $514,500 multifamily real estate refinance in Philadelphia. It contains the full mixed-content profile the refinery needs to handle: form fields, multi-year financial tables, analyst narrative, a third-party industry infographic, site-visit photographs, and a signed decision form.
Sample-Enhanced-Memo.pdf (22 pages, 1.1 MB)Multiple documents and cross-document linking (PFS, Rent Roll, Appraisal, scanned docs), cross-document verification and reconciliation, and origination vs. servicing timeline dimension are all Phase 2 — The Loan File scope. See §17 at the end of this document for the full preview.
At the end of Sunday, the demo must produce the following observable outcomes, regardless of how the system is built.
Sample-Enhanced-Memo.pdf produces a JSON document matching the schema in §7.What must be true of the system when the demo runs, regardless of how the system is built. An implementation that uses a single LLM pass, a specialist-team architecture, a state machine, or any other pattern is acceptable as long as every outcome below is observable in the final output.
Every page of the source PDF must be classified by content type (at minimum: form, table, narrative, visual, skip), and the classification must influence how that page is extracted. An implementation that processes every page with one tool and ignores content type does not satisfy this outcome.
The following fields must appear in the output with values correctly drawn from pages 1–2 of the sample. Each field must carry a source-page reference.
Three tables must be extracted into structured form (column → value) with correct numeric values and source-page references:
The following narrative elements must be extracted from the appropriate pages, with source-page references:
All mandatory checks from §8 must be executed against the extracted data and reported in the output, with expected value, observed value, and pass/fail status. Implementations may add additional checks beyond §8 but must not remove any.
The demo must include at least one observable case where an initial extraction attempt produced an incorrect value, the incorrect value was detected, and a corrected value was produced. The recovery must be visible in the final output — a reviewer should be able to see that the system tried one approach, the result failed a check or was flagged, and a different approach succeeded. How the detection and correction happen is an implementation choice.
The final output must match the schema in §7. Every populated field must carry a reference back to the source page and the method/agent/tool that produced it. A reviewer must be able to ask "where did this value come from?" for any field and receive a specific page answer.
The schema below is a Python/Pydantic expression. Teams using other languages must produce JSON that is shape-compatible — same field names, same structure, same nesting. The schema is deliberately flat and minimal.
from pydantic import BaseModel, Field
from typing import Literal
class SourceRef(BaseModel):
page: int
agent: str # e.g. "form", "table", "narrative", "visual", or custom
method: str # e.g. "pdfplumber-lattice", "claude-vision", "regex"
confidence: float = 1.0
class Field_(BaseModel):
value: str | float | int | None
source: SourceRef
class Guarantor(BaseModel):
name: Field_
credit_score: Field_
ownership_pct: Field_
stated_net_worth: Field_ | None = None
class LoanTerms(BaseModel):
amount: Field_
rate: Field_
rate_type: Field_
term_years: Field_
amortization_years: Field_
class Collateral(BaseModel):
asset_type: Field_
address: Field_
estimated_value: Field_
ltv_pct: Field_
lien_position: Field_
class FinancialTable(BaseModel):
name: str
rows: list[dict] # column name -> value
source: SourceRef
class ValidationCheck(BaseModel):
name: str
formula: str
expected: float | str
observed: float | str
status: Literal["pass", "fail", "warning"]
source_pages: list[int]
class CreditMemo(BaseModel):
document_id: str
memo_date: Field_
borrower_name: Field_
loan: LoanTerms
collateral: Collateral
guarantors: list[Guarantor]
sources: list[dict] = Field(default_factory=list)
uses: list[dict] = Field(default_factory=list)
tables: list[FinancialTable] = Field(default_factory=list)
strengths: list[str] = Field(default_factory=list)
weaknesses: list[str] = Field(default_factory=list)
analyst_notes: str | None = None
validation: list[ValidationCheck]
Mandatory arithmetic checks against the sample document. Expected values are the ground truth — if an implementation produces something different, either the extraction is wrong or the check is wrong. Either way, the demo does not ship until both agree. Tolerance: ±1 unit for integer dollar totals, ±0.01 for ratios.
| Check | Formula | Expected (from the sample) |
|---|---|---|
| C1 · Sources = Uses | sum(sources.amount) == sum(uses.amount) |
$514,500 == $514,500 |
| C2 · G1 stated net worth | total_assets − total_liabilities |
$4,760,177 − $1,070,737 = $3,689,440 |
| C3 · G1 adjusted net worth | assets − liabilities − adjustments |
$4,760,177 − $1,070,737 − $425,000 = $3,264,440 |
| C4 · G2 stated net worth | total_assets − total_liabilities |
$1,546,440 − $656,140 = $890,300 |
| C5 · Ownership sum | sum(guarantor.ownership_pct) == 100 |
50 + 50 = 100 |
| C6 · G1 DSCR 2020 | gross_cash_flow ÷ total_debt |
$111,831 ÷ $33,120 ≈ 3.38 |
| C7 · Global DSCR 2019 | total_combined_cash ÷ total_combined_debt_service |
$189,500 ÷ $66,050 ≈ 2.87 |
| C8 · Rent roll sum | sum(unit_rents) × 12 |
($800+$650+$650+$800+$750+$800) × 12 = $53,400 |
| C9 · Cross-page rents | narrative (p8) == transactional (p17) | $53,400 on both pages |
Every implementation — whether in unified-team or bake-off mode — delivers the following by the end of the weekend.
make demo, npm run demo, or equivalent)README.md with setup, run instructions, and how to regenerate the sample outputsample-output.json committed alongside the source PDF, matching the §7 schemaEverything below is guidance. It exists to help teams get started, reduce decision fatigue on Saturday morning, and provide a known-good reference point. Any section can be ignored, replaced, or adapted — as long as The Contract above is satisfied.
The strategic brief describes a specialist-team pattern — an orchestrator that routes pages, a set of focused extraction agents, a validator (the proofreader), and a reconciler. This is the reference architecture for the refinery. Teams may adopt it as-is, modify it, or use a different decomposition entirely.
Alternative patterns that would also satisfy The Contract include: a single-LLM-pass approach with a separate validation step, a state-machine approach where each page advances through explicit extraction and verification states, or a pipeline-per-content-type approach. Pick what matches the team's comfort and the problem.
Recommended defaults. Substitute freely — the only hard requirement is that an LLM is used somewhere in the extraction path, because semantic understanding of the document is the reason this approach works at all.
| Layer | Default | Reasonable alternatives |
|---|---|---|
| Language | Python 3.11+ | TypeScript/Node, Go, Rust — any language with decent PDF and LLM SDK support |
| LLM | Claude (Opus 4.6 / Sonnet 4.6) | Any frontier model with vision and structured output |
| Agent framework | Claude Agent SDK | LangGraph, a homegrown orchestrator, or no framework at all |
| PDF text/tables | pdfplumber | pymupdf, pdfminer.six, tabula-py, Azure Document Intelligence, AWS Textract |
| PDF rendering | pymupdf (fitz) | pdf2image, wand |
| Table fallback | camelot-py | LLM vision on the rendered page, tabula-py |
| Schema | pydantic v2 | zod (TS), attrs, dataclasses, hand-rolled JSON schema |
| UI | Streamlit | FastAPI + minimal HTML, Next.js, CLI only |
| Scaffold | uv or poetry | pnpm, cargo, whatever's in your muscle memory |
A suggested schedule for a team that has not run this before. Bake-off pairs may run their own schedules in parallel.
CreditMemo round-tripping through its chosen stack.For unified-team mode. Bake-off pairs split responsibilities however they like inside each pair.
Owns the hardest extraction paths — typically financial tables and the vision fallback. Knows the chosen PDF library well enough to debug merged-header drift in under 15 minutes.
Owns the orchestration, page routing, and the LLM integration. Defines the interface each extraction component exposes.
Owns the §8 checks, the error-recovery path, and the provenance tracking. This role is the quality gate for the whole demo.
Owns the schema, the demo surface, the integration runner, and the five-minute demo script. Also runs the rehearsal.
A suggested five-minute walkthrough. Teams may adapt the structure; the content points are what matters.
Bake-off variant. Each pair runs the same 5-minute script. After all pairs have demoed, the group spends 15 minutes on a comparative retrospective: which stack extracted fastest, which caught the hardest edge case (the red ($2,918) on page 12 is a good benchmark), which recovery path was cleverest, and what each approach would cost to scale to 20,000 memos.
| Risk | Likelihood | Mitigation |
|---|---|---|
| Over-scoping field extraction — trying to extract every field on every page | High | Extract only what's needed to run the §8 checks. Everything else is a stretch goal. |
| The hardest extraction (merged "Projected" header on page 5) consumes the weekend | High | Expected — this is the demo's punchline. The recovery path is what you're actually building. |
| Vision-based extraction attempts on page 10 (infographic) eat all of Sunday | Medium | Page 10 is explicitly a stretch goal. For the core demo, route it to skip. |
| LLM rate limits or API key issues on Saturday morning | Medium | Generate keys and verify access on Friday night. Cache LLM responses during development. |
| Proofreader checks pass trivially because extracted numbers come from the same wrong source | Medium | Prefer cross-agent or cross-page checks where possible. C9 exists for exactly this reason. |
| Bake-off: contract ambiguity leads to incompatible outputs | Medium (bake-off only) | Spend the first hour of Saturday walking through §7 and §8 together with the PDF open. Any gray area becomes three different interpretations otherwise. |
| Bake-off: unequal pair sizes or skill levels lead to one team finishing and others stalling | Medium (bake-off only) | Pre-balance pairs before splitting. Allow pairs that finish early to help others, or to attempt stretch goals. |
| Demo day: integration breaks 10 minutes before showtime | Always | Record the demo video on Sunday afternoon as a backup. Worst case: play the video. |
Phase 1 proves the refinery on a single credit memo. Phase 2 extends the challenge to the full loan file — multiple document types, cross-document verification, and a time dimension.
Phase 2 introduces a loan file containing several document types beyond the credit memo:
| Document | Description |
|---|---|
| Personal Financial Statement (PFS) | Borrower/guarantor assets, liabilities, income — must reconcile with credit memo net worth figures |
| Rent Roll | Unit-level rent schedule at a point in time — must reconcile with memo gross rents and operating statements |
| Appraisal | Third-party valuation report — must reconcile with memo collateral value, LTV, and property description |
| Scanned documents | Signed forms, tax returns, bank statements — OCR-dependent, lower structure |
The core challenge of Phase 2 is verifying consistency across documents:
Phase 2 adds an origination vs. servicing distinction:
| Tier | Name | Criteria |
|---|---|---|
| Bronze | Pipeline | Process all document types in the loan file independently; extract structured data from each |
| Silver | Crosscheck | Cross-document verification passes — values that appear in multiple documents are reconciled and discrepancies flagged |
| Gold | Platinum | Full timeline support — origination vs. servicing snapshots tracked, with drift detection across time |
Phase 1 tiers (Bronze/Silver/Gold based on checks C1–C9) remain the entry point. A team must achieve at least Phase 1 Silver (6+ checks passing) before attempting Phase 2. The Phase 1 credit memo extraction becomes one component of the larger Phase 2 loan file pipeline.