11 KiB
Research: K-1 PDF Scan Import
Phase 0 Output | Date: 2026-03-18
Decision 1: PDF Text Extraction (Tier 1 — Digital PDFs)
Decision: Use pdf-parse npm package for digitally-generated K-1 PDFs.
Rationale: Digitally-generated PDFs from fund administrators contain embedded text. pdf-parse extracts this text losslessly, is free, fully self-hosted, and instant. It has 3M+ weekly npm downloads and a stable API. No external API calls needed.
Alternatives Considered:
pdfjs-dist(Mozilla pdf.js) — lower-level, requires more boilerplate for text extraction;pdf-parsewraps this already.- Cloud OCR for all PDFs — unnecessary cost and latency for digital PDFs where text extraction is 100% accurate.
Decision 2: OCR for Scanned PDFs (Tier 2)
Decision: Use Azure AI Document Intelligence (Layout model) as primary Tier 2 provider, with tesseract.js as self-hosted fallback.
Rationale:
- Azure has the best tax-form pedigree among cloud providers (prebuilt IRS models for W-2, 1098, 1099)
- Returns per-field confidence scores (0.0–1.0) natively, directly fulfilling FR-006/FR-009
- 500 free pages/month covers typical family office volume (10–50 K-1s/year)
@azure/ai-form-recognizerhas full TypeScript types, aligns with NestJS patternstesseract.jsruns as WASM in Node.js (no system install), provides ~75% accuracy fallback
Alternatives Considered:
- Google Document AI — good form parsing but no tax-specific models, more expensive for custom processors ($30/1K pages)
- AWS Textract — strong table extraction but less established for tax forms, requires IAM setup
- Tesseract.js only — accuracy drops to 70–85% for clean scans, no layout understanding; acceptable as fallback but not primary
Decision 3: Two-Tier Extraction Architecture
Decision: Implement a PDF type detection step that routes digital PDFs to local extraction (free, instant) and scanned PDFs to cloud OCR.
Rationale: Most K-1s from fund administrators are digitally generated. The two-tier approach avoids unnecessary API calls and costs for the majority case, while still supporting scanned documents.
Detection heuristic: Extract text via pdf-parse; if extracted text length < 100 characters or does not contain K-1 keywords ("Schedule K-1", "Form 1065", "Partner's Share"), route to Tier 2 OCR.
Alternatives Considered:
- Cloud OCR for everything — simpler but adds cost ($0.15/page) and latency (3–10s) for digital PDFs that don't need it
- Local OCR only (Tesseract.js) — insufficient accuracy (75%) for production tax data; too many manual corrections needed
Decision 4: K-1 Box Extraction Strategy
Decision: Use regex-based box extraction for Tier 1 (digital text), and key-value pair extraction from the OCR provider for Tier 2. Both feed into a shared K-1 field mapper that applies the cell mapping configuration.
Rationale: The IRS Schedule K-1 (Form 1065) has a consistent, standardized layout:
- Page 1: Header + Part I (partnership info) + Part II (partner info) + Boxes 1–11
- Page 2: Boxes 12–20+ with code/sub-code details
- Box values sit in a numbered two-column grid: number label → description → value field
- Layout has been structurally stable for years, making template/regex extraction reliable
Challenges addressed:
- Multi-line sub-codes (Boxes 11, 13, 15, 16, 17, 18, 20) — handle by extracting code-letter/value pairs within each box section
- Supplemental schedules — out of scope for V1 auto-extraction; captured as additional Document attachments
- Multi-entity PDFs — detect via repeated "Schedule K-1" headers; split and process each K-1 separately
Alternatives Considered:
- Fixed coordinate-based extraction — too brittle across different PDF generators (varying margins, fonts)
- Machine learning model — overkill for V1 given the standardized form layout
Decision 5: Confidence Scoring Approach
Decision: Three-level confidence display (High/Medium/Low) derived from extraction method and validation heuristics.
Rationale:
For Tier 1 (digital text):
- Base confidence: 0.90 (text extraction is inherently reliable)
- +0.05 if box number regex matched cleanly
- +0.05 if value format validated (currency, percentage, integer)
- -0.10 to -0.30 for potential adjacent-box text contamination
For Tier 2 (cloud OCR):
- Use Azure's native per-field confidence score directly
- Layer cross-field validation (e.g., Box 6b ≤ Box 6a, sub-boxes sum to parent)
Display mapping:
- High (≥ 0.85): Green — no user attention needed
- Medium (0.60–0.84): Yellow — optional review
- Low (< 0.60): Red — highlighted, requires manual review (FR-009)
Alternatives Considered:
- Binary confidence (confident/not) — too coarse; doesn't guide the user's review attention
- Numeric score display — too technical for a non-engineer user; three levels with color coding is more actionable
Decision 6: New Database Models
Decision: Add two new Prisma models (K1ImportSession, CellMapping) to support import tracking and cell mapping configuration, alongside the existing K-document models from spec 001.
Rationale:
K1ImportSessiontracks the full import lifecycle (upload → processing → extracted → verified → confirmed/cancelled), enabling import history (FR-022) and re-processing (FR-023)CellMappingstores per-partnership cell label customizations (FR-017 through FR-021) separate from the KDocument data itself
Alternatives Considered:
- Store import sessions as JSON metadata on KDocument — would conflate document data with import workflow state; makes import history harder to query
- Store cell mappings as JSON on Partnership — would work but loses the ability to query/manage mappings independently and doesn't support a global default set
Decision 7: File Storage
Decision: Use the existing uploads/ directory and Document model from spec 001. Uploaded K-1 PDFs are stored on the local filesystem, with metadata in the Document table.
Rationale: The existing upload infrastructure (UploadController with FileInterceptor, Document model, uploads/ directory) is already in place. No need to add a new storage mechanism.
Alternatives Considered:
- S3/cloud storage — would require new infrastructure; the self-hosted philosophy favors local storage
- Database blob storage — increases database size and backup time for binary files
Decision 8: New Environment Variables
Decision: Add two optional environment variables for Azure Document Intelligence, following the existing ConfigurationService pattern with str({ default: '' }).
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT — Azure resource endpoint URL
AZURE_DOCUMENT_INTELLIGENCE_KEY — Azure API key
Rationale: When both are empty (default), the system falls back to tesseract.js for scanned PDFs. This makes Azure optional — the feature works fully self-hosted with degraded OCR accuracy.
Alternatives Considered:
- Separate feature flag — unnecessary; empty credentials are sufficient to indicate "not configured"
- Google/AWS credentials — Azure recommended as primary; could add additional providers later
Decision 9: New npm Dependencies
Decision: Add the following packages:
| Package | Purpose | Tier |
|---|---|---|
pdf-parse |
Text extraction from digital PDFs | Tier 1 (required) |
@azure/ai-form-recognizer |
Cloud OCR for scanned PDFs | Tier 2 (optional) |
tesseract.js |
Self-hosted OCR fallback | Tier 2 fallback |
Rationale: pdf-parse is essential for the Tier 1 (free, local) path. Azure SDK is optional (only loaded when credentials are configured). tesseract.js provides a zero-config fallback that runs as WASM — no system dependencies needed, works in the existing node:22-slim Docker image.
Alternatives Considered:
pdfjs-distdirectly instead ofpdf-parse— more boilerplate,pdf-parsewraps it with a simpler API- Only cloud OCR — loses the self-hosted story and adds cost for digital PDFs
Decision 10: Cell Aggregation Rules — Dynamic Computation
Decision: Persist only aggregation rule definitions (name, source cells, operation). Compute totals dynamically from raw K-1 box values at display time. Do NOT store computed totals.
Rationale:
- K-1 values can change during the import lifecycle (estimated → final transitions, manual edits after confirmation)
- Storing computed totals creates a denormalization risk — stale aggregates when underlying values change
- Computation is trivial (summing a handful of numbers) with no performance concern at family office scale
- Keeps a single source of truth: the raw box values in K1Data
- Aggregation rules are displayed on both the verification screen (FR-033) and KDocument detail view (FR-036)
Alternatives Considered:
- Persist computed totals alongside raw data — creates stale data risk, requires update triggers
- Persist both (snapshot + live) for audit — adds complexity V1 doesn't need; audit trail exists in import session history
Decision 11: Unmapped Items Handling
Decision: Display extracted values that don't match any configured cell mapping in a separate "Unmapped Items" section on the verification screen. Administrator can assign to an existing cell, create a new custom cell, or discard.
Rationale:
- OCR/extraction may pull supplemental schedule items, footnotes, state-specific addenda
- Silently discarding loses potentially important data
- Auto-creating cells for every unmatched value creates noise
- Explicit user decision preserves data integrity while keeping mapped cells clean
- Assigned unmapped items update the cell mapping for future imports (learning effect)
Alternatives Considered:
- Silent discard — loses data, violates user's expectation of completeness
- Auto-create custom cells — too noisy; PDF footnotes and headers would create junk cells
Decision 12: Verification Auto-Accept Strategy
Decision: Auto-accept (pre-check) high-confidence values on the verification screen. Require explicit review (acknowledge or edit) for medium and low-confidence values before allowing confirmation.
Rationale:
- V1 is "partially manual, partially automated" per user intent
- High-confidence values (≥ 0.85) from digital PDFs are reliably accurate (90%+ per SC-002)
- Forcing explicit review of every cell wastes time on correct values
- Blocking confirmation until medium/low-confidence fields are reviewed catches the errors
- All values remain visible and editable — user can override any pre-accepted value
Alternatives Considered:
- Every cell requires explicit accept — too slow for 15+ fields, doesn't match "partially automated" intent
- Spot-check model (everything auto-accepted) — too risky for tax data; OCR errors would go unreviewed