11 KiB

Raw Blame History

Research: K-1 PDF Scan Import

Phase 0 Output | Date: 2026-03-18

Decision 1: PDF Text Extraction (Tier 1 — Digital PDFs)

Decision: Use pdf-parse npm package for digitally-generated K-1 PDFs.

Rationale: Digitally-generated PDFs from fund administrators contain embedded text. pdf-parse extracts this text losslessly, is free, fully self-hosted, and instant. It has 3M+ weekly npm downloads and a stable API. No external API calls needed.

Alternatives Considered:

pdfjs-dist (Mozilla pdf.js) — lower-level, requires more boilerplate for text extraction; pdf-parse wraps this already.
Cloud OCR for all PDFs — unnecessary cost and latency for digital PDFs where text extraction is 100% accurate.

Decision 2: OCR for Scanned PDFs (Tier 2)

Decision: Use Azure AI Document Intelligence (Layout model) as primary Tier 2 provider, with tesseract.js as self-hosted fallback.

Rationale:

Azure has the best tax-form pedigree among cloud providers (prebuilt IRS models for W-2, 1098, 1099)
Returns per-field confidence scores (0.0–1.0) natively, directly fulfilling FR-006/FR-009
500 free pages/month covers typical family office volume (10–50 K-1s/year)
@azure/ai-form-recognizer has full TypeScript types, aligns with NestJS patterns
tesseract.js runs as WASM in Node.js (no system install), provides ~75% accuracy fallback

Alternatives Considered:

Google Document AI — good form parsing but no tax-specific models, more expensive for custom processors ($30/1K pages)
AWS Textract — strong table extraction but less established for tax forms, requires IAM setup
Tesseract.js only — accuracy drops to 70–85% for clean scans, no layout understanding; acceptable as fallback but not primary

Decision 3: Two-Tier Extraction Architecture

Decision: Implement a PDF type detection step that routes digital PDFs to local extraction (free, instant) and scanned PDFs to cloud OCR.

Rationale: Most K-1s from fund administrators are digitally generated. The two-tier approach avoids unnecessary API calls and costs for the majority case, while still supporting scanned documents.

Detection heuristic: Extract text via pdf-parse; if extracted text length < 100 characters or does not contain K-1 keywords ("Schedule K-1", "Form 1065", "Partner's Share"), route to Tier 2 OCR.

Alternatives Considered:

Cloud OCR for everything — simpler but adds cost ($0.15/page) and latency (3–10s) for digital PDFs that don't need it
Local OCR only (Tesseract.js) — insufficient accuracy (75%) for production tax data; too many manual corrections needed

Decision 4: K-1 Box Extraction Strategy

Decision: Use regex-based box extraction for Tier 1 (digital text), and key-value pair extraction from the OCR provider for Tier 2. Both feed into a shared K-1 field mapper that applies the cell mapping configuration.

Rationale: The IRS Schedule K-1 (Form 1065) has a consistent, standardized layout:

Page 1: Header + Part I (partnership info) + Part II (partner info) + Boxes 1–11
Page 2: Boxes 12–20+ with code/sub-code details
Box values sit in a numbered two-column grid: number label → description → value field
Layout has been structurally stable for years, making template/regex extraction reliable

Challenges addressed:

Multi-line sub-codes (Boxes 11, 13, 15, 16, 17, 18, 20) — handle by extracting code-letter/value pairs within each box section
Supplemental schedules — out of scope for V1 auto-extraction; captured as additional Document attachments
Multi-entity PDFs — detect via repeated "Schedule K-1" headers; split and process each K-1 separately

Alternatives Considered:

Fixed coordinate-based extraction — too brittle across different PDF generators (varying margins, fonts)
Machine learning model — overkill for V1 given the standardized form layout

Decision 5: Confidence Scoring Approach

Decision: Three-level confidence display (High/Medium/Low) derived from extraction method and validation heuristics.

Rationale:

For Tier 1 (digital text):

Base confidence: 0.90 (text extraction is inherently reliable)
+0.05 if box number regex matched cleanly
+0.05 if value format validated (currency, percentage, integer)
-0.10 to -0.30 for potential adjacent-box text contamination

For Tier 2 (cloud OCR):

Use Azure's native per-field confidence score directly
Layer cross-field validation (e.g., Box 6b ≤ Box 6a, sub-boxes sum to parent)

Display mapping:

High (≥ 0.85): Green — no user attention needed
Medium (0.60–0.84): Yellow — optional review
Low (< 0.60): Red — highlighted, requires manual review (FR-009)

Alternatives Considered:

Binary confidence (confident/not) — too coarse; doesn't guide the user's review attention
Numeric score display — too technical for a non-engineer user; three levels with color coding is more actionable

Decision 6: New Database Models

Decision: Add two new Prisma models (K1ImportSession, CellMapping) to support import tracking and cell mapping configuration, alongside the existing K-document models from spec 001.

Rationale:

K1ImportSession tracks the full import lifecycle (upload → processing → extracted → verified → confirmed/cancelled), enabling import history (FR-022) and re-processing (FR-023)
CellMapping stores per-partnership cell label customizations (FR-017 through FR-021) separate from the KDocument data itself

Alternatives Considered:

Store import sessions as JSON metadata on KDocument — would conflate document data with import workflow state; makes import history harder to query
Store cell mappings as JSON on Partnership — would work but loses the ability to query/manage mappings independently and doesn't support a global default set

Decision 7: File Storage

Decision: Use the existing uploads/ directory and Document model from spec 001. Uploaded K-1 PDFs are stored on the local filesystem, with metadata in the Document table.

Rationale: The existing upload infrastructure (UploadController with FileInterceptor, Document model, uploads/ directory) is already in place. No need to add a new storage mechanism.

Alternatives Considered:

S3/cloud storage — would require new infrastructure; the self-hosted philosophy favors local storage
Database blob storage — increases database size and backup time for binary files

Decision 8: New Environment Variables

Decision: Add two optional environment variables for Azure Document Intelligence, following the existing ConfigurationService pattern with str({ default: '' }).

AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT  — Azure resource endpoint URL
AZURE_DOCUMENT_INTELLIGENCE_KEY       — Azure API key

Rationale: When both are empty (default), the system falls back to tesseract.js for scanned PDFs. This makes Azure optional — the feature works fully self-hosted with degraded OCR accuracy.

Alternatives Considered:

Separate feature flag — unnecessary; empty credentials are sufficient to indicate "not configured"
Google/AWS credentials — Azure recommended as primary; could add additional providers later

Decision 9: New npm Dependencies

Decision: Add the following packages:

Package	Purpose	Tier
`pdf-parse`	Text extraction from digital PDFs	Tier 1 (required)
`@azure/ai-form-recognizer`	Cloud OCR for scanned PDFs	Tier 2 (optional)
`tesseract.js`	Self-hosted OCR fallback	Tier 2 fallback

Rationale: pdf-parse is essential for the Tier 1 (free, local) path. Azure SDK is optional (only loaded when credentials are configured). tesseract.js provides a zero-config fallback that runs as WASM — no system dependencies needed, works in the existing node:22-slim Docker image.

Alternatives Considered:

pdfjs-dist directly instead of pdf-parse — more boilerplate, pdf-parse wraps it with a simpler API
Only cloud OCR — loses the self-hosted story and adds cost for digital PDFs

Decision 10: Cell Aggregation Rules — Dynamic Computation

Decision: Persist only aggregation rule definitions (name, source cells, operation). Compute totals dynamically from raw K-1 box values at display time. Do NOT store computed totals.

Rationale:

K-1 values can change during the import lifecycle (estimated → final transitions, manual edits after confirmation)
Storing computed totals creates a denormalization risk — stale aggregates when underlying values change
Computation is trivial (summing a handful of numbers) with no performance concern at family office scale
Keeps a single source of truth: the raw box values in K1Data
Aggregation rules are displayed on both the verification screen (FR-033) and KDocument detail view (FR-036)

Alternatives Considered:

Persist computed totals alongside raw data — creates stale data risk, requires update triggers
Persist both (snapshot + live) for audit — adds complexity V1 doesn't need; audit trail exists in import session history

Decision 11: Unmapped Items Handling

Decision: Display extracted values that don't match any configured cell mapping in a separate "Unmapped Items" section on the verification screen. Administrator can assign to an existing cell, create a new custom cell, or discard.

Rationale:

OCR/extraction may pull supplemental schedule items, footnotes, state-specific addenda
Silently discarding loses potentially important data
Auto-creating cells for every unmatched value creates noise
Explicit user decision preserves data integrity while keeping mapped cells clean
Assigned unmapped items update the cell mapping for future imports (learning effect)

Alternatives Considered:

Silent discard — loses data, violates user's expectation of completeness
Auto-create custom cells — too noisy; PDF footnotes and headers would create junk cells

Decision 12: Verification Auto-Accept Strategy

Decision: Auto-accept (pre-check) high-confidence values on the verification screen. Require explicit review (acknowledge or edit) for medium and low-confidence values before allowing confirmation.

Rationale:

V1 is "partially manual, partially automated" per user intent
High-confidence values (≥ 0.85) from digital PDFs are reliably accurate (90%+ per SC-002)
Forcing explicit review of every cell wastes time on correct values
Blocking confirmation until medium/low-confidence fields are reviewed catches the errors
All values remain visible and editable — user can override any pre-accepted value

Alternatives Considered:

Every cell requires explicit accept — too slow for 15+ fields, doesn't match "partially automated" intent
Spot-check model (everything auto-accepted) — too risky for tax data; OCR errors would go unreviewed

11 KiB Raw Blame History

Research: K-1 PDF Scan Import

Decision 1: PDF Text Extraction (Tier 1 — Digital PDFs)

Decision 2: OCR for Scanned PDFs (Tier 2)

Decision 3: Two-Tier Extraction Architecture

Decision 4: K-1 Box Extraction Strategy

Decision 5: Confidence Scoring Approach

Decision 6: New Database Models

Decision 7: File Storage

Decision 8: New Environment Variables

Decision 9: New npm Dependencies

Decision 10: Cell Aggregation Rules — Dynamic Computation

Decision 11: Unmapped Items Handling

Decision 12: Verification Auto-Accept Strategy

11 KiB

Raw Blame History