# Research: K-1 PDF Scan Import

**Phase 0 Output** | **Date**: 2026-03-18

## Decision 1: PDF Text Extraction (Tier 1 — Digital PDFs)

**Decision**: Use `pdf-parse` npm package for digitally-generated K-1 PDFs.

**Rationale**: Digitally-generated PDFs from fund administrators contain embedded text. `pdf-parse` extracts this text losslessly, is free, fully self-hosted, and instant. It has 3M+ weekly npm downloads and a stable API. No external API calls needed.

**Alternatives Considered**:
- `pdfjs-dist` (Mozilla pdf.js) — lower-level, requires more boilerplate for text extraction; `pdf-parse` wraps this already.
- Cloud OCR for all PDFs — unnecessary cost and latency for digital PDFs where text extraction is 100% accurate.

---

## Decision 2: OCR for Scanned PDFs (Tier 2)

**Decision**: Use Azure AI Document Intelligence (Layout model) as primary Tier 2 provider, with `tesseract.js` as self-hosted fallback.

**Rationale**:
- Azure has the best tax-form pedigree among cloud providers (prebuilt IRS models for W-2, 1098, 1099)
- Returns per-field confidence scores (0.0–1.0) natively, directly fulfilling FR-006/FR-009
- 500 free pages/month covers typical family office volume (10–50 K-1s/year)
- `@azure/ai-form-recognizer` has full TypeScript types, aligns with NestJS patterns
- `tesseract.js` runs as WASM in Node.js (no system install), provides ~75% accuracy fallback

**Alternatives Considered**:
- Google Document AI — good form parsing but no tax-specific models, more expensive for custom processors ($30/1K pages)
- AWS Textract — strong table extraction but less established for tax forms, requires IAM setup
- Tesseract.js only — accuracy drops to 70–85% for clean scans, no layout understanding; acceptable as fallback but not primary

---

## Decision 3: Two-Tier Extraction Architecture

**Decision**: Implement a PDF type detection step that routes digital PDFs to local extraction (free, instant) and scanned PDFs to cloud OCR.

**Rationale**: Most K-1s from fund administrators are digitally generated. The two-tier approach avoids unnecessary API calls and costs for the majority case, while still supporting scanned documents.

**Detection heuristic**: Extract text via `pdf-parse`; if extracted text length < 100 characters or does not contain K-1 keywords ("Schedule K-1", "Form 1065", "Partner's Share"), route to Tier 2 OCR.

**Alternatives Considered**:
- Cloud OCR for everything — simpler but adds cost ($0.15/page) and latency (3–10s) for digital PDFs that don't need it
- Local OCR only (Tesseract.js) — insufficient accuracy (75%) for production tax data; too many manual corrections needed

---

## Decision 4: K-1 Box Extraction Strategy

**Decision**: Use regex-based box extraction for Tier 1 (digital text), and key-value pair extraction from the OCR provider for Tier 2. Both feed into a shared K-1 field mapper that applies the cell mapping configuration.

**Rationale**: The IRS Schedule K-1 (Form 1065) has a consistent, standardized layout:
- Page 1: Header + Part I (partnership info) + Part II (partner info) + Boxes 1–11
- Page 2: Boxes 12–20+ with code/sub-code details
- Box values sit in a numbered two-column grid: number label → description → value field
- Layout has been structurally stable for years, making template/regex extraction reliable

**Challenges addressed**:
- Multi-line sub-codes (Boxes 11, 13, 15, 16, 17, 18, 20) — handle by extracting code-letter/value pairs within each box section
- Supplemental schedules — out of scope for V1 auto-extraction; captured as additional Document attachments
- Multi-entity PDFs — detect via repeated "Schedule K-1" headers; split and process each K-1 separately

**Alternatives Considered**:
- Fixed coordinate-based extraction — too brittle across different PDF generators (varying margins, fonts)
- Machine learning model — overkill for V1 given the standardized form layout

---

## Decision 5: Confidence Scoring Approach

**Decision**: Three-level confidence display (High/Medium/Low) derived from extraction method and validation heuristics.

**Rationale**:

For **Tier 1** (digital text):
- Base confidence: 0.90 (text extraction is inherently reliable)
- +0.05 if box number regex matched cleanly
- +0.05 if value format validated (currency, percentage, integer)
- -0.10 to -0.30 for potential adjacent-box text contamination

For **Tier 2** (cloud OCR):
- Use Azure's native per-field confidence score directly
- Layer cross-field validation (e.g., Box 6b ≤ Box 6a, sub-boxes sum to parent)

**Display mapping**:
- High (≥ 0.85): Green — no user attention needed
- Medium (0.60–0.84): Yellow — optional review
- Low (< 0.60): Red — highlighted, requires manual review (FR-009)

**Alternatives Considered**:
- Binary confidence (confident/not) — too coarse; doesn't guide the user's review attention
- Numeric score display — too technical for a non-engineer user; three levels with color coding is more actionable

---

## Decision 6: New Database Models

**Decision**: Add two new Prisma models (`K1ImportSession`, `CellMapping`) to support import tracking and cell mapping configuration, alongside the existing K-document models from spec 001.

**Rationale**: 
- `K1ImportSession` tracks the full import lifecycle (upload → processing → extracted → verified → confirmed/cancelled), enabling import history (FR-022) and re-processing (FR-023)
- `CellMapping` stores per-partnership cell label customizations (FR-017 through FR-021) separate from the KDocument data itself

**Alternatives Considered**:
- Store import sessions as JSON metadata on KDocument — would conflate document data with import workflow state; makes import history harder to query
- Store cell mappings as JSON on Partnership — would work but loses the ability to query/manage mappings independently and doesn't support a global default set

---

## Decision 7: File Storage

**Decision**: Use the existing `uploads/` directory and `Document` model from spec 001. Uploaded K-1 PDFs are stored on the local filesystem, with metadata in the `Document` table.

**Rationale**: The existing upload infrastructure (UploadController with `FileInterceptor`, Document model, `uploads/` directory) is already in place. No need to add a new storage mechanism.

**Alternatives Considered**:
- S3/cloud storage — would require new infrastructure; the self-hosted philosophy favors local storage
- Database blob storage — increases database size and backup time for binary files

---

## Decision 8: New Environment Variables

**Decision**: Add two optional environment variables for Azure Document Intelligence, following the existing `ConfigurationService` pattern with `str({ default: '' })`.

```
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT  — Azure resource endpoint URL
AZURE_DOCUMENT_INTELLIGENCE_KEY       — Azure API key
```

**Rationale**: When both are empty (default), the system falls back to `tesseract.js` for scanned PDFs. This makes Azure optional — the feature works fully self-hosted with degraded OCR accuracy.

**Alternatives Considered**:
- Separate feature flag — unnecessary; empty credentials are sufficient to indicate "not configured"
- Google/AWS credentials — Azure recommended as primary; could add additional providers later

---

## Decision 9: New npm Dependencies

**Decision**: Add the following packages:

| Package | Purpose | Tier |
|---|---|---|
| `pdf-parse` | Text extraction from digital PDFs | Tier 1 (required) |
| `@azure/ai-form-recognizer` | Cloud OCR for scanned PDFs | Tier 2 (optional) |
| `tesseract.js` | Self-hosted OCR fallback | Tier 2 fallback |

**Rationale**: `pdf-parse` is essential for the Tier 1 (free, local) path. Azure SDK is optional (only loaded when credentials are configured). `tesseract.js` provides a zero-config fallback that runs as WASM — no system dependencies needed, works in the existing `node:22-slim` Docker image.

**Alternatives Considered**:
- `pdfjs-dist` directly instead of `pdf-parse` — more boilerplate, `pdf-parse` wraps it with a simpler API
- Only cloud OCR — loses the self-hosted story and adds cost for digital PDFs