From 269eed7faff35be894ed4de43c8816d26c5e3d30 Mon Sep 17 00:00:00 2001 From: Robert Patch Date: Wed, 18 Mar 2026 00:25:12 -0700 Subject: [PATCH] plan: K-1 scan import implementation plan (Phase 0+1) - research.md: 9 architectural decisions (two-tier OCR, Azure+tesseract fallback) - data-model.md: K1ImportSession, CellMapping models, K1ImportStatus enum - contracts/k1-import-api.md: 10 REST endpoints - quickstart.md: file structure, setup guide, workflow - plan.md: summary, technical context, constitution check, project structure - Updated copilot agent context with new tech stack entries --- .github/agents/copilot-instructions.md | 5 +- .../contracts/k1-import-api.md | 381 ++++++++++++++++++ specs/004-k1-scan-import/data-model.md | 241 +++++++++++ specs/004-k1-scan-import/plan.md | 121 ++++++ specs/004-k1-scan-import/quickstart.md | 120 ++++++ specs/004-k1-scan-import/research.md | 154 +++++++ 6 files changed, 1021 insertions(+), 1 deletion(-) create mode 100644 specs/004-k1-scan-import/contracts/k1-import-api.md create mode 100644 specs/004-k1-scan-import/data-model.md create mode 100644 specs/004-k1-scan-import/plan.md create mode 100644 specs/004-k1-scan-import/quickstart.md create mode 100644 specs/004-k1-scan-import/research.md diff --git a/.github/agents/copilot-instructions.md b/.github/agents/copilot-instructions.md index eeb98f9dc..89c495b5e 100644 --- a/.github/agents/copilot-instructions.md +++ b/.github/agents/copilot-instructions.md @@ -1,10 +1,12 @@ # portfolio-management Development Guidelines -Auto-generated from all feature plans. Last updated: 2026-03-16 +Auto-generated from all feature plans. Last updated: 2026-03-18 ## Active Technologies - TypeScript 5.9.2, Node.js >= 22.18.0 + Angular 21.1.1, NestJS 11.1.14, Angular Material 21.1.1, Prisma 6.19.0, big.js, date-fns 4.1.0 (003-portfolio-performance-views) - PostgreSQL via Prisma ORM (003-portfolio-performance-views) +- TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback) (004-k1-scan-import) +- PostgreSQL via Prisma (structured data), local filesystem `uploads/` (PDF files) (004-k1-scan-import) - TypeScript 5.9.2, Node.js ≥22.18.0 + NestJS 11.1.14 (API), Angular 21.1.1 + Angular Material 21.1.1 (client), Prisma 6.19.0 (ORM), Nx 22.5.3 (monorepo), big.js (decimal math), date-fns 4.1.0, chart.js 4.5.1, Bull 4.16.5 (job queues), Redis (caching), yahoo-finance2 3.13.2 (001-family-office-transform) @@ -25,6 +27,7 @@ npm test; npm run lint TypeScript 5.9.2, Node.js ≥22.18.0: Follow standard conventions ## Recent Changes +- 004-k1-scan-import: Added TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback) - 003-portfolio-performance-views: Added TypeScript 5.9.2, Node.js >= 22.18.0 + Angular 21.1.1, NestJS 11.1.14, Angular Material 21.1.1, Prisma 6.19.0, big.js, date-fns 4.1.0 - 001-family-office-transform: Added TypeScript 5.9.2, Node.js ≥22.18.0 + NestJS 11.1.14 (API), Angular 21.1.1 + Angular Material 21.1.1 (client), Prisma 6.19.0 (ORM), Nx 22.5.3 (monorepo), big.js (decimal math), date-fns 4.1.0, chart.js 4.5.1, Bull 4.16.5 (job queues), Redis (caching), yahoo-finance2 3.13.2 diff --git a/specs/004-k1-scan-import/contracts/k1-import-api.md b/specs/004-k1-scan-import/contracts/k1-import-api.md new file mode 100644 index 000000000..eb9451548 --- /dev/null +++ b/specs/004-k1-scan-import/contracts/k1-import-api.md @@ -0,0 +1,381 @@ +# API Contracts: K-1 Import + +**Phase 1 Output** | **Date**: 2026-03-18 + +## Base Path + +All endpoints under `/api/v1/k1-import/` + +## Authentication + +All endpoints require JWT authentication (`AuthGuard('jwt')`) and appropriate permissions via `HasPermissionGuard`. + +--- + +## Endpoints + +### POST /api/v1/k1-import/upload + +Upload a K-1 PDF and initiate extraction. + +**Permission**: `createKDocument` + +**Request**: `multipart/form-data` + +| Field | Type | Required | Description | +| --------------- | -------- | -------- | ------------------------------------ | +| `file` | File | Yes | PDF file (max 25 MB, MIME: application/pdf) | +| `partnershipId` | `string` | Yes | Target partnership UUID | +| `taxYear` | `number` | Yes | Tax year for this K-1 | + +**Response**: `201 Created` + +```json +{ + "id": "uuid", + "partnershipId": "uuid", + "status": "PROCESSING", + "taxYear": 2025, + "fileName": "K1-Smith-Capital-2025.pdf", + "fileSize": 245760, + "extractionMethod": "pdf-parse", + "createdAt": "2026-03-18T00:00:00.000Z" +} +``` + +**Errors**: + +| Status | Condition | +| ------ | -------------------------------------- | +| 400 | File is not a valid PDF | +| 400 | File exceeds 25 MB size limit | +| 400 | Partnership not found or not owned by user | +| 400 | Partnership has no active members | +| 400 | Tax year < partnership inception year | + +--- + +### GET /api/v1/k1-import/:id + +Get the current state of an import session, including extraction results. + +**Permission**: `readKDocument` + +**Response**: `200 OK` + +```json +{ + "id": "uuid", + "partnershipId": "uuid", + "status": "EXTRACTED", + "taxYear": 2025, + "fileName": "K1-Smith-Capital-2025.pdf", + "fileSize": 245760, + "extractionMethod": "pdf-parse", + "rawExtraction": { + "metadata": { + "partnershipName": "Smith Capital Partners LP", + "partnershipEin": "12-3456789", + "partnerName": "Smith Family Trust", + "partnerEin": "98-7654321", + "taxYear": 2025, + "isAmended": false, + "isFinal": true + }, + "fields": [ + { + "boxNumber": "1", + "label": "Ordinary business income (loss)", + "customLabel": null, + "rawValue": "$52,340", + "numericValue": 52340, + "confidence": 0.95, + "confidenceLevel": "HIGH", + "isUserEdited": false + } + ], + "overallConfidence": 0.92, + "method": "pdf-parse", + "pagesProcessed": 2 + }, + "verifiedData": null, + "documentId": "uuid", + "kDocumentId": null, + "errorMessage": null, + "createdAt": "2026-03-18T00:00:00.000Z", + "updatedAt": "2026-03-18T00:00:05.000Z" +} +``` + +**Errors**: + +| Status | Condition | +| ------ | ----------------------------------- | +| 404 | Import session not found | +| 403 | Import session belongs to different user | + +--- + +### PUT /api/v1/k1-import/:id/verify + +Submit user-verified/edited extraction data. Transitions status from EXTRACTED to VERIFIED. + +**Permission**: `updateKDocument` + +**Request**: `application/json` + +```json +{ + "taxYear": 2025, + "fields": [ + { + "boxNumber": "1", + "label": "Ordinary business income (loss)", + "customLabel": null, + "rawValue": "$52,340", + "numericValue": 52340, + "confidence": 0.95, + "confidenceLevel": "HIGH", + "isUserEdited": false + }, + { + "boxNumber": "11", + "label": "Other income (loss)", + "customLabel": "Section 1256 contracts", + "rawValue": "$8,200", + "numericValue": 8200, + "confidence": 0.72, + "confidenceLevel": "MEDIUM", + "isUserEdited": true + } + ] +} +``` + +**Response**: `200 OK` — Updated import session with status `VERIFIED` + +**Errors**: + +| Status | Condition | +| ------ | ----------------------------------------------- | +| 400 | Import session is not in EXTRACTED status | +| 400 | Fields array is empty | +| 404 | Import session not found | + +--- + +### POST /api/v1/k1-import/:id/confirm + +Confirm verified data and trigger automatic model object creation (KDocument, Distributions, Document linkage). + +**Permission**: `createKDocument` + +**Request**: `application/json` + +```json +{ + "filingStatus": "DRAFT", + "existingKDocumentAction": null +} +``` + +| Field | Type | Required | Description | +| ------------------------- | ----------------------------- | -------- | ---------------------------------------- | +| `filingStatus` | `"DRAFT" \| "ESTIMATED" \| "FINAL"` | Yes | Status for the created/updated KDocument | +| `existingKDocumentAction` | `"UPDATE" \| "CREATE_NEW" \| null` | No | Action if KDocument already exists | + +**Response**: `201 Created` + +```json +{ + "importSession": { + "id": "uuid", + "status": "CONFIRMED" + }, + "kDocument": { + "id": "uuid", + "partnershipId": "uuid", + "type": "K1", + "taxYear": 2025, + "filingStatus": "DRAFT", + "data": { "ordinaryIncome": 52340, "..." : "..." } + }, + "distributions": [ + { + "id": "uuid", + "entityId": "uuid", + "partnershipId": "uuid", + "type": "RETURN_OF_CAPITAL", + "amount": 60000, + "date": "2025-12-31T00:00:00.000Z" + } + ], + "allocations": [ + { + "entityId": "uuid", + "entityName": "Smith Family Trust", + "ownershipPercent": 60, + "allocatedValues": { "ordinaryIncome": 31404, "..." : "..." } + } + ], + "document": { + "id": "uuid", + "type": "K1", + "name": "K1-Smith-Capital-2025.pdf" + } +} +``` + +**Errors**: + +| Status | Condition | +| ------ | ------------------------------------------------------------- | +| 400 | Import session is not in VERIFIED status | +| 400 | Partnership has no active members | +| 409 | KDocument already exists for this partnership/year and no action specified | + +--- + +### POST /api/v1/k1-import/:id/cancel + +Cancel an import session. No model objects are created. + +**Permission**: `updateKDocument` + +**Response**: `200 OK` — Updated import session with status `CANCELLED` + +**Errors**: + +| Status | Condition | +| ------ | --------------------------------------------- | +| 400 | Import session is already CONFIRMED or CANCELLED | +| 404 | Import session not found | + +--- + +### GET /api/v1/k1-import/history + +List import sessions for a partnership, ordered by creation date descending. + +**Permission**: `readKDocument` + +**Query Parameters**: + +| Param | Type | Required | Description | +| --------------- | -------- | -------- | ------------------------------ | +| `partnershipId` | `string` | Yes | Partnership UUID | +| `taxYear` | `number` | No | Filter by tax year | + +**Response**: `200 OK` — Array of import session summaries + +```json +[ + { + "id": "uuid", + "partnershipId": "uuid", + "status": "CONFIRMED", + "taxYear": 2025, + "fileName": "K1-Smith-Capital-2025.pdf", + "extractionMethod": "pdf-parse", + "kDocumentId": "uuid", + "createdAt": "2026-03-18T00:00:00.000Z" + } +] +``` + +--- + +### POST /api/v1/k1-import/:id/reprocess + +Re-run extraction on a previously uploaded PDF using the current cell mapping configuration. + +**Permission**: `updateKDocument` + +**Response**: `200 OK` — New import session with status `PROCESSING` (original session unchanged) + +**Errors**: + +| Status | Condition | +| ------ | ------------------------------------------- | +| 400 | Original import session has no stored document | +| 404 | Import session not found | + +--- + +## Cell Mapping Endpoints + +### GET /api/v1/cell-mapping + +Get cell mappings for a partnership (with global defaults for unmapped boxes). + +**Permission**: `readKDocument` + +**Query Parameters**: + +| Param | Type | Required | Description | +| --------------- | -------- | -------- | ---------------------------------------- | +| `partnershipId` | `string` | No | Partnership UUID (omit for global defaults) | + +**Response**: `200 OK` + +```json +[ + { + "id": "uuid", + "partnershipId": null, + "boxNumber": "1", + "label": "Ordinary business income (loss)", + "description": "IRS Schedule K-1 Box 1", + "isCustom": false, + "sortOrder": 1 + } +] +``` + +--- + +### PUT /api/v1/cell-mapping + +Update or create cell mappings for a partnership. + +**Permission**: `updateKDocument` + +**Request**: `application/json` + +```json +{ + "partnershipId": "uuid", + "mappings": [ + { + "boxNumber": "11", + "label": "Section 1256 contracts", + "description": "Custom label for Box 11", + "isCustom": false + }, + { + "boxNumber": "20-Z", + "label": "Qualified Business Income (Section 199A)", + "description": "Custom additional box", + "isCustom": true + } + ] +} +``` + +**Response**: `200 OK` — Updated mappings array + +--- + +### DELETE /api/v1/cell-mapping/reset + +Reset a partnership's cell mappings to IRS defaults (deletes all custom mappings for the partnership). + +**Permission**: `updateKDocument` + +**Query Parameters**: + +| Param | Type | Required | Description | +| --------------- | -------- | -------- | ------------------ | +| `partnershipId` | `string` | Yes | Partnership UUID | + +**Response**: `200 OK` diff --git a/specs/004-k1-scan-import/data-model.md b/specs/004-k1-scan-import/data-model.md new file mode 100644 index 000000000..419cef2db --- /dev/null +++ b/specs/004-k1-scan-import/data-model.md @@ -0,0 +1,241 @@ +# Data Model: K-1 PDF Scan Import + +**Phase 1 Output** | **Date**: 2026-03-18 + +## Overview + +This feature adds 2 new Prisma models and 1 new enum to support K-1 PDF scanning, import session tracking, and cell mapping configuration. It extends the existing models from spec 001-family-office-transform (KDocument, Distribution, Document, PartnershipMembership) with automatic creation from scanned data. + +### Entity Relationship Diagram (Conceptual) + +``` +User (existing) + └── Partnership (existing from 001) + ├── K1ImportSession[] ──┬── Document (uploaded PDF, existing from 001) + │ (new model) ├── KDocument (auto-created, existing from 001) + │ └── CellMapping (per-partnership config) + ├── PartnershipMembership[] (existing from 001) + │ └── [K-1 allocations computed at confirm time] + ├── KDocument[] (existing from 001) + │ └── Distribution[] (auto-created from Box 19, existing from 001) + └── CellMapping[] (new model, per-partnership overrides) + +Global CellMapping (partnershipId = null) ── IRS default box definitions +``` + +## New Enum + +### K1ImportStatus + +Tracks the lifecycle of a K-1 import session. + +| Value | Description | +| ------------ | -------------------------------------------------------------- | +| `PROCESSING` | PDF uploaded, extraction in progress | +| `EXTRACTED` | Extraction complete, awaiting user review | +| `VERIFIED` | User has reviewed/edited values, ready for confirmation | +| `CONFIRMED` | User confirmed, model objects created (KDocument, Distributions) | +| `CANCELLED` | User cancelled, no model objects created | +| `FAILED` | Extraction failed (invalid PDF, OCR error, etc.) | + +## New Models + +### K1ImportSession + +A record of a single K-1 PDF import attempt, tracking the full lifecycle from upload through confirmation. + +| Field | Type | Constraints | Description | +| ------------------ | ---------------- | ---------------------------- | --------------------------------------------------------------- | +| `id` | `String` | PK, UUID, auto-generated | Unique identifier | +| `partnershipId` | `String` | FK → Partnership.id, indexed | Target partnership for this K-1 import | +| `userId` | `String` | FK → User.id, indexed | User who initiated the import | +| `status` | `K1ImportStatus` | Required, Default: PROCESSING | Current lifecycle status | +| `taxYear` | `Int` | Required | Tax year extracted or specified by user | +| `fileName` | `String` | Required | Original filename of uploaded PDF | +| `fileSize` | `Int` | Required | File size in bytes | +| `extractionMethod` | `String` | Required | Method used: "pdf-parse", "azure", "tesseract" | +| `rawExtraction` | `Json?` | Optional | Raw extraction results before user edits | +| `verifiedData` | `Json?` | Optional | User-verified/edited extraction results (K1ExtractionResult) | +| `documentId` | `String?` | FK → Document.id, optional | Linked uploaded PDF Document record | +| `kDocumentId` | `String?` | FK → KDocument.id, optional | Resulting KDocument (set on CONFIRMED status) | +| `errorMessage` | `String?` | Optional | Error details if status is FAILED | +| `createdAt` | `DateTime` | Default: now() | Upload timestamp | +| `updatedAt` | `DateTime` | Auto-updated | Last modification timestamp | + +**Relations**: + +- `partnership` → `Partnership` (many-to-one, cascade delete) +- `user` → `User` (many-to-one, cascade delete) +- `document` → `Document?` (many-to-one, optional) +- `kDocument` → `KDocument?` (many-to-one, optional) + +**Indexes**: `@@index([partnershipId, taxYear])` for import history queries per partnership/year. + +### CellMapping + +A configuration defining how K-1 box numbers map to labels. Supports a global IRS-default set (partnershipId = null) and per-partnership customizations. + +| Field | Type | Constraints | Description | +| --------------- | ---------- | -------------------------------------- | ---------------------------------------------------- | +| `id` | `String` | PK, UUID, auto-generated | Unique identifier | +| `partnershipId` | `String?` | FK → Partnership.id, optional, indexed | Partnership this mapping applies to (null = global) | +| `boxNumber` | `String` | Required | K-1 box identifier (e.g., "1", "6a", "19a", "20-A") | +| `label` | `String` | Required | Display label (e.g., "Ordinary business income") | +| `description` | `String?` | Optional | Extended description or IRS instructions | +| `isCustom` | `Boolean` | Default: false | Whether this is a user-added custom cell | +| `sortOrder` | `Int` | Required | Display order in the verification screen | +| `createdAt` | `DateTime` | Default: now() | Creation timestamp | +| `updatedAt` | `DateTime` | Auto-updated | Last modification timestamp | + +**Relations**: + +- `partnership` → `Partnership?` (many-to-one, optional, cascade delete) + +**Unique constraint**: `@@unique([partnershipId, boxNumber])` — one mapping per box per partnership (or per box globally when partnershipId is null). + +## Modifications to Existing Models + +### Partnership (from spec 001) + +Add back-references — no column changes: + +| New Field | Type | Description | +| ----------------- | -------------------- | ------------------------------------ | +| `importSessions` | `K1ImportSession[]` | Import attempts for this partnership | +| `cellMappings` | `CellMapping[]` | Custom cell mapping configurations | + +### KDocument (from spec 001) + +Add back-reference — no column changes: + +| New Field | Type | Description | +| ---------------- | ------------------- | ---------------------------------------- | +| `importSession` | `K1ImportSession?` | Import session that created this record | + +## Application-Layer Types + +### K1ExtractionResult (TypeScript interface) + +The structure returned by the extraction service and stored in `K1ImportSession.rawExtraction` and `K1ImportSession.verifiedData`. + +```typescript +interface K1ExtractionResult { + /** Extracted metadata from the K-1 header */ + metadata: { + partnershipName: string | null; + partnershipEin: string | null; + partnerName: string | null; + partnerEin: string | null; + taxYear: number | null; + isAmended: boolean; + isFinal: boolean; + }; + + /** Extracted box values */ + fields: K1ExtractedField[]; + + /** Overall extraction confidence (0.0–1.0) */ + overallConfidence: number; + + /** Extraction method used */ + method: 'pdf-parse' | 'azure' | 'tesseract'; + + /** Number of pages processed */ + pagesProcessed: number; +} + +interface K1ExtractedField { + /** Box identifier (e.g., "1", "6a", "19a") */ + boxNumber: string; + + /** Display label from cell mapping */ + label: string; + + /** Custom label override by user (null if not overridden) */ + customLabel: string | null; + + /** Extracted raw text value */ + rawValue: string; + + /** Parsed numeric value (null if unparseable) */ + numericValue: number | null; + + /** Confidence score (0.0–1.0) */ + confidence: number; + + /** Confidence level for display */ + confidenceLevel: 'HIGH' | 'MEDIUM' | 'LOW'; + + /** Whether user has manually edited this value */ + isUserEdited: boolean; +} +``` + +### K1ConfirmationRequest (TypeScript interface) + +The request body when the user confirms verified K-1 data. + +```typescript +interface K1ConfirmationRequest { + /** Import session ID */ + importSessionId: string; + + /** Tax year (may have been overridden by user) */ + taxYear: number; + + /** Filing status for the new KDocument */ + filingStatus: 'DRAFT' | 'ESTIMATED' | 'FINAL'; + + /** Verified fields with any user edits applied */ + fields: K1ExtractedField[]; + + /** Whether to update an existing KDocument (null = create new) */ + existingKDocumentAction: 'UPDATE' | 'CREATE_NEW' | null; +} +``` + +### Default IRS K-1 Cell Mapping + +The standard box definitions seeded as global CellMapping records (partnershipId = null): + +| boxNumber | label | sortOrder | +| --------- | ----------------------------------------- | --------- | +| 1 | Ordinary business income (loss) | 1 | +| 2 | Net rental real estate income (loss) | 2 | +| 3 | Other net rental income (loss) | 3 | +| 4 | Guaranteed payments for services | 4 | +| 4a | Guaranteed payments for capital | 5 | +| 4b | Total guaranteed payments | 6 | +| 5 | Interest income | 7 | +| 6a | Ordinary dividends | 8 | +| 6b | Qualified dividends | 9 | +| 6c | Dividend equivalents | 10 | +| 7 | Royalties | 11 | +| 8 | Net short-term capital gain (loss) | 12 | +| 9a | Net long-term capital gain (loss) | 13 | +| 9b | Collectibles (28%) gain (loss) | 14 | +| 9c | Unrecaptured section 1250 gain | 15 | +| 10 | Net section 1231 gain (loss) | 16 | +| 11 | Other income (loss) | 17 | +| 12 | Section 179 deduction | 18 | +| 13 | Other deductions | 19 | +| 14 | Self-employment earnings (loss) | 20 | +| 15 | Credits | 21 | +| 16 | Foreign transactions | 22 | +| 17 | Alternative minimum tax (AMT) items | 23 | +| 18 | Tax-exempt income and nondeductible expenses | 24 | +| 19a | Distributions — Cash and marketable securities | 25 | +| 19b | Distributions — Other property | 26 | +| 20 | Other information | 27 | +| 21 | Foreign taxes paid or accrued | 28 | + +## Validation Rules + +1. **Import session partnership**: Must reference an existing partnership owned by the current user. +2. **Import session tax year**: Must be ≥ year of the partnership's inception date. +3. **File upload**: Must be a valid PDF, ≤ 25 MB. System rejects non-PDF MIME types. +4. **Extraction status transitions**: Only valid transitions: PROCESSING → EXTRACTED → VERIFIED → CONFIRMED/CANCELLED, or PROCESSING → FAILED. No backwards transitions. +5. **Cell mapping uniqueness**: One mapping per (partnershipId, boxNumber). Custom mappings for a partnership override the global default for that box number. +6. **Confirmation prerequisites**: Can only confirm when status is VERIFIED, partnership has at least one active member, and verifiedData is not null. +7. **Duplicate KDocument check**: Before creating a KDocument, check for existing (partnershipId, type=K1, taxYear). If found, require explicit user decision (update existing or reject). +8. **Distribution allocation**: Box 19a/19b amounts are allocated to members by ownership percentage as of the tax year's fiscal year end. Allocation amounts must sum exactly to the partnership-level total (handle rounding by adjusting the largest member's allocation). diff --git a/specs/004-k1-scan-import/plan.md b/specs/004-k1-scan-import/plan.md new file mode 100644 index 000000000..c2baf2eec --- /dev/null +++ b/specs/004-k1-scan-import/plan.md @@ -0,0 +1,121 @@ +# Implementation Plan: K-1 PDF Scan Import + +**Branch**: `004-k1-scan-import` | **Date**: 2026-03-18 | **Spec**: [spec.md](spec.md) +**Input**: Feature specification from `/specs/004-k1-scan-import/spec.md` + +## Summary + +Automated K-1 PDF scanning that extracts structured IRS Schedule K-1 (Form 1065) data from uploaded PDFs, presents a verification screen for manual review/correction, and auto-creates downstream model objects (KDocument, Distributions, member allocations, Document). Uses a two-tier extraction approach: `pdf-parse` for digital PDFs (free, instant, local) and Azure AI Document Intelligence / `tesseract.js` fallback for scanned PDFs. Supports per-partnership cell mapping customization and import history with re-processing. + +## Technical Context + +**Language/Version**: TypeScript 5.9.2, Node.js ≥ 22.18.0 +**Primary Dependencies**: NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback) +**Storage**: PostgreSQL via Prisma (structured data), local filesystem `uploads/` (PDF files) +**Testing**: Jest (unit + integration), test K-1 PDF fixtures in `test/import/` +**Target Platform**: Docker (node:22-slim), self-hosted or Railway +**Project Type**: Web application (NestJS API + Angular SPA) — Nx monorepo +**Performance Goals**: PDF extraction < 30 seconds (SC-001), model creation < 5 seconds (SC-005), 90%+ accuracy for digital PDFs (SC-002) +**Constraints**: Self-hosted capable (Azure OCR optional), max PDF size 25 MB, K-1 Form 1065 only (V1) +**Scale/Scope**: Single family office (10–50 partnerships, 10–50 K-1s/year), 2 new API modules, 3 new frontend pages + +## Constitution Check + +_GATE: Must pass before Phase 0 research. Re-check after Phase 1 design._ + +No constitution.md exists for this project. Gates assessed against standard engineering principles: + +| Gate | Status | Notes | +|------|--------|-------| +| No unnecessary dependencies | PASS | 3 new packages (`pdf-parse`, `@azure/ai-form-recognizer`, `tesseract.js`) — each serves a distinct, justified purpose per research.md | +| Follows existing patterns | PASS | New NestJS modules follow existing controller/service/DTO pattern (mirrors `k-document`, `upload` modules) | +| No breaking changes | PASS | 2 new Prisma models + 1 enum, back-references only on existing models — no column changes | +| Test coverage | PASS | Unit tests for extractors, mapper, allocation; integration tests for full pipeline | +| Self-hosted compatible | PASS | Core extraction (pdf-parse) is fully local; Azure is optional with tesseract.js fallback | + +**Post-Phase 1 re-check**: PASS — data model adds 2 models/1 enum, no existing schema changes beyond back-references. API contracts follow existing REST patterns. No violations identified. + +## Project Structure + +### Documentation (this feature) + +```text +specs/004-k1-scan-import/ +├── plan.md # This file +├── research.md # Phase 0: OCR provider research & decisions +├── data-model.md # Phase 1: K1ImportSession, CellMapping models +├── quickstart.md # Phase 1: Setup & dev guide +├── contracts/ +│ └── k1-import-api.md # Phase 1: REST API contracts +├── checklists/ +│ └── requirements.md # Spec quality checklist +└── tasks.md # Phase 2 output (created by /speckit.tasks) +``` + +### Source Code (repository root) + +```text +apps/api/src/app/ +├── k1-import/ +│ ├── k1-import.module.ts +│ ├── k1-import.controller.ts +│ ├── k1-import.service.ts +│ ├── dto/ +│ │ ├── upload-k1.dto.ts +│ │ ├── verify-k1.dto.ts +│ │ └── confirm-k1.dto.ts +│ ├── extractors/ +│ │ ├── k1-extractor.interface.ts +│ │ ├── pdf-parse-extractor.ts +│ │ ├── azure-extractor.ts +│ │ └── tesseract-extractor.ts +│ ├── k1-field-mapper.service.ts +│ ├── k1-allocation.service.ts +│ └── k1-confidence.service.ts +├── cell-mapping/ +│ ├── cell-mapping.module.ts +│ ├── cell-mapping.controller.ts +│ └── cell-mapping.service.ts + +apps/client/src/app/ +├── pages/ +│ ├── k1-import/ +│ │ ├── k1-import-page.component.ts +│ │ ├── k1-import-page.html +│ │ ├── k1-import-page.scss +│ │ ├── k1-import-page.routes.ts +│ │ ├── k1-verification/ +│ │ │ ├── k1-verification.component.ts +│ │ │ ├── k1-verification.html +│ │ │ └── k1-verification.scss +│ │ └── k1-confirmation/ +│ │ ├── k1-confirmation.component.ts +│ │ ├── k1-confirmation.html +│ │ └── k1-confirmation.scss +│ └── cell-mapping/ +│ ├── cell-mapping-page.component.ts +│ ├── cell-mapping-page.html +│ └── cell-mapping-page.routes.ts +├── services/ +│ └── k1-import-data.service.ts + +libs/common/src/lib/ +├── interfaces/ +│ └── k1-import.interface.ts +├── dtos/ +│ └── k1-import/ +│ ├── create-k1-import.dto.ts +│ ├── verify-k1-import.dto.ts +│ └── confirm-k1-import.dto.ts + +prisma/ +├── schema.prisma # + K1ImportSession, CellMapping, K1ImportStatus +├── migrations/ +│ └── 2026XXXX_added_k1_import/ # New migration + +test/import/ +├── sample-k1-digital.pdf # Test fixture: digital K-1 +└── sample-k1-scanned.pdf # Test fixture: scanned K-1 +``` + +**Structure Decision**: Follows the existing Nx monorepo convention with new NestJS modules under `apps/api/src/app/` and new Angular pages under `apps/client/src/app/pages/`. Shared interfaces and DTOs in `libs/common/`. This mirrors the existing `k-document`, `upload`, and `family-office` module patterns. diff --git a/specs/004-k1-scan-import/quickstart.md b/specs/004-k1-scan-import/quickstart.md new file mode 100644 index 000000000..348f51739 --- /dev/null +++ b/specs/004-k1-scan-import/quickstart.md @@ -0,0 +1,120 @@ +# Quickstart: K-1 PDF Scan Import + +**Phase 1 Output** | **Date**: 2026-03-18 + +## Prerequisites + +1. Spec 001-family-office-transform models are implemented (Entity, Partnership, PartnershipMembership, KDocument, Distribution, Document) +2. At least one Partnership with one or more member Entities exists in the database +3. The existing upload infrastructure (`UploadController`, `uploads/` directory) is functional +4. Node.js ≥ 22.18.0, Docker for PostgreSQL/Redis + +## Environment Setup + +Add to `.env` (optional — for Azure OCR of scanned PDFs): +``` +AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource.cognitiveservices.azure.com/ +AZURE_DOCUMENT_INTELLIGENCE_KEY=your-api-key +``` + +If these are empty, scanned PDFs fall back to `tesseract.js` (lower accuracy but fully self-hosted). + +## New Dependencies + +```bash +npm install pdf-parse @azure/ai-form-recognizer tesseract.js +npm install -D @types/pdf-parse +``` + +## Database Migration + +After adding the new Prisma models (`K1ImportSession`, `CellMapping`, `K1ImportStatus` enum): + +```bash +npx prisma db push # Development: sync schema +# OR +npx prisma migrate dev # Create a migration file +``` + +Seed the default IRS cell mappings (28 rows with partnershipId = null) via the existing seed mechanism or a dedicated seed script. + +## Key Files to Create + +### Backend (apps/api/src/) + +``` +app/k1-import/ +├── k1-import.module.ts # NestJS module +├── k1-import.controller.ts # REST endpoints (see contracts/k1-import-api.md) +├── k1-import.service.ts # Orchestration: upload → extract → verify → confirm +├── dto/ +│ ├── upload-k1.dto.ts # Multipart upload DTO +│ ├── verify-k1.dto.ts # Verification submission DTO +│ └── confirm-k1.dto.ts # Confirmation request DTO +├── extractors/ +│ ├── k1-extractor.interface.ts # Common extraction interface +│ ├── pdf-parse-extractor.ts # Tier 1: digital PDF text extraction +│ ├── azure-extractor.ts # Tier 2: Azure Document Intelligence +│ └── tesseract-extractor.ts # Tier 2 fallback: tesseract.js OCR +├── k1-field-mapper.service.ts # Maps raw extraction → K1ExtractedField[] +├── k1-allocation.service.ts # Allocates K-1 amounts to members by ownership % +└── k1-confidence.service.ts # Computes confidence scores with validation heuristics + +app/cell-mapping/ +├── cell-mapping.module.ts # NestJS module +├── cell-mapping.controller.ts # CRUD for cell mappings +└── cell-mapping.service.ts # Cell mapping business logic + seed data +``` + +### Shared Types (libs/common/src/lib/) + +``` +interfaces/ +├── k1-import.interface.ts # K1ExtractionResult, K1ExtractedField, K1ConfirmationRequest +dtos/ +├── k1-import/ +│ ├── create-k1-import.dto.ts +│ ├── verify-k1-import.dto.ts +│ └── confirm-k1-import.dto.ts +``` + +### Frontend (apps/client/src/app/) + +``` +pages/k1-import/ +├── k1-import-page.component.ts # Upload + history view +├── k1-import-page.html +├── k1-import-page.scss +├── k1-import-page.routes.ts +├── k1-verification/ +│ ├── k1-verification.component.ts # Verification/edit screen +│ ├── k1-verification.html +│ └── k1-verification.scss +└── k1-confirmation/ + ├── k1-confirmation.component.ts # Confirmation result screen + ├── k1-confirmation.html + └── k1-confirmation.scss + +pages/cell-mapping/ +├── cell-mapping-page.component.ts # Cell mapping configuration UI +├── cell-mapping-page.html +└── cell-mapping-page.routes.ts + +services/ +├── k1-import-data.service.ts # HTTP client for k1-import endpoints +``` + +## Verification Workflow + +1. **Upload**: User selects PDF → `POST /api/v1/k1-import/upload` → session created with status PROCESSING +2. **Extract**: Backend detects PDF type (digital vs. scanned) → routes to appropriate extractor → status becomes EXTRACTED +3. **Review**: Frontend polls/fetches session → displays verification screen with extracted fields, confidence indicators +4. **Edit**: User corrects values, overrides labels → `PUT /api/v1/k1-import/:id/verify` → status becomes VERIFIED +5. **Confirm**: User clicks "Confirm & Save" → `POST /api/v1/k1-import/:id/confirm` → KDocument + Distributions + Document created → status becomes CONFIRMED + +## Testing Strategy + +- **Unit tests**: Extractors (pdf-parse, azure, tesseract), field mapper, confidence scoring, allocation math +- **Integration tests**: Full upload → extract → verify → confirm flow with test PDF fixtures +- **Test fixtures**: Include sample K-1 PDFs (digital and scanned) in `test/import/` directory +- **Allocation accuracy**: Verify rounding behavior — allocated amounts must sum exactly to partnership total diff --git a/specs/004-k1-scan-import/research.md b/specs/004-k1-scan-import/research.md new file mode 100644 index 000000000..17be04f4c --- /dev/null +++ b/specs/004-k1-scan-import/research.md @@ -0,0 +1,154 @@ +# Research: K-1 PDF Scan Import + +**Phase 0 Output** | **Date**: 2026-03-18 + +## Decision 1: PDF Text Extraction (Tier 1 — Digital PDFs) + +**Decision**: Use `pdf-parse` npm package for digitally-generated K-1 PDFs. + +**Rationale**: Digitally-generated PDFs from fund administrators contain embedded text. `pdf-parse` extracts this text losslessly, is free, fully self-hosted, and instant. It has 3M+ weekly npm downloads and a stable API. No external API calls needed. + +**Alternatives Considered**: +- `pdfjs-dist` (Mozilla pdf.js) — lower-level, requires more boilerplate for text extraction; `pdf-parse` wraps this already. +- Cloud OCR for all PDFs — unnecessary cost and latency for digital PDFs where text extraction is 100% accurate. + +--- + +## Decision 2: OCR for Scanned PDFs (Tier 2) + +**Decision**: Use Azure AI Document Intelligence (Layout model) as primary Tier 2 provider, with `tesseract.js` as self-hosted fallback. + +**Rationale**: +- Azure has the best tax-form pedigree among cloud providers (prebuilt IRS models for W-2, 1098, 1099) +- Returns per-field confidence scores (0.0–1.0) natively, directly fulfilling FR-006/FR-009 +- 500 free pages/month covers typical family office volume (10–50 K-1s/year) +- `@azure/ai-form-recognizer` has full TypeScript types, aligns with NestJS patterns +- `tesseract.js` runs as WASM in Node.js (no system install), provides ~75% accuracy fallback + +**Alternatives Considered**: +- Google Document AI — good form parsing but no tax-specific models, more expensive for custom processors ($30/1K pages) +- AWS Textract — strong table extraction but less established for tax forms, requires IAM setup +- Tesseract.js only — accuracy drops to 70–85% for clean scans, no layout understanding; acceptable as fallback but not primary + +--- + +## Decision 3: Two-Tier Extraction Architecture + +**Decision**: Implement a PDF type detection step that routes digital PDFs to local extraction (free, instant) and scanned PDFs to cloud OCR. + +**Rationale**: Most K-1s from fund administrators are digitally generated. The two-tier approach avoids unnecessary API calls and costs for the majority case, while still supporting scanned documents. + +**Detection heuristic**: Extract text via `pdf-parse`; if extracted text length < 100 characters or does not contain K-1 keywords ("Schedule K-1", "Form 1065", "Partner's Share"), route to Tier 2 OCR. + +**Alternatives Considered**: +- Cloud OCR for everything — simpler but adds cost ($0.15/page) and latency (3–10s) for digital PDFs that don't need it +- Local OCR only (Tesseract.js) — insufficient accuracy (75%) for production tax data; too many manual corrections needed + +--- + +## Decision 4: K-1 Box Extraction Strategy + +**Decision**: Use regex-based box extraction for Tier 1 (digital text), and key-value pair extraction from the OCR provider for Tier 2. Both feed into a shared K-1 field mapper that applies the cell mapping configuration. + +**Rationale**: The IRS Schedule K-1 (Form 1065) has a consistent, standardized layout: +- Page 1: Header + Part I (partnership info) + Part II (partner info) + Boxes 1–11 +- Page 2: Boxes 12–20+ with code/sub-code details +- Box values sit in a numbered two-column grid: number label → description → value field +- Layout has been structurally stable for years, making template/regex extraction reliable + +**Challenges addressed**: +- Multi-line sub-codes (Boxes 11, 13, 15, 16, 17, 18, 20) — handle by extracting code-letter/value pairs within each box section +- Supplemental schedules — out of scope for V1 auto-extraction; captured as additional Document attachments +- Multi-entity PDFs — detect via repeated "Schedule K-1" headers; split and process each K-1 separately + +**Alternatives Considered**: +- Fixed coordinate-based extraction — too brittle across different PDF generators (varying margins, fonts) +- Machine learning model — overkill for V1 given the standardized form layout + +--- + +## Decision 5: Confidence Scoring Approach + +**Decision**: Three-level confidence display (High/Medium/Low) derived from extraction method and validation heuristics. + +**Rationale**: + +For **Tier 1** (digital text): +- Base confidence: 0.90 (text extraction is inherently reliable) +- +0.05 if box number regex matched cleanly +- +0.05 if value format validated (currency, percentage, integer) +- -0.10 to -0.30 for potential adjacent-box text contamination + +For **Tier 2** (cloud OCR): +- Use Azure's native per-field confidence score directly +- Layer cross-field validation (e.g., Box 6b ≤ Box 6a, sub-boxes sum to parent) + +**Display mapping**: +- High (≥ 0.85): Green — no user attention needed +- Medium (0.60–0.84): Yellow — optional review +- Low (< 0.60): Red — highlighted, requires manual review (FR-009) + +**Alternatives Considered**: +- Binary confidence (confident/not) — too coarse; doesn't guide the user's review attention +- Numeric score display — too technical for a non-engineer user; three levels with color coding is more actionable + +--- + +## Decision 6: New Database Models + +**Decision**: Add two new Prisma models (`K1ImportSession`, `CellMapping`) to support import tracking and cell mapping configuration, alongside the existing K-document models from spec 001. + +**Rationale**: +- `K1ImportSession` tracks the full import lifecycle (upload → processing → extracted → verified → confirmed/cancelled), enabling import history (FR-022) and re-processing (FR-023) +- `CellMapping` stores per-partnership cell label customizations (FR-017 through FR-021) separate from the KDocument data itself + +**Alternatives Considered**: +- Store import sessions as JSON metadata on KDocument — would conflate document data with import workflow state; makes import history harder to query +- Store cell mappings as JSON on Partnership — would work but loses the ability to query/manage mappings independently and doesn't support a global default set + +--- + +## Decision 7: File Storage + +**Decision**: Use the existing `uploads/` directory and `Document` model from spec 001. Uploaded K-1 PDFs are stored on the local filesystem, with metadata in the `Document` table. + +**Rationale**: The existing upload infrastructure (UploadController with `FileInterceptor`, Document model, `uploads/` directory) is already in place. No need to add a new storage mechanism. + +**Alternatives Considered**: +- S3/cloud storage — would require new infrastructure; the self-hosted philosophy favors local storage +- Database blob storage — increases database size and backup time for binary files + +--- + +## Decision 8: New Environment Variables + +**Decision**: Add two optional environment variables for Azure Document Intelligence, following the existing `ConfigurationService` pattern with `str({ default: '' })`. + +``` +AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT — Azure resource endpoint URL +AZURE_DOCUMENT_INTELLIGENCE_KEY — Azure API key +``` + +**Rationale**: When both are empty (default), the system falls back to `tesseract.js` for scanned PDFs. This makes Azure optional — the feature works fully self-hosted with degraded OCR accuracy. + +**Alternatives Considered**: +- Separate feature flag — unnecessary; empty credentials are sufficient to indicate "not configured" +- Google/AWS credentials — Azure recommended as primary; could add additional providers later + +--- + +## Decision 9: New npm Dependencies + +**Decision**: Add the following packages: + +| Package | Purpose | Tier | +|---|---|---| +| `pdf-parse` | Text extraction from digital PDFs | Tier 1 (required) | +| `@azure/ai-form-recognizer` | Cloud OCR for scanned PDFs | Tier 2 (optional) | +| `tesseract.js` | Self-hosted OCR fallback | Tier 2 fallback | + +**Rationale**: `pdf-parse` is essential for the Tier 1 (free, local) path. Azure SDK is optional (only loaded when credentials are configured). `tesseract.js` provides a zero-config fallback that runs as WASM — no system dependencies needed, works in the existing `node:22-slim` Docker image. + +**Alternatives Considered**: +- `pdfjs-dist` directly instead of `pdf-parse` — more boilerplate, `pdf-parse` wraps it with a simpler API +- Only cloud OCR — loses the self-hosted story and adds cost for digital PDFs