Browse Source

plan: K-1 scan import implementation plan (Phase 0+1)

- research.md: 9 architectural decisions (two-tier OCR, Azure+tesseract fallback)
- data-model.md: K1ImportSession, CellMapping models, K1ImportStatus enum
- contracts/k1-import-api.md: 10 REST endpoints
- quickstart.md: file structure, setup guide, workflow
- plan.md: summary, technical context, constitution check, project structure
- Updated copilot agent context with new tech stack entries
pull/6701/head
Robert Patch 2 months ago
parent
commit
269eed7faf
  1. 5
      .github/agents/copilot-instructions.md
  2. 381
      specs/004-k1-scan-import/contracts/k1-import-api.md
  3. 241
      specs/004-k1-scan-import/data-model.md
  4. 121
      specs/004-k1-scan-import/plan.md
  5. 120
      specs/004-k1-scan-import/quickstart.md
  6. 154
      specs/004-k1-scan-import/research.md

5
.github/agents/copilot-instructions.md

@ -1,10 +1,12 @@
# portfolio-management Development Guidelines # portfolio-management Development Guidelines
Auto-generated from all feature plans. Last updated: 2026-03-16 Auto-generated from all feature plans. Last updated: 2026-03-18
## Active Technologies ## Active Technologies
- TypeScript 5.9.2, Node.js >= 22.18.0 + Angular 21.1.1, NestJS 11.1.14, Angular Material 21.1.1, Prisma 6.19.0, big.js, date-fns 4.1.0 (003-portfolio-performance-views) - TypeScript 5.9.2, Node.js >= 22.18.0 + Angular 21.1.1, NestJS 11.1.14, Angular Material 21.1.1, Prisma 6.19.0, big.js, date-fns 4.1.0 (003-portfolio-performance-views)
- PostgreSQL via Prisma ORM (003-portfolio-performance-views) - PostgreSQL via Prisma ORM (003-portfolio-performance-views)
- TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback) (004-k1-scan-import)
- PostgreSQL via Prisma (structured data), local filesystem `uploads/` (PDF files) (004-k1-scan-import)
- TypeScript 5.9.2, Node.js ≥22.18.0 + NestJS 11.1.14 (API), Angular 21.1.1 + Angular Material 21.1.1 (client), Prisma 6.19.0 (ORM), Nx 22.5.3 (monorepo), big.js (decimal math), date-fns 4.1.0, chart.js 4.5.1, Bull 4.16.5 (job queues), Redis (caching), yahoo-finance2 3.13.2 (001-family-office-transform) - TypeScript 5.9.2, Node.js ≥22.18.0 + NestJS 11.1.14 (API), Angular 21.1.1 + Angular Material 21.1.1 (client), Prisma 6.19.0 (ORM), Nx 22.5.3 (monorepo), big.js (decimal math), date-fns 4.1.0, chart.js 4.5.1, Bull 4.16.5 (job queues), Redis (caching), yahoo-finance2 3.13.2 (001-family-office-transform)
@ -25,6 +27,7 @@ npm test; npm run lint
TypeScript 5.9.2, Node.js ≥22.18.0: Follow standard conventions TypeScript 5.9.2, Node.js ≥22.18.0: Follow standard conventions
## Recent Changes ## Recent Changes
- 004-k1-scan-import: Added TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback)
- 003-portfolio-performance-views: Added TypeScript 5.9.2, Node.js >= 22.18.0 + Angular 21.1.1, NestJS 11.1.14, Angular Material 21.1.1, Prisma 6.19.0, big.js, date-fns 4.1.0 - 003-portfolio-performance-views: Added TypeScript 5.9.2, Node.js >= 22.18.0 + Angular 21.1.1, NestJS 11.1.14, Angular Material 21.1.1, Prisma 6.19.0, big.js, date-fns 4.1.0
- 001-family-office-transform: Added TypeScript 5.9.2, Node.js ≥22.18.0 + NestJS 11.1.14 (API), Angular 21.1.1 + Angular Material 21.1.1 (client), Prisma 6.19.0 (ORM), Nx 22.5.3 (monorepo), big.js (decimal math), date-fns 4.1.0, chart.js 4.5.1, Bull 4.16.5 (job queues), Redis (caching), yahoo-finance2 3.13.2 - 001-family-office-transform: Added TypeScript 5.9.2, Node.js ≥22.18.0 + NestJS 11.1.14 (API), Angular 21.1.1 + Angular Material 21.1.1 (client), Prisma 6.19.0 (ORM), Nx 22.5.3 (monorepo), big.js (decimal math), date-fns 4.1.0, chart.js 4.5.1, Bull 4.16.5 (job queues), Redis (caching), yahoo-finance2 3.13.2

381
specs/004-k1-scan-import/contracts/k1-import-api.md

@ -0,0 +1,381 @@
# API Contracts: K-1 Import
**Phase 1 Output** | **Date**: 2026-03-18
## Base Path
All endpoints under `/api/v1/k1-import/`
## Authentication
All endpoints require JWT authentication (`AuthGuard('jwt')`) and appropriate permissions via `HasPermissionGuard`.
---
## Endpoints
### POST /api/v1/k1-import/upload
Upload a K-1 PDF and initiate extraction.
**Permission**: `createKDocument`
**Request**: `multipart/form-data`
| Field | Type | Required | Description |
| --------------- | -------- | -------- | ------------------------------------ |
| `file` | File | Yes | PDF file (max 25 MB, MIME: application/pdf) |
| `partnershipId` | `string` | Yes | Target partnership UUID |
| `taxYear` | `number` | Yes | Tax year for this K-1 |
**Response**: `201 Created`
```json
{
"id": "uuid",
"partnershipId": "uuid",
"status": "PROCESSING",
"taxYear": 2025,
"fileName": "K1-Smith-Capital-2025.pdf",
"fileSize": 245760,
"extractionMethod": "pdf-parse",
"createdAt": "2026-03-18T00:00:00.000Z"
}
```
**Errors**:
| Status | Condition |
| ------ | -------------------------------------- |
| 400 | File is not a valid PDF |
| 400 | File exceeds 25 MB size limit |
| 400 | Partnership not found or not owned by user |
| 400 | Partnership has no active members |
| 400 | Tax year < partnership inception year |
---
### GET /api/v1/k1-import/:id
Get the current state of an import session, including extraction results.
**Permission**: `readKDocument`
**Response**: `200 OK`
```json
{
"id": "uuid",
"partnershipId": "uuid",
"status": "EXTRACTED",
"taxYear": 2025,
"fileName": "K1-Smith-Capital-2025.pdf",
"fileSize": 245760,
"extractionMethod": "pdf-parse",
"rawExtraction": {
"metadata": {
"partnershipName": "Smith Capital Partners LP",
"partnershipEin": "12-3456789",
"partnerName": "Smith Family Trust",
"partnerEin": "98-7654321",
"taxYear": 2025,
"isAmended": false,
"isFinal": true
},
"fields": [
{
"boxNumber": "1",
"label": "Ordinary business income (loss)",
"customLabel": null,
"rawValue": "$52,340",
"numericValue": 52340,
"confidence": 0.95,
"confidenceLevel": "HIGH",
"isUserEdited": false
}
],
"overallConfidence": 0.92,
"method": "pdf-parse",
"pagesProcessed": 2
},
"verifiedData": null,
"documentId": "uuid",
"kDocumentId": null,
"errorMessage": null,
"createdAt": "2026-03-18T00:00:00.000Z",
"updatedAt": "2026-03-18T00:00:05.000Z"
}
```
**Errors**:
| Status | Condition |
| ------ | ----------------------------------- |
| 404 | Import session not found |
| 403 | Import session belongs to different user |
---
### PUT /api/v1/k1-import/:id/verify
Submit user-verified/edited extraction data. Transitions status from EXTRACTED to VERIFIED.
**Permission**: `updateKDocument`
**Request**: `application/json`
```json
{
"taxYear": 2025,
"fields": [
{
"boxNumber": "1",
"label": "Ordinary business income (loss)",
"customLabel": null,
"rawValue": "$52,340",
"numericValue": 52340,
"confidence": 0.95,
"confidenceLevel": "HIGH",
"isUserEdited": false
},
{
"boxNumber": "11",
"label": "Other income (loss)",
"customLabel": "Section 1256 contracts",
"rawValue": "$8,200",
"numericValue": 8200,
"confidence": 0.72,
"confidenceLevel": "MEDIUM",
"isUserEdited": true
}
]
}
```
**Response**: `200 OK` — Updated import session with status `VERIFIED`
**Errors**:
| Status | Condition |
| ------ | ----------------------------------------------- |
| 400 | Import session is not in EXTRACTED status |
| 400 | Fields array is empty |
| 404 | Import session not found |
---
### POST /api/v1/k1-import/:id/confirm
Confirm verified data and trigger automatic model object creation (KDocument, Distributions, Document linkage).
**Permission**: `createKDocument`
**Request**: `application/json`
```json
{
"filingStatus": "DRAFT",
"existingKDocumentAction": null
}
```
| Field | Type | Required | Description |
| ------------------------- | ----------------------------- | -------- | ---------------------------------------- |
| `filingStatus` | `"DRAFT" \| "ESTIMATED" \| "FINAL"` | Yes | Status for the created/updated KDocument |
| `existingKDocumentAction` | `"UPDATE" \| "CREATE_NEW" \| null` | No | Action if KDocument already exists |
**Response**: `201 Created`
```json
{
"importSession": {
"id": "uuid",
"status": "CONFIRMED"
},
"kDocument": {
"id": "uuid",
"partnershipId": "uuid",
"type": "K1",
"taxYear": 2025,
"filingStatus": "DRAFT",
"data": { "ordinaryIncome": 52340, "..." : "..." }
},
"distributions": [
{
"id": "uuid",
"entityId": "uuid",
"partnershipId": "uuid",
"type": "RETURN_OF_CAPITAL",
"amount": 60000,
"date": "2025-12-31T00:00:00.000Z"
}
],
"allocations": [
{
"entityId": "uuid",
"entityName": "Smith Family Trust",
"ownershipPercent": 60,
"allocatedValues": { "ordinaryIncome": 31404, "..." : "..." }
}
],
"document": {
"id": "uuid",
"type": "K1",
"name": "K1-Smith-Capital-2025.pdf"
}
}
```
**Errors**:
| Status | Condition |
| ------ | ------------------------------------------------------------- |
| 400 | Import session is not in VERIFIED status |
| 400 | Partnership has no active members |
| 409 | KDocument already exists for this partnership/year and no action specified |
---
### POST /api/v1/k1-import/:id/cancel
Cancel an import session. No model objects are created.
**Permission**: `updateKDocument`
**Response**: `200 OK` — Updated import session with status `CANCELLED`
**Errors**:
| Status | Condition |
| ------ | --------------------------------------------- |
| 400 | Import session is already CONFIRMED or CANCELLED |
| 404 | Import session not found |
---
### GET /api/v1/k1-import/history
List import sessions for a partnership, ordered by creation date descending.
**Permission**: `readKDocument`
**Query Parameters**:
| Param | Type | Required | Description |
| --------------- | -------- | -------- | ------------------------------ |
| `partnershipId` | `string` | Yes | Partnership UUID |
| `taxYear` | `number` | No | Filter by tax year |
**Response**: `200 OK` — Array of import session summaries
```json
[
{
"id": "uuid",
"partnershipId": "uuid",
"status": "CONFIRMED",
"taxYear": 2025,
"fileName": "K1-Smith-Capital-2025.pdf",
"extractionMethod": "pdf-parse",
"kDocumentId": "uuid",
"createdAt": "2026-03-18T00:00:00.000Z"
}
]
```
---
### POST /api/v1/k1-import/:id/reprocess
Re-run extraction on a previously uploaded PDF using the current cell mapping configuration.
**Permission**: `updateKDocument`
**Response**: `200 OK` — New import session with status `PROCESSING` (original session unchanged)
**Errors**:
| Status | Condition |
| ------ | ------------------------------------------- |
| 400 | Original import session has no stored document |
| 404 | Import session not found |
---
## Cell Mapping Endpoints
### GET /api/v1/cell-mapping
Get cell mappings for a partnership (with global defaults for unmapped boxes).
**Permission**: `readKDocument`
**Query Parameters**:
| Param | Type | Required | Description |
| --------------- | -------- | -------- | ---------------------------------------- |
| `partnershipId` | `string` | No | Partnership UUID (omit for global defaults) |
**Response**: `200 OK`
```json
[
{
"id": "uuid",
"partnershipId": null,
"boxNumber": "1",
"label": "Ordinary business income (loss)",
"description": "IRS Schedule K-1 Box 1",
"isCustom": false,
"sortOrder": 1
}
]
```
---
### PUT /api/v1/cell-mapping
Update or create cell mappings for a partnership.
**Permission**: `updateKDocument`
**Request**: `application/json`
```json
{
"partnershipId": "uuid",
"mappings": [
{
"boxNumber": "11",
"label": "Section 1256 contracts",
"description": "Custom label for Box 11",
"isCustom": false
},
{
"boxNumber": "20-Z",
"label": "Qualified Business Income (Section 199A)",
"description": "Custom additional box",
"isCustom": true
}
]
}
```
**Response**: `200 OK` — Updated mappings array
---
### DELETE /api/v1/cell-mapping/reset
Reset a partnership's cell mappings to IRS defaults (deletes all custom mappings for the partnership).
**Permission**: `updateKDocument`
**Query Parameters**:
| Param | Type | Required | Description |
| --------------- | -------- | -------- | ------------------ |
| `partnershipId` | `string` | Yes | Partnership UUID |
**Response**: `200 OK`

241
specs/004-k1-scan-import/data-model.md

@ -0,0 +1,241 @@
# Data Model: K-1 PDF Scan Import
**Phase 1 Output** | **Date**: 2026-03-18
## Overview
This feature adds 2 new Prisma models and 1 new enum to support K-1 PDF scanning, import session tracking, and cell mapping configuration. It extends the existing models from spec 001-family-office-transform (KDocument, Distribution, Document, PartnershipMembership) with automatic creation from scanned data.
### Entity Relationship Diagram (Conceptual)
```
User (existing)
└── Partnership (existing from 001)
├── K1ImportSession[] ──┬── Document (uploaded PDF, existing from 001)
│ (new model) ├── KDocument (auto-created, existing from 001)
│ └── CellMapping (per-partnership config)
├── PartnershipMembership[] (existing from 001)
│ └── [K-1 allocations computed at confirm time]
├── KDocument[] (existing from 001)
│ └── Distribution[] (auto-created from Box 19, existing from 001)
└── CellMapping[] (new model, per-partnership overrides)
Global CellMapping (partnershipId = null) ── IRS default box definitions
```
## New Enum
### K1ImportStatus
Tracks the lifecycle of a K-1 import session.
| Value | Description |
| ------------ | -------------------------------------------------------------- |
| `PROCESSING` | PDF uploaded, extraction in progress |
| `EXTRACTED` | Extraction complete, awaiting user review |
| `VERIFIED` | User has reviewed/edited values, ready for confirmation |
| `CONFIRMED` | User confirmed, model objects created (KDocument, Distributions) |
| `CANCELLED` | User cancelled, no model objects created |
| `FAILED` | Extraction failed (invalid PDF, OCR error, etc.) |
## New Models
### K1ImportSession
A record of a single K-1 PDF import attempt, tracking the full lifecycle from upload through confirmation.
| Field | Type | Constraints | Description |
| ------------------ | ---------------- | ---------------------------- | --------------------------------------------------------------- |
| `id` | `String` | PK, UUID, auto-generated | Unique identifier |
| `partnershipId` | `String` | FK → Partnership.id, indexed | Target partnership for this K-1 import |
| `userId` | `String` | FK → User.id, indexed | User who initiated the import |
| `status` | `K1ImportStatus` | Required, Default: PROCESSING | Current lifecycle status |
| `taxYear` | `Int` | Required | Tax year extracted or specified by user |
| `fileName` | `String` | Required | Original filename of uploaded PDF |
| `fileSize` | `Int` | Required | File size in bytes |
| `extractionMethod` | `String` | Required | Method used: "pdf-parse", "azure", "tesseract" |
| `rawExtraction` | `Json?` | Optional | Raw extraction results before user edits |
| `verifiedData` | `Json?` | Optional | User-verified/edited extraction results (K1ExtractionResult) |
| `documentId` | `String?` | FK → Document.id, optional | Linked uploaded PDF Document record |
| `kDocumentId` | `String?` | FK → KDocument.id, optional | Resulting KDocument (set on CONFIRMED status) |
| `errorMessage` | `String?` | Optional | Error details if status is FAILED |
| `createdAt` | `DateTime` | Default: now() | Upload timestamp |
| `updatedAt` | `DateTime` | Auto-updated | Last modification timestamp |
**Relations**:
- `partnership``Partnership` (many-to-one, cascade delete)
- `user``User` (many-to-one, cascade delete)
- `document``Document?` (many-to-one, optional)
- `kDocument``KDocument?` (many-to-one, optional)
**Indexes**: `@@index([partnershipId, taxYear])` for import history queries per partnership/year.
### CellMapping
A configuration defining how K-1 box numbers map to labels. Supports a global IRS-default set (partnershipId = null) and per-partnership customizations.
| Field | Type | Constraints | Description |
| --------------- | ---------- | -------------------------------------- | ---------------------------------------------------- |
| `id` | `String` | PK, UUID, auto-generated | Unique identifier |
| `partnershipId` | `String?` | FK → Partnership.id, optional, indexed | Partnership this mapping applies to (null = global) |
| `boxNumber` | `String` | Required | K-1 box identifier (e.g., "1", "6a", "19a", "20-A") |
| `label` | `String` | Required | Display label (e.g., "Ordinary business income") |
| `description` | `String?` | Optional | Extended description or IRS instructions |
| `isCustom` | `Boolean` | Default: false | Whether this is a user-added custom cell |
| `sortOrder` | `Int` | Required | Display order in the verification screen |
| `createdAt` | `DateTime` | Default: now() | Creation timestamp |
| `updatedAt` | `DateTime` | Auto-updated | Last modification timestamp |
**Relations**:
- `partnership``Partnership?` (many-to-one, optional, cascade delete)
**Unique constraint**: `@@unique([partnershipId, boxNumber])` — one mapping per box per partnership (or per box globally when partnershipId is null).
## Modifications to Existing Models
### Partnership (from spec 001)
Add back-references — no column changes:
| New Field | Type | Description |
| ----------------- | -------------------- | ------------------------------------ |
| `importSessions` | `K1ImportSession[]` | Import attempts for this partnership |
| `cellMappings` | `CellMapping[]` | Custom cell mapping configurations |
### KDocument (from spec 001)
Add back-reference — no column changes:
| New Field | Type | Description |
| ---------------- | ------------------- | ---------------------------------------- |
| `importSession` | `K1ImportSession?` | Import session that created this record |
## Application-Layer Types
### K1ExtractionResult (TypeScript interface)
The structure returned by the extraction service and stored in `K1ImportSession.rawExtraction` and `K1ImportSession.verifiedData`.
```typescript
interface K1ExtractionResult {
/** Extracted metadata from the K-1 header */
metadata: {
partnershipName: string | null;
partnershipEin: string | null;
partnerName: string | null;
partnerEin: string | null;
taxYear: number | null;
isAmended: boolean;
isFinal: boolean;
};
/** Extracted box values */
fields: K1ExtractedField[];
/** Overall extraction confidence (0.0–1.0) */
overallConfidence: number;
/** Extraction method used */
method: 'pdf-parse' | 'azure' | 'tesseract';
/** Number of pages processed */
pagesProcessed: number;
}
interface K1ExtractedField {
/** Box identifier (e.g., "1", "6a", "19a") */
boxNumber: string;
/** Display label from cell mapping */
label: string;
/** Custom label override by user (null if not overridden) */
customLabel: string | null;
/** Extracted raw text value */
rawValue: string;
/** Parsed numeric value (null if unparseable) */
numericValue: number | null;
/** Confidence score (0.0–1.0) */
confidence: number;
/** Confidence level for display */
confidenceLevel: 'HIGH' | 'MEDIUM' | 'LOW';
/** Whether user has manually edited this value */
isUserEdited: boolean;
}
```
### K1ConfirmationRequest (TypeScript interface)
The request body when the user confirms verified K-1 data.
```typescript
interface K1ConfirmationRequest {
/** Import session ID */
importSessionId: string;
/** Tax year (may have been overridden by user) */
taxYear: number;
/** Filing status for the new KDocument */
filingStatus: 'DRAFT' | 'ESTIMATED' | 'FINAL';
/** Verified fields with any user edits applied */
fields: K1ExtractedField[];
/** Whether to update an existing KDocument (null = create new) */
existingKDocumentAction: 'UPDATE' | 'CREATE_NEW' | null;
}
```
### Default IRS K-1 Cell Mapping
The standard box definitions seeded as global CellMapping records (partnershipId = null):
| boxNumber | label | sortOrder |
| --------- | ----------------------------------------- | --------- |
| 1 | Ordinary business income (loss) | 1 |
| 2 | Net rental real estate income (loss) | 2 |
| 3 | Other net rental income (loss) | 3 |
| 4 | Guaranteed payments for services | 4 |
| 4a | Guaranteed payments for capital | 5 |
| 4b | Total guaranteed payments | 6 |
| 5 | Interest income | 7 |
| 6a | Ordinary dividends | 8 |
| 6b | Qualified dividends | 9 |
| 6c | Dividend equivalents | 10 |
| 7 | Royalties | 11 |
| 8 | Net short-term capital gain (loss) | 12 |
| 9a | Net long-term capital gain (loss) | 13 |
| 9b | Collectibles (28%) gain (loss) | 14 |
| 9c | Unrecaptured section 1250 gain | 15 |
| 10 | Net section 1231 gain (loss) | 16 |
| 11 | Other income (loss) | 17 |
| 12 | Section 179 deduction | 18 |
| 13 | Other deductions | 19 |
| 14 | Self-employment earnings (loss) | 20 |
| 15 | Credits | 21 |
| 16 | Foreign transactions | 22 |
| 17 | Alternative minimum tax (AMT) items | 23 |
| 18 | Tax-exempt income and nondeductible expenses | 24 |
| 19a | Distributions — Cash and marketable securities | 25 |
| 19b | Distributions — Other property | 26 |
| 20 | Other information | 27 |
| 21 | Foreign taxes paid or accrued | 28 |
## Validation Rules
1. **Import session partnership**: Must reference an existing partnership owned by the current user.
2. **Import session tax year**: Must be ≥ year of the partnership's inception date.
3. **File upload**: Must be a valid PDF, ≤ 25 MB. System rejects non-PDF MIME types.
4. **Extraction status transitions**: Only valid transitions: PROCESSING → EXTRACTED → VERIFIED → CONFIRMED/CANCELLED, or PROCESSING → FAILED. No backwards transitions.
5. **Cell mapping uniqueness**: One mapping per (partnershipId, boxNumber). Custom mappings for a partnership override the global default for that box number.
6. **Confirmation prerequisites**: Can only confirm when status is VERIFIED, partnership has at least one active member, and verifiedData is not null.
7. **Duplicate KDocument check**: Before creating a KDocument, check for existing (partnershipId, type=K1, taxYear). If found, require explicit user decision (update existing or reject).
8. **Distribution allocation**: Box 19a/19b amounts are allocated to members by ownership percentage as of the tax year's fiscal year end. Allocation amounts must sum exactly to the partnership-level total (handle rounding by adjusting the largest member's allocation).

121
specs/004-k1-scan-import/plan.md

@ -0,0 +1,121 @@
# Implementation Plan: K-1 PDF Scan Import
**Branch**: `004-k1-scan-import` | **Date**: 2026-03-18 | **Spec**: [spec.md](spec.md)
**Input**: Feature specification from `/specs/004-k1-scan-import/spec.md`
## Summary
Automated K-1 PDF scanning that extracts structured IRS Schedule K-1 (Form 1065) data from uploaded PDFs, presents a verification screen for manual review/correction, and auto-creates downstream model objects (KDocument, Distributions, member allocations, Document). Uses a two-tier extraction approach: `pdf-parse` for digital PDFs (free, instant, local) and Azure AI Document Intelligence / `tesseract.js` fallback for scanned PDFs. Supports per-partnership cell mapping customization and import history with re-processing.
## Technical Context
**Language/Version**: TypeScript 5.9.2, Node.js ≥ 22.18.0
**Primary Dependencies**: NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback)
**Storage**: PostgreSQL via Prisma (structured data), local filesystem `uploads/` (PDF files)
**Testing**: Jest (unit + integration), test K-1 PDF fixtures in `test/import/`
**Target Platform**: Docker (node:22-slim), self-hosted or Railway
**Project Type**: Web application (NestJS API + Angular SPA) — Nx monorepo
**Performance Goals**: PDF extraction < 30 seconds (SC-001), model creation < 5 seconds (SC-005), 90%+ accuracy for digital PDFs (SC-002)
**Constraints**: Self-hosted capable (Azure OCR optional), max PDF size 25 MB, K-1 Form 1065 only (V1)
**Scale/Scope**: Single family office (10–50 partnerships, 10–50 K-1s/year), 2 new API modules, 3 new frontend pages
## Constitution Check
_GATE: Must pass before Phase 0 research. Re-check after Phase 1 design._
No constitution.md exists for this project. Gates assessed against standard engineering principles:
| Gate | Status | Notes |
|------|--------|-------|
| No unnecessary dependencies | PASS | 3 new packages (`pdf-parse`, `@azure/ai-form-recognizer`, `tesseract.js`) — each serves a distinct, justified purpose per research.md |
| Follows existing patterns | PASS | New NestJS modules follow existing controller/service/DTO pattern (mirrors `k-document`, `upload` modules) |
| No breaking changes | PASS | 2 new Prisma models + 1 enum, back-references only on existing models — no column changes |
| Test coverage | PASS | Unit tests for extractors, mapper, allocation; integration tests for full pipeline |
| Self-hosted compatible | PASS | Core extraction (pdf-parse) is fully local; Azure is optional with tesseract.js fallback |
**Post-Phase 1 re-check**: PASS — data model adds 2 models/1 enum, no existing schema changes beyond back-references. API contracts follow existing REST patterns. No violations identified.
## Project Structure
### Documentation (this feature)
```text
specs/004-k1-scan-import/
├── plan.md # This file
├── research.md # Phase 0: OCR provider research & decisions
├── data-model.md # Phase 1: K1ImportSession, CellMapping models
├── quickstart.md # Phase 1: Setup & dev guide
├── contracts/
│ └── k1-import-api.md # Phase 1: REST API contracts
├── checklists/
│ └── requirements.md # Spec quality checklist
└── tasks.md # Phase 2 output (created by /speckit.tasks)
```
### Source Code (repository root)
```text
apps/api/src/app/
├── k1-import/
│ ├── k1-import.module.ts
│ ├── k1-import.controller.ts
│ ├── k1-import.service.ts
│ ├── dto/
│ │ ├── upload-k1.dto.ts
│ │ ├── verify-k1.dto.ts
│ │ └── confirm-k1.dto.ts
│ ├── extractors/
│ │ ├── k1-extractor.interface.ts
│ │ ├── pdf-parse-extractor.ts
│ │ ├── azure-extractor.ts
│ │ └── tesseract-extractor.ts
│ ├── k1-field-mapper.service.ts
│ ├── k1-allocation.service.ts
│ └── k1-confidence.service.ts
├── cell-mapping/
│ ├── cell-mapping.module.ts
│ ├── cell-mapping.controller.ts
│ └── cell-mapping.service.ts
apps/client/src/app/
├── pages/
│ ├── k1-import/
│ │ ├── k1-import-page.component.ts
│ │ ├── k1-import-page.html
│ │ ├── k1-import-page.scss
│ │ ├── k1-import-page.routes.ts
│ │ ├── k1-verification/
│ │ │ ├── k1-verification.component.ts
│ │ │ ├── k1-verification.html
│ │ │ └── k1-verification.scss
│ │ └── k1-confirmation/
│ │ ├── k1-confirmation.component.ts
│ │ ├── k1-confirmation.html
│ │ └── k1-confirmation.scss
│ └── cell-mapping/
│ ├── cell-mapping-page.component.ts
│ ├── cell-mapping-page.html
│ └── cell-mapping-page.routes.ts
├── services/
│ └── k1-import-data.service.ts
libs/common/src/lib/
├── interfaces/
│ └── k1-import.interface.ts
├── dtos/
│ └── k1-import/
│ ├── create-k1-import.dto.ts
│ ├── verify-k1-import.dto.ts
│ └── confirm-k1-import.dto.ts
prisma/
├── schema.prisma # + K1ImportSession, CellMapping, K1ImportStatus
├── migrations/
│ └── 2026XXXX_added_k1_import/ # New migration
test/import/
├── sample-k1-digital.pdf # Test fixture: digital K-1
└── sample-k1-scanned.pdf # Test fixture: scanned K-1
```
**Structure Decision**: Follows the existing Nx monorepo convention with new NestJS modules under `apps/api/src/app/` and new Angular pages under `apps/client/src/app/pages/`. Shared interfaces and DTOs in `libs/common/`. This mirrors the existing `k-document`, `upload`, and `family-office` module patterns.

120
specs/004-k1-scan-import/quickstart.md

@ -0,0 +1,120 @@
# Quickstart: K-1 PDF Scan Import
**Phase 1 Output** | **Date**: 2026-03-18
## Prerequisites
1. Spec 001-family-office-transform models are implemented (Entity, Partnership, PartnershipMembership, KDocument, Distribution, Document)
2. At least one Partnership with one or more member Entities exists in the database
3. The existing upload infrastructure (`UploadController`, `uploads/` directory) is functional
4. Node.js ≥ 22.18.0, Docker for PostgreSQL/Redis
## Environment Setup
Add to `.env` (optional — for Azure OCR of scanned PDFs):
```
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-api-key
```
If these are empty, scanned PDFs fall back to `tesseract.js` (lower accuracy but fully self-hosted).
## New Dependencies
```bash
npm install pdf-parse @azure/ai-form-recognizer tesseract.js
npm install -D @types/pdf-parse
```
## Database Migration
After adding the new Prisma models (`K1ImportSession`, `CellMapping`, `K1ImportStatus` enum):
```bash
npx prisma db push # Development: sync schema
# OR
npx prisma migrate dev # Create a migration file
```
Seed the default IRS cell mappings (28 rows with partnershipId = null) via the existing seed mechanism or a dedicated seed script.
## Key Files to Create
### Backend (apps/api/src/)
```
app/k1-import/
├── k1-import.module.ts # NestJS module
├── k1-import.controller.ts # REST endpoints (see contracts/k1-import-api.md)
├── k1-import.service.ts # Orchestration: upload → extract → verify → confirm
├── dto/
│ ├── upload-k1.dto.ts # Multipart upload DTO
│ ├── verify-k1.dto.ts # Verification submission DTO
│ └── confirm-k1.dto.ts # Confirmation request DTO
├── extractors/
│ ├── k1-extractor.interface.ts # Common extraction interface
│ ├── pdf-parse-extractor.ts # Tier 1: digital PDF text extraction
│ ├── azure-extractor.ts # Tier 2: Azure Document Intelligence
│ └── tesseract-extractor.ts # Tier 2 fallback: tesseract.js OCR
├── k1-field-mapper.service.ts # Maps raw extraction → K1ExtractedField[]
├── k1-allocation.service.ts # Allocates K-1 amounts to members by ownership %
└── k1-confidence.service.ts # Computes confidence scores with validation heuristics
app/cell-mapping/
├── cell-mapping.module.ts # NestJS module
├── cell-mapping.controller.ts # CRUD for cell mappings
└── cell-mapping.service.ts # Cell mapping business logic + seed data
```
### Shared Types (libs/common/src/lib/)
```
interfaces/
├── k1-import.interface.ts # K1ExtractionResult, K1ExtractedField, K1ConfirmationRequest
dtos/
├── k1-import/
│ ├── create-k1-import.dto.ts
│ ├── verify-k1-import.dto.ts
│ └── confirm-k1-import.dto.ts
```
### Frontend (apps/client/src/app/)
```
pages/k1-import/
├── k1-import-page.component.ts # Upload + history view
├── k1-import-page.html
├── k1-import-page.scss
├── k1-import-page.routes.ts
├── k1-verification/
│ ├── k1-verification.component.ts # Verification/edit screen
│ ├── k1-verification.html
│ └── k1-verification.scss
└── k1-confirmation/
├── k1-confirmation.component.ts # Confirmation result screen
├── k1-confirmation.html
└── k1-confirmation.scss
pages/cell-mapping/
├── cell-mapping-page.component.ts # Cell mapping configuration UI
├── cell-mapping-page.html
└── cell-mapping-page.routes.ts
services/
├── k1-import-data.service.ts # HTTP client for k1-import endpoints
```
## Verification Workflow
1. **Upload**: User selects PDF → `POST /api/v1/k1-import/upload` → session created with status PROCESSING
2. **Extract**: Backend detects PDF type (digital vs. scanned) → routes to appropriate extractor → status becomes EXTRACTED
3. **Review**: Frontend polls/fetches session → displays verification screen with extracted fields, confidence indicators
4. **Edit**: User corrects values, overrides labels → `PUT /api/v1/k1-import/:id/verify` → status becomes VERIFIED
5. **Confirm**: User clicks "Confirm & Save" → `POST /api/v1/k1-import/:id/confirm` → KDocument + Distributions + Document created → status becomes CONFIRMED
## Testing Strategy
- **Unit tests**: Extractors (pdf-parse, azure, tesseract), field mapper, confidence scoring, allocation math
- **Integration tests**: Full upload → extract → verify → confirm flow with test PDF fixtures
- **Test fixtures**: Include sample K-1 PDFs (digital and scanned) in `test/import/` directory
- **Allocation accuracy**: Verify rounding behavior — allocated amounts must sum exactly to partnership total

154
specs/004-k1-scan-import/research.md

@ -0,0 +1,154 @@
# Research: K-1 PDF Scan Import
**Phase 0 Output** | **Date**: 2026-03-18
## Decision 1: PDF Text Extraction (Tier 1 — Digital PDFs)
**Decision**: Use `pdf-parse` npm package for digitally-generated K-1 PDFs.
**Rationale**: Digitally-generated PDFs from fund administrators contain embedded text. `pdf-parse` extracts this text losslessly, is free, fully self-hosted, and instant. It has 3M+ weekly npm downloads and a stable API. No external API calls needed.
**Alternatives Considered**:
- `pdfjs-dist` (Mozilla pdf.js) — lower-level, requires more boilerplate for text extraction; `pdf-parse` wraps this already.
- Cloud OCR for all PDFs — unnecessary cost and latency for digital PDFs where text extraction is 100% accurate.
---
## Decision 2: OCR for Scanned PDFs (Tier 2)
**Decision**: Use Azure AI Document Intelligence (Layout model) as primary Tier 2 provider, with `tesseract.js` as self-hosted fallback.
**Rationale**:
- Azure has the best tax-form pedigree among cloud providers (prebuilt IRS models for W-2, 1098, 1099)
- Returns per-field confidence scores (0.0–1.0) natively, directly fulfilling FR-006/FR-009
- 500 free pages/month covers typical family office volume (10–50 K-1s/year)
- `@azure/ai-form-recognizer` has full TypeScript types, aligns with NestJS patterns
- `tesseract.js` runs as WASM in Node.js (no system install), provides ~75% accuracy fallback
**Alternatives Considered**:
- Google Document AI — good form parsing but no tax-specific models, more expensive for custom processors ($30/1K pages)
- AWS Textract — strong table extraction but less established for tax forms, requires IAM setup
- Tesseract.js only — accuracy drops to 70–85% for clean scans, no layout understanding; acceptable as fallback but not primary
---
## Decision 3: Two-Tier Extraction Architecture
**Decision**: Implement a PDF type detection step that routes digital PDFs to local extraction (free, instant) and scanned PDFs to cloud OCR.
**Rationale**: Most K-1s from fund administrators are digitally generated. The two-tier approach avoids unnecessary API calls and costs for the majority case, while still supporting scanned documents.
**Detection heuristic**: Extract text via `pdf-parse`; if extracted text length < 100 characters or does not contain K-1 keywords ("Schedule K-1", "Form 1065", "Partner's Share"), route to Tier 2 OCR.
**Alternatives Considered**:
- Cloud OCR for everything — simpler but adds cost ($0.15/page) and latency (3–10s) for digital PDFs that don't need it
- Local OCR only (Tesseract.js) — insufficient accuracy (75%) for production tax data; too many manual corrections needed
---
## Decision 4: K-1 Box Extraction Strategy
**Decision**: Use regex-based box extraction for Tier 1 (digital text), and key-value pair extraction from the OCR provider for Tier 2. Both feed into a shared K-1 field mapper that applies the cell mapping configuration.
**Rationale**: The IRS Schedule K-1 (Form 1065) has a consistent, standardized layout:
- Page 1: Header + Part I (partnership info) + Part II (partner info) + Boxes 1–11
- Page 2: Boxes 12–20+ with code/sub-code details
- Box values sit in a numbered two-column grid: number label → description → value field
- Layout has been structurally stable for years, making template/regex extraction reliable
**Challenges addressed**:
- Multi-line sub-codes (Boxes 11, 13, 15, 16, 17, 18, 20) — handle by extracting code-letter/value pairs within each box section
- Supplemental schedules — out of scope for V1 auto-extraction; captured as additional Document attachments
- Multi-entity PDFs — detect via repeated "Schedule K-1" headers; split and process each K-1 separately
**Alternatives Considered**:
- Fixed coordinate-based extraction — too brittle across different PDF generators (varying margins, fonts)
- Machine learning model — overkill for V1 given the standardized form layout
---
## Decision 5: Confidence Scoring Approach
**Decision**: Three-level confidence display (High/Medium/Low) derived from extraction method and validation heuristics.
**Rationale**:
For **Tier 1** (digital text):
- Base confidence: 0.90 (text extraction is inherently reliable)
- +0.05 if box number regex matched cleanly
- +0.05 if value format validated (currency, percentage, integer)
- -0.10 to -0.30 for potential adjacent-box text contamination
For **Tier 2** (cloud OCR):
- Use Azure's native per-field confidence score directly
- Layer cross-field validation (e.g., Box 6b ≤ Box 6a, sub-boxes sum to parent)
**Display mapping**:
- High (≥ 0.85): Green — no user attention needed
- Medium (0.60–0.84): Yellow — optional review
- Low (< 0.60): Red highlighted, requires manual review (FR-009)
**Alternatives Considered**:
- Binary confidence (confident/not) — too coarse; doesn't guide the user's review attention
- Numeric score display — too technical for a non-engineer user; three levels with color coding is more actionable
---
## Decision 6: New Database Models
**Decision**: Add two new Prisma models (`K1ImportSession`, `CellMapping`) to support import tracking and cell mapping configuration, alongside the existing K-document models from spec 001.
**Rationale**:
- `K1ImportSession` tracks the full import lifecycle (upload → processing → extracted → verified → confirmed/cancelled), enabling import history (FR-022) and re-processing (FR-023)
- `CellMapping` stores per-partnership cell label customizations (FR-017 through FR-021) separate from the KDocument data itself
**Alternatives Considered**:
- Store import sessions as JSON metadata on KDocument — would conflate document data with import workflow state; makes import history harder to query
- Store cell mappings as JSON on Partnership — would work but loses the ability to query/manage mappings independently and doesn't support a global default set
---
## Decision 7: File Storage
**Decision**: Use the existing `uploads/` directory and `Document` model from spec 001. Uploaded K-1 PDFs are stored on the local filesystem, with metadata in the `Document` table.
**Rationale**: The existing upload infrastructure (UploadController with `FileInterceptor`, Document model, `uploads/` directory) is already in place. No need to add a new storage mechanism.
**Alternatives Considered**:
- S3/cloud storage — would require new infrastructure; the self-hosted philosophy favors local storage
- Database blob storage — increases database size and backup time for binary files
---
## Decision 8: New Environment Variables
**Decision**: Add two optional environment variables for Azure Document Intelligence, following the existing `ConfigurationService` pattern with `str({ default: '' })`.
```
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT — Azure resource endpoint URL
AZURE_DOCUMENT_INTELLIGENCE_KEY — Azure API key
```
**Rationale**: When both are empty (default), the system falls back to `tesseract.js` for scanned PDFs. This makes Azure optional — the feature works fully self-hosted with degraded OCR accuracy.
**Alternatives Considered**:
- Separate feature flag — unnecessary; empty credentials are sufficient to indicate "not configured"
- Google/AWS credentials — Azure recommended as primary; could add additional providers later
---
## Decision 9: New npm Dependencies
**Decision**: Add the following packages:
| Package | Purpose | Tier |
|---|---|---|
| `pdf-parse` | Text extraction from digital PDFs | Tier 1 (required) |
| `@azure/ai-form-recognizer` | Cloud OCR for scanned PDFs | Tier 2 (optional) |
| `tesseract.js` | Self-hosted OCR fallback | Tier 2 fallback |
**Rationale**: `pdf-parse` is essential for the Tier 1 (free, local) path. Azure SDK is optional (only loaded when credentials are configured). `tesseract.js` provides a zero-config fallback that runs as WASM — no system dependencies needed, works in the existing `node:22-slim` Docker image.
**Alternatives Considered**:
- `pdfjs-dist` directly instead of `pdf-parse` — more boilerplate, `pdf-parse` wraps it with a simpler API
- Only cloud OCR — loses the self-hosted story and adds cost for digital PDFs
Loading…
Cancel
Save