Browse Source

plan: update K-1 scan import plan with clarification decisions

Post-clarification updates (5 decisions integrated):
- research.md: +3 decisions (aggregation, unmapped items, auto-accept)
- data-model.md: +CellAggregationRule model, K1UnmappedItem interface, isReviewed field
- contracts/k1-import-api.md: +3 aggregation rule endpoints, unmapped items in verify
- quickstart.md: +aggregation service, updated workflow, updated tests
- plan.md: rebuilt with updated summary, 3 models/1 enum, extended project structure
- copilot-instructions.md: agent context refreshed
pull/6701/head
Robert Patch 2 months ago
parent
commit
a759b94ada
  1. 2
      .github/agents/copilot-instructions.md
  2. 150
      specs/004-k1-scan-import/contracts/k1-import-api.md
  3. 69
      specs/004-k1-scan-import/data-model.md
  4. 33
      specs/004-k1-scan-import/plan.md
  5. 28
      specs/004-k1-scan-import/quickstart.md
  6. 51
      specs/004-k1-scan-import/research.md

2
.github/agents/copilot-instructions.md

@ -28,9 +28,9 @@ TypeScript 5.9.2, Node.js ≥22.18.0: Follow standard conventions
## Recent Changes ## Recent Changes
- 004-k1-scan-import: Added TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback) - 004-k1-scan-import: Added TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback)
- 004-k1-scan-import: Added TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback)
- 003-portfolio-performance-views: Added TypeScript 5.9.2, Node.js >= 22.18.0 + Angular 21.1.1, NestJS 11.1.14, Angular Material 21.1.1, Prisma 6.19.0, big.js, date-fns 4.1.0 - 003-portfolio-performance-views: Added TypeScript 5.9.2, Node.js >= 22.18.0 + Angular 21.1.1, NestJS 11.1.14, Angular Material 21.1.1, Prisma 6.19.0, big.js, date-fns 4.1.0
- 001-family-office-transform: Added TypeScript 5.9.2, Node.js ≥22.18.0 + NestJS 11.1.14 (API), Angular 21.1.1 + Angular Material 21.1.1 (client), Prisma 6.19.0 (ORM), Nx 22.5.3 (monorepo), big.js (decimal math), date-fns 4.1.0, chart.js 4.5.1, Bull 4.16.5 (job queues), Redis (caching), yahoo-finance2 3.13.2
<!-- MANUAL ADDITIONS START --> <!-- MANUAL ADDITIONS START -->
<!-- MANUAL ADDITIONS END --> <!-- MANUAL ADDITIONS END -->

150
specs/004-k1-scan-import/contracts/k1-import-api.md

@ -1,6 +1,6 @@
# API Contracts: K-1 Import # API Contracts: K-1 Import
**Phase 1 Output** | **Date**: 2026-03-18 **Phase 1 Output** | **Date**: 2026-03-18 | **Updated**: 2026-03-18 (post-clarification)
## Base Path ## Base Path
@ -136,7 +136,8 @@ Submit user-verified/edited extraction data. Transitions status from EXTRACTED t
"numericValue": 52340, "numericValue": 52340,
"confidence": 0.95, "confidence": 0.95,
"confidenceLevel": "HIGH", "confidenceLevel": "HIGH",
"isUserEdited": false "isUserEdited": false,
"isReviewed": true
}, },
{ {
"boxNumber": "11", "boxNumber": "11",
@ -146,7 +147,19 @@ Submit user-verified/edited extraction data. Transitions status from EXTRACTED t
"numericValue": 8200, "numericValue": 8200,
"confidence": 0.72, "confidence": 0.72,
"confidenceLevel": "MEDIUM", "confidenceLevel": "MEDIUM",
"isUserEdited": true "isUserEdited": true,
"isReviewed": true
}
],
"unmappedItems": [
{
"rawLabel": "State tax adjustment",
"rawValue": "$1,200",
"numericValue": 1200,
"confidence": 0.65,
"pageNumber": 3,
"resolution": "discarded",
"assignedBoxNumber": null
} }
] ]
} }
@ -160,6 +173,8 @@ Submit user-verified/edited extraction data. Transitions status from EXTRACTED t
| ------ | ----------------------------------------------- | | ------ | ----------------------------------------------- |
| 400 | Import session is not in EXTRACTED status | | 400 | Import session is not in EXTRACTED status |
| 400 | Fields array is empty | | 400 | Fields array is empty |
| 400 | Medium/low-confidence fields not all reviewed (isReviewed must be true) |
| 400 | Unmapped items not all resolved (each must be 'assigned' or 'discarded') |
| 404 | Import session not found | | 404 | Import session not found |
--- ---
@ -379,3 +394,132 @@ Reset a partnership's cell mappings to IRS defaults (deletes all custom mappings
| `partnershipId` | `string` | Yes | Partnership UUID | | `partnershipId` | `string` | Yes | Partnership UUID |
**Response**: `200 OK` **Response**: `200 OK`
---
## Aggregation Rule Endpoints
### GET /api/v1/cell-mapping/aggregation-rules
Get aggregation rules for a partnership (with global defaults for partnerships without custom rules).
**Permission**: `readKDocument`
**Query Parameters**:
| Param | Type | Required | Description |
| --------------- | -------- | -------- | ---------------------------------------------- |
| `partnershipId` | `string` | No | Partnership UUID (omit for global defaults) |
**Response**: `200 OK`
```json
[
{
"id": "uuid",
"partnershipId": null,
"name": "Total Ordinary Income",
"operation": "SUM",
"sourceCells": ["1"],
"sortOrder": 1
},
{
"id": "uuid",
"partnershipId": null,
"name": "Total Capital Gains",
"operation": "SUM",
"sourceCells": ["8", "9a", "9b", "9c", "10"],
"sortOrder": 2
},
{
"id": "uuid",
"partnershipId": null,
"name": "Total Deductions",
"operation": "SUM",
"sourceCells": ["12", "13"],
"sortOrder": 3
}
]
```
---
### PUT /api/v1/cell-mapping/aggregation-rules
Create or update aggregation rules for a partnership.
**Permission**: `updateKDocument`
**Request**: `application/json`
```json
{
"partnershipId": "uuid",
"rules": [
{
"name": "Income Summary",
"operation": "SUM",
"sourceCells": ["1", "2", "3", "4b", "5", "6a", "7"]
},
{
"name": "Total Capital Gains",
"operation": "SUM",
"sourceCells": ["8", "9a", "10"]
}
]
}
```
**Response**: `200 OK` — Updated rules array
**Errors**:
| Status | Condition |
| ------ | ---------------------------------------------------- |
| 400 | Source cell box number not found in cell mappings |
| 400 | Duplicate rule name for the same partnership |
| 400 | Empty sourceCells array |
---
### GET /api/v1/cell-mapping/aggregation-rules/compute
Compute aggregation values for a specific KDocument's data. Returns the dynamically calculated totals.
**Permission**: `readKDocument`
**Query Parameters**:
| Param | Type | Required | Description |
| --------------- | -------- | -------- | ---------------------------------------------- |
| `kDocumentId` | `string` | Yes | KDocument UUID to compute aggregates for |
| `partnershipId` | `string` | No | Override which partnership's rules to use |
**Response**: `200 OK`
```json
[
{
"ruleId": "uuid",
"name": "Income Summary",
"operation": "SUM",
"sourceCells": ["1", "2", "3", "4b", "5", "6a", "7"],
"computedValue": 187520.00,
"breakdown": {
"1": 52340,
"2": 35000,
"3": 0,
"4b": 15000,
"5": 8200,
"6a": 72980,
"7": 4000
}
}
]
```
**Errors**:
| Status | Condition |
| ------ | --------------------------- |
| 404 | KDocument not found |

69
specs/004-k1-scan-import/data-model.md

@ -1,10 +1,10 @@
# Data Model: K-1 PDF Scan Import # Data Model: K-1 PDF Scan Import
**Phase 1 Output** | **Date**: 2026-03-18 **Phase 1 Output** | **Date**: 2026-03-18 | **Updated**: 2026-03-18 (post-clarification)
## Overview ## Overview
This feature adds 2 new Prisma models and 1 new enum to support K-1 PDF scanning, import session tracking, and cell mapping configuration. It extends the existing models from spec 001-family-office-transform (KDocument, Distribution, Document, PartnershipMembership) with automatic creation from scanned data. This feature adds 3 new Prisma models and 1 new enum to support K-1 PDF scanning, import session tracking, cell mapping configuration, and aggregation rules. It extends the existing models from spec 001-family-office-transform (KDocument, Distribution, Document, PartnershipMembership) with automatic creation from scanned data.
### Entity Relationship Diagram (Conceptual) ### Entity Relationship Diagram (Conceptual)
@ -18,9 +18,12 @@ User (existing)
│ └── [K-1 allocations computed at confirm time] │ └── [K-1 allocations computed at confirm time]
├── KDocument[] (existing from 001) ├── KDocument[] (existing from 001)
│ └── Distribution[] (auto-created from Box 19, existing from 001) │ └── Distribution[] (auto-created from Box 19, existing from 001)
└── CellMapping[] (new model, per-partnership overrides) ├── CellMapping[] (new model, per-partnership overrides)
└── CellAggregationRule[] (new model, per-partnership or global)
└── [computed totals derived dynamically from raw box values]
Global CellMapping (partnershipId = null) ── IRS default box definitions Global CellMapping (partnershipId = null) ── IRS default box definitions
Global CellAggregationRule (partnershipId = null) ── default summary rules
``` ```
## New Enum ## New Enum
@ -93,6 +96,29 @@ A configuration defining how K-1 box numbers map to labels. Supports a global IR
**Unique constraint**: `@@unique([partnershipId, boxNumber])` — one mapping per box per partnership (or per box globally when partnershipId is null). **Unique constraint**: `@@unique([partnershipId, boxNumber])` — one mapping per box per partnership (or per box globally when partnershipId is null).
### CellAggregationRule
A named rule that combines multiple K-1 cells into a computed summary value. Computed totals are NOT stored — they are derived dynamically from raw box values each time they are displayed (FR-039).
| Field | Type | Constraints | Description |
| --------------- | ---------- | -------------------------------------- | --------------------------------------------------------------- |
| `id` | `String` | PK, UUID, auto-generated | Unique identifier |
| `partnershipId` | `String?` | FK → Partnership.id, optional, indexed | Partnership this rule applies to (null = global default) |
| `name` | `String` | Required | Display name (e.g., "Income Summary", "Total Capital Gains") |
| `operation` | `String` | Required, Default: "SUM" | Aggregation operation (SUM for V1; future: AVG, MIN, MAX) |
| `sourceCells` | `Json` | Required | Array of box numbers to aggregate (e.g., ["1", "2", "3"]) |
| `sortOrder` | `Int` | Required | Display order in the aggregation summary section |
| `createdAt` | `DateTime` | Default: now() | Creation timestamp |
| `updatedAt` | `DateTime` | Auto-updated | Last modification timestamp |
**Relations**:
- `partnership``Partnership?` (many-to-one, optional, cascade delete)
**Unique constraint**: `@@unique([partnershipId, name])` — one rule per name per partnership (or globally).
**Note**: No `computedValue` column. Totals are always computed on-the-fly from the KDocument's raw box values using the `sourceCells` array and `operation`. This ensures summaries auto-update when underlying values change (e.g., estimated→final K-1 transition).
## Modifications to Existing Models ## Modifications to Existing Models
### Partnership (from spec 001) ### Partnership (from spec 001)
@ -100,9 +126,10 @@ A configuration defining how K-1 box numbers map to labels. Supports a global IR
Add back-references — no column changes: Add back-references — no column changes:
| New Field | Type | Description | | New Field | Type | Description |
| ----------------- | -------------------- | ------------------------------------ | | -------------------- | ------------------------ | ------------------------------------ |
| `importSessions` | `K1ImportSession[]` | Import attempts for this partnership | | `importSessions` | `K1ImportSession[]` | Import attempts for this partnership |
| `cellMappings` | `CellMapping[]` | Custom cell mapping configurations | | `cellMappings` | `CellMapping[]` | Custom cell mapping configurations |
| `aggregationRules` | `CellAggregationRule[]` | Custom aggregation rule definitions |
### KDocument (from spec 001) ### KDocument (from spec 001)
@ -131,9 +158,12 @@ interface K1ExtractionResult {
isFinal: boolean; isFinal: boolean;
}; };
/** Extracted box values */ /** Extracted box values — mapped to known cells */
fields: K1ExtractedField[]; fields: K1ExtractedField[];
/** Extracted values that didn't match any configured cell mapping */
unmappedItems: K1UnmappedItem[];
/** Overall extraction confidence (0.0–1.0) */ /** Overall extraction confidence (0.0–1.0) */
overallConfidence: number; overallConfidence: number;
@ -168,6 +198,32 @@ interface K1ExtractedField {
/** Whether user has manually edited this value */ /** Whether user has manually edited this value */
isUserEdited: boolean; isUserEdited: boolean;
/** Whether user has explicitly reviewed this field (required for medium/low confidence) */
isReviewed: boolean;
}
interface K1UnmappedItem {
/** Raw text label extracted from the PDF */
rawLabel: string;
/** Raw text value extracted */
rawValue: string;
/** Parsed numeric value (null if unparseable) */
numericValue: number | null;
/** Confidence score (0.0–1.0) */
confidence: number;
/** Page number where this was extracted */
pageNumber: number;
/** User action: 'assigned' (to a cell), 'discarded', or null (pending) */
resolution: 'assigned' | 'discarded' | null;
/** If assigned, the box number it was assigned to */
assignedBoxNumber: string | null;
} }
``` ```
@ -239,3 +295,6 @@ The standard box definitions seeded as global CellMapping records (partnershipId
6. **Confirmation prerequisites**: Can only confirm when status is VERIFIED, partnership has at least one active member, and verifiedData is not null. 6. **Confirmation prerequisites**: Can only confirm when status is VERIFIED, partnership has at least one active member, and verifiedData is not null.
7. **Duplicate KDocument check**: Before creating a KDocument, check for existing (partnershipId, type=K1, taxYear). If found, require explicit user decision (update existing or reject). 7. **Duplicate KDocument check**: Before creating a KDocument, check for existing (partnershipId, type=K1, taxYear). If found, require explicit user decision (update existing or reject).
8. **Distribution allocation**: Box 19a/19b amounts are allocated to members by ownership percentage as of the tax year's fiscal year end. Allocation amounts must sum exactly to the partnership-level total (handle rounding by adjusting the largest member's allocation). 8. **Distribution allocation**: Box 19a/19b amounts are allocated to members by ownership percentage as of the tax year's fiscal year end. Allocation amounts must sum exactly to the partnership-level total (handle rounding by adjusting the largest member's allocation).
9. **Aggregation rule source cells**: All box numbers in `sourceCells` must reference valid cell mapping entries. If a source cell has no value in the KDocument, it contributes 0 to the aggregate.
10. **Unmapped items resolution**: All unmapped items must be resolved (assigned to a cell or discarded) before the import session can transition to VERIFIED status.
11. **Review requirement**: All medium and low-confidence fields must have `isReviewed: true` before confirmation is allowed (FR-035). High-confidence fields are auto-set to `isReviewed: true`.

33
specs/004-k1-scan-import/plan.md

@ -5,7 +5,7 @@
## Summary ## Summary
Automated K-1 PDF scanning that extracts structured IRS Schedule K-1 (Form 1065) data from uploaded PDFs, presents a verification screen for manual review/correction, and auto-creates downstream model objects (KDocument, Distributions, member allocations, Document). Uses a two-tier extraction approach: `pdf-parse` for digital PDFs (free, instant, local) and Azure AI Document Intelligence / `tesseract.js` fallback for scanned PDFs. Supports per-partnership cell mapping customization and import history with re-processing. Automated K-1 PDF scanning that extracts structured IRS Schedule K-1 (Form 1065) data from uploaded PDFs, presents a verification screen with auto-accepted high-confidence values and explicit review for medium/low-confidence fields, and auto-creates downstream model objects (KDocument, Distributions, member allocations, Document). Uses a two-tier extraction approach: `pdf-parse` for digital PDFs (free, instant, local) and Azure AI Document Intelligence / `tesseract.js` fallback for scanned PDFs. Supports per-partnership cell mapping customization, administrator-defined aggregation rules (dynamically computed summaries displayed on verification screen and KDocument detail view), an "Unmapped Items" section for unrecognized extractions, and import history with re-processing.
## Technical Context ## Technical Context
@ -17,7 +17,7 @@ Automated K-1 PDF scanning that extracts structured IRS Schedule K-1 (Form 1065)
**Project Type**: Web application (NestJS API + Angular SPA) — Nx monorepo **Project Type**: Web application (NestJS API + Angular SPA) — Nx monorepo
**Performance Goals**: PDF extraction < 30 seconds (SC-001), model creation < 5 seconds (SC-005), 90%+ accuracy for digital PDFs (SC-002) **Performance Goals**: PDF extraction < 30 seconds (SC-001), model creation < 5 seconds (SC-005), 90%+ accuracy for digital PDFs (SC-002)
**Constraints**: Self-hosted capable (Azure OCR optional), max PDF size 25 MB, K-1 Form 1065 only (V1) **Constraints**: Self-hosted capable (Azure OCR optional), max PDF size 25 MB, K-1 Form 1065 only (V1)
**Scale/Scope**: Single family office (10–50 partnerships, 10–50 K-1s/year), 2 new API modules, 3 new frontend pages **Scale/Scope**: Single family office (10–50 partnerships, 10–50 K-1s/year), 2 new API modules, 4 new frontend pages
## Constitution Check ## Constitution Check
@ -29,11 +29,11 @@ No constitution.md exists for this project. Gates assessed against standard engi
|------|--------|-------| |------|--------|-------|
| No unnecessary dependencies | PASS | 3 new packages (`pdf-parse`, `@azure/ai-form-recognizer`, `tesseract.js`) — each serves a distinct, justified purpose per research.md | | No unnecessary dependencies | PASS | 3 new packages (`pdf-parse`, `@azure/ai-form-recognizer`, `tesseract.js`) — each serves a distinct, justified purpose per research.md |
| Follows existing patterns | PASS | New NestJS modules follow existing controller/service/DTO pattern (mirrors `k-document`, `upload` modules) | | Follows existing patterns | PASS | New NestJS modules follow existing controller/service/DTO pattern (mirrors `k-document`, `upload` modules) |
| No breaking changes | PASS | 2 new Prisma models + 1 enum, back-references only on existing models — no column changes | | No breaking changes | PASS | 3 new Prisma models + 1 enum, back-references only on existing models — no column changes |
| Test coverage | PASS | Unit tests for extractors, mapper, allocation; integration tests for full pipeline | | Test coverage | PASS | Unit tests for extractors, mapper, allocation, aggregation; integration tests for full pipeline |
| Self-hosted compatible | PASS | Core extraction (pdf-parse) is fully local; Azure is optional with tesseract.js fallback | | Self-hosted compatible | PASS | Core extraction (pdf-parse) is fully local; Azure is optional with tesseract.js fallback |
**Post-Phase 1 re-check**: PASS — data model adds 2 models/1 enum, no existing schema changes beyond back-references. API contracts follow existing REST patterns. No violations identified. **Post-Phase 1 re-check**: PASS — data model adds 3 models/1 enum (K1ImportSession, CellMapping, CellAggregationRule, K1ImportStatus). No existing schema changes beyond back-references. API contracts follow existing REST patterns. Aggregation rules are dynamically computed — no stored denormalization. No violations identified.
## Project Structure ## Project Structure
@ -43,7 +43,7 @@ No constitution.md exists for this project. Gates assessed against standard engi
specs/004-k1-scan-import/ specs/004-k1-scan-import/
├── plan.md # This file ├── plan.md # This file
├── research.md # Phase 0: OCR provider research & decisions ├── research.md # Phase 0: OCR provider research & decisions
├── data-model.md # Phase 1: K1ImportSession, CellMapping models ├── data-model.md # Phase 1: K1ImportSession, CellMapping, CellAggregationRule models
├── quickstart.md # Phase 1: Setup & dev guide ├── quickstart.md # Phase 1: Setup & dev guide
├── contracts/ ├── contracts/
│ └── k1-import-api.md # Phase 1: REST API contracts │ └── k1-import-api.md # Phase 1: REST API contracts
@ -71,10 +71,11 @@ apps/api/src/app/
│ │ └── tesseract-extractor.ts │ │ └── tesseract-extractor.ts
│ ├── k1-field-mapper.service.ts │ ├── k1-field-mapper.service.ts
│ ├── k1-allocation.service.ts │ ├── k1-allocation.service.ts
│ └── k1-confidence.service.ts │ ├── k1-confidence.service.ts
│ └── k1-aggregation.service.ts # Dynamically computes aggregation summaries
├── cell-mapping/ ├── cell-mapping/
│ ├── cell-mapping.module.ts │ ├── cell-mapping.module.ts
│ ├── cell-mapping.controller.ts │ ├── cell-mapping.controller.ts # Cell mapping + aggregation rule CRUD
│ └── cell-mapping.service.ts │ └── cell-mapping.service.ts
apps/client/src/app/ apps/client/src/app/
@ -85,17 +86,19 @@ apps/client/src/app/
│ │ ├── k1-import-page.scss │ │ ├── k1-import-page.scss
│ │ ├── k1-import-page.routes.ts │ │ ├── k1-import-page.routes.ts
│ │ ├── k1-verification/ │ │ ├── k1-verification/
│ │ │ ├── k1-verification.component.ts │ │ │ ├── k1-verification.component.ts # Mapped cells + unmapped items + aggregations
│ │ │ ├── k1-verification.html │ │ │ ├── k1-verification.html
│ │ │ └── k1-verification.scss │ │ │ └── k1-verification.scss
│ │ └── k1-confirmation/ │ │ └── k1-confirmation/
│ │ ├── k1-confirmation.component.ts │ │ ├── k1-confirmation.component.ts
│ │ ├── k1-confirmation.html │ │ ├── k1-confirmation.html
│ │ └── k1-confirmation.scss │ │ └── k1-confirmation.scss
│ └── cell-mapping/ │ ├── cell-mapping/
│ ├── cell-mapping-page.component.ts │ │ ├── cell-mapping-page.component.ts # Cell mapping + aggregation rule config
│ ├── cell-mapping-page.html │ │ ├── cell-mapping-page.html
│ └── cell-mapping-page.routes.ts │ │ └── cell-mapping-page.routes.ts
│ └── k-document/ # Existing page — extended
│ └── k-document-detail/ # Add aggregation summary section (FR-036)
├── services/ ├── services/
│ └── k1-import-data.service.ts │ └── k1-import-data.service.ts
@ -109,7 +112,7 @@ libs/common/src/lib/
│ └── confirm-k1-import.dto.ts │ └── confirm-k1-import.dto.ts
prisma/ prisma/
├── schema.prisma # + K1ImportSession, CellMapping, K1ImportStatus ├── schema.prisma # + K1ImportSession, CellMapping, CellAggregationRule, K1ImportStatus
├── migrations/ ├── migrations/
│ └── 2026XXXX_added_k1_import/ # New migration │ └── 2026XXXX_added_k1_import/ # New migration
@ -118,4 +121,4 @@ test/import/
└── sample-k1-scanned.pdf # Test fixture: scanned K-1 └── sample-k1-scanned.pdf # Test fixture: scanned K-1
``` ```
**Structure Decision**: Follows the existing Nx monorepo convention with new NestJS modules under `apps/api/src/app/` and new Angular pages under `apps/client/src/app/pages/`. Shared interfaces and DTOs in `libs/common/`. This mirrors the existing `k-document`, `upload`, and `family-office` module patterns. **Structure Decision**: Follows the existing Nx monorepo convention with new NestJS modules under `apps/api/src/app/` and new Angular pages under `apps/client/src/app/pages/`. Shared interfaces and DTOs in `libs/common/`. This mirrors the existing `k-document`, `upload`, and `family-office` module patterns. The KDocument detail view is extended (not replaced) to display aggregation summaries.

28
specs/004-k1-scan-import/quickstart.md

@ -1,6 +1,6 @@
# Quickstart: K-1 PDF Scan Import # Quickstart: K-1 PDF Scan Import
**Phase 1 Output** | **Date**: 2026-03-18 **Phase 1 Output** | **Date**: 2026-03-18 | **Updated**: 2026-03-18 (post-clarification)
## Prerequisites ## Prerequisites
@ -28,7 +28,7 @@ npm install -D @types/pdf-parse
## Database Migration ## Database Migration
After adding the new Prisma models (`K1ImportSession`, `CellMapping`, `K1ImportStatus` enum): After adding the new Prisma models (`K1ImportSession`, `CellMapping`, `CellAggregationRule`, `K1ImportStatus` enum):
```bash ```bash
npx prisma db push # Development: sync schema npx prisma db push # Development: sync schema
@ -36,7 +36,7 @@ npx prisma db push # Development: sync schema
npx prisma migrate dev # Create a migration file npx prisma migrate dev # Create a migration file
``` ```
Seed the default IRS cell mappings (28 rows with partnershipId = null) via the existing seed mechanism or a dedicated seed script. Seed the default IRS cell mappings (28 rows with partnershipId = null) and default aggregation rules (e.g., "Total Ordinary Income", "Total Capital Gains", "Total Deductions") via the existing seed mechanism or a dedicated seed script.
## Key Files to Create ## Key Files to Create
@ -58,12 +58,13 @@ app/k1-import/
│ └── tesseract-extractor.ts # Tier 2 fallback: tesseract.js OCR │ └── tesseract-extractor.ts # Tier 2 fallback: tesseract.js OCR
├── k1-field-mapper.service.ts # Maps raw extraction → K1ExtractedField[] ├── k1-field-mapper.service.ts # Maps raw extraction → K1ExtractedField[]
├── k1-allocation.service.ts # Allocates K-1 amounts to members by ownership % ├── k1-allocation.service.ts # Allocates K-1 amounts to members by ownership %
└── k1-confidence.service.ts # Computes confidence scores with validation heuristics ├── k1-confidence.service.ts # Computes confidence scores with validation heuristics
└── k1-aggregation.service.ts # Dynamically computes aggregation summaries from rules
app/cell-mapping/ app/cell-mapping/
├── cell-mapping.module.ts # NestJS module ├── cell-mapping.module.ts # NestJS module
├── cell-mapping.controller.ts # CRUD for cell mappings ├── cell-mapping.controller.ts # CRUD for cell mappings + aggregation rules
└── cell-mapping.service.ts # Cell mapping business logic + seed data └── cell-mapping.service.ts # Cell mapping + aggregation rule business logic + seed data
``` ```
### Shared Types (libs/common/src/lib/) ### Shared Types (libs/common/src/lib/)
@ -87,7 +88,7 @@ pages/k1-import/
├── k1-import-page.scss ├── k1-import-page.scss
├── k1-import-page.routes.ts ├── k1-import-page.routes.ts
├── k1-verification/ ├── k1-verification/
│ ├── k1-verification.component.ts # Verification/edit screen │ ├── k1-verification.component.ts # Verification/edit screen (mapped + unmapped + aggregations)
│ ├── k1-verification.html │ ├── k1-verification.html
│ └── k1-verification.scss │ └── k1-verification.scss
└── k1-confirmation/ └── k1-confirmation/
@ -96,7 +97,7 @@ pages/k1-import/
└── k1-confirmation.scss └── k1-confirmation.scss
pages/cell-mapping/ pages/cell-mapping/
├── cell-mapping-page.component.ts # Cell mapping configuration UI ├── cell-mapping-page.component.ts # Cell mapping + aggregation rule configuration UI
├── cell-mapping-page.html ├── cell-mapping-page.html
└── cell-mapping-page.routes.ts └── cell-mapping-page.routes.ts
@ -108,13 +109,18 @@ services/
1. **Upload**: User selects PDF → `POST /api/v1/k1-import/upload` → session created with status PROCESSING 1. **Upload**: User selects PDF → `POST /api/v1/k1-import/upload` → session created with status PROCESSING
2. **Extract**: Backend detects PDF type (digital vs. scanned) → routes to appropriate extractor → status becomes EXTRACTED 2. **Extract**: Backend detects PDF type (digital vs. scanned) → routes to appropriate extractor → status becomes EXTRACTED
3. **Review**: Frontend polls/fetches session → displays verification screen with extracted fields, confidence indicators 3. **Review**: Frontend polls/fetches session → displays verification screen with:
4. **Edit**: User corrects values, overrides labels → `PUT /api/v1/k1-import/:id/verify` → status becomes VERIFIED - **Mapped cells**: extracted fields with confidence indicators. High-confidence values are pre-accepted. Medium/low-confidence values require explicit review (acknowledge or edit).
- **Unmapped items**: separate section for values that didn't match any cell. User assigns to a cell or discards.
- **Aggregation summaries**: dynamically computed from mapped values using aggregation rules. Recalculate live when cell values are edited.
4. **Verify**: User reviews all medium/low fields and resolves unmapped items → `PUT /api/v1/k1-import/:id/verify` → status becomes VERIFIED
5. **Confirm**: User clicks "Confirm & Save" → `POST /api/v1/k1-import/:id/confirm` → KDocument + Distributions + Document created → status becomes CONFIRMED 5. **Confirm**: User clicks "Confirm & Save" → `POST /api/v1/k1-import/:id/confirm` → KDocument + Distributions + Document created → status becomes CONFIRMED
## Testing Strategy ## Testing Strategy
- **Unit tests**: Extractors (pdf-parse, azure, tesseract), field mapper, confidence scoring, allocation math - **Unit tests**: Extractors (pdf-parse, azure, tesseract), field mapper, confidence scoring, allocation math, aggregation computation
- **Integration tests**: Full upload → extract → verify → confirm flow with test PDF fixtures - **Integration tests**: Full upload → extract → verify → confirm flow with test PDF fixtures
- **Test fixtures**: Include sample K-1 PDFs (digital and scanned) in `test/import/` directory - **Test fixtures**: Include sample K-1 PDFs (digital and scanned) in `test/import/` directory
- **Allocation accuracy**: Verify rounding behavior — allocated amounts must sum exactly to partnership total - **Allocation accuracy**: Verify rounding behavior — allocated amounts must sum exactly to partnership total
- **Aggregation tests**: Verify dynamic computation from rules, auto-recalculation on value edit, behavior when source cells are empty
- **Review enforcement**: Verify confirmation blocked when medium/low-confidence fields not reviewed or unmapped items unresolved

51
specs/004-k1-scan-import/research.md

@ -152,3 +152,54 @@ AZURE_DOCUMENT_INTELLIGENCE_KEY — Azure API key
**Alternatives Considered**: **Alternatives Considered**:
- `pdfjs-dist` directly instead of `pdf-parse` — more boilerplate, `pdf-parse` wraps it with a simpler API - `pdfjs-dist` directly instead of `pdf-parse` — more boilerplate, `pdf-parse` wraps it with a simpler API
- Only cloud OCR — loses the self-hosted story and adds cost for digital PDFs - Only cloud OCR — loses the self-hosted story and adds cost for digital PDFs
---
## Decision 10: Cell Aggregation Rules — Dynamic Computation
**Decision**: Persist only aggregation rule definitions (name, source cells, operation). Compute totals dynamically from raw K-1 box values at display time. Do NOT store computed totals.
**Rationale**:
- K-1 values can change during the import lifecycle (estimated → final transitions, manual edits after confirmation)
- Storing computed totals creates a denormalization risk — stale aggregates when underlying values change
- Computation is trivial (summing a handful of numbers) with no performance concern at family office scale
- Keeps a single source of truth: the raw box values in K1Data
- Aggregation rules are displayed on both the verification screen (FR-033) and KDocument detail view (FR-036)
**Alternatives Considered**:
- Persist computed totals alongside raw data — creates stale data risk, requires update triggers
- Persist both (snapshot + live) for audit — adds complexity V1 doesn't need; audit trail exists in import session history
---
## Decision 11: Unmapped Items Handling
**Decision**: Display extracted values that don't match any configured cell mapping in a separate "Unmapped Items" section on the verification screen. Administrator can assign to an existing cell, create a new custom cell, or discard.
**Rationale**:
- OCR/extraction may pull supplemental schedule items, footnotes, state-specific addenda
- Silently discarding loses potentially important data
- Auto-creating cells for every unmatched value creates noise
- Explicit user decision preserves data integrity while keeping mapped cells clean
- Assigned unmapped items update the cell mapping for future imports (learning effect)
**Alternatives Considered**:
- Silent discard — loses data, violates user's expectation of completeness
- Auto-create custom cells — too noisy; PDF footnotes and headers would create junk cells
---
## Decision 12: Verification Auto-Accept Strategy
**Decision**: Auto-accept (pre-check) high-confidence values on the verification screen. Require explicit review (acknowledge or edit) for medium and low-confidence values before allowing confirmation.
**Rationale**:
- V1 is "partially manual, partially automated" per user intent
- High-confidence values (≥ 0.85) from digital PDFs are reliably accurate (90%+ per SC-002)
- Forcing explicit review of every cell wastes time on correct values
- Blocking confirmation until medium/low-confidence fields are reviewed catches the errors
- All values remain visible and editable — user can override any pre-accepted value
**Alternatives Considered**:
- Every cell requires explicit accept — too slow for 15+ fields, doesn't match "partially automated" intent
- Spot-check model (everything auto-accepted) — too risky for tax data; OCR errors would go unreviewed

Loading…
Cancel
Save