This feature adds 2 new Prisma models and 1 new enum to support K-1 PDF scanning, import session tracking, and cell mapping configuration. It extends the existing models from spec 001-family-office-transform (KDocument, Distribution, Document, PartnershipMembership) with automatic creation from scanned data.
This feature adds 3 new Prisma models and 1 new enum to support K-1 PDF scanning, import session tracking, cell mapping configuration, and aggregation rules. It extends the existing models from spec 001-family-office-transform (KDocument, Distribution, Document, PartnershipMembership) with automatic creation from scanned data.
### Entity Relationship Diagram (Conceptual)
### Entity Relationship Diagram (Conceptual)
@ -18,9 +18,12 @@ User (existing)
│ └── [K-1 allocations computed at confirm time]
│ └── [K-1 allocations computed at confirm time]
├── KDocument[] (existing from 001)
├── KDocument[] (existing from 001)
│ └── Distribution[] (auto-created from Box 19, existing from 001)
│ └── Distribution[] (auto-created from Box 19, existing from 001)
└── CellMapping[] (new model, per-partnership overrides)
├── CellMapping[] (new model, per-partnership overrides)
└── CellAggregationRule[] (new model, per-partnership or global)
└── [computed totals derived dynamically from raw box values]
Global CellMapping (partnershipId = null) ── IRS default box definitions
Global CellMapping (partnershipId = null) ── IRS default box definitions
Global CellAggregationRule (partnershipId = null) ── default summary rules
```
```
## New Enum
## New Enum
@ -93,6 +96,29 @@ A configuration defining how K-1 box numbers map to labels. Supports a global IR
**Unique constraint**: `@@unique([partnershipId, boxNumber])` — one mapping per box per partnership (or per box globally when partnershipId is null).
**Unique constraint**: `@@unique([partnershipId, boxNumber])` — one mapping per box per partnership (or per box globally when partnershipId is null).
### CellAggregationRule
A named rule that combines multiple K-1 cells into a computed summary value. Computed totals are NOT stored — they are derived dynamically from raw box values each time they are displayed (FR-039).
**Unique constraint**: `@@unique([partnershipId, name])` — one rule per name per partnership (or globally).
**Note**: No `computedValue` column. Totals are always computed on-the-fly from the KDocument's raw box values using the `sourceCells` array and `operation`. This ensures summaries auto-update when underlying values change (e.g., estimated→final K-1 transition).
## Modifications to Existing Models
## Modifications to Existing Models
### Partnership (from spec 001)
### Partnership (from spec 001)
@ -100,9 +126,10 @@ A configuration defining how K-1 box numbers map to labels. Supports a global IR
/** Extracted box values — mapped to known cells */
fields: K1ExtractedField[];
fields: K1ExtractedField[];
/** Extracted values that didn't match any configured cell mapping */
unmappedItems: K1UnmappedItem[];
/** Overall extraction confidence (0.0–1.0) */
/** Overall extraction confidence (0.0–1.0) */
overallConfidence: number;
overallConfidence: number;
@ -168,6 +198,32 @@ interface K1ExtractedField {
/** Whether user has manually edited this value */
/** Whether user has manually edited this value */
isUserEdited: boolean;
isUserEdited: boolean;
/** Whether user has explicitly reviewed this field (required for medium/low confidence) */
isReviewed: boolean;
}
interface K1UnmappedItem {
/** Raw text label extracted from the PDF */
rawLabel: string;
/** Raw text value extracted */
rawValue: string;
/** Parsed numeric value (null if unparseable) */
numericValue: number | null;
/** Confidence score (0.0–1.0) */
confidence: number;
/** Page number where this was extracted */
pageNumber: number;
/** User action: 'assigned' (to a cell), 'discarded', or null (pending) */
resolution: 'assigned' | 'discarded' | null;
/** If assigned, the box number it was assigned to */
assignedBoxNumber: string | null;
}
}
```
```
@ -239,3 +295,6 @@ The standard box definitions seeded as global CellMapping records (partnershipId
6. **Confirmation prerequisites**: Can only confirm when status is VERIFIED, partnership has at least one active member, and verifiedData is not null.
6. **Confirmation prerequisites**: Can only confirm when status is VERIFIED, partnership has at least one active member, and verifiedData is not null.
7. **Duplicate KDocument check**: Before creating a KDocument, check for existing (partnershipId, type=K1, taxYear). If found, require explicit user decision (update existing or reject).
7. **Duplicate KDocument check**: Before creating a KDocument, check for existing (partnershipId, type=K1, taxYear). If found, require explicit user decision (update existing or reject).
8. **Distribution allocation**: Box 19a/19b amounts are allocated to members by ownership percentage as of the tax year's fiscal year end. Allocation amounts must sum exactly to the partnership-level total (handle rounding by adjusting the largest member's allocation).
8. **Distribution allocation**: Box 19a/19b amounts are allocated to members by ownership percentage as of the tax year's fiscal year end. Allocation amounts must sum exactly to the partnership-level total (handle rounding by adjusting the largest member's allocation).
9. **Aggregation rule source cells**: All box numbers in `sourceCells` must reference valid cell mapping entries. If a source cell has no value in the KDocument, it contributes 0 to the aggregate.
10. **Unmapped items resolution**: All unmapped items must be resolved (assigned to a cell or discarded) before the import session can transition to VERIFIED status.
11. **Review requirement**: All medium and low-confidence fields must have `isReviewed: true` before confirmation is allowed (FR-035). High-confidence fields are auto-set to `isReviewed: true`.
Automated K-1 PDF scanning that extracts structured IRS Schedule K-1 (Form 1065) data from uploaded PDFs, presents a verification screen for manual review/correction, and auto-creates downstream model objects (KDocument, Distributions, member allocations, Document). Uses a two-tier extraction approach: `pdf-parse` for digital PDFs (free, instant, local) and Azure AI Document Intelligence / `tesseract.js` fallback for scanned PDFs. Supports per-partnership cell mapping customization and import history with re-processing.
Automated K-1 PDF scanning that extracts structured IRS Schedule K-1 (Form 1065) data from uploaded PDFs, presents a verification screen with auto-accepted high-confidence values and explicit review for medium/low-confidence fields, and auto-creates downstream model objects (KDocument, Distributions, member allocations, Document). Uses a two-tier extraction approach: `pdf-parse` for digital PDFs (free, instant, local) and Azure AI Document Intelligence / `tesseract.js` fallback for scanned PDFs. Supports per-partnership cell mapping customization, administrator-defined aggregation rules (dynamically computed summaries displayed on verification screen and KDocument detail view), an "Unmapped Items" section for unrecognized extractions, and import history with re-processing.
## Technical Context
## Technical Context
@ -17,7 +17,7 @@ Automated K-1 PDF scanning that extracts structured IRS Schedule K-1 (Form 1065)
**Project Type**: Web application (NestJS API + Angular SPA) — Nx monorepo
**Project Type**: Web application (NestJS API + Angular SPA) — Nx monorepo
**Performance Goals**: PDF extraction <30seconds(SC-001),modelcreation<5seconds(SC-005),90%+accuracyfordigitalPDFs(SC-002)
**Performance Goals**: PDF extraction <30seconds(SC-001),modelcreation<5seconds(SC-005),90%+accuracyfordigitalPDFs(SC-002)
**Constraints**: Self-hosted capable (Azure OCR optional), max PDF size 25 MB, K-1 Form 1065 only (V1)
**Constraints**: Self-hosted capable (Azure OCR optional), max PDF size 25 MB, K-1 Form 1065 only (V1)
**Scale/Scope**: Single family office (10–50 partnerships, 10–50 K-1s/year), 2 new API modules, 3 new frontend pages
**Scale/Scope**: Single family office (10–50 partnerships, 10–50 K-1s/year), 2 new API modules, 4 new frontend pages
## Constitution Check
## Constitution Check
@ -29,11 +29,11 @@ No constitution.md exists for this project. Gates assessed against standard engi
|------|--------|-------|
|------|--------|-------|
| No unnecessary dependencies | PASS | 3 new packages (`pdf-parse`, `@azure/ai-form-recognizer`, `tesseract.js`) — each serves a distinct, justified purpose per research.md |
| No unnecessary dependencies | PASS | 3 new packages (`pdf-parse`, `@azure/ai-form-recognizer`, `tesseract.js`) — each serves a distinct, justified purpose per research.md |
└── sample-k1-scanned.pdf # Test fixture: scanned K-1
└── sample-k1-scanned.pdf # Test fixture: scanned K-1
```
```
**Structure Decision**: Follows the existing Nx monorepo convention with new NestJS modules under `apps/api/src/app/` and new Angular pages under `apps/client/src/app/pages/`. Shared interfaces and DTOs in `libs/common/`. This mirrors the existing `k-document`, `upload`, and `family-office` module patterns.
**Structure Decision**: Follows the existing Nx monorepo convention with new NestJS modules under `apps/api/src/app/` and new Angular pages under `apps/client/src/app/pages/`. Shared interfaces and DTOs in `libs/common/`. This mirrors the existing `k-document`, `upload`, and `family-office` module patterns. The KDocument detail view is extended (not replaced) to display aggregation summaries.
Seed the default IRS cell mappings (28 rows with partnershipId = null) via the existing seed mechanism or a dedicated seed script.
Seed the default IRS cell mappings (28 rows with partnershipId = null) and default aggregation rules (e.g., "Total Ordinary Income", "Total Capital Gains", "Total Deductions") via the existing seed mechanism or a dedicated seed script.
**Decision**: Persist only aggregation rule definitions (name, source cells, operation). Compute totals dynamically from raw K-1 box values at display time. Do NOT store computed totals.
**Rationale**:
- K-1 values can change during the import lifecycle (estimated → final transitions, manual edits after confirmation)
- Storing computed totals creates a denormalization risk — stale aggregates when underlying values change
- Computation is trivial (summing a handful of numbers) with no performance concern at family office scale
- Keeps a single source of truth: the raw box values in K1Data
- Aggregation rules are displayed on both the verification screen (FR-033) and KDocument detail view (FR-036)
**Alternatives Considered**:
- Persist computed totals alongside raw data — creates stale data risk, requires update triggers
- Persist both (snapshot + live) for audit — adds complexity V1 doesn't need; audit trail exists in import session history
---
## Decision 11: Unmapped Items Handling
**Decision**: Display extracted values that don't match any configured cell mapping in a separate "Unmapped Items" section on the verification screen. Administrator can assign to an existing cell, create a new custom cell, or discard.
**Rationale**:
- OCR/extraction may pull supplemental schedule items, footnotes, state-specific addenda
- Silently discarding loses potentially important data
- Auto-creating cells for every unmatched value creates noise
- Explicit user decision preserves data integrity while keeping mapped cells clean
- Assigned unmapped items update the cell mapping for future imports (learning effect)
- Auto-create custom cells — too noisy; PDF footnotes and headers would create junk cells
---
## Decision 12: Verification Auto-Accept Strategy
**Decision**: Auto-accept (pre-check) high-confidence values on the verification screen. Require explicit review (acknowledge or edit) for medium and low-confidence values before allowing confirmation.
**Rationale**:
- V1 is "partially manual, partially automated" per user intent
- High-confidence values (≥ 0.85) from digital PDFs are reliably accurate (90%+ per SC-002)
- Forcing explicit review of every cell wastes time on correct values
- Blocking confirmation until medium/low-confidence fields are reviewed catches the errors
- All values remain visible and editable — user can override any pre-accepted value
**Alternatives Considered**:
- Every cell requires explicit accept — too slow for 15+ fields, doesn't match "partially automated" intent
- Spot-check model (everything auto-accepted) — too risky for tax data; OCR errors would go unreviewed