4.7 KiB
Data Model: Fix K-1 PDF Parser
Feature: 005-k1-parser-fix | Date: 2026-03-18
Overview
This feature modifies no database tables. All changes are to in-memory TypeScript interfaces in @ghostfolio/common. The extraction result flows through: PDF → extractor → K1ExtractionResult → review UI → confirm → persist to existing KDocument/K1Cell tables.
Entity Changes
K1ExtractedField (modified)
Existing interface at libs/common/src/lib/interfaces/k1-import.interface.ts. Three new fields added:
| Field | Type | Required | Description |
|---|---|---|---|
| boxNumber | string | yes | Existing: "1", "6a", "19", "20" |
| label | string | yes | Existing: display label from cell mapping |
| customLabel | string | null | no | Existing: user override |
| rawValue | string | yes | Existing: raw extracted text ("498,211", "(409,811)", "SEE STMT", "X") |
| numericValue | number | null | no | Existing: parsed numeric value |
| confidence | number | yes | Existing: 0.0–1.0 |
| confidenceLevel | 'HIGH' | 'MEDIUM' | 'LOW' | yes | Existing |
| isUserEdited | boolean | yes | Existing: default false |
| isReviewed | boolean | yes | Existing: default false |
| subtype | string | null | no | NEW: subtype code (e.g., "ZZ*", "A", "B", "*"). Null for simple boxes. |
| fieldCategory | string | yes | NEW: "PART_III", "METADATA", "SECTION_J", "SECTION_K", "SECTION_L", "SECTION_M", "SECTION_N", "CHECKBOX" |
| isCheckbox | boolean | yes | NEW: true if field is a boolean checkbox value. Default false. |
K1UnmappedItem (modified)
Existing interface. Three new fields for position debugging:
| Field | Type | Required | Description |
|---|---|---|---|
| rawLabel | string | yes | Existing |
| rawValue | string | yes | Existing |
| numericValue | number | null | no | Existing |
| confidence | number | yes | Existing |
| pageNumber | number | yes | Existing |
| resolution | 'assigned' | 'discarded' | null | no | Existing |
| assignedBoxNumber | string | null | no | Existing |
| x | number | yes | NEW: x position in PDF points |
| y | number | yes | NEW: y position in PDF points |
| fontName | string | yes | NEW: PDF font identifier |
K1ExtractionResult (unchanged)
No changes to the top-level extraction result interface. The metadata, fields, unmappedItems, overallConfidence, method, and pagesProcessed structure remains the same.
K1PositionRegion (new — internal to extractor)
This is NOT a shared interface — it lives inside the extractor module. It defines a bounding box for a K-1 form field region.
| Field | Type | Description |
|---|---|---|
| fieldId | string | Unique identifier (e.g., "BOX_1", "J_PROFIT_BEGIN", "FINAL_K1") |
| boxNumber | string | K-1 box number for Part III fields; section identifier for others |
| label | string | Display label |
| fieldCategory | string | "PART_III", "METADATA", "SECTION_J", etc. |
| valueType | string | "numeric", "text", "checkbox", "percentage" |
| xMin | number | Left edge in PDF points |
| xMax | number | Right edge in PDF points |
| yMin | number | Bottom edge in PDF points |
| yMax | number | Top edge in PDF points |
| hasSubtype | boolean | Whether this region supports subtype codes |
| subtypeXMin | number | null | Code column left edge (if hasSubtype) |
| subtypeXMax | number | null | Code column right edge (if hasSubtype) |
K1PositionRegion Count: 73 regions
See research.md Decision 3 for the complete region map.
Validation Rules
boxNumbermust be a valid K-1 box identifier (1-21, a/b/c sub-boxes, or section identifiers J/K/L/M/N)numericValuemust be null for "SEE STMT" and checkbox fieldsisCheckbox: truerequiresrawValue: "X"andnumericValue: nullsubtypeis only set for boxes that support subtypes (11, 12, 13, 14, 15, 17, 19, 20, 21)- Parenthesized values MUST have negative
numericValue - Percentage values (Section J) MUST preserve decimal precision (no rounding)
confidencemust be 0.0–1.0 with HIGH ≥ 0.90, MEDIUM 0.70–0.89, LOW 0.50–0.69
State Transitions
No state machine changes. The existing K1ImportSession status flow remains:
UPLOADING → EXTRACTING → NEEDS_REVIEW → CONFIRMED → COMPLETED
↘ EXTRACTION_FAILED
Database Impact
None. No Prisma schema changes. The existing K1Cell table stores boxNumber, value, label etc. The new subtype field on K1ExtractedField can be concatenated into the existing boxNumber field for storage (e.g., "11-ZZ*", "20-A") or stored via the existing metadata JSON column.