You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

4.7 KiB

Data Model: Fix K-1 PDF Parser

Feature: 005-k1-parser-fix | Date: 2026-03-18

Overview

This feature modifies no database tables. All changes are to in-memory TypeScript interfaces in @ghostfolio/common. The extraction result flows through: PDF → extractor → K1ExtractionResult → review UI → confirm → persist to existing KDocument/K1Cell tables.

Entity Changes

K1ExtractedField (modified)

Existing interface at libs/common/src/lib/interfaces/k1-import.interface.ts. Three new fields added:

Field Type Required Description
boxNumber string yes Existing: "1", "6a", "19", "20"
label string yes Existing: display label from cell mapping
customLabel string | null no Existing: user override
rawValue string yes Existing: raw extracted text ("498,211", "(409,811)", "SEE STMT", "X")
numericValue number | null no Existing: parsed numeric value
confidence number yes Existing: 0.0–1.0
confidenceLevel 'HIGH' | 'MEDIUM' | 'LOW' yes Existing
isUserEdited boolean yes Existing: default false
isReviewed boolean yes Existing: default false
subtype string | null no NEW: subtype code (e.g., "ZZ*", "A", "B", "*"). Null for simple boxes.
fieldCategory string yes NEW: "PART_III", "METADATA", "SECTION_J", "SECTION_K", "SECTION_L", "SECTION_M", "SECTION_N", "CHECKBOX"
isCheckbox boolean yes NEW: true if field is a boolean checkbox value. Default false.

K1UnmappedItem (modified)

Existing interface. Three new fields for position debugging:

Field Type Required Description
rawLabel string yes Existing
rawValue string yes Existing
numericValue number | null no Existing
confidence number yes Existing
pageNumber number yes Existing
resolution 'assigned' | 'discarded' | null no Existing
assignedBoxNumber string | null no Existing
x number yes NEW: x position in PDF points
y number yes NEW: y position in PDF points
fontName string yes NEW: PDF font identifier

K1ExtractionResult (unchanged)

No changes to the top-level extraction result interface. The metadata, fields, unmappedItems, overallConfidence, method, and pagesProcessed structure remains the same.

K1PositionRegion (new — internal to extractor)

This is NOT a shared interface — it lives inside the extractor module. It defines a bounding box for a K-1 form field region.

Field Type Description
fieldId string Unique identifier (e.g., "BOX_1", "J_PROFIT_BEGIN", "FINAL_K1")
boxNumber string K-1 box number for Part III fields; section identifier for others
label string Display label
fieldCategory string "PART_III", "METADATA", "SECTION_J", etc.
valueType string "numeric", "text", "checkbox", "percentage"
xMin number Left edge in PDF points
xMax number Right edge in PDF points
yMin number Bottom edge in PDF points
yMax number Top edge in PDF points
hasSubtype boolean Whether this region supports subtype codes
subtypeXMin number | null Code column left edge (if hasSubtype)
subtypeXMax number | null Code column right edge (if hasSubtype)

K1PositionRegion Count: 73 regions

See research.md Decision 3 for the complete region map.

Validation Rules

  1. boxNumber must be a valid K-1 box identifier (1-21, a/b/c sub-boxes, or section identifiers J/K/L/M/N)
  2. numericValue must be null for "SEE STMT" and checkbox fields
  3. isCheckbox: true requires rawValue: "X" and numericValue: null
  4. subtype is only set for boxes that support subtypes (11, 12, 13, 14, 15, 17, 19, 20, 21)
  5. Parenthesized values MUST have negative numericValue
  6. Percentage values (Section J) MUST preserve decimal precision (no rounding)
  7. confidence must be 0.0–1.0 with HIGH ≥ 0.90, MEDIUM 0.70–0.89, LOW 0.50–0.69

State Transitions

No state machine changes. The existing K1ImportSession status flow remains:

UPLOADING → EXTRACTING → NEEDS_REVIEW → CONFIRMED → COMPLETED
                       ↘ EXTRACTION_FAILED

Database Impact

None. No Prisma schema changes. The existing K1Cell table stores boxNumber, value, label etc. The new subtype field on K1ExtractedField can be concatenated into the existing boxNumber field for storage (e.g., "11-ZZ*", "20-A") or stored via the existing metadata JSON column.