You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

4.8 KiB

Implementation Plan: Fix K-1 PDF Parser — Position-Based Extraction

Branch: 005-k1-parser-fix | Date: 2026-03-20 | Spec: spec.md Input: Feature specification from /specs/005-k1-parser-fix/spec.md

Note: This template is filled in by the /speckit.plan command. See .specify/templates/plan-template.md for the execution workflow.

Summary

Rewrite the K-1 PDF parser from regex-based label matching to position-based text extraction using pdfjs-dist. The current regex parser incorrectly matches cell numbers instead of actual data values. The new parser will use font discrimination (data fonts vs template fonts) and (x,y) coordinate mapping to bounding-box regions for each K-1 form field. This fixes extraction for all Part I/II metadata, Part III boxes 1-21 (including subtypes, multi-value fields, and SEE STMT references), checkboxes, and Sections J/K/L/M/N. The existing PdfParseExtractor already implements position-based extraction — this spec refines its accuracy and adds confidence scoring, unmapped item handling, and dynamic font identification.

Technical Context

Language/Version: TypeScript 5.x, Node.js ≥22.18.0
Primary Dependencies: NestJS 11+, Angular 21+, pdfjs-dist (position-based text extraction), Prisma ORM
Storage: PostgreSQL (via Prisma), Redis (caching), filesystem (uploaded PDFs)
Testing: Jest (unit + integration)
Target Platform: Linux server (Docker) / local dev (Windows/macOS)
Project Type: Web application (Nx monorepo: api + client + common + ui)
Performance Goals: <5 seconds for single-page K-1 extraction (SC-009)
Constraints: Zero data loss during extraction (SC-007); preserve existing API contract (FR-025)
Scale/Scope: Single-user family office; ~10-50 K-1 PDFs per tax year

Constitution Check

GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.

Gate Rule Status Notes
Nx boundary Features respect project boundaries (api/client/common/ui) PASS Parser in @ghostfolio/api, interfaces in @ghostfolio/common, UI in @ghostfolio/client
NestJS module pattern Module + Controller + Service structure PASS K1ImportModule already exists with proper DI
Prisma data layer No direct SQL; use PrismaService PASS All DB access via Prisma ORM
TypeScript strict No unused locals/params, path aliases PASS Existing codebase conventions followed
Simplicity first YAGNI, minimal abstractions PASS Modifying existing PdfParseExtractor, not adding new layers
Interface-first design Shared interfaces in @ghostfolio/common PASS K1ExtractionResult, K1ExtractedField, K1UnmappedItem already defined
Max 3 Nx projects per feature api + common typical PASS Touches api + common only (client UI already exists, no changes needed)

All gates pass. No violations requiring justification.

Project Structure

Documentation (this feature)

specs/005-k1-parser-fix/
├── plan.md              # This file
├── research.md          # Phase 0 output
├── data-model.md        # Phase 1 output
├── quickstart.md        # Phase 1 output
├── contracts/           # Phase 1 output
└── tasks.md             # Phase 2 output (/speckit.tasks)

Source Code (repository root)

apps/api/src/app/k1-import/
├── extractors/
│   ├── k1-extractor.interface.ts        # K1Extractor contract (no changes)
│   ├── k1-position-regions.ts           # MODIFY: refine bounding boxes, add tolerance config
│   ├── pdf-parse-extractor.ts           # MODIFY: core rewrite — font discrimination, position mapping
│   ├── azure-extractor.ts               # No changes (Tier 2)
│   └── tesseract-extractor.ts           # No changes (Tier 2 fallback)
├── k1-import.service.ts                 # Minor: add warning generation for unmapped items
├── k1-import.controller.ts              # No changes
├── k1-field-mapper.service.ts           # Minor: handle new confidence levels
├── k1-confidence.service.ts             # MODIFY: integrate position-match confidence
└── k1-import.module.ts                  # No changes

libs/common/src/lib/interfaces/
└── k1-import.interface.ts               # Minor: add fontName/position to K1UnmappedItem if needed

prisma/
└── schema.prisma                        # No changes (existing schema sufficient)

Structure Decision: Existing Nx monorepo structure is used. The core change is within apps/api/src/app/k1-import/extractors/ — specifically pdf-parse-extractor.ts and k1-position-regions.ts. No new modules, no new Nx projects.

Complexity Tracking

No violations detected. All changes fit within existing module boundaries.