Implementation Plan: Fix K-1 PDF Parser — Position-Based Extraction

Branch: 005-k1-parser-fix | Date: 2026-03-20 | Spec: spec.md Input: Feature specification from /specs/005-k1-parser-fix/spec.md

Note: This template is filled in by the /speckit.plan command. See .specify/templates/plan-template.md for the execution workflow.

Summary

Rewrite the K-1 PDF parser from regex-based label matching to position-based text extraction using pdfjs-dist. The current regex parser incorrectly matches cell numbers instead of actual data values. The new parser will use font discrimination (data fonts vs template fonts) and (x,y) coordinate mapping to bounding-box regions for each K-1 form field. This fixes extraction for all Part I/II metadata, Part III boxes 1-21 (including subtypes, multi-value fields, and SEE STMT references), checkboxes, and Sections J/K/L/M/N. The existing PdfParseExtractor already implements position-based extraction — this spec refines its accuracy and adds confidence scoring, unmapped item handling, and dynamic font identification.

Technical Context

Language/Version: TypeScript 5.x, Node.js ≥22.18.0
Primary Dependencies: NestJS 11+, Angular 21+, pdfjs-dist (position-based text extraction), Prisma ORM
Storage: PostgreSQL (via Prisma), Redis (caching), filesystem (uploaded PDFs)
Testing: Jest (unit + integration)
Target Platform: Linux server (Docker) / local dev (Windows/macOS)
Project Type: Web application (Nx monorepo: api + client + common + ui)
Performance Goals: <5 seconds for single-page K-1 extraction (SC-009)
Constraints: Zero data loss during extraction (SC-007); preserve existing API contract (FR-025)
Scale/Scope: Single-user family office; ~10-50 K-1 PDFs per tax year

Constitution Check

GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.

Gate	Rule	Status	Notes
Nx boundary	Features respect project boundaries (api/client/common/ui)	✅ PASS	Parser in `@ghostfolio/api`, interfaces in `@ghostfolio/common`, UI in `@ghostfolio/client`
NestJS module pattern	Module + Controller + Service structure	✅ PASS	`K1ImportModule` already exists with proper DI
Prisma data layer	No direct SQL; use PrismaService	✅ PASS	All DB access via Prisma ORM
TypeScript strict	No unused locals/params, path aliases	✅ PASS	Existing codebase conventions followed
Simplicity first	YAGNI, minimal abstractions	✅ PASS	Modifying existing `PdfParseExtractor`, not adding new layers
Interface-first design	Shared interfaces in `@ghostfolio/common`	✅ PASS	`K1ExtractionResult`, `K1ExtractedField`, `K1UnmappedItem` already defined
Max 3 Nx projects per feature	api + common typical	✅ PASS	Touches api + common only (client UI already exists, no changes needed)

All gates pass. No violations requiring justification.

Project Structure

Documentation (this feature)

specs/005-k1-parser-fix/
├── plan.md              # This file
├── research.md          # Phase 0 output
├── data-model.md        # Phase 1 output
├── quickstart.md        # Phase 1 output
├── contracts/           # Phase 1 output
└── tasks.md             # Phase 2 output (/speckit.tasks)

Source Code (repository root)

apps/api/src/app/k1-import/
├── extractors/
│   ├── k1-extractor.interface.ts        # K1Extractor contract (no changes)
│   ├── k1-position-regions.ts           # MODIFY: refine bounding boxes, add tolerance config
│   ├── pdf-parse-extractor.ts           # MODIFY: core rewrite — font discrimination, position mapping
│   ├── azure-extractor.ts               # No changes (Tier 2)
│   └── tesseract-extractor.ts           # No changes (Tier 2 fallback)
├── k1-import.service.ts                 # Minor: add warning generation for unmapped items
├── k1-import.controller.ts              # No changes
├── k1-field-mapper.service.ts           # Minor: handle new confidence levels
├── k1-confidence.service.ts             # MODIFY: integrate position-match confidence
└── k1-import.module.ts                  # No changes

libs/common/src/lib/interfaces/
└── k1-import.interface.ts               # Minor: add fontName/position to K1UnmappedItem if needed

prisma/
└── schema.prisma                        # No changes (existing schema sufficient)

Structure Decision: Existing Nx monorepo structure is used. The core change is within apps/api/src/app/k1-import/extractors/ — specifically pdf-parse-extractor.ts and k1-position-regions.ts. No new modules, no new Nx projects.

Complexity Tracking

No violations detected. All changes fit within existing module boundaries.

4.8 KiB Raw Blame History