# Implementation Plan: Fix K-1 PDF Parser — Position-Based Extraction **Branch**: `005-k1-parser-fix` | **Date**: 2026-03-18 | **Spec**: [spec.md](spec.md) **Input**: Feature specification from `/specs/005-k1-parser-fix/spec.md` **Note**: This template is filled in by the `/speckit.plan` command. See `.specify/templates/plan-template.md` for the execution workflow. ## Summary Rewrite the K-1 PDF extractor from a broken regex-based label matcher to a position-based extraction engine using pdfjs-dist. The core approach: use `page.getTextContent()` to get all text items with (x, y) coordinates and font info, discriminate data values from template text by font, then map each data value to a K-1 form field based on position regions (bounding boxes). Supports Part III boxes 1-21 with subtype codes, Part I/II metadata, sections J/K/L/M/N, and checkboxes. Unmapped values go to a fallback list for manual user assignment. ## Technical Context **Language/Version**: TypeScript 5.x (Node.js runtime) **Primary Dependencies**: NestJS 11.x, pdfjs-dist 5.4.x (already installed via pdf-parse), pdf-parse 2.4.x (kept for `isDigitalK1` detection) **Storage**: PostgreSQL via Prisma ORM (existing K1ImportSession, Document tables) **Testing**: Jest (unit tests for extraction logic, position mapping, value parsing) **Target Platform**: Node.js server (NestJS API), Angular 21 client (existing review UI) **Project Type**: Web service (monorepo: api + common libs) **Performance Goals**: < 5 seconds extraction for a single-page K-1 PDF **Constraints**: Must preserve existing `K1Extractor` interface contract; no new npm dependencies (pdfjs-dist is already transitive) **Scale/Scope**: Single-file parser rewrite + interface expansion in common lib; ~2 files modified, ~1 new file ## Constitution Check _GATE: Must pass before Phase 0 research. Re-check after Phase 1 design._ | Principle | Status | Notes | |-----------|--------|-------| | I. Nx Monorepo Structure | PASS | Changes in `apps/api` (extractor) and `libs/common` (interfaces). No new projects. | | II. NestJS Module Pattern | PASS | PdfParseExtractor is already a `@Injectable()` provider in K1ImportModule. Rewriting internals only. | | III. Prisma Data Layer | PASS | No schema changes. Existing tables sufficient. | | IV. TypeScript Strict Conventions | PASS | Will follow `noUnusedLocals`, `noUnusedParameters`, path aliases. | | V. Simplicity First | PASS | Rewriting one file, expanding one interface. No new architectural layers. | | VI. Interface-First Design | PASS | K1ExtractedField interface expanded first, then implementation follows. | No gate violations. Proceeding to Phase 0. ## Project Structure ### Documentation (this feature) ```text specs/005-k1-parser-fix/ ├── plan.md # This file ├── research.md # Phase 0 output ├── data-model.md # Phase 1 output ├── quickstart.md # Phase 1 output ├── contracts/ # Phase 1 output │ └── extraction.md # Extractor interface contract └── tasks.md # Phase 2 output (created by /speckit.tasks) ``` ### Source Code (repository root) ```text apps/api/src/app/k1-import/ ├── extractors/ │ ├── k1-extractor.interface.ts # Unchanged │ ├── pdf-parse-extractor.ts # REWRITE: position-based extraction │ ├── k1-position-regions.ts # NEW: bounding box definitions for K-1 form fields │ ├── azure-extractor.ts # Unchanged │ └── tesseract-extractor.ts # Unchanged ├── k1-import.module.ts # Unchanged ├── k1-import.service.ts # Minor: handle new subtype field in K1ExtractedField ├── k1-import.controller.ts # Unchanged └── ... libs/common/src/lib/interfaces/ └── k1-import.interface.ts # MODIFY: add subtype, fieldCategory, isCheckbox to K1ExtractedField tests/ └── apps/api/src/app/k1-import/ └── extractors/ └── pdf-parse-extractor.spec.ts # NEW: unit tests ``` **Structure Decision**: Minimalist approach — rewrite one extractor file, add one position-region data file, expand one interface. Follows the existing module structure with no new architectural patterns. ## Complexity Tracking No constitution violations. Table intentionally empty.