You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

4.3 KiB

Implementation Plan: Fix K-1 PDF Parser — Position-Based Extraction

Branch: 005-k1-parser-fix | Date: 2026-03-18 | Spec: spec.md Input: Feature specification from /specs/005-k1-parser-fix/spec.md

Note: This template is filled in by the /speckit.plan command. See .specify/templates/plan-template.md for the execution workflow.

Summary

Rewrite the K-1 PDF extractor from a broken regex-based label matcher to a position-based extraction engine using pdfjs-dist. The core approach: use page.getTextContent() to get all text items with (x, y) coordinates and font info, discriminate data values from template text by font, then map each data value to a K-1 form field based on position regions (bounding boxes). Supports Part III boxes 1-21 with subtype codes, Part I/II metadata, sections J/K/L/M/N, and checkboxes. Unmapped values go to a fallback list for manual user assignment.

Technical Context

Language/Version: TypeScript 5.x (Node.js runtime) Primary Dependencies: NestJS 11.x, pdfjs-dist 5.4.x (already installed via pdf-parse), pdf-parse 2.4.x (kept for isDigitalK1 detection) Storage: PostgreSQL via Prisma ORM (existing K1ImportSession, Document tables) Testing: Jest (unit tests for extraction logic, position mapping, value parsing) Target Platform: Node.js server (NestJS API), Angular 21 client (existing review UI) Project Type: Web service (monorepo: api + common libs) Performance Goals: < 5 seconds extraction for a single-page K-1 PDF Constraints: Must preserve existing K1Extractor interface contract; no new npm dependencies (pdfjs-dist is already transitive) Scale/Scope: Single-file parser rewrite + interface expansion in common lib; ~2 files modified, ~1 new file

Constitution Check

GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.

Principle Status Notes
I. Nx Monorepo Structure PASS Changes in apps/api (extractor) and libs/common (interfaces). No new projects.
II. NestJS Module Pattern PASS PdfParseExtractor is already a @Injectable() provider in K1ImportModule. Rewriting internals only.
III. Prisma Data Layer PASS No schema changes. Existing tables sufficient.
IV. TypeScript Strict Conventions PASS Will follow noUnusedLocals, noUnusedParameters, path aliases.
V. Simplicity First PASS Rewriting one file, expanding one interface. No new architectural layers.
VI. Interface-First Design PASS K1ExtractedField interface expanded first, then implementation follows.

No gate violations. Proceeding to Phase 0.

Project Structure

Documentation (this feature)

specs/005-k1-parser-fix/
├── plan.md              # This file
├── research.md          # Phase 0 output
├── data-model.md        # Phase 1 output
├── quickstart.md        # Phase 1 output
├── contracts/           # Phase 1 output
│   └── extraction.md    # Extractor interface contract
└── tasks.md             # Phase 2 output (created by /speckit.tasks)

Source Code (repository root)

apps/api/src/app/k1-import/
├── extractors/
│   ├── k1-extractor.interface.ts      # Unchanged
│   ├── pdf-parse-extractor.ts         # REWRITE: position-based extraction
│   ├── k1-position-regions.ts         # NEW: bounding box definitions for K-1 form fields
│   ├── azure-extractor.ts             # Unchanged
│   └── tesseract-extractor.ts         # Unchanged
├── k1-import.module.ts                # Unchanged
├── k1-import.service.ts               # Minor: handle new subtype field in K1ExtractedField
├── k1-import.controller.ts            # Unchanged
└── ...

libs/common/src/lib/interfaces/
└── k1-import.interface.ts             # MODIFY: add subtype, fieldCategory, isCheckbox to K1ExtractedField

tests/
└── apps/api/src/app/k1-import/
    └── extractors/
        └── pdf-parse-extractor.spec.ts  # NEW: unit tests

Structure Decision: Minimalist approach — rewrite one extractor file, add one position-region data file, expand one interface. Follows the existing module structure with no new architectural patterns.

Complexity Tracking

No constitution violations. Table intentionally empty.