Browse Source

feat(k1-import): rewrite K-1 PDF parser with position-based pdfjs-dist extraction

pull/6701/head
Robert Patch 2 months ago
parent
commit
f65386b3b6
  1. 4
      .github/agents/copilot-instructions.md
  2. 55
      .specify/memory/constitution.md
  3. 1222
      apps/api/src/app/k1-import/extractors/k1-position-regions.ts
  4. 1038
      apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
  5. 6
      apps/api/src/app/k1-import/k1-import.service.ts
  6. 24
      libs/common/src/lib/dtos/k1-import.dto.ts
  7. 18
      libs/common/src/lib/interfaces/k1-import.interface.ts
  8. 36
      specs/005-k1-parser-fix/checklists/requirements.md
  9. 107
      specs/005-k1-parser-fix/contracts/extraction.md
  10. 94
      specs/005-k1-parser-fix/data-model.md
  11. 82
      specs/005-k1-parser-fix/plan.md
  12. 64
      specs/005-k1-parser-fix/quickstart.md
  13. 221
      specs/005-k1-parser-fix/research.md
  14. 202
      specs/005-k1-parser-fix/spec.md
  15. 237
      specs/005-k1-parser-fix/tasks.md

4
.github/agents/copilot-instructions.md

@ -7,6 +7,8 @@ Auto-generated from all feature plans. Last updated: 2026-03-18
- PostgreSQL via Prisma ORM (003-portfolio-performance-views)
- TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback) (004-k1-scan-import)
- PostgreSQL via Prisma (structured data), local filesystem `uploads/` (PDF files) (004-k1-scan-import)
- TypeScript 5.x (Node.js runtime) + NestJS 11.x, pdfjs-dist 5.4.x (already installed via pdf-parse), pdf-parse 2.4.x (kept for `isDigitalK1` detection) (005-k1-parser-fix)
- PostgreSQL via Prisma ORM (existing K1ImportSession, Document tables) (005-k1-parser-fix)
- TypeScript 5.9.2, Node.js ≥22.18.0 + NestJS 11.1.14 (API), Angular 21.1.1 + Angular Material 21.1.1 (client), Prisma 6.19.0 (ORM), Nx 22.5.3 (monorepo), big.js (decimal math), date-fns 4.1.0, chart.js 4.5.1, Bull 4.16.5 (job queues), Redis (caching), yahoo-finance2 3.13.2 (001-family-office-transform)
@ -27,9 +29,9 @@ npm test; npm run lint
TypeScript 5.9.2, Node.js ≥22.18.0: Follow standard conventions
## Recent Changes
- 005-k1-parser-fix: Added TypeScript 5.x (Node.js runtime) + NestJS 11.x, pdfjs-dist 5.4.x (already installed via pdf-parse), pdf-parse 2.4.x (kept for `isDigitalK1` detection)
- 004-k1-scan-import: Added TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback)
- 004-k1-scan-import: Added TypeScript 5.9.2, Node.js ≥ 22.18.0 + NestJS 11.x (backend), Angular 21.x (frontend), Prisma 6.x (ORM), pdf-parse (PDF text), @azure/ai-form-recognizer (cloud OCR), tesseract.js (local OCR fallback)
- 003-portfolio-performance-views: Added TypeScript 5.9.2, Node.js >= 22.18.0 + Angular 21.1.1, NestJS 11.1.14, Angular Material 21.1.1, Prisma 6.19.0, big.js, date-fns 4.1.0
<!-- MANUAL ADDITIONS START -->

55
.specify/memory/constitution.md

@ -0,0 +1,55 @@
# Ghostfolio Constitution
## Core Principles
### I. Nx Monorepo Structure
Ghostfolio uses an Nx monorepo with apps (`api`, `client`) and libs (`common`, `ui`). Features must respect project boundaries:
- `@ghostfolio/common` — shared interfaces, types, constants (no framework dependencies)
- `@ghostfolio/ui` — shared Angular UI components
- `@ghostfolio/api` — NestJS backend services, controllers, modules
- `@ghostfolio/client` — Angular frontend pages, services, components
### II. NestJS Module Pattern
Backend features are organized as NestJS modules with:
- Module file registering providers, controllers, imports, exports
- Controller for HTTP endpoints (no business logic)
- Service for business logic
- Interfaces in `@ghostfolio/common` for shared types
### III. Prisma Data Layer
Database access uses Prisma ORM exclusively. Schema changes require migrations. No direct SQL queries. The `PrismaService` is injected via `PrismaModule`.
### IV. TypeScript Strict Conventions
- `noUnusedLocals: true`, `noUnusedParameters: true` — no dead code allowed
- `esModuleInterop: true` — use default imports for CommonJS modules
- Path aliases: `@ghostfolio/api/*`, `@ghostfolio/common/*`, `@ghostfolio/client/*`, `@ghostfolio/ui/*`
### V. Simplicity First
- Start with the simplest solution that works
- YAGNI — don't add abstractions until needed
- Prefer modifying existing files over creating new architectural layers
- Maximum 3 Nx projects per feature (api + common is typical, client when UI needed)
### VI. Interface-First Design
- Shared interfaces live in `@ghostfolio/common`
- API endpoints return typed DTOs
- Feature contracts defined before implementation
## Additional Constraints
- **Angular 21+**: Standalone components, signals preferred
- **NestJS 11+**: Module-based DI, versioned API (URI-based v1)
- **Testing**: Jest for unit/integration tests
- **Docker**: Development via docker-compose (PostgreSQL 5434, Redis 6380)
## Governance
Constitution principles guide all feature development. Complexity beyond these patterns must be justified in the plan's Complexity Tracking table.
**Version**: 1.0.0 | **Ratified**: 2026-03-18 | **Last Amended**: 2026-03-18

1222
apps/api/src/app/k1-import/extractors/k1-position-regions.ts

File diff suppressed because it is too large

1038
apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts

File diff suppressed because it is too large

6
apps/api/src/app/k1-import/k1-import.service.ts

@ -633,7 +633,11 @@ export class K1ImportService {
// Build KDocument data from verified fields
const kDocumentData: Record<string, number | null> = {};
for (const field of verifiedData.fields) {
kDocumentData[field.boxNumber] = field.numericValue ?? null;
// For subtype fields (e.g., box 11 "ZZ*", box 20 "A"), create unique key
const key = field.subtype
? `${field.boxNumber}-${field.subtype}`
: field.boxNumber;
kDocumentData[key] = field.numericValue ?? null;
}
// FR-012: Create or update KDocument

24
libs/common/src/lib/dtos/k1-import.dto.ts

@ -50,6 +50,18 @@ export class K1ExtractedFieldDto {
@IsBoolean()
isReviewed: boolean;
@IsOptional()
@IsString()
subtype?: string | null;
@IsOptional()
@IsString()
fieldCategory?: string;
@IsOptional()
@IsBoolean()
isCheckbox?: boolean;
}
export class K1UnmappedItemDto {
@ -75,6 +87,18 @@ export class K1UnmappedItemDto {
@IsOptional()
@IsString()
assignedBoxNumber?: string;
@IsOptional()
@IsNumber()
x?: number;
@IsOptional()
@IsNumber()
y?: number;
@IsOptional()
@IsString()
fontName?: string;
}
export class VerifyK1ImportDto {

18
libs/common/src/lib/interfaces/k1-import.interface.ts

@ -53,6 +53,15 @@ export interface K1ExtractedField {
/** Whether user has explicitly reviewed this field (required for medium/low confidence) */
isReviewed: boolean;
/** Subtype code for boxes that support them (e.g., "ZZ*", "A", "B", "*"). Null for simple boxes. */
subtype?: string | null;
/** Field category: PART_III, METADATA, SECTION_J, SECTION_K, SECTION_L, SECTION_M, SECTION_N, CHECKBOX */
fieldCategory?: string;
/** Whether this field is a boolean checkbox value */
isCheckbox?: boolean;
}
export interface K1UnmappedItem {
@ -76,6 +85,15 @@ export interface K1UnmappedItem {
/** If assigned, the box number it was assigned to */
assignedBoxNumber: string | null;
/** X position in PDF points */
x?: number;
/** Y position in PDF points */
y?: number;
/** PDF font identifier for debugging */
fontName?: string;
}
export interface K1ConfirmationRequest {

36
specs/005-k1-parser-fix/checklists/requirements.md

@ -0,0 +1,36 @@
# Specification Quality Checklist: Fix K-1 PDF Parser
**Purpose**: Validate specification completeness and quality before proceeding to planning
**Created**: 2025-07-21
**Feature**: [spec.md](../spec.md)
## Content Quality
- [x] No implementation details (languages, frameworks, APIs)
- [x] Focused on user value and business needs
- [x] Written for non-technical stakeholders
- [x] All mandatory sections completed
## Requirement Completeness
- [x] No [NEEDS CLARIFICATION] markers remain
- [x] Requirements are testable and unambiguous
- [x] Success criteria are measurable
- [x] Success criteria are technology-agnostic (no implementation details)
- [x] All acceptance scenarios are defined
- [x] Edge cases are identified
- [x] Scope is clearly bounded
- [x] Dependencies and assumptions identified
## Feature Readiness
- [x] All functional requirements have clear acceptance criteria
- [x] User scenarios cover primary flows
- [x] Feature meets measurable outcomes defined in Success Criteria
- [x] No implementation details leak into specification
## Notes
- All items pass validation. Specification is ready for `/speckit.clarify` or `/speckit.plan`.
- The spec references "position coordinates" and "font discrimination" in the Background section as domain concepts (how K-1 PDFs work), not as implementation instructions. This is intentional — it describes the problem domain, not the solution approach.
- No [NEEDS CLARIFICATION] markers exist — reasonable defaults were applied for all decisions based on the user's detailed field mapping and explicit guidance.

107
specs/005-k1-parser-fix/contracts/extraction.md

@ -0,0 +1,107 @@
# Contract: K1 Extractor Interface
**Feature**: 005-k1-parser-fix | **Date**: 2026-03-18
## Overview
The K1 extraction system uses a strategy pattern where multiple extractors implement the `K1Extractor` interface. This feature rewrites the `PdfParseExtractor` (Tier 1) internals while preserving the interface contract.
## K1Extractor Interface (unchanged)
```typescript
interface K1Extractor {
extract(buffer: Buffer, fileName: string): Promise<K1ExtractionResult>;
isAvailable(): boolean;
}
```
### extract(buffer, fileName)
**Input**:
- `buffer`: Raw PDF file content as a Node.js Buffer
- `fileName`: Original filename of the uploaded PDF (for logging/diagnostics)
**Output**: `K1ExtractionResult` containing:
- `metadata`: Partnership/partner info, tax year, filing status
- `fields`: Array of `K1ExtractedField` (mapped values)
- `unmappedItems`: Array of `K1UnmappedItem` (values that couldn't be mapped)
- `overallConfidence`: 0.0–1.0 aggregate confidence
- `method`: `'pdf-parse'` (this extractor)
- `pagesProcessed`: number (typically 1)
**Error handling**:
- Throws on non-PDF input (invalid buffer)
- Returns empty fields + low confidence for non-K-1 PDFs
- Never crashes on unexpected PDF content
### isAvailable()
Returns `true` always (no external dependencies or API keys needed).
## K1ExtractionResult Shape (expanded)
```typescript
interface K1ExtractionResult {
metadata: {
partnershipName: string | null;
partnershipEin: string | null;
partnerName: string | null;
partnerEin: string | null;
taxYear: number | null;
isAmended: boolean;
isFinal: boolean;
};
fields: K1ExtractedField[];
unmappedItems: K1UnmappedItem[];
overallConfidence: number;
method: 'pdf-parse' | 'azure' | 'tesseract';
pagesProcessed: number;
}
```
## K1ExtractedField Shape (expanded)
```typescript
interface K1ExtractedField {
boxNumber: string; // "1", "6a", "19", "20", "J_PROFIT_BEGIN", etc.
label: string; // Display label
customLabel: string | null; // User override
rawValue: string; // Raw text: "498,211", "(409,811)", "SEE STMT", "X"
numericValue: number | null; // Parsed: 498211, -409811, null, null
confidence: number; // 0.0–1.0
confidenceLevel: 'HIGH' | 'MEDIUM' | 'LOW';
isUserEdited: boolean; // Default false
isReviewed: boolean; // Default false
subtype: string | null; // NEW: "ZZ*", "A", "B", "*", null
fieldCategory: string; // NEW: "PART_III", "METADATA", "SECTION_J", etc.
isCheckbox: boolean; // NEW: true for checkbox fields
}
```
## K1UnmappedItem Shape (expanded)
```typescript
interface K1UnmappedItem {
rawLabel: string;
rawValue: string;
numericValue: number | null;
confidence: number;
pageNumber: number;
resolution: 'assigned' | 'discarded' | null;
assignedBoxNumber: string | null;
x: number; // NEW: x position in PDF points
y: number; // NEW: y position in PDF points
fontName: string; // NEW: PDF font identifier
}
```
## Behavioral Contract
1. **Font discrimination**: The extractor MUST dynamically identify which fonts carry data values vs. template text. It MUST NOT hardcode specific font names.
2. **Position matching**: Each data value MUST be mapped to a K-1 field by checking its (x, y) against defined bounding box regions.
3. **Subtype pairing**: For subtype boxes, code and value items at the same y-position (±8 pts) MUST be paired.
4. **Multi-subtype**: Boxes with multiple subtypes (e.g., box 20) MUST produce separate `K1ExtractedField` entries for each subtype row.
5. **Value parsing**: Parenthesized values MUST become negative. Commas MUST be stripped. "SEE STMT" MUST remain as-is with null numericValue.
6. **Unmapped fallback**: Any data value not matching a region MUST appear in `unmappedItems` — zero data loss.
7. **Cleanup**: The PDF document MUST be destroyed after extraction to free worker resources.
8. **Page scope**: Only page 1 is processed. Multi-page K-1s have supplemental statements on subsequent pages (out of scope).

94
specs/005-k1-parser-fix/data-model.md

@ -0,0 +1,94 @@
# Data Model: Fix K-1 PDF Parser
**Feature**: 005-k1-parser-fix | **Date**: 2026-03-18
## Overview
This feature modifies no database tables. All changes are to in-memory TypeScript interfaces in `@ghostfolio/common`. The extraction result flows through: PDF → extractor → K1ExtractionResult → review UI → confirm → persist to existing KDocument/K1Cell tables.
## Entity Changes
### K1ExtractedField (modified)
Existing interface at `libs/common/src/lib/interfaces/k1-import.interface.ts`. Three new fields added:
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| boxNumber | string | yes | Existing: "1", "6a", "19", "20" |
| label | string | yes | Existing: display label from cell mapping |
| customLabel | string \| null | no | Existing: user override |
| rawValue | string | yes | Existing: raw extracted text ("498,211", "(409,811)", "SEE STMT", "X") |
| numericValue | number \| null | no | Existing: parsed numeric value |
| confidence | number | yes | Existing: 0.0–1.0 |
| confidenceLevel | 'HIGH' \| 'MEDIUM' \| 'LOW' | yes | Existing |
| isUserEdited | boolean | yes | Existing: default false |
| isReviewed | boolean | yes | Existing: default false |
| **subtype** | **string \| null** | **no** | **NEW**: subtype code (e.g., "ZZ*", "A", "B", "*"). Null for simple boxes. |
| **fieldCategory** | **string** | **yes** | **NEW**: "PART_III", "METADATA", "SECTION_J", "SECTION_K", "SECTION_L", "SECTION_M", "SECTION_N", "CHECKBOX" |
| **isCheckbox** | **boolean** | **yes** | **NEW**: true if field is a boolean checkbox value. Default false. |
### K1UnmappedItem (modified)
Existing interface. Three new fields for position debugging:
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| rawLabel | string | yes | Existing |
| rawValue | string | yes | Existing |
| numericValue | number \| null | no | Existing |
| confidence | number | yes | Existing |
| pageNumber | number | yes | Existing |
| resolution | 'assigned' \| 'discarded' \| null | no | Existing |
| assignedBoxNumber | string \| null | no | Existing |
| **x** | **number** | **yes** | **NEW**: x position in PDF points |
| **y** | **number** | **yes** | **NEW**: y position in PDF points |
| **fontName** | **string** | **yes** | **NEW**: PDF font identifier |
### K1ExtractionResult (unchanged)
No changes to the top-level extraction result interface. The `metadata`, `fields`, `unmappedItems`, `overallConfidence`, `method`, and `pagesProcessed` structure remains the same.
### K1PositionRegion (new — internal to extractor)
This is NOT a shared interface — it lives inside the extractor module. It defines a bounding box for a K-1 form field region.
| Field | Type | Description |
|-------|------|-------------|
| fieldId | string | Unique identifier (e.g., "BOX_1", "J_PROFIT_BEGIN", "FINAL_K1") |
| boxNumber | string | K-1 box number for Part III fields; section identifier for others |
| label | string | Display label |
| fieldCategory | string | "PART_III", "METADATA", "SECTION_J", etc. |
| valueType | string | "numeric", "text", "checkbox", "percentage" |
| xMin | number | Left edge in PDF points |
| xMax | number | Right edge in PDF points |
| yMin | number | Bottom edge in PDF points |
| yMax | number | Top edge in PDF points |
| hasSubtype | boolean | Whether this region supports subtype codes |
| subtypeXMin | number \| null | Code column left edge (if hasSubtype) |
| subtypeXMax | number \| null | Code column right edge (if hasSubtype) |
### K1PositionRegion Count: 73 regions
See [research.md](research.md) Decision 3 for the complete region map.
## Validation Rules
1. `boxNumber` must be a valid K-1 box identifier (1-21, a/b/c sub-boxes, or section identifiers J/K/L/M/N)
2. `numericValue` must be null for "SEE STMT" and checkbox fields
3. `isCheckbox: true` requires `rawValue: "X"` and `numericValue: null`
4. `subtype` is only set for boxes that support subtypes (11, 12, 13, 14, 15, 17, 19, 20, 21)
5. Parenthesized values MUST have negative `numericValue`
6. Percentage values (Section J) MUST preserve decimal precision (no rounding)
7. `confidence` must be 0.0–1.0 with HIGH ≥ 0.90, MEDIUM 0.70–0.89, LOW 0.50–0.69
## State Transitions
No state machine changes. The existing K1ImportSession status flow remains:
```
UPLOADING → EXTRACTING → NEEDS_REVIEW → CONFIRMED → COMPLETED
↘ EXTRACTION_FAILED
```
## Database Impact
**None.** No Prisma schema changes. The existing `K1Cell` table stores `boxNumber`, `value`, `label` etc. The new `subtype` field on K1ExtractedField can be concatenated into the existing boxNumber field for storage (e.g., "11-ZZ*", "20-A") or stored via the existing `metadata` JSON column.

82
specs/005-k1-parser-fix/plan.md

@ -0,0 +1,82 @@
# Implementation Plan: Fix K-1 PDF Parser — Position-Based Extraction
**Branch**: `005-k1-parser-fix` | **Date**: 2026-03-18 | **Spec**: [spec.md](spec.md)
**Input**: Feature specification from `/specs/005-k1-parser-fix/spec.md`
**Note**: This template is filled in by the `/speckit.plan` command. See `.specify/templates/plan-template.md` for the execution workflow.
## Summary
Rewrite the K-1 PDF extractor from a broken regex-based label matcher to a position-based extraction engine using pdfjs-dist. The core approach: use `page.getTextContent()` to get all text items with (x, y) coordinates and font info, discriminate data values from template text by font, then map each data value to a K-1 form field based on position regions (bounding boxes). Supports Part III boxes 1-21 with subtype codes, Part I/II metadata, sections J/K/L/M/N, and checkboxes. Unmapped values go to a fallback list for manual user assignment.
## Technical Context
**Language/Version**: TypeScript 5.x (Node.js runtime)
**Primary Dependencies**: NestJS 11.x, pdfjs-dist 5.4.x (already installed via pdf-parse), pdf-parse 2.4.x (kept for `isDigitalK1` detection)
**Storage**: PostgreSQL via Prisma ORM (existing K1ImportSession, Document tables)
**Testing**: Jest (unit tests for extraction logic, position mapping, value parsing)
**Target Platform**: Node.js server (NestJS API), Angular 21 client (existing review UI)
**Project Type**: Web service (monorepo: api + common libs)
**Performance Goals**: < 5 seconds extraction for a single-page K-1 PDF
**Constraints**: Must preserve existing `K1Extractor` interface contract; no new npm dependencies (pdfjs-dist is already transitive)
**Scale/Scope**: Single-file parser rewrite + interface expansion in common lib; ~2 files modified, ~1 new file
## Constitution Check
_GATE: Must pass before Phase 0 research. Re-check after Phase 1 design._
| Principle | Status | Notes |
|-----------|--------|-------|
| I. Nx Monorepo Structure | PASS | Changes in `apps/api` (extractor) and `libs/common` (interfaces). No new projects. |
| II. NestJS Module Pattern | PASS | PdfParseExtractor is already a `@Injectable()` provider in K1ImportModule. Rewriting internals only. |
| III. Prisma Data Layer | PASS | No schema changes. Existing tables sufficient. |
| IV. TypeScript Strict Conventions | PASS | Will follow `noUnusedLocals`, `noUnusedParameters`, path aliases. |
| V. Simplicity First | PASS | Rewriting one file, expanding one interface. No new architectural layers. |
| VI. Interface-First Design | PASS | K1ExtractedField interface expanded first, then implementation follows. |
No gate violations. Proceeding to Phase 0.
## Project Structure
### Documentation (this feature)
```text
specs/005-k1-parser-fix/
├── plan.md # This file
├── research.md # Phase 0 output
├── data-model.md # Phase 1 output
├── quickstart.md # Phase 1 output
├── contracts/ # Phase 1 output
│ └── extraction.md # Extractor interface contract
└── tasks.md # Phase 2 output (created by /speckit.tasks)
```
### Source Code (repository root)
```text
apps/api/src/app/k1-import/
├── extractors/
│ ├── k1-extractor.interface.ts # Unchanged
│ ├── pdf-parse-extractor.ts # REWRITE: position-based extraction
│ ├── k1-position-regions.ts # NEW: bounding box definitions for K-1 form fields
│ ├── azure-extractor.ts # Unchanged
│ └── tesseract-extractor.ts # Unchanged
├── k1-import.module.ts # Unchanged
├── k1-import.service.ts # Minor: handle new subtype field in K1ExtractedField
├── k1-import.controller.ts # Unchanged
└── ...
libs/common/src/lib/interfaces/
└── k1-import.interface.ts # MODIFY: add subtype, fieldCategory, isCheckbox to K1ExtractedField
tests/
└── apps/api/src/app/k1-import/
└── extractors/
└── pdf-parse-extractor.spec.ts # NEW: unit tests
```
**Structure Decision**: Minimalist approach — rewrite one extractor file, add one position-region data file, expand one interface. Follows the existing module structure with no new architectural patterns.
## Complexity Tracking
No constitution violations. Table intentionally empty.

64
specs/005-k1-parser-fix/quickstart.md

@ -0,0 +1,64 @@
# Quickstart: Fix K-1 PDF Parser
**Feature**: 005-k1-parser-fix | **Date**: 2026-03-18
## Prerequisites
- Node.js 18+ with npm
- Docker running (PostgreSQL + Redis via docker-compose)
- Existing `004-k1-scan-import` feature branch merged or available
## Setup
```bash
# 1. Switch to feature branch
git checkout 005-k1-parser-fix
# 2. Install dependencies (should be no-op — no new packages)
npm install
# 3. Start dev infrastructure
docker compose -f docker/docker-compose.dev.yml up -d
# 4. Run database setup
npm run database:setup
# 5. Start API server
npm run start:server
# 6. Start client (separate terminal)
npm run start:client
```
## Files to Modify
| File | Action | Description |
|------|--------|-------------|
| `libs/common/src/lib/interfaces/k1-import.interface.ts` | MODIFY | Add `subtype`, `fieldCategory`, `isCheckbox` to K1ExtractedField; add `x`, `y`, `fontName` to K1UnmappedItem |
| `apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts` | REWRITE | Replace regex-based extraction with pdfjs-dist position-based extraction |
| `apps/api/src/app/k1-import/extractors/k1-position-regions.ts` | CREATE | Define 73 bounding box regions for all K-1 form fields |
## Testing
```bash
# Upload a K-1 PDF via the API
curl -X POST http://localhost:3333/api/v1/k1-import/upload \
-H "Authorization: Bearer <token>" \
-F "file=@path/to/k1.pdf"
# Check extraction results
curl http://localhost:3333/api/v1/k1-import/session/<session-id> \
-H "Authorization: Bearer <token>"
```
## Verification Checklist
- [ ] Box 11 extracted with subtype "ZZ*" and value -409615
- [ ] Box 19 extracted with subtype "A" and value 4493757
- [ ] Box 20 extracted with 4 separate subtype entries (A, B, Z, *)
- [ ] Box 21 extracted with subtype "*" and value 196
- [ ] Section J percentages extracted (3.032900, 0.000000)
- [ ] Section L capital values extracted with correct signs
- [ ] Final K-1 checkbox detected as true
- [ ] Unmapped items list is empty (all values mapped) for the reference PDF
- [ ] Non-K-1 PDF produces error, not garbage data

221
specs/005-k1-parser-fix/research.md

@ -0,0 +1,221 @@
# Research: Fix K-1 PDF Parser
**Feature**: 005-k1-parser-fix | **Date**: 2026-03-18
## Research Summary
All technical unknowns resolved. Three key decisions made:
1. **pdfjs-dist** for position-based text extraction (already installed)
2. **Font discrimination + position region mapping** as the extraction strategy
3. **73 bounding box regions** defined covering all K-1 form fields
---
## Decision 1: PDF Parsing Library
**Decision**: Use `pdfjs-dist` v5.4.296 directly (already installed as transitive dependency of pdf-parse v2.4.5)
**Rationale**:
- Already installed — no new npm dependencies
- `page.getTextContent()` returns `TextItem` objects with precise (x, y) coordinates, font name, width, height
- `@napi-rs/canvas` v0.1.80 (also already installed) provides DOMMatrix polyfill for Node.js via the legacy build
- The legacy build at `pdfjs-dist/legacy/build/pdf.mjs` auto-polyfills `DOMMatrix`, `ImageData`, `Path2D`, and `navigator`
**Alternatives considered**:
- **pdf-parse v2.4.5** (currently used): Wraps pdfjs-dist but does NOT expose position coordinates. Only returns concatenated text strings. Insufficient for position-based extraction.
- **pdf-lib**: Can read AcroForm fields, but K-1 PDFs have zero AcroForm fields (values are text overlays). Not useful.
- **pdf2json**: Older PDF.js fork with positioned text. Redundant — pdfjs-dist v5.4 is already available and more current.
### API Details
**Import** (must use dynamic import — API project compiles to CommonJS via webpack):
```typescript
const { getDocument, GlobalWorkerOptions } = await import('pdfjs-dist/legacy/build/pdf.mjs');
```
**Worker configuration** (required in v5.4.x):
```typescript
const workerPath = 'file:///' + resolve('node_modules/pdfjs-dist/legacy/build/pdf.worker.mjs').replace(/\\/g, '/');
GlobalWorkerOptions.workerSrc = workerPath;
```
**Document loading**:
```typescript
const loadingTask = getDocument({
data: new Uint8Array(buffer),
standardFontDataUrl: resolve('node_modules/pdfjs-dist/standard_fonts') + '/',
cMapUrl: resolve('node_modules/pdfjs-dist/cmaps') + '/',
cMapPacked: true,
isEvalSupported: false,
disableFontFace: true,
});
```
**Text extraction**:
```typescript
const page = await pdfDoc.getPage(1);
const textContent = await page.getTextContent({ includeMarkedContent: false });
// textContent.items: TextItem[] with { str, transform, width, height, fontName, hasEOL, dir }
// textContent.styles: { [fontName]: { fontFamily, ascent, descent, vertical } }
// transform[4] = x, transform[5] = y (PDF points, origin bottom-left)
```
**Cleanup** (required):
```typescript
await pdfDoc.destroy(); // Terminates worker, frees resources
```
### Gotchas
1. Must use `pdfjs-dist/legacy/build/pdf.mjs` — main build crashes with `DOMMatrix is not defined`
2. Must set `GlobalWorkerOptions.workerSrc` to the worker file path — empty string no longer works in v5.4.x
3. `workerSrc` must be a `file://` URL on Windows
4. Use `await import()` not static `import` — CommonJS compat via webpack
5. Y-coordinates are bottom-up: `transform[5]` = 792 is top of page, 0 is bottom
6. `page.view` gives `[0, 0, 612, 792]` — standard US Letter
---
## Decision 2: Extraction Strategy
**Decision**: Hybrid approach — font discrimination (primary) + position-based region mapping (secondary)
**Rationale**:
- Font filtering instantly isolates ~30 data values from 467 total text items on page 1
- Position mapping then determines exactly which K-1 field each value belongs to
- Two-phase filtering is more robust than either approach alone
- Resilient to minor position variations across different K-1 generators
**Alternatives considered**:
- **Regex label matching** (current approach): Fundamentally broken — pdf-parse outputs all template labels first, then all data values separately. Labels and values are never adjacent in the text stream.
- **Sequential positional parsing** (text order): Fragile — depends on exact text ordering which varies between generators. Also can't distinguish data values from template text.
- **Pure position-based** (no font check): Would work but requires matching against all 73 regions for all 467 items. Font filtering first reduces the problem to ~30 items × 73 regions.
### Font Discrimination Details
From the sample K-1 PDF, text items use these fonts:
| fontName | fontFamily | Usage | Count |
|----------|-----------|-------|-------|
| g_d0_f1 | serif | Template labels, headers | ~350 items |
| g_d0_f2 | sans-serif | "20" in tax year | 1 item |
| g_d0_f3 | sans-serif | "25" in tax year (data) | 1 item |
| g_d0_f5 | serif | Footnotes, small text | ~80 items |
| g_d0_f6 | sans-serif | Data values | ~10 items |
| g_d0_f7 | monospace | Checkboxes/codes | ~5 items |
| g_d0_f8 | sans-serif | Data values (primary) | ~20 items |
**Key insight**: Template labels exclusively use `serif` fonts. Data values exclusively use `sans-serif` or `monospace` fonts. Filtering by `fontFamily !== 'serif'` isolates all data values.
**Dynamic detection**: Since font names vary across generators, the algorithm should:
1. Get all unique fonts from `textContent.styles`
2. Identify template fonts: the fonts used by known template text items (items matching "Schedule K-1", "Form 1065", "Ordinary business income", etc.)
3. Non-template fonts = data fonts
4. Filter items to only those using data fonts
---
## Decision 3: Position Region Map
**Decision**: Define 73 bounding box regions covering all K-1 form fields with ±15 pt tolerance
**Rationale**:
- K-1 form layout is standardized by the IRS — position regions are consistent across generators
- 22 positions verified from actual PDF extraction with exact coordinates
- Remaining ~51 positions interpolated from verified anchors and standard IRS form spacing
- ±15 pt tolerance handles minor variations between generators
### Verified Anchor Points (from actual K-1 PDF)
| Value | x | y | Field |
|-------|-----|-------|-------|
| "X" | 324.3 | 746.2 | FINAL_K1 |
| "X" | 180.3 | 446.6 | G_LIMITED |
| "X" | 58.0 | 422.9 | H1_DOMESTIC |
| "3.032900" | 139.1 | 339.1 | J_PROFIT_BEGIN |
| "0.000000" | 250.1 | 339.1 | J_PROFIT_END |
| "498,211" | 180.8 | 254.5 | K_NONRECOURSE_BEGIN |
| "X" | 294.9 | 205.8 | K2_CHECKBOX |
| "4,903,568" | 257.8 | 157.4 | L_BEG_CAPITAL |
| "(409,811)" | 259.3 | 133.7 | L_CURR_YR_INCOME |
| "4,493,757" | 257.8 | 109.4 | L_WITHDRAWALS |
| "X" | 101.2 | 74.2 | M_NO |
| "(5,373)" | 271.5 | 49.7 | N_BEGINNING |
| "(409,811)" | 92.1 | 2.8 | N_ENDING |
| "ZZ*" | 314.2 | 314.4 | BOX_11_CODE |
| "(409,615)" | 403.9 | 314.4 | BOX_11_VALUE |
| "X" | 563.3 | 603.8 | BOX_16_K3 |
| "A" | 455.2 | 423.2 | BOX_19_CODE |
| "4,493,757" | 530.6 | 422.0 | BOX_19_VALUE |
| "*" | 456.4 | 267.1 | BOX_21_CODE |
| "196" | 555.6 | 266.1 | BOX_21_VALUE |
### Region Layout Summary
| Group | X range | Y range | Fields |
|-------|---------|---------|--------|
| Header | 120–450 | 731–785 | 5: TAX_YEAR, TAX_YEAR_BEGIN/END, FINAL_K1, AMENDED_K1 |
| Part I | 30–290 | 610–735 | 4: A_EIN, B_NAME, B_ADDR, C_IRS_CENTER |
| Part II | 30–306 | 350–610 | 12: D through I2 |
| Section J | 120–305 | 285–354 | 7: profit/loss/capital begin/end + decrease sale |
| Section K | 155–310 | 176–270 | 8: nonrecourse/qual/recourse begin/end + K2/K3 checkboxes |
| Section L | 220–306 | 83–173 | 6: beg/contributed/income/other/withdrawals/end |
| Section M | 50–120 | 59–89 | 2: M_YES, M_NO |
| Section N | 60–306 | 0–65 | 2: N_BEGINNING, N_ENDING |
| Part III Left | 300–455 | 245–698 | 19: boxes 1–13 (including a/b/c sub-boxes) |
| Part III Right | 440–595 | 245–710 | 8: boxes 14–21 |
### Subtype Handling
Boxes 11, 12, 13 (left column) and 14, 15, 17, 19, 20, 21 (right column) can have subtype codes:
- **Left column**: code at x ≈ 300–350, value at x ≈ 370–455
- **Right column**: code at x ≈ 440–475, value at x ≈ 510–595
Pairing algorithm: find code and value items on the same y-line (within ±8 pts).
Box 20 supports multiple subtype rows (A, B, V/Z, *) spaced ~23 pts apart within y range 275–395.
---
## Decision 4: Numeric Value Parsing
**Decision**: Parse all K-1 values using consistent rules
**Rationale**: IRS K-1 forms use standard US financial formatting. No ambiguity in the parsing rules.
**Rules**:
1. Remove commas: "4,903,568" → "4903568"
2. Parenthesized = negative: "(409,811)" → "-409811" → -409811
3. Leading minus = negative: "-5,373" → -5373
4. Dollar sign: strip "$" if present
5. Decimal percentages: "3.032900" → 3.032900 (preserve precision, do not round)
6. "SEE STMT" / "STMT" → `numericValue: null`, `rawValue: "SEE STMT"`
7. "X" (checkbox) → boolean true, `rawValue: "X"`
8. Empty / whitespace → omit field or `numericValue: 0`
9. "E-FILE" and other text values → `numericValue: null`, preserve as rawValue
---
## Decision 5: Interface Expansion
**Decision**: Add `subtype`, `fieldCategory`, and `isCheckbox` to `K1ExtractedField`; add position info to `K1UnmappedItem`
**Rationale**: The existing interface lacks fields needed for subtype codes (box 11 "ZZ*", box 20 "A"/"B"), field categorization (Part III vs Section J vs metadata), and checkbox discrimination. Adding these fields is backward-compatible (all optional/nullable).
**New fields on K1ExtractedField**:
- `subtype: string | null` — subtype code (e.g., "ZZ*", "A", "B", "*")
- `fieldCategory: 'PART_III' | 'METADATA' | 'SECTION_J' | 'SECTION_K' | 'SECTION_L' | 'SECTION_M' | 'SECTION_N' | 'CHECKBOX'`
- `isCheckbox: boolean` — whether this field is a boolean checkbox value
**New fields on K1UnmappedItem**:
- `x: number` — x position in PDF points
- `y: number` — y position in PDF points
- `fontName: string` — font identifier for debugging
---
## Open Items
None. All NEEDS CLARIFICATION items resolved.

202
specs/005-k1-parser-fix/spec.md

@ -0,0 +1,202 @@
# Feature Specification: Fix K-1 PDF Parser — Position-Based Extraction
**Feature Branch**: `005-k1-parser-fix`
**Created**: 2025-07-21
**Status**: Draft
**Input**: User description: "Fix K-1 PDF parser to correctly extract positional values from IRS Schedule K-1 (Form 1065) PDFs. The current regex-based parser matches cell numbers as values instead of actual data. Rewrite using position-based extraction with pdfjs-dist to reliably map form field values by their (x, y) coordinates and font discrimination. Support all Part I/II metadata fields, Part III income/deduction boxes (1-21), subtype codes, checkboxes, percentages, and 'SEE STMT' references. Allow users to manually map any ambiguous or unrecognized fields."
## Background
E-filed IRS Schedule K-1 (Form 1065) PDFs have a specific structure: form template text (labels, headings, instructions) and data values are rendered as separate text overlays on the same page. When extracted as plain text, all template text appears first, followed by all data values in a flat positional list — without any labels attached to the values. The current regex-based parser attempts to match labels to adjacent values, which fundamentally fails because labels and values are in completely different sections of the extracted text.
### PDF Structure Discovery
Analysis of a real e-filed K-1 PDF reveals:
- **467 total text items** on page 1
- **Zero AcroForm fields** — values are positioned text overlays, not fillable form fields
- **Font discrimination**: All data values use a distinct font (e.g., `g_d0_f8`) that differs from template text fonts
- **Position coordinates**: Each text item has precise (x, y) coordinates via the PDF transformation matrix
- **K-1 form layout**: Three distinct regions — Part I/II (left column, partnership/partner info), Part III left column (boxes 1-13), Part III right column (boxes 14-21)
- **Subtype codes**: Some boxes (11, 19, 20, 21) have letter/symbol codes as separate text items in the same y-band as their values
- **Checkboxes**: Represented as "X" text items at checkbox positions
## User Scenarios & Testing _(mandatory)_
### User Story 1 — Accurate K-1 Value Extraction (Priority: P1)
As an investor uploading an e-filed K-1 PDF, I want the system to correctly extract all Part III box values (boxes 1-21) so that my income, deductions, and credits are accurately captured without manual correction.
**Why this priority**: This is the core value proposition. If Part III box values are wrong, the entire K-1 import feature is unusable. Every K-1 has Part III data, and getting it right eliminates the most painful manual data entry.
**Independent Test**: Upload a sample e-filed K-1 PDF and verify that all Part III boxes with values are correctly extracted with the right box number, value, and sign (parenthesized values as negative).
**Acceptance Scenarios**:
1. **Given** an e-filed K-1 PDF with box 1 value "498,211", **When** the PDF is uploaded and parsed, **Then** box 1 is extracted with `rawValue: "498,211"` and `numericValue: 498211`
2. **Given** an e-filed K-1 PDF with box 11 having subtype "ZZ*" and value "(409,615)", **When** parsed, **Then** box 11 is extracted with `boxNumber: "11"`, `subtype: "ZZ*"`, `rawValue: "(409,615)"`, and `numericValue: -409615`
3. **Given** an e-filed K-1 PDF with box 19 subtype "A" and value "4,493,757", **When** parsed, **Then** box 19 is extracted with `boxNumber: "19"`, `subtype: "A"`, `rawValue: "4,493,757"`, and `numericValue: 4493757`
4. **Given** an e-filed K-1 PDF with box 20 subtypes A, B, Z, and * all showing "SEE STMT", **When** parsed, **Then** four separate fields are extracted for box 20, each with the correct subtype code and `rawValue: "SEE STMT"`, `numericValue: null`
5. **Given** an e-filed K-1 PDF with box 21 having subtype "*" and value "196", **When** parsed, **Then** box 21 is extracted with `boxNumber: "21"`, `subtype: "*"`, `rawValue: "196"`, and `numericValue: 196`
6. **Given** a value in parentheses like "(409,811)", **When** parsed, **Then** the numericValue is `-409811` (negative)
7. **Given** an empty box (no value present), **When** parsed, **Then** the box is either omitted from results or included with `numericValue: 0`
---
### User Story 2 — Partnership & Partner Metadata Extraction (Priority: P1)
As an investor, I want the system to extract Part I (partnership info) and Part II (partner info) metadata — including names, EINs, addresses, tax year, and filing status — so I can match K-1 documents to the correct partnership and tax period.
**Why this priority**: Metadata is essential for identifying which partnership and partner the K-1 belongs to, and for the tax year assignment. Without this, K-1 data cannot be properly filed.
**Independent Test**: Upload a K-1 PDF and verify that partnership name, EIN, partner name, EIN, tax year, and final/amended status are correctly extracted.
**Acceptance Scenarios**:
1. **Given** an e-filed K-1 that is marked "Final K-1", **When** parsed, **Then** `metadata.isFinal` is `true`
2. **Given** an e-filed K-1 for tax year 2025, **When** parsed, **Then** `metadata.taxYear` is `2025`
3. **Given** a K-1 with IRS Center field showing "E-FILE", **When** parsed, **Then** the IRS center metadata field is captured as "E-FILE"
---
### User Story 3 — Part I/II Financial Fields Extraction (Priority: P2)
As an investor, I want the system to extract Part I/II financial fields — Section J (profit/loss/capital percentages), Section K (liabilities), Section L (capital account analysis), Section M (contributed property), and Section N (partner share of net income) — so that my partnership interest details are fully captured.
**Why this priority**: These fields provide the partnership interest context (ownership percentages, capital account, liabilities) needed for tax reporting. They are secondary to Part III income boxes but still required for a complete K-1 record.
**Independent Test**: Upload a K-1 PDF and verify J/K/L/M/N sections are extracted with correct begin/end values and signs.
**Acceptance Scenarios**:
1. **Given** a K-1 with Section J showing profit beginning "3.032900" and ending "0.000000", **When** parsed, **Then** fields are extracted: J_PROFIT_BEGIN = 3.032900, J_PROFIT_END = 0.000000
2. **Given** a K-1 with Section J loss and capital rows identical to profit, **When** parsed, **Then** J_LOSS_BEGIN, J_LOSS_END, J_CAPITAL_BEGIN, J_CAPITAL_END are all correctly extracted
3. **Given** a K-1 with Section K showing nonrecourse beginning "498,211", **When** parsed, **Then** K_NONRECOURSE_BEGIN = 498211
4. **Given** a K-1 with Section L showing beginning capital "4,903,568", withdrawals "4,493,757", and current year net income/loss "(409,811)", **When** parsed, **Then** L_BEG_CAP = 4903568, L_WITHD_DIST = 4493757, L_CURR_YR_INCOME = -409811
5. **Given** a K-1 with Section M checkbox "No" marked, **When** parsed, **Then** M_CONTRIBUTED_PROPERTY = false (or "NO")
6. **Given** a K-1 with Section N showing beginning "(5,373)" and ending "(409,811)", **When** parsed, **Then** N_BEG = -5373, N_END = -409811
---
### User Story 4 — Checkbox and Boolean Field Extraction (Priority: P2)
As an investor, I want checkbox fields (Final K-1, Amended K-1, General/Limited partner, Domestic/Foreign partner, K-2/K-3 attached indicators) to be correctly identified as boolean values so they accurately reflect my filing status.
**Why this priority**: Checkboxes determine filing status and partner classification. Misidentifying them can lead to incorrect tax treatment. They are simpler to extract (just "X" at a position) but critical to get right.
**Independent Test**: Upload a K-1 PDF with known checkbox states and verify all checkboxes are correctly identified as checked or unchecked.
**Acceptance Scenarios**:
1. **Given** a K-1 with "Final K-1" checked and "Amended K-1" unchecked, **When** parsed, **Then** `isFinal: true`, `isAmended: false`
2. **Given** a K-1 with "Limited partner" checked, **When** parsed, **Then** the partner type field reflects "Limited"
3. **Given** a K-1 with "Domestic partner" checked, **When** parsed, **Then** the partner domestic/foreign field reflects "Domestic"
4. **Given** a K-1 with box 16 "K-3 attached" checked, **When** parsed, **Then** box 16 reflects `true`
---
### User Story 5 — Manual Mapping Fallback for Ambiguous Fields (Priority: P3)
As an investor, when the parser cannot confidently map a value to a specific K-1 field (due to unexpected positioning, font, or layout variation), I want to see those values listed as "unmapped" so I can manually assign them to the correct fields through the review interface.
**Why this priority**: No parser is perfect. Different K-1 generators may produce slightly different layouts. Providing a manual mapping fallback ensures data is never lost and users always have control, even when automatic extraction is imperfect.
**Independent Test**: Upload a K-1 PDF where some values fall outside expected position regions, and verify those values appear in the unmapped items list for manual assignment.
**Acceptance Scenarios**:
1. **Given** a K-1 PDF where a value appears at an unexpected position, **When** parsed, **Then** that value appears in the `unmappedItems` list with its raw text, position, and page number
2. **Given** an unmapped item in the review interface, **When** the user assigns it to box "4", **Then** it moves to the extracted fields list as box 4 with the assigned value
3. **Given** an unmapped item, **When** the user marks it as "discarded", **Then** it is excluded from the final import
---
### Edge Cases
- **Multi-page K-1**: Some K-1s span multiple pages. The parser should handle page 1 (the standard K-1 form) and recognize that subsequent pages are supplemental statements, not additional K-1 data to parse.
- **All-empty K-1**: A K-1 with zero data values (all boxes empty) should produce an extraction result with no fields and no errors.
- **Negative values**: Parenthesized values like "(40,029)" must be parsed as negative numbers (-40029). Plain minus signs (e.g., "-5,373") should also be handled.
- **"SEE STMT" references**: Some boxes contain "SEE STMT" (See Statement) instead of a numeric value. These should be captured as-is with `numericValue: null`.
- **Tab-separated subtype/value pairs**: Values like "ZZ*\t(409,615)" or "A\t4,493,757" contain a subtype code tab-separated from the value. Both parts must be captured.
- **Multiple subtypes per box**: Box 20 can have multiple rows (A, B, Z, *), each with its own value. All must be extracted as separate fields.
- **Non-standard fonts**: Different K-1 generators may use different font names. The parser should identify data fonts dynamically rather than hardcoding a specific font name.
- **Corrupted or non-K-1 PDFs**: If a PDF has no recognizable K-1 structure (no matching template text), extraction should fail gracefully with a meaningful error message, not crash.
- **Percentage values**: Section J values are decimal percentages (e.g., "3.032900"). These should be preserved as-is without rounding.
## Requirements _(mandatory)_
### Functional Requirements
#### Core Extraction
- **FR-001**: System MUST extract all Part III box values (boxes 1 through 21) from e-filed K-1 PDFs using position-based text extraction rather than regex label matching
- **FR-002**: System MUST extract each text item's position coordinates (x, y) and font information from the PDF to determine which form field a value belongs to
- **FR-003**: System MUST discriminate between template text (labels/headings) and data values using font characteristics as the primary differentiator, with position and content pattern as secondary signals
- **FR-004**: System MUST define position regions (bounding boxes) for each K-1 form field and map extracted data values to the correct field based on which region their coordinates fall within
- **FR-005**: System MUST parse parenthesized values as negative numbers: "(409,811)" → -409811
- **FR-006**: System MUST handle comma-separated thousands in numeric values: "4,903,568" → 4903568
- **FR-007**: System MUST preserve "SEE STMT" values as raw text with a null numeric value and not attempt numeric parsing
#### Subtype and Multi-Value Fields
- **FR-008**: System MUST extract subtype codes for boxes that support them (boxes 11, 12, 13, 14, 19, 20, 21) where a letter or symbol code appears as a separate text item in the same vertical band as the value
- **FR-009**: System MUST support multiple subtype rows per box (e.g., box 20 with subtypes A, B, Z, and *)
- **FR-010**: System MUST capture tab-separated subtype/value pairs where the code and value appear on the same text line
#### Metadata and Part I/II
- **FR-011**: System MUST extract Part I/II metadata including: partnership name, partnership EIN, partner name, partner EIN, tax year, IRS center, and filing status (final/amended)
- **FR-012**: System MUST extract Section J percentage fields (profit, loss, capital — beginning and ending)
- **FR-013**: System MUST extract Section K liability fields (nonrecourse, qualified nonrecourse, recourse — beginning and ending as available)
- **FR-014**: System MUST extract Section L capital account fields (beginning capital, capital contributed, current year net income/loss, other increase/decrease, withdrawals/distributions, ending capital)
- **FR-015**: System MUST extract Section M (contributed property indicator) and Section N (partner share of net unrecognized 704(c) gain/loss — beginning and ending)
#### Checkbox Fields
- **FR-016**: System MUST identify checkbox fields marked with "X" at known checkbox positions (Final K-1, Amended K-1, General/Limited partner, Domestic/Foreign partner, K-2/K-3 attached)
- **FR-017**: System MUST represent checkbox values as boolean (true = "X" present at the checkbox position, false = absent)
#### Confidence and Unmapped Items
- **FR-018**: System MUST assign a confidence level (HIGH, MEDIUM, LOW) to each extracted field based on how precisely the value's position matches the expected region
- **FR-019**: System MUST place any extracted value that does not fall within a defined position region into the "unmapped items" list, capturing the raw text, position, and page number
- **FR-020**: System MUST allow users to manually assign unmapped items to specific box numbers through the existing review interface
- **FR-021**: System MUST allow users to discard unmapped items they determine are irrelevant
#### Robustness
- **FR-022**: System MUST handle K-1 PDFs from different e-filing generators that may use different font names by dynamically identifying which font is used for data values
- **FR-023**: System MUST gracefully handle PDFs that are not K-1 forms or have unrecognizable layouts, returning a meaningful error rather than crashing
- **FR-024**: System MUST process only page 1 of the K-1 PDF for standard form data extraction (supplemental statement pages are out of scope for this feature)
- **FR-025**: System MUST preserve the existing extraction interface contract so that upstream services (K1 import service, review UI) continue to work without changes
### Key Entities
- **K1ExtractedField**: A single parsed value from the K-1 form. Key attributes: box number, optional subtype code, raw text value, parsed numeric value, confidence level, field category (Part III box, Part I/II metadata, Section J/K/L/M/N), and whether it's a checkbox.
- **K1PositionRegion**: A defined bounding area on the K-1 form page corresponding to a specific field. Attributes: field identifier, x-min, x-max, y-min, y-max, expected value type (numeric, text, checkbox, percentage).
- **K1UnmappedItem**: A data value extracted from the PDF that couldn't be mapped to any defined position region. Attributes: raw text, x/y position, page number, user resolution (assigned/discarded/pending).
- **K1ExtractionResult**: The complete output of parsing a K-1 PDF. Contains metadata (partnership, partner, tax year, filing status), mapped fields array, unmapped items array, overall confidence, and extraction method identifier.
## Success Criteria _(mandatory)_
### Measurable Outcomes
- **SC-001**: For a standard e-filed K-1 PDF, all Part III boxes with values are extracted with the correct box number and value in a single upload — no manual corrections needed for the reference test PDF
- **SC-002**: Numeric values including negative (parenthesized) amounts are parsed correctly with 100% accuracy for well-formed values
- **SC-003**: All subtype codes (e.g., box 11 "ZZ*", box 19 "A", box 20 "A"/"B"/"Z"/"*") are correctly paired with their values
- **SC-004**: Part I/II metadata (tax year, filing status, partner type) is extracted correctly
- **SC-005**: Section J percentages, Section K liabilities, Section L capital account, and Section N values are extracted with correct signs and decimal precision
- **SC-006**: Users can review and correct any extraction result through the existing review interface within 2 minutes
- **SC-007**: Values that cannot be automatically mapped appear in the unmapped items list, ensuring zero data loss during extraction
- **SC-008**: Non-K-1 PDFs produce a clear error message rather than incorrect/garbage data
- **SC-009**: Extraction completes within 5 seconds for a single-page K-1 PDF
## Assumptions
- All K-1 PDFs follow the standard IRS Schedule K-1 (Form 1065) layout for 2025 and adjacent tax years. Custom or non-standard K-1 formats are not in scope.
- E-filed K-1 PDFs render values as positioned text overlays (not AcroForm fields). The system does not need to support fillable PDF form field extraction.
- The existing review/confirmation UI and data flow (upload → extract → review → confirm) remains unchanged. Only the extraction logic is being rewritten.
- Font names vary across K-1 generators; the parser will dynamically identify the data font rather than hardcoding a specific font name.
- "SEE STMT" references indicate supplemental statement pages exist but parsing those supplemental pages is out of scope for this feature.
- PDF page coordinates use standard PDF coordinate system (origin at bottom-left, y increases upward).
- The position region map is calibrated for the standard IRS K-1 form layout; minor position adjustments may be needed over time as different generators are encountered.

237
specs/005-k1-parser-fix/tasks.md

@ -0,0 +1,237 @@
# Tasks: Fix K-1 PDF Parser — Position-Based Extraction
**Input**: Design documents from `/specs/005-k1-parser-fix/`
**Prerequisites**: plan.md, spec.md, research.md, data-model.md, contracts/extraction.md, quickstart.md
**Tests**: Not explicitly requested — test tasks omitted.
**Organization**: Tasks grouped by user story to enable independent implementation and testing.
## Format: `[ID] [P?] [Story] Description`
- **[P]**: Can run in parallel (different files, no dependencies)
- **[Story]**: Which user story this task belongs to (US1–US5)
- Exact file paths included in all descriptions
## Path Conventions
- **Monorepo (Nx)**: `apps/api/src/`, `libs/common/src/`
- **Extractor module**: `apps/api/src/app/k1-import/extractors/`
- **Shared interfaces**: `libs/common/src/lib/interfaces/`
---
## Phase 1: Setup
**Purpose**: Expand shared interfaces to support new extraction fields
- [x] T001 Add `subtype: string | null`, `fieldCategory: string`, and `isCheckbox: boolean` to K1ExtractedField interface, and add `x: number`, `y: number`, `fontName: string` to K1UnmappedItem interface in libs/common/src/lib/interfaces/k1-import.interface.ts
---
## Phase 2: Foundational (Blocking Prerequisites)
**Purpose**: Core extraction infrastructure that ALL user stories depend on — pdfjs-dist integration, position regions, font discrimination, value parsing
**⚠️ CRITICAL**: No user story work can begin until this phase is complete
- [x] T002 [P] Create K1PositionRegion interface and export all 73 bounding box region definitions (Header, Part I, Part II, Sections J/K/L/M/N, Part III left boxes 1-13, Part III right boxes 14-21) with ±15pt tolerance using verified anchor coordinates from research.md in apps/api/src/app/k1-import/extractors/k1-position-regions.ts
- [x] T003 Replace existing regex-based extraction with pdfjs-dist scaffold: dynamic `await import('pdfjs-dist/legacy/build/pdf.mjs')`, GlobalWorkerOptions.workerSrc set to `file://` path of pdf.worker.mjs, getDocument() with buffer, getPage(1), getTextContent(), and pdfDoc.destroy() cleanup in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T004 Implement dynamic font discrimination using textContent.styles: classify each font as template (serif fontFamily) or data (sans-serif/monospace fontFamily), filter text items to only data-font items in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T005 Implement findRegionForPosition() function that takes (x, y) coordinates and returns the matching K1PositionRegion from the 73-region map using ±15pt bounding box tolerance, or null if no match in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T006 Implement parseK1Value() utility: strip commas, parenthesized values → negative number, leading minus → negative, "SEE STMT" → numericValue null, "X" → checkbox true, dollar sign strip, preserve decimal percentages without rounding in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
**Checkpoint**: Foundation ready — pdfjs-dist loads PDFs, data-font items are isolated, positions match to regions, values parse correctly
---
## Phase 3: User Story 1 — Accurate K-1 Value Extraction (Priority: P1) 🎯 MVP
**Goal**: Extract all Part III box values (boxes 1-21) with correct box numbers, values, signs, and subtype codes
**Independent Test**: Upload a sample K-1 PDF and verify Part III boxes are correctly extracted — box 1 = 498,211; box 11 ZZ* = (409,615); box 19 A = 4,493,757; box 20 with 4 subtypes; box 21 * = 196
### Implementation for User Story 1
- [x] T007 [US1] Implement Part III extraction loop: iterate data-font items, match to Part III regions (left column boxes 1-13, right column boxes 14-21), build K1ExtractedField with boxNumber, rawValue, numericValue, fieldCategory='PART_III' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T008 [US1] Implement subtype code pairing: for regions with hasSubtype=true, find code text item and value text item at same y-band (±8pts) using subtypeXMin/XMax ranges from k1-position-regions.ts, set subtype field on K1ExtractedField in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T009 [US1] Handle multi-subtype boxes (box 20 with A, B, Z, * at ~23pt vertical spacing): produce separate K1ExtractedField entry for each subtype/value pair within the box's y-range in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T010 [US1] Wire Part III extraction into the main extract() method: call extraction after font filtering and position matching, merge Part III fields into K1ExtractionResult.fields array in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
**Checkpoint**: Part III boxes 1-21 fully extracted with subtypes — User Story 1 independently testable via upload
---
## Phase 4: User Story 2 — Partnership & Partner Metadata Extraction (Priority: P1)
**Goal**: Extract Part I (partnership info) and Part II (partner info) metadata — names, EINs, addresses, tax year, filing status
**Independent Test**: Upload a K-1 PDF and verify partnership name, EIN, partner name, tax year, and final/amended status are correctly populated on K1ExtractionResult.metadata
### Implementation for User Story 2
- [x] T011 [US2] Implement header region extraction: match data items to Header regions for tax year (combine "20" + "25"), tax year begin/end dates, Final K-1 flag, Amended K-1 flag in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T012 [US2] Implement Part I extraction: match data items to Part I regions for partnership EIN (field A), partnership name and address (field B), and IRS Center (field C) in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T013 [US2] Implement Part II extraction: match data items to Part II regions for partner EIN (field D), partner name (field E), address (field F), and partner type general/limited (field G) and domestic/foreign (field H) in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T014 [US2] Assemble K1ExtractionResult.metadata object from extracted header, Part I, and Part II fields, setting partnershipName, partnershipEin, partnerName, partnerEin, taxYear, isFinal, isAmended in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
**Checkpoint**: Metadata fully populated — User Story 2 independently testable via upload
---
## Phase 5: User Story 3 — Part I/II Financial Fields Extraction (Priority: P2)
**Goal**: Extract Sections J (percentages), K (liabilities), L (capital account), M (contributed property), N (net 704(c) gain/loss)
**Independent Test**: Upload a K-1 PDF and verify Section J percentages (3.032900 / 0.000000), Section K nonrecourse (498,211), Section L capital values with correct signs, Section N values are extracted
### Implementation for User Story 3
- [x] T015 [US3] Implement Section J extraction: match data items to 7 Section J regions for profit/loss/capital beginning and ending percentages, plus decrease-in-sale field, with fieldCategory='SECTION_J' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T016 [US3] Implement Section K extraction: match data items to 8 Section K regions for nonrecourse/qualified nonrecourse/recourse beginning and ending liabilities, plus K-2/K-3 checkbox regions, with fieldCategory='SECTION_K' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T017 [US3] Implement Section L extraction: match data items to 6 Section L regions for beginning capital, capital contributed, current year net income/loss, other increase/decrease, withdrawals/distributions, ending capital with fieldCategory='SECTION_L' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T018 [US3] Implement Section M (contributed property yes/no checkbox) and Section N (beginning and ending net 704(c) gain/loss values) extraction with fieldCategory='SECTION_M' and 'SECTION_N' respectively in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
**Checkpoint**: All J/K/L/M/N financial fields extracted — User Story 3 independently testable
---
## Phase 6: User Story 4 — Checkbox and Boolean Field Extraction (Priority: P2)
**Goal**: Detect all checkbox fields (Final K-1, Amended K-1, General/Limited, Domestic/Foreign, K-2/K-3 attached) as boolean values
**Independent Test**: Upload a K-1 PDF with known checkbox states and verify Final K-1 = true, Limited partner = true, Domestic = true, box 16 K-3 attached = true
### Implementation for User Story 4
- [x] T019 [US4] Implement checkbox detection: for all regions with valueType='checkbox', check if an "X" text item exists at the checkbox position, build K1ExtractedField with rawValue="X", numericValue=null, isCheckbox=true, fieldCategory='CHECKBOX' for checked boxes in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T020 [US4] Ensure unchecked checkboxes are either omitted or included with rawValue="" and isCheckbox=true to distinguish from missing data, and verify checkbox fields set on K1ExtractionResult.metadata (isFinal, isAmended) in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
**Checkpoint**: All checkbox fields correctly detected as boolean values — User Story 4 independently testable
---
## Phase 7: User Story 5 — Manual Mapping Fallback for Ambiguous Fields (Priority: P3)
**Goal**: Data-font values that don't match any region appear in unmappedItems with position info for manual assignment
**Independent Test**: Upload a K-1 PDF where some values fall outside expected regions and verify those values appear in unmappedItems with raw text, x, y, fontName, pageNumber
### Implementation for User Story 5
- [x] T021 [US5] After all region matching is complete, collect remaining unmatched data-font items into K1UnmappedItem[] with rawLabel='', rawValue, numericValue (parsed), confidence=0.5, pageNumber=1, x, y, fontName in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T022 [US5] Verify unmapped items integrate with existing review UI manual assignment flow: ensure assignedBoxNumber and resolution fields on K1UnmappedItem work with the confirmation endpoint in apps/api/src/app/k1-import/k1-import.service.ts
**Checkpoint**: Zero data loss — all extracted values either mapped to fields or available in unmappedItems for manual assignment
---
## Phase 8: Polish & Cross-Cutting Concerns
**Purpose**: Error handling, confidence scoring, cleanup, and service integration
- [x] T023 Implement graceful error handling: wrap extraction in try/catch, return empty fields + low confidence + meaningful error for non-K-1 and corrupted PDFs, never crash on unexpected content in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T024 Implement confidence scoring: HIGH (≥0.90) when value center is within region center ±5pts, MEDIUM (0.70-0.89) within ±10pts, LOW (0.50-0.69) at tolerance boundary ±15pts; compute overallConfidence as weighted average in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T025 Ensure pdfDoc.destroy() cleanup runs in all code paths (success, error, empty result) using try/finally in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T026 [P] Update k1-import.service.ts to handle new subtype field when building K1Cell records — concatenate subtype into boxNumber (e.g., "11-ZZ*", "20-A") or store via metadata JSON column in apps/api/src/app/k1-import/k1-import.service.ts
- [x] T027 Run quickstart.md verification checklist: upload test K-1 PDF, verify all 9 checklist items pass (box 11/19/20/21, Section J/L, Final K-1 checkbox, unmapped empty, non-K-1 error)
---
## Dependencies & Execution Order
### Phase Dependencies
- **Setup (Phase 1)**: No dependencies — can start immediately
- **Foundational (Phase 2)**: T002 can run in parallel with T001 (different files). T003-T006 depend on T001 (interface types) and execute sequentially in pdf-parse-extractor.ts
- **User Stories (Phase 3-7)**: ALL depend on Foundational phase completion (T001-T006)
- US1 (Phase 3) and US2 (Phase 4): Both P1, execute sequentially (same file)
- US3 (Phase 5) and US4 (Phase 6): Both P2, execute after US1+US2 (same file)
- US5 (Phase 7): P3, executes last of user stories
- **Polish (Phase 8)**: T023-T025 depend on all user stories. T026 is independent (different file, marked [P])
### User Story Dependencies
- **US1 (P1)**: Depends only on Foundational. No dependency on other stories.
- **US2 (P1)**: Depends only on Foundational. No dependency on US1 (metadata vs Part III are separate regions).
- **US3 (P2)**: Depends only on Foundational. J/K/L/M/N regions are independent of Part III.
- **US4 (P2)**: Depends only on Foundational. Checkbox detection is position-based, independent of value extraction. Some overlap with US2 (Final/Amended checkboxes set metadata flags).
- **US5 (P3)**: Depends on US1-US4 being done (unmapped = whatever's left after all matching).
### Within Each User Story
- Region matching before subtype pairing
- Subtype pairing before multi-subtype handling
- Core extraction before wiring into extract()
- Story complete before moving to next priority
### Parallel Opportunities
- **T001 + T002**: Interface expansion and position regions file — different files, no dependencies
- **T026**: Service update — different file from extractor, can run in parallel with T023-T025
- **US1-US4**: While all modify the same extractor file (sequential), each story's extraction logic is a self-contained function that could theoretically be developed in parallel branches
---
## Parallel Example: Foundational Phase
```
# These two tasks can run simultaneously:
Task T001: "Expand interfaces in k1-import.interface.ts"
Task T002: "Create position regions in k1-position-regions.ts"
# Then sequentially in pdf-parse-extractor.ts:
Task T003: "Scaffold pdfjs-dist infrastructure"
Task T004: "Font discrimination logic"
Task T005: "Position matching engine"
Task T006: "Value parsing utility"
```
## Parallel Example: Polish Phase
```
# These can run simultaneously (different files):
Task T023-T025: "Error handling, confidence, cleanup in pdf-parse-extractor.ts"
Task T026: "Service subtype handling in k1-import.service.ts"
# Final validation after all above:
Task T027: "Run quickstart.md verification checklist"
```
---
## Implementation Strategy
### MVP First (User Story 1 Only)
1. Complete Phase 1: Setup (T001) — interface expansion
2. Complete Phase 2: Foundational (T002-T006) — pdfjs-dist + regions + font + parsing
3. Complete Phase 3: User Story 1 (T007-T010) — Part III boxes 1-21
4. **STOP and VALIDATE**: Upload test K-1 PDF, verify Part III extraction
5. This delivers the core value — accurate box values replace broken regex parser
### Incremental Delivery
1. Setup + Foundational → Infrastructure ready
2. Add US1 (Part III) → Test independently → **MVP!**
3. Add US2 (Metadata) → Test independently → Metadata populated
4. Add US3 (J/K/L/M/N) → Test independently → Financial fields complete
5. Add US4 (Checkboxes) → Test independently → Boolean fields detected
6. Add US5 (Unmapped) → Test independently → Zero data loss guaranteed
7. Polish → Error handling, confidence, service integration
### Single Developer Flow
All user story tasks modify the same extractor file, so execute sequentially:
Phase 1 → Phase 2 → Phase 3 (US1) → Phase 4 (US2) → Phase 5 (US3) → Phase 6 (US4) → Phase 7 (US5) → Phase 8 (Polish)
---
## Notes
- All 73 position regions are defined in T002 upfront — individual story phases use them
- No new npm dependencies required (pdfjs-dist already installed via pdf-parse)
- The extractor rewrite preserves the existing K1Extractor interface contract (extract + isAvailable)
- Keep isDigitalK1() from the existing extractor — it's used by isAvailable()
- Font names are dynamic — never hardcode specific font names like "g_d0_f8"
- Total: 27 tasks across 8 phases covering 5 user stories
Loading…
Cancel
Save