16 KiB
Tasks: Fix K-1 PDF Parser — Position-Based Extraction
Input: Design documents from /specs/005-k1-parser-fix/
Prerequisites: plan.md, spec.md, research.md, data-model.md, contracts/extraction.md, quickstart.md
Tests: Not explicitly requested — test tasks omitted.
Organization: Tasks grouped by user story to enable independent implementation and testing.
Format: [ID] [P?] [Story] Description
- [P]: Can run in parallel (different files, no dependencies)
- [Story]: Which user story this task belongs to (US1–US5)
- Exact file paths included in all descriptions
Path Conventions
- Monorepo (Nx):
apps/api/src/,libs/common/src/ - Extractor module:
apps/api/src/app/k1-import/extractors/ - Shared interfaces:
libs/common/src/lib/interfaces/
Phase 1: Setup
Purpose: Expand shared interfaces to support new extraction fields
- T001 Add
subtype: string | null,fieldCategory: string, andisCheckbox: booleanto K1ExtractedField interface, and addx: number,y: number,fontName: stringto K1UnmappedItem interface in libs/common/src/lib/interfaces/k1-import.interface.ts
Phase 2: Foundational (Blocking Prerequisites)
Purpose: Core extraction infrastructure that ALL user stories depend on — pdfjs-dist integration, position regions, font discrimination, value parsing
⚠️ CRITICAL: No user story work can begin until this phase is complete
- T002 [P] Create K1PositionRegion interface and export all 73 bounding box region definitions (Header, Part I, Part II, Sections J/K/L/M/N, Part III left boxes 1-13, Part III right boxes 14-21) with ±15pt tolerance using verified anchor coordinates from research.md in apps/api/src/app/k1-import/extractors/k1-position-regions.ts
- T003 Replace existing regex-based extraction with pdfjs-dist scaffold: dynamic
await import('pdfjs-dist/legacy/build/pdf.mjs'), GlobalWorkerOptions.workerSrc set tofile://path of pdf.worker.mjs, getDocument() with buffer, getPage(1), getTextContent(), and pdfDoc.destroy() cleanup in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts - T004 Implement dynamic font discrimination using textContent.styles: classify each font as template (serif fontFamily) or data (sans-serif/monospace fontFamily), filter text items to only data-font items in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T005 Implement findRegionForPosition() function that takes (x, y) coordinates and returns the matching K1PositionRegion from the 73-region map using ±15pt bounding box tolerance, or null if no match in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T006 Implement parseK1Value() utility: strip commas, parenthesized values → negative number, leading minus → negative, "SEE STMT" → numericValue null, "X" → checkbox true, dollar sign strip, preserve decimal percentages without rounding in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
Checkpoint: Foundation ready — pdfjs-dist loads PDFs, data-font items are isolated, positions match to regions, values parse correctly
Phase 3: User Story 1 — Accurate K-1 Value Extraction (Priority: P1) 🎯 MVP
Goal: Extract all Part III box values (boxes 1-21) with correct box numbers, values, signs, and subtype codes
Independent Test: Upload a sample K-1 PDF and verify Part III boxes are correctly extracted — box 1 = 498,211; box 11 ZZ* = (409,615); box 19 A = 4,493,757; box 20 with 4 subtypes; box 21 * = 196
Implementation for User Story 1
- T007 [US1] Implement Part III extraction loop: iterate data-font items, match to Part III regions (left column boxes 1-13, right column boxes 14-21), build K1ExtractedField with boxNumber, rawValue, numericValue, fieldCategory='PART_III' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T008 [US1] Implement subtype code pairing: for regions with hasSubtype=true, find code text item and value text item at same y-band (±8pts) using subtypeXMin/XMax ranges from k1-position-regions.ts, set subtype field on K1ExtractedField in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T009 [US1] Handle multi-subtype boxes (box 20 with A, B, Z, * at ~23pt vertical spacing): produce separate K1ExtractedField entry for each subtype/value pair within the box's y-range in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T010 [US1] Wire Part III extraction into the main extract() method: call extraction after font filtering and position matching, merge Part III fields into K1ExtractionResult.fields array in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
Checkpoint: Part III boxes 1-21 fully extracted with subtypes — User Story 1 independently testable via upload
Phase 4: User Story 2 — Partnership & Partner Metadata Extraction (Priority: P1)
Goal: Extract Part I (partnership info) and Part II (partner info) metadata — names, EINs, addresses, tax year, filing status
Independent Test: Upload a K-1 PDF and verify partnership name, EIN, partner name, tax year, and final/amended status are correctly populated on K1ExtractionResult.metadata
Implementation for User Story 2
- T011 [US2] Implement header region extraction: match data items to Header regions for tax year (combine "20" + "25"), tax year begin/end dates, Final K-1 flag, Amended K-1 flag in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T012 [US2] Implement Part I extraction: match data items to Part I regions for partnership EIN (field A), partnership name and address (field B), and IRS Center (field C) in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T013 [US2] Implement Part II extraction: match data items to Part II regions for partner EIN (field D), partner name (field E), address (field F), and partner type general/limited (field G) and domestic/foreign (field H) in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T014 [US2] Assemble K1ExtractionResult.metadata object from extracted header, Part I, and Part II fields, setting partnershipName, partnershipEin, partnerName, partnerEin, taxYear, isFinal, isAmended in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
Checkpoint: Metadata fully populated — User Story 2 independently testable via upload
Phase 5: User Story 3 — Part I/II Financial Fields Extraction (Priority: P2)
Goal: Extract Sections J (percentages), K (liabilities), L (capital account), M (contributed property), N (net 704(c) gain/loss)
Independent Test: Upload a K-1 PDF and verify Section J percentages (3.032900 / 0.000000), Section K nonrecourse (498,211), Section L capital values with correct signs, Section N values are extracted
Implementation for User Story 3
- T015 [US3] Implement Section J extraction: match data items to 7 Section J regions for profit/loss/capital beginning and ending percentages, plus decrease-in-sale field, with fieldCategory='SECTION_J' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T016 [US3] Implement Section K extraction: match data items to 8 Section K regions for nonrecourse/qualified nonrecourse/recourse beginning and ending liabilities, plus K-2/K-3 checkbox regions, with fieldCategory='SECTION_K' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T017 [US3] Implement Section L extraction: match data items to 6 Section L regions for beginning capital, capital contributed, current year net income/loss, other increase/decrease, withdrawals/distributions, ending capital with fieldCategory='SECTION_L' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T018 [US3] Implement Section M (contributed property yes/no checkbox) and Section N (beginning and ending net 704(c) gain/loss values) extraction with fieldCategory='SECTION_M' and 'SECTION_N' respectively in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
Checkpoint: All J/K/L/M/N financial fields extracted — User Story 3 independently testable
Phase 6: User Story 4 — Checkbox and Boolean Field Extraction (Priority: P2)
Goal: Detect all checkbox fields (Final K-1, Amended K-1, General/Limited, Domestic/Foreign, K-2/K-3 attached) as boolean values
Independent Test: Upload a K-1 PDF with known checkbox states and verify Final K-1 = true, Limited partner = true, Domestic = true, box 16 K-3 attached = true
Implementation for User Story 4
- T019 [US4] Implement checkbox detection: for all regions with valueType='checkbox', check if an "X" text item exists at the checkbox position, build K1ExtractedField with rawValue="X", numericValue=null, isCheckbox=true, fieldCategory='CHECKBOX' for checked boxes in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T020 [US4] Ensure unchecked checkboxes are either omitted or included with rawValue="" and isCheckbox=true to distinguish from missing data, and verify checkbox fields set on K1ExtractionResult.metadata (isFinal, isAmended) in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
Checkpoint: All checkbox fields correctly detected as boolean values — User Story 4 independently testable
Phase 7: User Story 5 — Manual Mapping Fallback for Ambiguous Fields (Priority: P3)
Goal: Data-font values that don't match any region appear in unmappedItems with position info for manual assignment
Independent Test: Upload a K-1 PDF where some values fall outside expected regions and verify those values appear in unmappedItems with raw text, x, y, fontName, pageNumber
Implementation for User Story 5
- T021 [US5] After all region matching is complete, collect remaining unmatched data-font items into K1UnmappedItem[] with rawLabel='', rawValue, numericValue (parsed), confidence=0.5, pageNumber=1, x, y, fontName in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T022 [US5] Verify unmapped items integrate with existing review UI manual assignment flow: ensure assignedBoxNumber and resolution fields on K1UnmappedItem work with the confirmation endpoint in apps/api/src/app/k1-import/k1-import.service.ts
Checkpoint: Zero data loss — all extracted values either mapped to fields or available in unmappedItems for manual assignment
Phase 8: Polish & Cross-Cutting Concerns
Purpose: Error handling, confidence scoring, cleanup, and service integration
- T023 Implement graceful error handling: wrap extraction in try/catch, return empty fields + low confidence + meaningful error for non-K-1 and corrupted PDFs, never crash on unexpected content in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T024 Implement confidence scoring: HIGH (≥0.90) when value center is within region center ±5pts, MEDIUM (0.70-0.89) within ±10pts, LOW (0.50-0.69) at tolerance boundary ±15pts; compute overallConfidence as weighted average in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T025 Ensure pdfDoc.destroy() cleanup runs in all code paths (success, error, empty result) using try/finally in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- T026 [P] Update k1-import.service.ts to handle new subtype field when building K1Cell records — concatenate subtype into boxNumber (e.g., "11-ZZ*", "20-A") or store via metadata JSON column in apps/api/src/app/k1-import/k1-import.service.ts
- T027 Run quickstart.md verification checklist: upload test K-1 PDF, verify all 9 checklist items pass (box 11/19/20/21, Section J/L, Final K-1 checkbox, unmapped empty, non-K-1 error)
Dependencies & Execution Order
Phase Dependencies
- Setup (Phase 1): No dependencies — can start immediately
- Foundational (Phase 2): T002 can run in parallel with T001 (different files). T003-T006 depend on T001 (interface types) and execute sequentially in pdf-parse-extractor.ts
- User Stories (Phase 3-7): ALL depend on Foundational phase completion (T001-T006)
- US1 (Phase 3) and US2 (Phase 4): Both P1, execute sequentially (same file)
- US3 (Phase 5) and US4 (Phase 6): Both P2, execute after US1+US2 (same file)
- US5 (Phase 7): P3, executes last of user stories
- Polish (Phase 8): T023-T025 depend on all user stories. T026 is independent (different file, marked [P])
User Story Dependencies
- US1 (P1): Depends only on Foundational. No dependency on other stories.
- US2 (P1): Depends only on Foundational. No dependency on US1 (metadata vs Part III are separate regions).
- US3 (P2): Depends only on Foundational. J/K/L/M/N regions are independent of Part III.
- US4 (P2): Depends only on Foundational. Checkbox detection is position-based, independent of value extraction. Some overlap with US2 (Final/Amended checkboxes set metadata flags).
- US5 (P3): Depends on US1-US4 being done (unmapped = whatever's left after all matching).
Within Each User Story
- Region matching before subtype pairing
- Subtype pairing before multi-subtype handling
- Core extraction before wiring into extract()
- Story complete before moving to next priority
Parallel Opportunities
- T001 + T002: Interface expansion and position regions file — different files, no dependencies
- T026: Service update — different file from extractor, can run in parallel with T023-T025
- US1-US4: While all modify the same extractor file (sequential), each story's extraction logic is a self-contained function that could theoretically be developed in parallel branches
Parallel Example: Foundational Phase
# These two tasks can run simultaneously:
Task T001: "Expand interfaces in k1-import.interface.ts"
Task T002: "Create position regions in k1-position-regions.ts"
# Then sequentially in pdf-parse-extractor.ts:
Task T003: "Scaffold pdfjs-dist infrastructure"
Task T004: "Font discrimination logic"
Task T005: "Position matching engine"
Task T006: "Value parsing utility"
Parallel Example: Polish Phase
# These can run simultaneously (different files):
Task T023-T025: "Error handling, confidence, cleanup in pdf-parse-extractor.ts"
Task T026: "Service subtype handling in k1-import.service.ts"
# Final validation after all above:
Task T027: "Run quickstart.md verification checklist"
Implementation Strategy
MVP First (User Story 1 Only)
- Complete Phase 1: Setup (T001) — interface expansion
- Complete Phase 2: Foundational (T002-T006) — pdfjs-dist + regions + font + parsing
- Complete Phase 3: User Story 1 (T007-T010) — Part III boxes 1-21
- STOP and VALIDATE: Upload test K-1 PDF, verify Part III extraction
- This delivers the core value — accurate box values replace broken regex parser
Incremental Delivery
- Setup + Foundational → Infrastructure ready
- Add US1 (Part III) → Test independently → MVP!
- Add US2 (Metadata) → Test independently → Metadata populated
- Add US3 (J/K/L/M/N) → Test independently → Financial fields complete
- Add US4 (Checkboxes) → Test independently → Boolean fields detected
- Add US5 (Unmapped) → Test independently → Zero data loss guaranteed
- Polish → Error handling, confidence, service integration
Single Developer Flow
All user story tasks modify the same extractor file, so execute sequentially: Phase 1 → Phase 2 → Phase 3 (US1) → Phase 4 (US2) → Phase 5 (US3) → Phase 6 (US4) → Phase 7 (US5) → Phase 8 (Polish)
Notes
- All 73 position regions are defined in T002 upfront — individual story phases use them
- No new npm dependencies required (pdfjs-dist already installed via pdf-parse)
- The extractor rewrite preserves the existing K1Extractor interface contract (extract + isAvailable)
- Keep isDigitalK1() from the existing extractor — it's used by isAvailable()
- Font names are dynamic — never hardcode specific font names like "g_d0_f8"
- Total: 27 tasks across 8 phases covering 5 user stories