# Tasks: Fix K-1 PDF Parser — Position-Based Extraction

**Input**: Design documents from `/specs/005-k1-parser-fix/`
**Prerequisites**: plan.md, spec.md, research.md, data-model.md, contracts/extraction.md, quickstart.md

**Tests**: Not explicitly requested — test tasks omitted.

**Organization**: Tasks grouped by user story to enable independent implementation and testing.

## Format: `[ID] [P?] [Story] Description`

- **[P]**: Can run in parallel (different files, no dependencies)
- **[Story]**: Which user story this task belongs to (US1–US5)
- Exact file paths included in all descriptions

## Path Conventions

- **Monorepo (Nx)**: `apps/api/src/`, `libs/common/src/`
- **Extractor module**: `apps/api/src/app/k1-import/extractors/`
- **Shared interfaces**: `libs/common/src/lib/interfaces/`

---

## Phase 1: Setup

**Purpose**: Expand shared interfaces to support new extraction fields

- [x] T001 Add `subtype: string | null`, `fieldCategory: string`, and `isCheckbox: boolean` to K1ExtractedField interface, and add `x: number`, `y: number`, `fontName: string` to K1UnmappedItem interface in libs/common/src/lib/interfaces/k1-import.interface.ts

---

## Phase 2: Foundational (Blocking Prerequisites)

**Purpose**: Core extraction infrastructure that ALL user stories depend on — pdfjs-dist integration, position regions, font discrimination, value parsing

**⚠️ CRITICAL**: No user story work can begin until this phase is complete

- [x] T002 [P] Create K1PositionRegion interface and export all 73 bounding box region definitions (Header, Part I, Part II, Sections J/K/L/M/N, Part III left boxes 1-13, Part III right boxes 14-21) with ±15pt tolerance using verified anchor coordinates from research.md in apps/api/src/app/k1-import/extractors/k1-position-regions.ts
- [x] T003 Replace existing regex-based extraction with pdfjs-dist scaffold: dynamic `await import('pdfjs-dist/legacy/build/pdf.mjs')`, GlobalWorkerOptions.workerSrc set to `file://` path of pdf.worker.mjs, getDocument() with buffer, getPage(1), getTextContent(), and pdfDoc.destroy() cleanup in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T004 Implement dynamic font discrimination using textContent.styles: classify each font as template (serif fontFamily) or data (sans-serif/monospace fontFamily), filter text items to only data-font items in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T005 Implement findRegionForPosition() function that takes (x, y) coordinates and returns the matching K1PositionRegion from the 73-region map using ±15pt bounding box tolerance, or null if no match in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T006 Implement parseK1Value() utility: strip commas, parenthesized values → negative number, leading minus → negative, "SEE STMT" → numericValue null, "X" → checkbox true, dollar sign strip, preserve decimal percentages without rounding in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts

**Checkpoint**: Foundation ready — pdfjs-dist loads PDFs, data-font items are isolated, positions match to regions, values parse correctly

---

## Phase 3: User Story 1 — Accurate K-1 Value Extraction (Priority: P1) 🎯 MVP

**Goal**: Extract all Part III box values (boxes 1-21) with correct box numbers, values, signs, and subtype codes

**Independent Test**: Upload a sample K-1 PDF and verify Part III boxes are correctly extracted — box 1 = 498,211; box 11 ZZ* = (409,615); box 19 A = 4,493,757; box 20 with 4 subtypes; box 21 * = 196

### Implementation for User Story 1

- [x] T007 [US1] Implement Part III extraction loop: iterate data-font items, match to Part III regions (left column boxes 1-13, right column boxes 14-21), build K1ExtractedField with boxNumber, rawValue, numericValue, fieldCategory='PART_III' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T008 [US1] Implement subtype code pairing: for regions with hasSubtype=true, find code text item and value text item at same y-band (±8pts) using subtypeXMin/XMax ranges from k1-position-regions.ts, set subtype field on K1ExtractedField in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T009 [US1] Handle multi-subtype boxes (box 20 with A, B, Z, * at ~23pt vertical spacing): produce separate K1ExtractedField entry for each subtype/value pair within the box's y-range in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T010 [US1] Wire Part III extraction into the main extract() method: call extraction after font filtering and position matching, merge Part III fields into K1ExtractionResult.fields array in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts

**Checkpoint**: Part III boxes 1-21 fully extracted with subtypes — User Story 1 independently testable via upload

---

## Phase 4: User Story 2 — Partnership & Partner Metadata Extraction (Priority: P1)

**Goal**: Extract Part I (partnership info) and Part II (partner info) metadata — names, EINs, addresses, tax year, filing status

**Independent Test**: Upload a K-1 PDF and verify partnership name, EIN, partner name, tax year, and final/amended status are correctly populated on K1ExtractionResult.metadata

### Implementation for User Story 2

- [x] T011 [US2] Implement header region extraction: match data items to Header regions for tax year (combine "20" + "25"), tax year begin/end dates, Final K-1 flag, Amended K-1 flag in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T012 [US2] Implement Part I extraction: match data items to Part I regions for partnership EIN (field A), partnership name and address (field B), and IRS Center (field C) in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T013 [US2] Implement Part II extraction: match data items to Part II regions for partner EIN (field D), partner name (field E), address (field F), and partner type general/limited (field G) and domestic/foreign (field H) in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T014 [US2] Assemble K1ExtractionResult.metadata object from extracted header, Part I, and Part II fields, setting partnershipName, partnershipEin, partnerName, partnerEin, taxYear, isFinal, isAmended in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts

**Checkpoint**: Metadata fully populated — User Story 2 independently testable via upload

---

## Phase 5: User Story 3 — Part I/II Financial Fields Extraction (Priority: P2)

**Goal**: Extract Sections J (percentages), K (liabilities), L (capital account), M (contributed property), N (net 704(c) gain/loss)

**Independent Test**: Upload a K-1 PDF and verify Section J percentages (3.032900 / 0.000000), Section K nonrecourse (498,211), Section L capital values with correct signs, Section N values are extracted

### Implementation for User Story 3

- [x] T015 [US3] Implement Section J extraction: match data items to 7 Section J regions for profit/loss/capital beginning and ending percentages, plus decrease-in-sale field, with fieldCategory='SECTION_J' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T016 [US3] Implement Section K extraction: match data items to 8 Section K regions for nonrecourse/qualified nonrecourse/recourse beginning and ending liabilities, plus K-2/K-3 checkbox regions, with fieldCategory='SECTION_K' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T017 [US3] Implement Section L extraction: match data items to 6 Section L regions for beginning capital, capital contributed, current year net income/loss, other increase/decrease, withdrawals/distributions, ending capital with fieldCategory='SECTION_L' in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T018 [US3] Implement Section M (contributed property yes/no checkbox) and Section N (beginning and ending net 704(c) gain/loss values) extraction with fieldCategory='SECTION_M' and 'SECTION_N' respectively in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts

**Checkpoint**: All J/K/L/M/N financial fields extracted — User Story 3 independently testable

---

## Phase 6: User Story 4 — Checkbox and Boolean Field Extraction (Priority: P2)

**Goal**: Detect all checkbox fields (Final K-1, Amended K-1, General/Limited, Domestic/Foreign, K-2/K-3 attached) as boolean values

**Independent Test**: Upload a K-1 PDF with known checkbox states and verify Final K-1 = true, Limited partner = true, Domestic = true, box 16 K-3 attached = true

### Implementation for User Story 4

- [x] T019 [US4] Implement checkbox detection: for all regions with valueType='checkbox', check if an "X" text item exists at the checkbox position, build K1ExtractedField with rawValue="X", numericValue=null, isCheckbox=true, fieldCategory='CHECKBOX' for checked boxes in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T020 [US4] Ensure unchecked checkboxes are either omitted or included with rawValue="" and isCheckbox=true to distinguish from missing data, and verify checkbox fields set on K1ExtractionResult.metadata (isFinal, isAmended) in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts

**Checkpoint**: All checkbox fields correctly detected as boolean values — User Story 4 independently testable

---

## Phase 7: User Story 5 — Manual Mapping Fallback for Ambiguous Fields (Priority: P3)

**Goal**: Data-font values that don't match any region appear in unmappedItems with position info for manual assignment

**Independent Test**: Upload a K-1 PDF where some values fall outside expected regions and verify those values appear in unmappedItems with raw text, x, y, fontName, pageNumber

### Implementation for User Story 5

- [x] T021 [US5] After all region matching is complete, collect remaining unmatched data-font items into K1UnmappedItem[] with rawLabel='', rawValue, numericValue (parsed), confidence=0.5, pageNumber=1, x, y, fontName in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T022 [US5] Verify unmapped items integrate with existing review UI manual assignment flow: ensure assignedBoxNumber and resolution fields on K1UnmappedItem work with the confirmation endpoint in apps/api/src/app/k1-import/k1-import.service.ts

**Checkpoint**: Zero data loss — all extracted values either mapped to fields or available in unmappedItems for manual assignment

---

## Phase 8: Polish & Cross-Cutting Concerns

**Purpose**: Error handling, confidence scoring, cleanup, and service integration

- [x] T023 Implement graceful error handling: wrap extraction in try/catch, return empty fields + low confidence + meaningful error for non-K-1 and corrupted PDFs, never crash on unexpected content in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T024 Implement confidence scoring: HIGH (≥0.90) when value center is within region center ±5pts, MEDIUM (0.70-0.89) within ±10pts, LOW (0.50-0.69) at tolerance boundary ±15pts; compute overallConfidence as weighted average in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T025 Ensure pdfDoc.destroy() cleanup runs in all code paths (success, error, empty result) using try/finally in apps/api/src/app/k1-import/extractors/pdf-parse-extractor.ts
- [x] T026 [P] Update k1-import.service.ts to handle new subtype field when building K1Cell records — concatenate subtype into boxNumber (e.g., "11-ZZ*", "20-A") or store via metadata JSON column in apps/api/src/app/k1-import/k1-import.service.ts
- [x] T027 Run quickstart.md verification checklist: upload test K-1 PDF, verify all 9 checklist items pass (box 11/19/20/21, Section J/L, Final K-1 checkbox, unmapped empty, non-K-1 error)

---

## Dependencies & Execution Order

### Phase Dependencies

- **Setup (Phase 1)**: No dependencies — can start immediately
- **Foundational (Phase 2)**: T002 can run in parallel with T001 (different files). T003-T006 depend on T001 (interface types) and execute sequentially in pdf-parse-extractor.ts
- **User Stories (Phase 3-7)**: ALL depend on Foundational phase completion (T001-T006)
  - US1 (Phase 3) and US2 (Phase 4): Both P1, execute sequentially (same file)
  - US3 (Phase 5) and US4 (Phase 6): Both P2, execute after US1+US2 (same file)
  - US5 (Phase 7): P3, executes last of user stories
- **Polish (Phase 8)**: T023-T025 depend on all user stories. T026 is independent (different file, marked [P])

### User Story Dependencies

- **US1 (P1)**: Depends only on Foundational. No dependency on other stories.
- **US2 (P1)**: Depends only on Foundational. No dependency on US1 (metadata vs Part III are separate regions).
- **US3 (P2)**: Depends only on Foundational. J/K/L/M/N regions are independent of Part III.
- **US4 (P2)**: Depends only on Foundational. Checkbox detection is position-based, independent of value extraction. Some overlap with US2 (Final/Amended checkboxes set metadata flags).
- **US5 (P3)**: Depends on US1-US4 being done (unmapped = whatever's left after all matching).

### Within Each User Story

- Region matching before subtype pairing
- Subtype pairing before multi-subtype handling
- Core extraction before wiring into extract()
- Story complete before moving to next priority

### Parallel Opportunities

- **T001 + T002**: Interface expansion and position regions file — different files, no dependencies
- **T026**: Service update — different file from extractor, can run in parallel with T023-T025
- **US1-US4**: While all modify the same extractor file (sequential), each story's extraction logic is a self-contained function that could theoretically be developed in parallel branches

---

## Parallel Example: Foundational Phase

```
# These two tasks can run simultaneously:
Task T001: "Expand interfaces in k1-import.interface.ts"
Task T002: "Create position regions in k1-position-regions.ts"

# Then sequentially in pdf-parse-extractor.ts:
Task T003: "Scaffold pdfjs-dist infrastructure"
Task T004: "Font discrimination logic"
Task T005: "Position matching engine"
Task T006: "Value parsing utility"
```

## Parallel Example: Polish Phase

```
# These can run simultaneously (different files):
Task T023-T025: "Error handling, confidence, cleanup in pdf-parse-extractor.ts"
Task T026: "Service subtype handling in k1-import.service.ts"

# Final validation after all above:
Task T027: "Run quickstart.md verification checklist"
```

---

## Implementation Strategy

### MVP First (User Story 1 Only)

1. Complete Phase 1: Setup (T001) — interface expansion
2. Complete Phase 2: Foundational (T002-T006) — pdfjs-dist + regions + font + parsing
3. Complete Phase 3: User Story 1 (T007-T010) — Part III boxes 1-21
4. **STOP and VALIDATE**: Upload test K-1 PDF, verify Part III extraction
5. This delivers the core value — accurate box values replace broken regex parser

### Incremental Delivery

1. Setup + Foundational → Infrastructure ready
2. Add US1 (Part III) → Test independently → **MVP!**
3. Add US2 (Metadata) → Test independently → Metadata populated
4. Add US3 (J/K/L/M/N) → Test independently → Financial fields complete
5. Add US4 (Checkboxes) → Test independently → Boolean fields detected
6. Add US5 (Unmapped) → Test independently → Zero data loss guaranteed
7. Polish → Error handling, confidence, service integration

### Single Developer Flow

All user story tasks modify the same extractor file, so execute sequentially:
Phase 1 → Phase 2 → Phase 3 (US1) → Phase 4 (US2) → Phase 5 (US3) → Phase 6 (US4) → Phase 7 (US5) → Phase 8 (Polish)

---

## Notes

- All 73 position regions are defined in T002 upfront — individual story phases use them
- No new npm dependencies required (pdfjs-dist already installed via pdf-parse)
- The extractor rewrite preserves the existing K1Extractor interface contract (extract + isAvailable)
- Keep isDigitalK1() from the existing extractor — it's used by isAvailable()
- Font names are dynamic — never hardcode specific font names like "g_d0_f8"
- Total: 27 tasks across 8 phases covering 5 user stories