# Research: Fix K-1 PDF Parser **Feature**: 005-k1-parser-fix | **Date**: 2026-03-18 ## Research Summary All technical unknowns resolved. Three key decisions made: 1. **pdfjs-dist** for position-based text extraction (already installed) 2. **Font discrimination + position region mapping** as the extraction strategy 3. **73 bounding box regions** defined covering all K-1 form fields --- ## Decision 1: PDF Parsing Library **Decision**: Use `pdfjs-dist` v5.4.296 directly (already installed as transitive dependency of pdf-parse v2.4.5) **Rationale**: - Already installed — no new npm dependencies - `page.getTextContent()` returns `TextItem` objects with precise (x, y) coordinates, font name, width, height - `@napi-rs/canvas` v0.1.80 (also already installed) provides DOMMatrix polyfill for Node.js via the legacy build - The legacy build at `pdfjs-dist/legacy/build/pdf.mjs` auto-polyfills `DOMMatrix`, `ImageData`, `Path2D`, and `navigator` **Alternatives considered**: - **pdf-parse v2.4.5** (currently used): Wraps pdfjs-dist but does NOT expose position coordinates. Only returns concatenated text strings. Insufficient for position-based extraction. - **pdf-lib**: Can read AcroForm fields, but K-1 PDFs have zero AcroForm fields (values are text overlays). Not useful. - **pdf2json**: Older PDF.js fork with positioned text. Redundant — pdfjs-dist v5.4 is already available and more current. ### API Details **Import** (must use dynamic import — API project compiles to CommonJS via webpack): ```typescript const { getDocument, GlobalWorkerOptions } = await import('pdfjs-dist/legacy/build/pdf.mjs'); ``` **Worker configuration** (required in v5.4.x): ```typescript const workerPath = 'file:///' + resolve('node_modules/pdfjs-dist/legacy/build/pdf.worker.mjs').replace(/\\/g, '/'); GlobalWorkerOptions.workerSrc = workerPath; ``` **Document loading**: ```typescript const loadingTask = getDocument({ data: new Uint8Array(buffer), standardFontDataUrl: resolve('node_modules/pdfjs-dist/standard_fonts') + '/', cMapUrl: resolve('node_modules/pdfjs-dist/cmaps') + '/', cMapPacked: true, isEvalSupported: false, disableFontFace: true, }); ``` **Text extraction**: ```typescript const page = await pdfDoc.getPage(1); const textContent = await page.getTextContent({ includeMarkedContent: false }); // textContent.items: TextItem[] with { str, transform, width, height, fontName, hasEOL, dir } // textContent.styles: { [fontName]: { fontFamily, ascent, descent, vertical } } // transform[4] = x, transform[5] = y (PDF points, origin bottom-left) ``` **Cleanup** (required): ```typescript await pdfDoc.destroy(); // Terminates worker, frees resources ``` ### Gotchas 1. Must use `pdfjs-dist/legacy/build/pdf.mjs` — main build crashes with `DOMMatrix is not defined` 2. Must set `GlobalWorkerOptions.workerSrc` to the worker file path — empty string no longer works in v5.4.x 3. `workerSrc` must be a `file://` URL on Windows 4. Use `await import()` not static `import` — CommonJS compat via webpack 5. Y-coordinates are bottom-up: `transform[5]` = 792 is top of page, 0 is bottom 6. `page.view` gives `[0, 0, 612, 792]` — standard US Letter --- ## Decision 2: Extraction Strategy **Decision**: Hybrid approach — font discrimination (primary) + position-based region mapping (secondary) **Rationale**: - Font filtering instantly isolates ~30 data values from 467 total text items on page 1 - Position mapping then determines exactly which K-1 field each value belongs to - Two-phase filtering is more robust than either approach alone - Resilient to minor position variations across different K-1 generators **Alternatives considered**: - **Regex label matching** (current approach): Fundamentally broken — pdf-parse outputs all template labels first, then all data values separately. Labels and values are never adjacent in the text stream. - **Sequential positional parsing** (text order): Fragile — depends on exact text ordering which varies between generators. Also can't distinguish data values from template text. - **Pure position-based** (no font check): Would work but requires matching against all 73 regions for all 467 items. Font filtering first reduces the problem to ~30 items × 73 regions. ### Font Discrimination Details From the sample K-1 PDF, text items use these fonts: | fontName | fontFamily | Usage | Count | |----------|-----------|-------|-------| | g_d0_f1 | serif | Template labels, headers | ~350 items | | g_d0_f2 | sans-serif | "20" in tax year | 1 item | | g_d0_f3 | sans-serif | "25" in tax year (data) | 1 item | | g_d0_f5 | serif | Footnotes, small text | ~80 items | | g_d0_f6 | sans-serif | Data values | ~10 items | | g_d0_f7 | monospace | Checkboxes/codes | ~5 items | | g_d0_f8 | sans-serif | Data values (primary) | ~20 items | **Key insight**: Template labels exclusively use `serif` fonts. Data values exclusively use `sans-serif` or `monospace` fonts. Filtering by `fontFamily !== 'serif'` isolates all data values. **Dynamic detection**: Since font names vary across generators, the algorithm should: 1. Get all unique fonts from `textContent.styles` 2. Identify template fonts: the fonts used by known template text items (items matching "Schedule K-1", "Form 1065", "Ordinary business income", etc.) 3. Non-template fonts = data fonts 4. Filter items to only those using data fonts --- ## Decision 3: Position Region Map **Decision**: Define 73 bounding box regions covering all K-1 form fields with ±15 pt tolerance **Rationale**: - K-1 form layout is standardized by the IRS — position regions are consistent across generators - 22 positions verified from actual PDF extraction with exact coordinates - Remaining ~51 positions interpolated from verified anchors and standard IRS form spacing - ±15 pt tolerance handles minor variations between generators ### Verified Anchor Points (from actual K-1 PDF) | Value | x | y | Field | |-------|-----|-------|-------| | "X" | 324.3 | 746.2 | FINAL_K1 | | "X" | 180.3 | 446.6 | G_LIMITED | | "X" | 58.0 | 422.9 | H1_DOMESTIC | | "3.032900" | 139.1 | 339.1 | J_PROFIT_BEGIN | | "0.000000" | 250.1 | 339.1 | J_PROFIT_END | | "498,211" | 180.8 | 254.5 | K_NONRECOURSE_BEGIN | | "X" | 294.9 | 205.8 | K2_CHECKBOX | | "4,903,568" | 257.8 | 157.4 | L_BEG_CAPITAL | | "(409,811)" | 259.3 | 133.7 | L_CURR_YR_INCOME | | "4,493,757" | 257.8 | 109.4 | L_WITHDRAWALS | | "X" | 101.2 | 74.2 | M_NO | | "(5,373)" | 271.5 | 49.7 | N_BEGINNING | | "(409,811)" | 92.1 | 2.8 | N_ENDING | | "ZZ*" | 314.2 | 314.4 | BOX_11_CODE | | "(409,615)" | 403.9 | 314.4 | BOX_11_VALUE | | "X" | 563.3 | 603.8 | BOX_16_K3 | | "A" | 455.2 | 423.2 | BOX_19_CODE | | "4,493,757" | 530.6 | 422.0 | BOX_19_VALUE | | "*" | 456.4 | 267.1 | BOX_21_CODE | | "196" | 555.6 | 266.1 | BOX_21_VALUE | ### Region Layout Summary | Group | X range | Y range | Fields | |-------|---------|---------|--------| | Header | 120–450 | 731–785 | 5: TAX_YEAR, TAX_YEAR_BEGIN/END, FINAL_K1, AMENDED_K1 | | Part I | 30–290 | 610–735 | 4: A_EIN, B_NAME, B_ADDR, C_IRS_CENTER | | Part II | 30–306 | 350–610 | 12: D through I2 | | Section J | 120–305 | 285–354 | 7: profit/loss/capital begin/end + decrease sale | | Section K | 155–310 | 176–270 | 8: nonrecourse/qual/recourse begin/end + K2/K3 checkboxes | | Section L | 220–306 | 83–173 | 6: beg/contributed/income/other/withdrawals/end | | Section M | 50–120 | 59–89 | 2: M_YES, M_NO | | Section N | 60–306 | 0–65 | 2: N_BEGINNING, N_ENDING | | Part III Left | 300–455 | 245–698 | 19: boxes 1–13 (including a/b/c sub-boxes) | | Part III Right | 440–595 | 245–710 | 8: boxes 14–21 | ### Subtype Handling Boxes 11, 12, 13 (left column) and 14, 15, 17, 19, 20, 21 (right column) can have subtype codes: - **Left column**: code at x ≈ 300–350, value at x ≈ 370–455 - **Right column**: code at x ≈ 440–475, value at x ≈ 510–595 Pairing algorithm: find code and value items on the same y-line (within ±8 pts). Box 20 supports multiple subtype rows (A, B, V/Z, *) spaced ~23 pts apart within y range 275–395. --- ## Decision 4: Numeric Value Parsing **Decision**: Parse all K-1 values using consistent rules **Rationale**: IRS K-1 forms use standard US financial formatting. No ambiguity in the parsing rules. **Rules**: 1. Remove commas: "4,903,568" → "4903568" 2. Parenthesized = negative: "(409,811)" → "-409811" → -409811 3. Leading minus = negative: "-5,373" → -5373 4. Dollar sign: strip "$" if present 5. Decimal percentages: "3.032900" → 3.032900 (preserve precision, do not round) 6. "SEE STMT" / "STMT" → `numericValue: null`, `rawValue: "SEE STMT"` 7. "X" (checkbox) → boolean true, `rawValue: "X"` 8. Empty / whitespace → omit field or `numericValue: 0` 9. "E-FILE" and other text values → `numericValue: null`, preserve as rawValue --- ## Decision 5: Interface Expansion **Decision**: Add `subtype`, `fieldCategory`, and `isCheckbox` to `K1ExtractedField`; add position info to `K1UnmappedItem` **Rationale**: The existing interface lacks fields needed for subtype codes (box 11 "ZZ*", box 20 "A"/"B"), field categorization (Part III vs Section J vs metadata), and checkbox discrimination. Adding these fields is backward-compatible (all optional/nullable). **New fields on K1ExtractedField**: - `subtype: string | null` — subtype code (e.g., "ZZ*", "A", "B", "*") - `fieldCategory: 'PART_III' | 'METADATA' | 'SECTION_J' | 'SECTION_K' | 'SECTION_L' | 'SECTION_M' | 'SECTION_N' | 'CHECKBOX'` - `isCheckbox: boolean` — whether this field is a boolean checkbox value **New fields on K1UnmappedItem**: - `x: number` — x position in PDF points - `y: number` — y position in PDF points - `fontName: string` — font identifier for debugging --- ## Open Items None. All NEEDS CLARIFICATION items resolved.