You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

9.6 KiB

Research: Fix K-1 PDF Parser

Feature: 005-k1-parser-fix | Date: 2026-03-18

Research Summary

All technical unknowns resolved. Three key decisions made:

  1. pdfjs-dist for position-based text extraction (already installed)
  2. Font discrimination + position region mapping as the extraction strategy
  3. 73 bounding box regions defined covering all K-1 form fields

Decision 1: PDF Parsing Library

Decision: Use pdfjs-dist v5.4.296 directly (already installed as transitive dependency of pdf-parse v2.4.5)

Rationale:

  • Already installed — no new npm dependencies
  • page.getTextContent() returns TextItem objects with precise (x, y) coordinates, font name, width, height
  • @napi-rs/canvas v0.1.80 (also already installed) provides DOMMatrix polyfill for Node.js via the legacy build
  • The legacy build at pdfjs-dist/legacy/build/pdf.mjs auto-polyfills DOMMatrix, ImageData, Path2D, and navigator

Alternatives considered:

  • pdf-parse v2.4.5 (currently used): Wraps pdfjs-dist but does NOT expose position coordinates. Only returns concatenated text strings. Insufficient for position-based extraction.
  • pdf-lib: Can read AcroForm fields, but K-1 PDFs have zero AcroForm fields (values are text overlays). Not useful.
  • pdf2json: Older PDF.js fork with positioned text. Redundant — pdfjs-dist v5.4 is already available and more current.

API Details

Import (must use dynamic import — API project compiles to CommonJS via webpack):

const { getDocument, GlobalWorkerOptions } = await import('pdfjs-dist/legacy/build/pdf.mjs');

Worker configuration (required in v5.4.x):

const workerPath = 'file:///' + resolve('node_modules/pdfjs-dist/legacy/build/pdf.worker.mjs').replace(/\\/g, '/');
GlobalWorkerOptions.workerSrc = workerPath;

Document loading:

const loadingTask = getDocument({
  data: new Uint8Array(buffer),
  standardFontDataUrl: resolve('node_modules/pdfjs-dist/standard_fonts') + '/',
  cMapUrl: resolve('node_modules/pdfjs-dist/cmaps') + '/',
  cMapPacked: true,
  isEvalSupported: false,
  disableFontFace: true,
});

Text extraction:

const page = await pdfDoc.getPage(1);
const textContent = await page.getTextContent({ includeMarkedContent: false });
// textContent.items: TextItem[] with { str, transform, width, height, fontName, hasEOL, dir }
// textContent.styles: { [fontName]: { fontFamily, ascent, descent, vertical } }
// transform[4] = x, transform[5] = y (PDF points, origin bottom-left)

Cleanup (required):

await pdfDoc.destroy(); // Terminates worker, frees resources

Gotchas

  1. Must use pdfjs-dist/legacy/build/pdf.mjs — main build crashes with DOMMatrix is not defined
  2. Must set GlobalWorkerOptions.workerSrc to the worker file path — empty string no longer works in v5.4.x
  3. workerSrc must be a file:// URL on Windows
  4. Use await import() not static import — CommonJS compat via webpack
  5. Y-coordinates are bottom-up: transform[5] = 792 is top of page, 0 is bottom
  6. page.view gives [0, 0, 612, 792] — standard US Letter

Decision 2: Extraction Strategy

Decision: Hybrid approach — font discrimination (primary) + position-based region mapping (secondary)

Rationale:

  • Font filtering instantly isolates ~30 data values from 467 total text items on page 1
  • Position mapping then determines exactly which K-1 field each value belongs to
  • Two-phase filtering is more robust than either approach alone
  • Resilient to minor position variations across different K-1 generators

Alternatives considered:

  • Regex label matching (current approach): Fundamentally broken — pdf-parse outputs all template labels first, then all data values separately. Labels and values are never adjacent in the text stream.
  • Sequential positional parsing (text order): Fragile — depends on exact text ordering which varies between generators. Also can't distinguish data values from template text.
  • Pure position-based (no font check): Would work but requires matching against all 73 regions for all 467 items. Font filtering first reduces the problem to ~30 items × 73 regions.

Font Discrimination Details

From the sample K-1 PDF, text items use these fonts:

fontName fontFamily Usage Count
g_d0_f1 serif Template labels, headers ~350 items
g_d0_f2 sans-serif "20" in tax year 1 item
g_d0_f3 sans-serif "25" in tax year (data) 1 item
g_d0_f5 serif Footnotes, small text ~80 items
g_d0_f6 sans-serif Data values ~10 items
g_d0_f7 monospace Checkboxes/codes ~5 items
g_d0_f8 sans-serif Data values (primary) ~20 items

Key insight: Template labels exclusively use serif fonts. Data values exclusively use sans-serif or monospace fonts. Filtering by fontFamily !== 'serif' isolates all data values.

Dynamic detection: Since font names vary across generators, the algorithm should:

  1. Get all unique fonts from textContent.styles
  2. Identify template fonts: the fonts used by known template text items (items matching "Schedule K-1", "Form 1065", "Ordinary business income", etc.)
  3. Non-template fonts = data fonts
  4. Filter items to only those using data fonts

Decision 3: Position Region Map

Decision: Define 73 bounding box regions covering all K-1 form fields with ±15 pt tolerance

Rationale:

  • K-1 form layout is standardized by the IRS — position regions are consistent across generators
  • 22 positions verified from actual PDF extraction with exact coordinates
  • Remaining ~51 positions interpolated from verified anchors and standard IRS form spacing
  • ±15 pt tolerance handles minor variations between generators

Verified Anchor Points (from actual K-1 PDF)

Value x y Field
"X" 324.3 746.2 FINAL_K1
"X" 180.3 446.6 G_LIMITED
"X" 58.0 422.9 H1_DOMESTIC
"3.032900" 139.1 339.1 J_PROFIT_BEGIN
"0.000000" 250.1 339.1 J_PROFIT_END
"498,211" 180.8 254.5 K_NONRECOURSE_BEGIN
"X" 294.9 205.8 K2_CHECKBOX
"4,903,568" 257.8 157.4 L_BEG_CAPITAL
"(409,811)" 259.3 133.7 L_CURR_YR_INCOME
"4,493,757" 257.8 109.4 L_WITHDRAWALS
"X" 101.2 74.2 M_NO
"(5,373)" 271.5 49.7 N_BEGINNING
"(409,811)" 92.1 2.8 N_ENDING
"ZZ*" 314.2 314.4 BOX_11_CODE
"(409,615)" 403.9 314.4 BOX_11_VALUE
"X" 563.3 603.8 BOX_16_K3
"A" 455.2 423.2 BOX_19_CODE
"4,493,757" 530.6 422.0 BOX_19_VALUE
"*" 456.4 267.1 BOX_21_CODE
"196" 555.6 266.1 BOX_21_VALUE

Region Layout Summary

Group X range Y range Fields
Header 120–450 731–785 5: TAX_YEAR, TAX_YEAR_BEGIN/END, FINAL_K1, AMENDED_K1
Part I 30–290 610–735 4: A_EIN, B_NAME, B_ADDR, C_IRS_CENTER
Part II 30–306 350–610 12: D through I2
Section J 120–305 285–354 7: profit/loss/capital begin/end + decrease sale
Section K 155–310 176–270 8: nonrecourse/qual/recourse begin/end + K2/K3 checkboxes
Section L 220–306 83–173 6: beg/contributed/income/other/withdrawals/end
Section M 50–120 59–89 2: M_YES, M_NO
Section N 60–306 0–65 2: N_BEGINNING, N_ENDING
Part III Left 300–455 245–698 19: boxes 1–13 (including a/b/c sub-boxes)
Part III Right 440–595 245–710 8: boxes 14–21

Subtype Handling

Boxes 11, 12, 13 (left column) and 14, 15, 17, 19, 20, 21 (right column) can have subtype codes:

  • Left column: code at x ≈ 300–350, value at x ≈ 370–455
  • Right column: code at x ≈ 440–475, value at x ≈ 510–595

Pairing algorithm: find code and value items on the same y-line (within ±8 pts).

Box 20 supports multiple subtype rows (A, B, V/Z, *) spaced ~23 pts apart within y range 275–395.


Decision 4: Numeric Value Parsing

Decision: Parse all K-1 values using consistent rules

Rationale: IRS K-1 forms use standard US financial formatting. No ambiguity in the parsing rules.

Rules:

  1. Remove commas: "4,903,568" → "4903568"
  2. Parenthesized = negative: "(409,811)" → "-409811" → -409811
  3. Leading minus = negative: "-5,373" → -5373
  4. Dollar sign: strip "$" if present
  5. Decimal percentages: "3.032900" → 3.032900 (preserve precision, do not round)
  6. "SEE STMT" / "STMT" → numericValue: null, rawValue: "SEE STMT"
  7. "X" (checkbox) → boolean true, rawValue: "X"
  8. Empty / whitespace → omit field or numericValue: 0
  9. "E-FILE" and other text values → numericValue: null, preserve as rawValue

Decision 5: Interface Expansion

Decision: Add subtype, fieldCategory, and isCheckbox to K1ExtractedField; add position info to K1UnmappedItem

Rationale: The existing interface lacks fields needed for subtype codes (box 11 "ZZ*", box 20 "A"/"B"), field categorization (Part III vs Section J vs metadata), and checkbox discrimination. Adding these fields is backward-compatible (all optional/nullable).

New fields on K1ExtractedField:

  • subtype: string | null — subtype code (e.g., "ZZ*", "A", "B", "*")
  • fieldCategory: 'PART_III' | 'METADATA' | 'SECTION_J' | 'SECTION_K' | 'SECTION_L' | 'SECTION_M' | 'SECTION_N' | 'CHECKBOX'
  • isCheckbox: boolean — whether this field is a boolean checkbox value

New fields on K1UnmappedItem:

  • x: number — x position in PDF points
  • y: number — y position in PDF points
  • fontName: string — font identifier for debugging

Open Items

None. All NEEDS CLARIFICATION items resolved.