You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

4.0 KiB

Contract: K1 Extractor Interface

Feature: 005-k1-parser-fix | Date: 2026-03-18

Overview

The K1 extraction system uses a strategy pattern where multiple extractors implement the K1Extractor interface. This feature rewrites the PdfParseExtractor (Tier 1) internals while preserving the interface contract.

K1Extractor Interface (unchanged)

interface K1Extractor {
  extract(buffer: Buffer, fileName: string): Promise<K1ExtractionResult>;
  isAvailable(): boolean;
}

extract(buffer, fileName)

Input:

  • buffer: Raw PDF file content as a Node.js Buffer
  • fileName: Original filename of the uploaded PDF (for logging/diagnostics)

Output: K1ExtractionResult containing:

  • metadata: Partnership/partner info, tax year, filing status
  • fields: Array of K1ExtractedField (mapped values)
  • unmappedItems: Array of K1UnmappedItem (values that couldn't be mapped)
  • overallConfidence: 0.0–1.0 aggregate confidence
  • method: 'pdf-parse' (this extractor)
  • pagesProcessed: number (typically 1)

Error handling:

  • Throws on non-PDF input (invalid buffer)
  • Returns empty fields + low confidence for non-K-1 PDFs
  • Never crashes on unexpected PDF content

isAvailable()

Returns true always (no external dependencies or API keys needed).

K1ExtractionResult Shape (expanded)

interface K1ExtractionResult {
  metadata: {
    partnershipName: string | null;
    partnershipEin: string | null;
    partnerName: string | null;
    partnerEin: string | null;
    taxYear: number | null;
    isAmended: boolean;
    isFinal: boolean;
  };
  fields: K1ExtractedField[];
  unmappedItems: K1UnmappedItem[];
  overallConfidence: number;
  method: 'pdf-parse' | 'azure' | 'tesseract';
  pagesProcessed: number;
}

K1ExtractedField Shape (expanded)

interface K1ExtractedField {
  boxNumber: string;           // "1", "6a", "19", "20", "J_PROFIT_BEGIN", etc.
  label: string;               // Display label
  customLabel: string | null;  // User override
  rawValue: string;            // Raw text: "498,211", "(409,811)", "SEE STMT", "X"
  numericValue: number | null; // Parsed: 498211, -409811, null, null
  confidence: number;          // 0.0–1.0
  confidenceLevel: 'HIGH' | 'MEDIUM' | 'LOW';
  isUserEdited: boolean;       // Default false
  isReviewed: boolean;         // Default false
  subtype: string | null;      // NEW: "ZZ*", "A", "B", "*", null
  fieldCategory: string;       // NEW: "PART_III", "METADATA", "SECTION_J", etc.
  isCheckbox: boolean;         // NEW: true for checkbox fields
}

K1UnmappedItem Shape (expanded)

interface K1UnmappedItem {
  rawLabel: string;
  rawValue: string;
  numericValue: number | null;
  confidence: number;
  pageNumber: number;
  resolution: 'assigned' | 'discarded' | null;
  assignedBoxNumber: string | null;
  x: number;                   // NEW: x position in PDF points
  y: number;                   // NEW: y position in PDF points
  fontName: string;            // NEW: PDF font identifier
}

Behavioral Contract

  1. Font discrimination: The extractor MUST dynamically identify which fonts carry data values vs. template text. It MUST NOT hardcode specific font names.
  2. Position matching: Each data value MUST be mapped to a K-1 field by checking its (x, y) against defined bounding box regions.
  3. Subtype pairing: For subtype boxes, code and value items at the same y-position (±8 pts) MUST be paired.
  4. Multi-subtype: Boxes with multiple subtypes (e.g., box 20) MUST produce separate K1ExtractedField entries for each subtype row.
  5. Value parsing: Parenthesized values MUST become negative. Commas MUST be stripped. "SEE STMT" MUST remain as-is with null numericValue.
  6. Unmapped fallback: Any data value not matching a region MUST appear in unmappedItems — zero data loss.
  7. Cleanup: The PDF document MUST be destroyed after extraction to free worker resources.
  8. Page scope: Only page 1 is processed. Multi-page K-1s have supplemental statements on subsequent pages (out of scope).