# Contract: K1 Extractor Interface **Feature**: 005-k1-parser-fix | **Date**: 2026-03-18 ## Overview The K1 extraction system uses a strategy pattern where multiple extractors implement the `K1Extractor` interface. This feature rewrites the `PdfParseExtractor` (Tier 1) internals while preserving the interface contract. ## K1Extractor Interface (unchanged) ```typescript interface K1Extractor { extract(buffer: Buffer, fileName: string): Promise; isAvailable(): boolean; } ``` ### extract(buffer, fileName) **Input**: - `buffer`: Raw PDF file content as a Node.js Buffer - `fileName`: Original filename of the uploaded PDF (for logging/diagnostics) **Output**: `K1ExtractionResult` containing: - `metadata`: Partnership/partner info, tax year, filing status - `fields`: Array of `K1ExtractedField` (mapped values) - `unmappedItems`: Array of `K1UnmappedItem` (values that couldn't be mapped) - `overallConfidence`: 0.0–1.0 aggregate confidence - `method`: `'pdf-parse'` (this extractor) - `pagesProcessed`: number (typically 1) **Error handling**: - Throws on non-PDF input (invalid buffer) - Returns empty fields + low confidence for non-K-1 PDFs - Never crashes on unexpected PDF content ### isAvailable() Returns `true` always (no external dependencies or API keys needed). ## K1ExtractionResult Shape (expanded) ```typescript interface K1ExtractionResult { metadata: { partnershipName: string | null; partnershipEin: string | null; partnerName: string | null; partnerEin: string | null; taxYear: number | null; isAmended: boolean; isFinal: boolean; }; fields: K1ExtractedField[]; unmappedItems: K1UnmappedItem[]; overallConfidence: number; method: 'pdf-parse' | 'azure' | 'tesseract'; pagesProcessed: number; } ``` ## K1ExtractedField Shape (expanded) ```typescript interface K1ExtractedField { boxNumber: string; // "1", "6a", "19", "20", "J_PROFIT_BEGIN", etc. label: string; // Display label customLabel: string | null; // User override rawValue: string; // Raw text: "498,211", "(409,811)", "SEE STMT", "X" numericValue: number | null; // Parsed: 498211, -409811, null, null confidence: number; // 0.0–1.0 confidenceLevel: 'HIGH' | 'MEDIUM' | 'LOW'; isUserEdited: boolean; // Default false isReviewed: boolean; // Default false subtype: string | null; // NEW: "ZZ*", "A", "B", "*", null fieldCategory: string; // NEW: "PART_III", "METADATA", "SECTION_J", etc. isCheckbox: boolean; // NEW: true for checkbox fields } ``` ## K1UnmappedItem Shape (expanded) ```typescript interface K1UnmappedItem { rawLabel: string; rawValue: string; numericValue: number | null; confidence: number; pageNumber: number; resolution: 'assigned' | 'discarded' | null; assignedBoxNumber: string | null; x: number; // NEW: x position in PDF points y: number; // NEW: y position in PDF points fontName: string; // NEW: PDF font identifier } ``` ## Behavioral Contract 1. **Font discrimination**: The extractor MUST dynamically identify which fonts carry data values vs. template text. It MUST NOT hardcode specific font names. 2. **Position matching**: Each data value MUST be mapped to a K-1 field by checking its (x, y) against defined bounding box regions. 3. **Subtype pairing**: For subtype boxes, code and value items at the same y-position (±8 pts) MUST be paired. 4. **Multi-subtype**: Boxes with multiple subtypes (e.g., box 20) MUST produce separate `K1ExtractedField` entries for each subtype row. 5. **Value parsing**: Parenthesized values MUST become negative. Commas MUST be stripped. "SEE STMT" MUST remain as-is with null numericValue. 6. **Unmapped fallback**: Any data value not matching a region MUST appear in `unmappedItems` — zero data loss. 7. **Cleanup**: The PDF document MUST be destroyed after extraction to free worker resources. 8. **Page scope**: Only page 1 is processed. Multi-page K-1s have supplemental statements on subsequent pages (out of scope).