mirror of https://github.com/ghostfolio/ghostfolio
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
4.0 KiB
4.0 KiB
Contract: K1 Extractor Interface
Feature: 005-k1-parser-fix | Date: 2026-03-18
Overview
The K1 extraction system uses a strategy pattern where multiple extractors implement the K1Extractor interface. This feature rewrites the PdfParseExtractor (Tier 1) internals while preserving the interface contract.
K1Extractor Interface (unchanged)
interface K1Extractor {
extract(buffer: Buffer, fileName: string): Promise<K1ExtractionResult>;
isAvailable(): boolean;
}
extract(buffer, fileName)
Input:
buffer: Raw PDF file content as a Node.js BufferfileName: Original filename of the uploaded PDF (for logging/diagnostics)
Output: K1ExtractionResult containing:
metadata: Partnership/partner info, tax year, filing statusfields: Array ofK1ExtractedField(mapped values)unmappedItems: Array ofK1UnmappedItem(values that couldn't be mapped)overallConfidence: 0.0–1.0 aggregate confidencemethod:'pdf-parse'(this extractor)pagesProcessed: number (typically 1)
Error handling:
- Throws on non-PDF input (invalid buffer)
- Returns empty fields + low confidence for non-K-1 PDFs
- Never crashes on unexpected PDF content
isAvailable()
Returns true always (no external dependencies or API keys needed).
K1ExtractionResult Shape (expanded)
interface K1ExtractionResult {
metadata: {
partnershipName: string | null;
partnershipEin: string | null;
partnerName: string | null;
partnerEin: string | null;
taxYear: number | null;
isAmended: boolean;
isFinal: boolean;
};
fields: K1ExtractedField[];
unmappedItems: K1UnmappedItem[];
overallConfidence: number;
method: 'pdf-parse' | 'azure' | 'tesseract';
pagesProcessed: number;
}
K1ExtractedField Shape (expanded)
interface K1ExtractedField {
boxNumber: string; // "1", "6a", "19", "20", "J_PROFIT_BEGIN", etc.
label: string; // Display label
customLabel: string | null; // User override
rawValue: string; // Raw text: "498,211", "(409,811)", "SEE STMT", "X"
numericValue: number | null; // Parsed: 498211, -409811, null, null
confidence: number; // 0.0–1.0
confidenceLevel: 'HIGH' | 'MEDIUM' | 'LOW';
isUserEdited: boolean; // Default false
isReviewed: boolean; // Default false
subtype: string | null; // NEW: "ZZ*", "A", "B", "*", null
fieldCategory: string; // NEW: "PART_III", "METADATA", "SECTION_J", etc.
isCheckbox: boolean; // NEW: true for checkbox fields
}
K1UnmappedItem Shape (expanded)
interface K1UnmappedItem {
rawLabel: string;
rawValue: string;
numericValue: number | null;
confidence: number;
pageNumber: number;
resolution: 'assigned' | 'discarded' | null;
assignedBoxNumber: string | null;
x: number; // NEW: x position in PDF points
y: number; // NEW: y position in PDF points
fontName: string; // NEW: PDF font identifier
}
Behavioral Contract
- Font discrimination: The extractor MUST dynamically identify which fonts carry data values vs. template text. It MUST NOT hardcode specific font names.
- Position matching: Each data value MUST be mapped to a K-1 field by checking its (x, y) against defined bounding box regions.
- Subtype pairing: For subtype boxes, code and value items at the same y-position (±8 pts) MUST be paired.
- Multi-subtype: Boxes with multiple subtypes (e.g., box 20) MUST produce separate
K1ExtractedFieldentries for each subtype row. - Value parsing: Parenthesized values MUST become negative. Commas MUST be stripped. "SEE STMT" MUST remain as-is with null numericValue.
- Unmapped fallback: Any data value not matching a region MUST appear in
unmappedItems— zero data loss. - Cleanup: The PDF document MUST be destroyed after extraction to free worker resources.
- Page scope: Only page 1 is processed. Multi-page K-1s have supplemental statements on subsequent pages (out of scope).