mirror of https://github.com/ghostfolio/ghostfolio
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
5.9 KiB
5.9 KiB
Quickstart: K-1 PDF Scan Import
Phase 1 Output | Date: 2026-03-18 | Updated: 2026-03-18 (post-clarification)
Prerequisites
- Spec 001-family-office-transform models are implemented (Entity, Partnership, PartnershipMembership, KDocument, Distribution, Document)
- At least one Partnership with one or more member Entities exists in the database
- The existing upload infrastructure (
UploadController,uploads/directory) is functional - Node.js ≥ 22.18.0, Docker for PostgreSQL/Redis
Environment Setup
Add to .env (optional — for Azure OCR of scanned PDFs):
AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT=https://your-resource.cognitiveservices.azure.com/
AZURE_DOCUMENT_INTELLIGENCE_KEY=your-api-key
If these are empty, scanned PDFs fall back to tesseract.js (lower accuracy but fully self-hosted).
New Dependencies
npm install pdf-parse @azure/ai-form-recognizer tesseract.js
npm install -D @types/pdf-parse
Database Migration
After adding the new Prisma models (K1ImportSession, CellMapping, CellAggregationRule, K1ImportStatus enum):
npx prisma db push # Development: sync schema
# OR
npx prisma migrate dev # Create a migration file
Seed the default IRS cell mappings (28 rows with partnershipId = null) and default aggregation rules (e.g., "Total Ordinary Income", "Total Capital Gains", "Total Deductions") via the existing seed mechanism or a dedicated seed script.
Key Files to Create
Backend (apps/api/src/)
app/k1-import/
├── k1-import.module.ts # NestJS module
├── k1-import.controller.ts # REST endpoints (see contracts/k1-import-api.md)
├── k1-import.service.ts # Orchestration: upload → extract → verify → confirm
├── dto/
│ ├── upload-k1.dto.ts # Multipart upload DTO
│ ├── verify-k1.dto.ts # Verification submission DTO
│ └── confirm-k1.dto.ts # Confirmation request DTO
├── extractors/
│ ├── k1-extractor.interface.ts # Common extraction interface
│ ├── pdf-parse-extractor.ts # Tier 1: digital PDF text extraction
│ ├── azure-extractor.ts # Tier 2: Azure Document Intelligence
│ └── tesseract-extractor.ts # Tier 2 fallback: tesseract.js OCR
├── k1-field-mapper.service.ts # Maps raw extraction → K1ExtractedField[]
├── k1-allocation.service.ts # Allocates K-1 amounts to members by ownership %
├── k1-confidence.service.ts # Computes confidence scores with validation heuristics
└── k1-aggregation.service.ts # Dynamically computes aggregation summaries from rules
app/cell-mapping/
├── cell-mapping.module.ts # NestJS module
├── cell-mapping.controller.ts # CRUD for cell mappings + aggregation rules
└── cell-mapping.service.ts # Cell mapping + aggregation rule business logic + seed data
Shared Types (libs/common/src/lib/)
interfaces/
├── k1-import.interface.ts # K1ExtractionResult, K1ExtractedField, K1ConfirmationRequest
dtos/
├── k1-import/
│ ├── create-k1-import.dto.ts
│ ├── verify-k1-import.dto.ts
│ └── confirm-k1-import.dto.ts
Frontend (apps/client/src/app/)
pages/k1-import/
├── k1-import-page.component.ts # Upload + history view
├── k1-import-page.html
├── k1-import-page.scss
├── k1-import-page.routes.ts
├── k1-verification/
│ ├── k1-verification.component.ts # Verification/edit screen (mapped + unmapped + aggregations)
│ ├── k1-verification.html
│ └── k1-verification.scss
└── k1-confirmation/
├── k1-confirmation.component.ts # Confirmation result screen
├── k1-confirmation.html
└── k1-confirmation.scss
pages/cell-mapping/
├── cell-mapping-page.component.ts # Cell mapping + aggregation rule configuration UI
├── cell-mapping-page.html
└── cell-mapping-page.routes.ts
services/
├── k1-import-data.service.ts # HTTP client for k1-import endpoints
Verification Workflow
- Upload: User selects PDF →
POST /api/v1/k1-import/upload→ session created with status PROCESSING - Extract: Backend detects PDF type (digital vs. scanned) → routes to appropriate extractor → status becomes EXTRACTED
- Review: Frontend polls/fetches session → displays verification screen with:
- Mapped cells: extracted fields with confidence indicators. High-confidence values are pre-accepted. Medium/low-confidence values require explicit review (acknowledge or edit).
- Unmapped items: separate section for values that didn't match any cell. User assigns to a cell or discards.
- Aggregation summaries: dynamically computed from mapped values using aggregation rules. Recalculate live when cell values are edited.
- Verify: User reviews all medium/low fields and resolves unmapped items →
PUT /api/v1/k1-import/:id/verify→ status becomes VERIFIED - Confirm: User clicks "Confirm & Save" →
POST /api/v1/k1-import/:id/confirm→ KDocument + Distributions + Document created → status becomes CONFIRMED
Testing Strategy
- Unit tests: Extractors (pdf-parse, azure, tesseract), field mapper, confidence scoring, allocation math, aggregation computation
- Integration tests: Full upload → extract → verify → confirm flow with test PDF fixtures
- Test fixtures: Include sample K-1 PDFs (digital and scanned) in
test/import/directory - Allocation accuracy: Verify rounding behavior — allocated amounts must sum exactly to partnership total
- Aggregation tests: Verify dynamic computation from rules, auto-recalculation on value edit, behavior when source cells are empty
- Review enforcement: Verify confirmation blocked when medium/low-confidence fields not reviewed or unmapped items unresolved