# Research: Normalized Relational Model for K-1 Financial Data **Phase 0 Output** | **Date**: 2026-03-20 | **Research Only — No Code** --- ## Context The current system stores K-1 box data as a flat JSON blob on `KDocument.data`: ```json {"1": 50000, "9a": -1200, "11-ZZ*": 500, "20-A": 1200} ``` Aggregations are computed on-the-fly in `k1-aggregation.service.ts` by iterating JSON keys. `CellMapping` provides label metadata, and `CellAggregationRule` defines which box keys to SUM. The system currently has ~80+ possible K-1 fields (boxes 1–21 with subtypes, Sections J/K/L/M/N, metadata fields like A–I). The goal is to evaluate whether and how to transform this into a normalized relational model. --- ## Topic 1: Wide vs Normalized Financial Data Models ### Decision **Move to a normalized fact table** (`K1LineItem`) for Part III financial data (boxes 1–21), but **keep a JSON metadata column** for Part I/II identity fields (A–I, J–N) that are queried infrequently. ### Rationale The current JSON blob approach has these specific weaknesses for analytics: **Query limitations observed in this codebase:** 1. **No SQL-level filtering or aggregation** — The `computeForKDocument()` method in `k1-aggregation.service.ts` must fetch the entire `KDocument` row, deserialize JSON, and loop through `Object.entries(data)` in application code. This means you cannot write `SELECT SUM(amount) FROM ... WHERE box_number = '1' AND tax_year BETWEEN 2020 AND 2025` — every aggregation requires fetching and deserializing all rows. 2. **No indexes on values** — Cannot index `data->'1'` effectively in PostgreSQL JSONB for range queries. While GIN indexes support containment (`@>`), they don't help with `>`, `<`, or `BETWEEN` on numeric values within the JSON. 3. **No referential integrity** — A typo like `"9A"` vs `"9a"` silently creates bad data. The current `CellMapping` table defines valid box numbers, but nothing enforces that `KDocument.data` keys match them. 4. **Cross-document aggregation is O(n) deserialization** — To compute "total ordinary income (Box 1) across all partnerships for 2025," every KDocument row matching the year must be fetched and parsed. With 50+ partnerships × 5 years, this is 250+ JSON deserializations for one number. 5. **No partial update tracking** — When a KDocument transitions from ESTIMATED → FINAL, the entire JSON blob is replaced. `previousData` preserves the old blob but provides no field-level diff. 6. **Schema evolution is invisible** — If the IRS adds a Box 6d in 2027, there's no migration — it just appears as a new JSON key. This sounds convenient but means no validation, no type checking, and no discoverability for future NL-to-SQL. **When the wide/JSON model is acceptable:** - Archival storage of the complete raw extraction (already served by `K1ImportSession.rawExtraction`) - Rarely-queried metadata fields (Part I/II: partnership name, EIN, addresses) - Configurations and user preferences (already used for `Settings.settings`) - Fewer than ~10 documents with no cross-document queries needed **When it breaks down (the current situation):** - Cross-entity/cross-year aggregation (core family office use case) - Performance analytics over time (partnership returns by year) - Tax planning queries ("show me all partnerships with Section 1231 losses > $10K") - Audit trail at field granularity - LLM-generated SQL queries (LLMs cannot reliably generate JSONB path expressions) ### Alternatives Considered | Alternative | Pros | Cons | |---|---|---| | **Keep JSON blob** (status quo) | No migration, flexible schema | All query limitations above; blocks analytics roadmap | | **JSONB with generated columns** | No schema change for K-1 fields; PostgreSQL 12+ supports `GENERATED ALWAYS AS (data->>'1')::numeric` | Max ~30 generated columns practical; doesn't scale to 80+ fields; still no FK integrity | | **Wide table with 80+ columns** | Simple queries, strong typing | Extremely sparse (most K-1s populate ~20 of 80+ boxes); ALTER TABLE for every IRS form change; NULL-heavy | | **Normalized fact table** (chosen) | SQL aggregation, indexes, FK integrity, LLM-friendly, field-level audit trail | More JOINs; migration effort; slightly more complex insert logic | --- ## Topic 2: EAV vs Normalized Tables for Tax Document Fields ### Decision **Use a hybrid approach**: a single EAV-style fact table (`K1LineItem`) for all Part III financial line items, combined with a reference/dimension table (`K1BoxDefinition`) that provides metadata, typing, and validation rules. Keep Part I/II identity metadata as structured JSON on the KDocument. This is technically EAV but with strong constraints — it's closer to a **typed fact table** pattern than classic unconstrained EAV. ### Rationale **Why EAV is appropriate here (and usually isn't):** Classic EAV fails because it loses type safety, makes queries verbose, and resists validation. K-1 data avoids these pitfalls because: 1. **Uniform value type** — All Part III financial values (boxes 1–21) are `Decimal` amounts. Unlike generic EAV where attributes might be strings, dates, booleans, or blobs, K-1 line items are uniformly monetary amounts with a known currency. This eliminates the "value_string / value_number / value_date" anti-pattern. 2. **Closed attribute set** — The IRS defines ~50 Part III line items. This is not open-ended. The `K1BoxDefinition` reference table enumerates all valid attributes, so there's no unbounded attribute sprawl. 3. **Natural query pattern** — The primary queries are aggregations across one attribute dimension: `SUM(amount) WHERE box_key = '1'`. This is exactly what EAV is good at — pivot-style aggregation across a known set of attributes. 4. **Sparse data** — A typical K-1 populates 15–25 of ~50 possible line items. A wide table would be 50–70% NULL. The EAV/fact table stores only populated fields, which is both space-efficient and semantically clearer. **Proposed structure (conceptual):** ``` K1BoxDefinition (reference/dimension table) ├── boxKey VARCHAR PK -- "1", "9a", "11-ZZ*", "20-A" ├── label VARCHAR -- "Ordinary business income (loss)" ├── section VARCHAR -- "PART_III", "PART_I", "SECTION_J" ├── dataType VARCHAR -- "CURRENCY", "PERCENTAGE", "BOOLEAN", "TEXT" ├── sortOrder INT ├── irsFormLine VARCHAR -- "Box 1", "Box 9a", "Section J, Line 1" └── description TEXT K1LineItem (fact table — one row per box per KDocument) ├── id UUID PK ├── kDocumentId UUID FK → KDocument.id ├── boxKey VARCHAR FK → K1BoxDefinition.boxKey ├── amount DECIMAL(15,2) -- financial value (null for non-monetary) ├── textValue VARCHAR -- for text/boolean fields if needed ├── sourceConfidence DECIMAL(3,2) -- 0.00–1.00, from extraction ├── sourcePageNumber INT -- PDF page where extracted ├── sourceCoordinates JSON -- {x, y, width, height} on the page ├── isUserEdited BOOLEAN -- true if user modified during verification ├── createdAt TIMESTAMP ├── updatedAt TIMESTAMP └── @@unique([kDocumentId, boxKey]) ``` **Why not separate normalized tables for each box category:** An alternative is dedicated tables: `K1IncomeItems`, `K1DeductionItems`, `K1CreditItems`, `K1CapitalAccount`, etc. This was rejected because: - K-1 boxes don't cleanly partition into fixed categories (Box 11 "Other income" spans multiple categories via sub-codes) - Sub-code boxes (11-A through 11-ZZ*, 13-A through 13-ZZ*, 20-A through 20-ZZ*) have partnership-specific meaning — the same structural pattern repeats across boxes - It would require 6–8 tables with identical column shapes, making queries harder, not easier - The `K1BoxDefinition` reference table provides the categorical metadata without needing separate physical tables **Treatment of Part I/II metadata fields:** Fields like Partnership EIN (Box A), Partner name (Box F), Section J percentages, and Section L capital account data are better stored as structured JSON on `KDocument` in a `metadata` column because: - They're queried for display, not for aggregation - They have heterogeneous types (strings, booleans, percentages, addresses) - They identify the document rather than representing financial facts - There are ~30 of them, and they're almost all populated (not sparse) ### Alternatives Considered | Alternative | Pros | Cons | |---|---|---| | **Pure EAV (no reference table)** | Maximum flexibility | No validation of box keys; `CellMapping` already serves this role but without FK enforcement | | **Wide table (one column per box)** | Simple SELECTs for specific boxes | 80+ columns; 50–70% NULLs; ALTER TABLE for new boxes; poor for cross-box aggregation | | **Separate tables per box category** | Strong typing per category | 6–8 near-identical tables; complex UNION queries; sub-code boxes don't fit cleanly | | **Hybrid EAV + reference table** (chosen) | Uniform fact table; strong FK validation; sparse-friendly; single query pattern for aggregation; field-level provenance | Pivot queries needed for "show one K-1 as a form"; slightly more complex writes | --- ## Topic 3: Financial Fact Tables for Tax Data ### Decision **Model K-1 line items as a financial fact table** in a star-schema-inspired design, with KDocument as the central bridge to dimension tables (Partnership, Entity, TaxYear). Monetary values stored as `DECIMAL(15,2)` with explicit currency. ### Rationale Financial data warehouses consistently use a fact/dimension pattern for tax line items: **Star schema mapping for K-1 data:** ``` ┌──────────────┐ │ Partnership │ (dimension) │ ────────── │ │ id, name, │ │ type, ein │ └──────┬───────┘ │ ┌──────────────┐ ┌──────┴───────┐ ┌──────────────────┐ │ Entity │────│ KDocument │────│ K1BoxDefinition │ (dimension) │ (dimension) │ │ (bridge) │ │ ────────────────│ │ ────────── │ │ ────────── │ │ boxKey, label, │ │ id, name, │ │ id, taxYear,│ │ section, type │ │ type, taxId │ │ status │ └──────────────────┘ └──────────────┘ └──────┬───────┘ │ ┌──────┴───────┐ │ K1LineItem │ (FACT) │ ────────── │ │ amount, │ │ boxKey, │ │ confidence │ └──────────────┘ ``` **Best practices from financial data warehousing applied here:** 1. **Additive facts only** — `K1LineItem.amount` is fully additive: you can SUM across tax years, partnerships, entities, or box types. Non-additive data (percentages, booleans, text) is stored separately in `textValue` or on the KDocument metadata. 2. **Grain = one box value per K-1 document** — Each row in `K1LineItem` represents one financial amount from one K-1 for one tax year. This is the atomic grain. Aggregation rules from `CellAggregationRule` operate on this grain. 3. **Slowly changing dimensions** — `PartnershipMembership` already handles SCD Type 2 (effective dates) for ownership percentages. `K1BoxDefinition` is SCD Type 1 (overwritten on IRS form changes, with version tracking if needed). 4. **Conformed dimensions** — `Partnership` and `Entity` serve as conformed dimensions shared between K-1 facts, Distribution facts, and Valuation facts. A single `Entity` dimension joins to multiple fact tables. 5. **Currency handling** — Store amounts in the source currency with a `currency` column. The KDocument inherits currency from Partnership. Conversion to reporting currency happens at query time or in materialized views, never by mutating the fact. 6. **Decimal precision** — `DECIMAL(15,2)` covers amounts up to $9,999,999,999,999.99. K-1 amounts from large partnerships (PE funds, hedge funds) can reach tens of millions. 15 digits provides headroom. Use 2 decimal places to match IRS reporting precision. **Aggregation queries enabled by this model:** ```sql -- Total ordinary income across all partnerships for 2025 SELECT SUM(li.amount) FROM k1_line_item li JOIN k_document kd ON li.k_document_id = kd.id WHERE li.box_key = '1' AND kd.tax_year = 2025; -- Income breakdown by entity for tax year 2025 SELECT e.name, li.box_key, SUM(li.amount) FROM k1_line_item li JOIN k_document kd ON li.k_document_id = kd.id JOIN partnership p ON kd.partnership_id = p.id JOIN partnership_membership pm ON pm.partnership_id = p.id JOIN entity e ON pm.entity_id = e.id WHERE kd.tax_year = 2025 GROUP BY e.name, li.box_key; -- Partnership performance: Box 1 over time SELECT kd.tax_year, p.name, li.amount FROM k1_line_item li JOIN k_document kd ON li.k_document_id = kd.id JOIN partnership p ON kd.partnership_id = p.id WHERE li.box_key = '1' ORDER BY kd.tax_year; ``` These queries are impossible or impractical with the current JSON blob model. ### Alternatives Considered | Alternative | Pros | Cons | |---|---|---| | **Snowflake schema (more normalization)** | Normalized box categories into sub-dimensions | Over-normalized for ~50 box types; extra JOINs for no benefit | | **Flat denormalized reporting table** | Fastest reads; no JOINs | Write complexity; data duplication; hard to keep consistent | | **OLAP cube / column store** | Best aggregation performance | Overkill for <10K rows; adds infrastructure complexity | | **Star-schema-inspired fact table** (chosen) | Natural fit for K-1 aggregation queries; leverages existing dimensions; PostgreSQL handles this scale trivially | Requires JOINs for full context (acceptable) | --- ## Topic 4: Source Traceability in Financial Systems ### Decision **Store extraction provenance at the line-item grain** — each `K1LineItem` records the source page number, bounding-box coordinates, raw extracted text, confidence score, and whether it was user-edited. The `K1ImportSession` retains the complete raw extraction as an immutable JSON snapshot. ### Rationale The audit trail must support this flow: ``` Displayed aggregated number → K1LineItem (individual box value) → KDocument (which K-1, which year, which partnership) → K1ImportSession (extraction record) → Document (source PDF file) → Specific page + coordinates on that page → Raw extracted text before parsing ``` **Granularity levels and what to store where:** | Level | Table | Fields | Purpose | |---|---|---|---| | **Aggregation** | Computed at query time | SUM/formula from `CellAggregationRule` | "Where does this total come from?" → list of K1LineItems | | **Line item** | `K1LineItem` | `amount`, `boxKey`, `sourceConfidence`, `sourcePageNumber`, `sourceCoordinates`, `rawExtractedText`, `isUserEdited` | "What exactly was extracted and from where?" | | **Document** | `K1ImportSession` | `rawExtraction` (full JSON), `extractionMethod`, `fileName` | "What did the system originally see?" (immutable after extraction) | | **File** | `Document` | `filePath`, `fileSize`, `mimeType` | "Where is the original PDF?" | **Key design principles:** 1. **Immutability of raw extraction** — `K1ImportSession.rawExtraction` is written once at extraction time and never modified. `verifiedData` captures user edits. This provides a complete before/after audit trail. 2. **Coordinate-level provenance** — Current `k1-positions-dump.txt` shows the parser already extracts `x, y` coordinates for each text element. Storing `sourceCoordinates: {x, y, width, height}` on each `K1LineItem` enables a future "click to highlight in PDF" feature. 3. **Confidence as first-class data** — The system already computes confidence scores (0.0–1.0) during extraction. Persisting this on the line item (not just in the import session JSON) enables queries like "show me all low-confidence values across all partnerships" and supports audit prioritization. 4. **User edit tracking** — `isUserEdited: boolean` distinguishes machine-extracted values from human-verified overrides. This is critical for audit and for training future extraction models. 5. **No deletion of source data** — When a KDocument transitions from ESTIMATED → FINAL, the old line items should be soft-versioned (via `KDocument.previousData` or a separate version table), not deleted. **What NOT to store at line-item level:** - Full PDF binary (stay on Document/filesystem) - Complete OCR output for the entire page (stay on K1ImportSession.rawExtraction) - Rendering coordinates for non-K-1 text on the page (not relevant) ### Alternatives Considered | Alternative | Pros | Cons | |---|---|---| | **Provenance only at document level** | Simpler; fewer columns | Cannot trace an individual number back to a specific location on a page | | **Separate provenance table** (K1LineItemProvenance) | Clean separation of concerns | Extra JOIN for every audit query; 1:1 relationship is usually better as columns | | **Store full page image crops per line item** | Visual proof | Massive storage; PDF coordinates + original file are sufficient for re-rendering | | **Provenance on line item** (chosen) | Direct traceability; no extra JOINs; enables "highlight in PDF"; supports audit queries | Slightly wider rows (acceptable for <10K rows) | --- ## Topic 5: PostgreSQL Materialized Views for Financial Reporting ### Decision **Use materialized views for cross-partnership/cross-year aggregation dashboards**, refreshed on a schedule or triggered by KDocument changes. Use regular views for single-document or single-partnership queries. Do **not** use denormalized reporting tables. ### Rationale **When to use each approach in this system:** | Scenario | Approach | Reason | |---|---|---| | "Show Box 1–21 for one K-1" | Regular query on `K1LineItem` | Small result set; no aggregation; fast enough | | "Total income by box for one partnership across years" | Regular SQL `GROUP BY` | <20 rows × <10 years = <200 rows; trivial for PostgreSQL | | "Dashboard: all partnerships × all entities × 5 years" | **Materialized view** | Cross-joins across dimensions; 50 partnerships × 5 entities × 5 years × 20 boxes = 25,000 aggregated values; worth pre-computing | | "Tax planning: find partnerships with specific loss patterns" | Materialized view or indexed view | Complex filtering across many K-1s | | "YoY change in Box 1 by partnership" | Materialized view | Window functions over multiple years | **Proposed materialized views:** ```sql -- MV 1: K-1 Summary by Partnership/Year CREATE MATERIALIZED VIEW mv_k1_partnership_year_summary AS SELECT kd.partnership_id, kd.tax_year, li.box_key, bd.label, bd.section, SUM(li.amount) AS total_amount, COUNT(*) AS line_count, kd.filing_status FROM k1_line_item li JOIN k_document kd ON li.k_document_id = kd.id JOIN k1_box_definition bd ON li.box_key = bd.box_key GROUP BY kd.partnership_id, kd.tax_year, li.box_key, bd.label, bd.section, kd.filing_status; -- MV 2: Entity-level Income Aggregation CREATE MATERIALIZED VIEW mv_entity_income_summary AS SELECT e.id AS entity_id, e.name AS entity_name, kd.tax_year, li.box_key, SUM(li.amount * pm.ownership_percent / 100) AS allocated_amount FROM k1_line_item li JOIN k_document kd ON li.k_document_id = kd.id JOIN partnership_membership pm ON pm.partnership_id = kd.partnership_id JOIN entity e ON pm.entity_id = e.id WHERE pm.effective_date <= make_date(kd.tax_year, 12, 31) AND (pm.end_date IS NULL OR pm.end_date > make_date(kd.tax_year, 12, 31)) GROUP BY e.id, e.name, kd.tax_year, li.box_key; ``` **Refresh strategy:** - **Trigger-based refresh**: After any KDocument insert/update/delete or status change to FINAL, refresh affected materialized views. In NestJS, this is a `@OnEvent('k-document.changed')` handler that calls `REFRESH MATERIALIZED VIEW CONCURRENTLY`. - **`CONCURRENTLY` keyword**: Allows reads during refresh (requires a unique index on the MV). Essential for a multi-user system. - **Frequency**: For a family office with <100 K-1s updated per year, refresh takes <1 second. No scheduling needed — event-driven refresh is sufficient. **Why not denormalized reporting tables:** Denormalized tables (duplicating data into a flat reporting structure) require write-time consistency management — every KDocument change must update the reporting table transactionally. This is the pattern used in high-write OLTP systems, but K-1 data is low-write (<100 writes/year) and high-read (dashboards queried many times). Materialized views handle this perfectly with zero application-level sync logic. **Why not computed/generated columns:** PostgreSQL generated columns cannot reference other tables. Since aggregations span KDocument → K1LineItem → Partnership → Entity, generated columns are structurally insufficient. ### Alternatives Considered | Alternative | Pros | Cons | |---|---|---| | **Application-level caching** (Redis/in-memory) | No DB schema changes | Cache invalidation complexity; doesn't help SQL-based analytics | | **Denormalized reporting tables** | Fastest reads; works at any scale | Write-time maintenance burden; consistency bugs; overkill for <10K rows | | **Regular views** (not materialized) | Always fresh; no refresh needed | Recomputed on every query; slow for cross-entity dashboards | | **Materialized views** (chosen) | Pre-computed; concurrent reads; event-driven refresh; zero application-level sync | Slight staleness (mitigated by event-driven refresh); requires unique indexes for CONCURRENTLY | --- ## Topic 6: Migration Strategy from JSON Blob to Normalized Tables ### Decision **Phase the migration in 3 steps**: (1) Create new tables alongside existing JSON, (2) Dual-write to both during a transition period, (3) Make normalized tables authoritative. **Keep the JSON blob immutable as an archive** — never delete it. ### Rationale **Step 1: Additive schema changes (zero breaking changes)** ``` Migration 1: Create K1BoxDefinition table, seed with IRS default box definitions Migration 2: Create K1LineItem table with FK to KDocument and K1BoxDefinition Migration 3: Backfill K1LineItem from existing KDocument.data JSON blobs ``` The backfill migration for Step 3: ```sql -- Pseudocode: For each KDocument, iterate JSON keys and insert K1LineItems INSERT INTO k1_line_item (id, k_document_id, box_key, amount, created_at, updated_at) SELECT gen_random_uuid(), kd.id, je.key, (je.value)::decimal, kd.created_at, NOW() FROM k_document kd, jsonb_each(kd.data::jsonb) AS je(key, value) WHERE jsonb_typeof(je.value) = 'number'; ``` **Step 2: Dual-write transition period** During the transition: - `k1-import.service.ts` `confirmImport()` writes to **both** `KDocument.data` (JSON) and `K1LineItem` (rows) - Read operations gradually migrate from JSON-based to K1LineItem-based - `k1-aggregation.service.ts` switches from JSON iteration to `SELECT SUM` on K1LineItem - Run validation queries comparing JSON-derived totals to K1LineItem-derived totals **Step 3: K1LineItem becomes authoritative** - New features (dashboards, tax planning, LLM queries) read only from K1LineItem - `KDocument.data` is retained as immutable archive but no longer written to for new documents - `CellAggregationRule.sourceCells` continues to work — the boxKey values are the same strings - `CellMapping` evolves into or is replaced by `K1BoxDefinition` **Should the old JSON be kept immutable?** **Yes, permanently.** Reasons: 1. **Audit requirement** — The JSON blob is the original imported representation. Regulatory and audit standards require preserving source data in its original form. 2. **Rollback safety** — If the migration has bugs, the JSON blob is the recovery source. 3. **Storage is trivial** — A JSON blob with ~30 key-value pairs is <1 KB. Even 1,000 KDocuments = <1 MB total. There's no storage pressure to delete it. 4. **Import session already preserves extraction** — `K1ImportSession.rawExtraction` holds the pre-verification extraction. `KDocument.data` holds the post-verification snapshot. Both should survive indefinitely. **Backward compatibility considerations:** - The `KDocument.data` column type stays `Json` (not nullable, not removed) - The existing `k-document-form.component.ts` UI reads from `KDocument.data` — it continues to work during transition - The `computeForKDocument()` aggregation service works against JSON through the transition, then switches to K1LineItem queries - No existing API contracts change — `GET /k-documents/:id` returns the same shape **Handling the CellMapping → K1BoxDefinition transition:** The existing `CellMapping` table (per-partnership box definitions) maps closely to the proposed `K1BoxDefinition`. The migration strategy: - `K1BoxDefinition` absorbs the global (partnershipId = null) CellMapping records - Per-partnership CellMapping overrides become per-partnership `K1BoxDefinition` rows (or remain as display-layer configuration separate from the data model) - `CellMapping` fields like `isIgnored`, `isCustom` are presentation concerns that may not belong on the data-layer `K1BoxDefinition` ### Alternatives Considered | Alternative | Pros | Cons | |---|---|---| | **Big-bang migration** (drop JSON, create tables, migrate in one step) | Clean; no dual-write complexity | Risk of data loss; requires full feature freeze; hard to validate | | **Dual-write indefinitely** | Maximum safety | Permanent write overhead; divergence risk between JSON and rows | | **Keep JSON as authoritative, add views** | No migration of writes | Doesn't solve the core query limitation; views over JSONB are slow | | **Phased migration with immutable archive** (chosen) | Zero-downtime; incremental validation; rollback possible; preserves audit trail | Dual-write period adds complexity (bounded to weeks, not permanent) | --- ## Topic 7: Schema Design for Future LLM NL-to-SQL ### Decision **Design tables with self-documenting names, add PostgreSQL `COMMENT ON` annotations for every table and column, use consistent naming conventions, and avoid ambiguity between similarly-named entities.** ### Rationale LLMs generating SQL (via text-to-SQL or NL-to-SQL) work by receiving the schema as context and mapping natural language to table/column references. The schema itself is the prompt. Research from the Spider benchmark (Yale), BIRD benchmark, and production NL-to-SQL systems (e.g., Vanna.ai, DataHerald) identifies these factors as most impactful: **1. Naming conventions that LLMs parse correctly:** | Current Name | Problem | Proposed Name | Why Better | |---|---|---|---| | `KDocument` | "K" is ambiguous to LLMs | `k1_document` | Explicitly says "K-1" | | `KDocument.data` | "data" is the most generic possible name | `k1_document.raw_data_json` | Describes what it holds | | `K1LineItem.amount` | Could be confused with Distribution.amount | `k1_line_item.reported_amount` | Disambiguates | | `CellMapping` | "Cell" is a spreadsheet term, not a tax term | `k1_box_definition` | Domain-specific | | `CellAggregationRule` | LLMs may not connect "cell" to K-1 boxes | `k1_aggregation_rule` | Clearer context | **Naming conventions to adopt:** - `snake_case` for all table and column names (PostgreSQL convention; LLMs trained on more snake_case SQL than camelCase) - Prefix K-1-specific tables with `k1_` to create a namespace - Use `_id` suffix for all foreign keys - Avoid abbreviations (`partnership_id` not `ptnr_id`) - Use `_at` suffix for timestamps (`created_at`, `updated_at`) - Use descriptive names over short names (`tax_year` not `yr`, `filing_status` not `status`) **2. PostgreSQL COMMENT annotations:** ```sql COMMENT ON TABLE k1_line_item IS 'Individual financial line item from an IRS Schedule K-1 (Form 1065). One row per box number per K-1 document.'; COMMENT ON COLUMN k1_line_item.box_key IS 'IRS K-1 box identifier such as "1" for ordinary income, "9a" for long-term capital gains, or "20-A" for other information code A.'; COMMENT ON COLUMN k1_line_item.reported_amount IS 'Dollar amount reported on this K-1 line item, in the partnership base currency. Negative values represent losses.'; COMMENT ON TABLE k1_box_definition IS 'Reference table of IRS Schedule K-1 box definitions. Maps box identifiers to human-readable labels and categories.'; ``` LLM NL-to-SQL systems extract these comments as schema context. A model asked "what is total ordinary income?" can map "ordinary income" → `k1_box_definition.label = 'Ordinary business income (loss)'` → `box_key = '1'` → join to `k1_line_item`. **3. Avoiding ambiguity:** Current pain points for LLM-generated SQL: - `Distribution.amount` vs `K1LineItem.amount` — an LLM asked "total distributions" might query the wrong table. Solution: `k1_line_item.reported_amount` vs `distribution.distribution_amount`. - `Partnership` has `distributions`, `kDocuments`, `valuations` — naming all FK columns `partnership_id` is correct and expected by LLMs. - `Entity` is overloaded (database entities, legal entities). The table comment must clarify: "A legal person or structure (trust, LLC, individual) that owns assets and receives K-1 allocations." **4. Schema metadata table for LLM context:** Consider a lightweight `schema_metadata` table or a markdown document that provides the LLM with: - Table relationships in natural language - Common query patterns with examples - Business rules ("Box 19a distributions are allocated to entities by ownership percentage") - Valid values for enum columns This is cheaper than fine-tuning and more maintainable than few-shot prompts. **5. Avoid patterns that confuse LLMs:** | Anti-pattern | Why It Confuses LLMs | Alternative | |---|---|---| | JSON columns for queryable data | LLMs generate `->` / `->>` operators inconsistently | Normalized columns | | Composite primary keys | LLMs often forget one part of the key in JOINs | Surrogate UUID PK + unique constraint | | Polymorphic FKs (one FK, multiple target tables) | LLMs can't determine which table to JOIN | Separate FK columns | | Generic column names (`type`, `status`, `data`, `value`) | Ambiguous across tables | Prefix with table context (`filing_status`, `box_data_type`) | | Soft deletes (`is_deleted`) | LLMs forget the `WHERE is_deleted = false` filter | Use `end_date IS NULL` pattern (already in use for memberships) | ### Alternatives Considered | Alternative | Pros | Cons | |---|---|---| | **No schema changes for LLM** | No work | LLM accuracy drops significantly with ambiguous/generic names; JSONB columns are nearly unusable for NL-to-SQL | | **Fine-tune LLM on this schema** | Can handle any naming convention | Expensive; needs retraining on every schema change; vendor lock-in | | **RAG over schema docs** | Flexible; schema-aware | Still limited by underlying schema quality; garbage-in-garbage-out | | **Self-documenting schema + COMMENT annotations** (chosen) | Works with any LLM; zero runtime cost; maintainable; improves human readability too | Requires discipline to maintain comments on schema changes | --- ## Summary of Decisions | # | Topic | Decision | |---|---|---| | 1 | Wide vs Normalized | Normalized fact table for Part III financial data; JSON retained for Part I/II metadata | | 2 | EAV vs Normalized | Hybrid: typed EAV fact table (`K1LineItem`) with reference dimension (`K1BoxDefinition`); uniform `DECIMAL` value type avoids classic EAV pitfalls | | 3 | Financial fact tables | Star-schema-inspired design with `K1LineItem` as fact, `KDocument`/`Partnership`/`Entity` as dimensions | | 4 | Source traceability | Per-line-item provenance (page, coordinates, confidence, raw text, user-edit flag); K1ImportSession.rawExtraction as immutable full extraction archive | | 5 | Materialized views | Event-driven materialized views for cross-entity dashboards; regular queries for single-document access | | 6 | Migration strategy | 3-phase: additive tables → dual-write → K1LineItem authoritative; JSON blob kept immutable forever | | 7 | LLM NL-to-SQL | Self-documenting `snake_case` names, `COMMENT ON` annotations, disambiguation of similar columns, `k1_` table prefix namespace |