You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

32 KiB

Research: Normalized Relational Model for K-1 Financial Data

Phase 0 Output | Date: 2026-03-20 | Research Only — No Code


Context

The current system stores K-1 box data as a flat JSON blob on KDocument.data:

{"1": 50000, "9a": -1200, "11-ZZ*": 500, "20-A": 1200}

Aggregations are computed on-the-fly in k1-aggregation.service.ts by iterating JSON keys. CellMapping provides label metadata, and CellAggregationRule defines which box keys to SUM. The system currently has ~80+ possible K-1 fields (boxes 1–21 with subtypes, Sections J/K/L/M/N, metadata fields like A–I).

The goal is to evaluate whether and how to transform this into a normalized relational model.


Topic 1: Wide vs Normalized Financial Data Models

Decision

Move to a normalized fact table (K1LineItem) for Part III financial data (boxes 1–21), but keep a JSON metadata column for Part I/II identity fields (A–I, J–N) that are queried infrequently.

Rationale

The current JSON blob approach has these specific weaknesses for analytics:

Query limitations observed in this codebase:

  1. No SQL-level filtering or aggregation — The computeForKDocument() method in k1-aggregation.service.ts must fetch the entire KDocument row, deserialize JSON, and loop through Object.entries(data) in application code. This means you cannot write SELECT SUM(amount) FROM ... WHERE box_number = '1' AND tax_year BETWEEN 2020 AND 2025 — every aggregation requires fetching and deserializing all rows.
  2. No indexes on values — Cannot index data->'1' effectively in PostgreSQL JSONB for range queries. While GIN indexes support containment (@>), they don't help with >, <, or BETWEEN on numeric values within the JSON.
  3. No referential integrity — A typo like "9A" vs "9a" silently creates bad data. The current CellMapping table defines valid box numbers, but nothing enforces that KDocument.data keys match them.
  4. Cross-document aggregation is O(n) deserialization — To compute "total ordinary income (Box 1) across all partnerships for 2025," every KDocument row matching the year must be fetched and parsed. With 50+ partnerships × 5 years, this is 250+ JSON deserializations for one number.
  5. No partial update tracking — When a KDocument transitions from ESTIMATED → FINAL, the entire JSON blob is replaced. previousData preserves the old blob but provides no field-level diff.
  6. Schema evolution is invisible — If the IRS adds a Box 6d in 2027, there's no migration — it just appears as a new JSON key. This sounds convenient but means no validation, no type checking, and no discoverability for future NL-to-SQL.

When the wide/JSON model is acceptable:

  • Archival storage of the complete raw extraction (already served by K1ImportSession.rawExtraction)
  • Rarely-queried metadata fields (Part I/II: partnership name, EIN, addresses)
  • Configurations and user preferences (already used for Settings.settings)
  • Fewer than ~10 documents with no cross-document queries needed

When it breaks down (the current situation):

  • Cross-entity/cross-year aggregation (core family office use case)
  • Performance analytics over time (partnership returns by year)
  • Tax planning queries ("show me all partnerships with Section 1231 losses > $10K")
  • Audit trail at field granularity
  • LLM-generated SQL queries (LLMs cannot reliably generate JSONB path expressions)

Alternatives Considered

Alternative Pros Cons
Keep JSON blob (status quo) No migration, flexible schema All query limitations above; blocks analytics roadmap
JSONB with generated columns No schema change for K-1 fields; PostgreSQL 12+ supports GENERATED ALWAYS AS (data->>'1')::numeric Max ~30 generated columns practical; doesn't scale to 80+ fields; still no FK integrity
Wide table with 80+ columns Simple queries, strong typing Extremely sparse (most K-1s populate ~20 of 80+ boxes); ALTER TABLE for every IRS form change; NULL-heavy
Normalized fact table (chosen) SQL aggregation, indexes, FK integrity, LLM-friendly, field-level audit trail More JOINs; migration effort; slightly more complex insert logic

Topic 2: EAV vs Normalized Tables for Tax Document Fields

Decision

Use a hybrid approach: a single EAV-style fact table (K1LineItem) for all Part III financial line items, combined with a reference/dimension table (K1BoxDefinition) that provides metadata, typing, and validation rules. Keep Part I/II identity metadata as structured JSON on the KDocument.

This is technically EAV but with strong constraints — it's closer to a typed fact table pattern than classic unconstrained EAV.

Rationale

Why EAV is appropriate here (and usually isn't):

Classic EAV fails because it loses type safety, makes queries verbose, and resists validation. K-1 data avoids these pitfalls because:

  1. Uniform value type — All Part III financial values (boxes 1–21) are Decimal amounts. Unlike generic EAV where attributes might be strings, dates, booleans, or blobs, K-1 line items are uniformly monetary amounts with a known currency. This eliminates the "value_string / value_number / value_date" anti-pattern.

  2. Closed attribute set — The IRS defines ~50 Part III line items. This is not open-ended. The K1BoxDefinition reference table enumerates all valid attributes, so there's no unbounded attribute sprawl.

  3. Natural query pattern — The primary queries are aggregations across one attribute dimension: SUM(amount) WHERE box_key = '1'. This is exactly what EAV is good at — pivot-style aggregation across a known set of attributes.

  4. Sparse data — A typical K-1 populates 15–25 of ~50 possible line items. A wide table would be 50–70% NULL. The EAV/fact table stores only populated fields, which is both space-efficient and semantically clearer.

Proposed structure (conceptual):

K1BoxDefinition (reference/dimension table)
├── boxKey        VARCHAR PK   -- "1", "9a", "11-ZZ*", "20-A"
├── label         VARCHAR      -- "Ordinary business income (loss)"
├── section       VARCHAR      -- "PART_III", "PART_I", "SECTION_J"
├── dataType      VARCHAR      -- "CURRENCY", "PERCENTAGE", "BOOLEAN", "TEXT"
├── sortOrder     INT
├── irsFormLine   VARCHAR      -- "Box 1", "Box 9a", "Section J, Line 1"
└── description   TEXT

K1LineItem (fact table — one row per box per KDocument)
├── id              UUID PK
├── kDocumentId     UUID FK → KDocument.id
├── boxKey          VARCHAR FK → K1BoxDefinition.boxKey
├── amount          DECIMAL(15,2)    -- financial value (null for non-monetary)
├── textValue       VARCHAR          -- for text/boolean fields if needed
├── sourceConfidence DECIMAL(3,2)    -- 0.00–1.00, from extraction
├── sourcePageNumber INT             -- PDF page where extracted
├── sourceCoordinates JSON           -- {x, y, width, height} on the page
├── isUserEdited     BOOLEAN         -- true if user modified during verification
├── createdAt       TIMESTAMP
├── updatedAt       TIMESTAMP
└── @@unique([kDocumentId, boxKey])

Why not separate normalized tables for each box category:

An alternative is dedicated tables: K1IncomeItems, K1DeductionItems, K1CreditItems, K1CapitalAccount, etc. This was rejected because:

  • K-1 boxes don't cleanly partition into fixed categories (Box 11 "Other income" spans multiple categories via sub-codes)
  • Sub-code boxes (11-A through 11-ZZ*, 13-A through 13-ZZ*, 20-A through 20-ZZ*) have partnership-specific meaning — the same structural pattern repeats across boxes
  • It would require 6–8 tables with identical column shapes, making queries harder, not easier
  • The K1BoxDefinition reference table provides the categorical metadata without needing separate physical tables

Treatment of Part I/II metadata fields:

Fields like Partnership EIN (Box A), Partner name (Box F), Section J percentages, and Section L capital account data are better stored as structured JSON on KDocument in a metadata column because:

  • They're queried for display, not for aggregation
  • They have heterogeneous types (strings, booleans, percentages, addresses)
  • They identify the document rather than representing financial facts
  • There are ~30 of them, and they're almost all populated (not sparse)

Alternatives Considered

Alternative Pros Cons
Pure EAV (no reference table) Maximum flexibility No validation of box keys; CellMapping already serves this role but without FK enforcement
Wide table (one column per box) Simple SELECTs for specific boxes 80+ columns; 50–70% NULLs; ALTER TABLE for new boxes; poor for cross-box aggregation
Separate tables per box category Strong typing per category 6–8 near-identical tables; complex UNION queries; sub-code boxes don't fit cleanly
Hybrid EAV + reference table (chosen) Uniform fact table; strong FK validation; sparse-friendly; single query pattern for aggregation; field-level provenance Pivot queries needed for "show one K-1 as a form"; slightly more complex writes

Topic 3: Financial Fact Tables for Tax Data

Decision

Model K-1 line items as a financial fact table in a star-schema-inspired design, with KDocument as the central bridge to dimension tables (Partnership, Entity, TaxYear). Monetary values stored as DECIMAL(15,2) with explicit currency.

Rationale

Financial data warehouses consistently use a fact/dimension pattern for tax line items:

Star schema mapping for K-1 data:

                    ┌──────────────┐
                    │  Partnership │  (dimension)
                    │  ──────────  │
                    │  id, name,   │
                    │  type, ein   │
                    └──────┬───────┘
                           │
┌──────────────┐    ┌──────┴───────┐    ┌──────────────────┐
│   Entity     │────│  KDocument   │────│ K1BoxDefinition  │  (dimension)
│  (dimension) │    │  (bridge)    │    │  ────────────────│
│  ──────────  │    │  ──────────  │    │  boxKey, label,  │
│  id, name,   │    │  id, taxYear,│    │  section, type   │
│  type, taxId │    │  status      │    └──────────────────┘
└──────────────┘    └──────┬───────┘
                           │
                    ┌──────┴───────┐
                    │  K1LineItem   │  (FACT)
                    │  ──────────  │
                    │  amount,     │
                    │  boxKey,     │
                    │  confidence  │
                    └──────────────┘

Best practices from financial data warehousing applied here:

  1. Additive facts onlyK1LineItem.amount is fully additive: you can SUM across tax years, partnerships, entities, or box types. Non-additive data (percentages, booleans, text) is stored separately in textValue or on the KDocument metadata.

  2. Grain = one box value per K-1 document — Each row in K1LineItem represents one financial amount from one K-1 for one tax year. This is the atomic grain. Aggregation rules from CellAggregationRule operate on this grain.

  3. Slowly changing dimensionsPartnershipMembership already handles SCD Type 2 (effective dates) for ownership percentages. K1BoxDefinition is SCD Type 1 (overwritten on IRS form changes, with version tracking if needed).

  4. Conformed dimensionsPartnership and Entity serve as conformed dimensions shared between K-1 facts, Distribution facts, and Valuation facts. A single Entity dimension joins to multiple fact tables.

  5. Currency handling — Store amounts in the source currency with a currency column. The KDocument inherits currency from Partnership. Conversion to reporting currency happens at query time or in materialized views, never by mutating the fact.

  6. Decimal precisionDECIMAL(15,2) covers amounts up to $9,999,999,999,999.99. K-1 amounts from large partnerships (PE funds, hedge funds) can reach tens of millions. 15 digits provides headroom. Use 2 decimal places to match IRS reporting precision.

Aggregation queries enabled by this model:

-- Total ordinary income across all partnerships for 2025
SELECT SUM(li.amount)
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
WHERE li.box_key = '1' AND kd.tax_year = 2025;

-- Income breakdown by entity for tax year 2025
SELECT e.name, li.box_key, SUM(li.amount)
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
JOIN partnership p ON kd.partnership_id = p.id
JOIN partnership_membership pm ON pm.partnership_id = p.id
JOIN entity e ON pm.entity_id = e.id
WHERE kd.tax_year = 2025
GROUP BY e.name, li.box_key;

-- Partnership performance: Box 1 over time
SELECT kd.tax_year, p.name, li.amount
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
JOIN partnership p ON kd.partnership_id = p.id
WHERE li.box_key = '1'
ORDER BY kd.tax_year;

These queries are impossible or impractical with the current JSON blob model.

Alternatives Considered

Alternative Pros Cons
Snowflake schema (more normalization) Normalized box categories into sub-dimensions Over-normalized for ~50 box types; extra JOINs for no benefit
Flat denormalized reporting table Fastest reads; no JOINs Write complexity; data duplication; hard to keep consistent
OLAP cube / column store Best aggregation performance Overkill for <10K rows; adds infrastructure complexity
Star-schema-inspired fact table (chosen) Natural fit for K-1 aggregation queries; leverages existing dimensions; PostgreSQL handles this scale trivially Requires JOINs for full context (acceptable)

Topic 4: Source Traceability in Financial Systems

Decision

Store extraction provenance at the line-item grain — each K1LineItem records the source page number, bounding-box coordinates, raw extracted text, confidence score, and whether it was user-edited. The K1ImportSession retains the complete raw extraction as an immutable JSON snapshot.

Rationale

The audit trail must support this flow:

Displayed aggregated number
  → K1LineItem (individual box value)
    → KDocument (which K-1, which year, which partnership)
      → K1ImportSession (extraction record)
        → Document (source PDF file)
          → Specific page + coordinates on that page
            → Raw extracted text before parsing

Granularity levels and what to store where:

Level Table Fields Purpose
Aggregation Computed at query time SUM/formula from CellAggregationRule "Where does this total come from?" → list of K1LineItems
Line item K1LineItem amount, boxKey, sourceConfidence, sourcePageNumber, sourceCoordinates, rawExtractedText, isUserEdited "What exactly was extracted and from where?"
Document K1ImportSession rawExtraction (full JSON), extractionMethod, fileName "What did the system originally see?" (immutable after extraction)
File Document filePath, fileSize, mimeType "Where is the original PDF?"

Key design principles:

  1. Immutability of raw extractionK1ImportSession.rawExtraction is written once at extraction time and never modified. verifiedData captures user edits. This provides a complete before/after audit trail.

  2. Coordinate-level provenance — Current k1-positions-dump.txt shows the parser already extracts x, y coordinates for each text element. Storing sourceCoordinates: {x, y, width, height} on each K1LineItem enables a future "click to highlight in PDF" feature.

  3. Confidence as first-class data — The system already computes confidence scores (0.0–1.0) during extraction. Persisting this on the line item (not just in the import session JSON) enables queries like "show me all low-confidence values across all partnerships" and supports audit prioritization.

  4. User edit trackingisUserEdited: boolean distinguishes machine-extracted values from human-verified overrides. This is critical for audit and for training future extraction models.

  5. No deletion of source data — When a KDocument transitions from ESTIMATED → FINAL, the old line items should be soft-versioned (via KDocument.previousData or a separate version table), not deleted.

What NOT to store at line-item level:

  • Full PDF binary (stay on Document/filesystem)
  • Complete OCR output for the entire page (stay on K1ImportSession.rawExtraction)
  • Rendering coordinates for non-K-1 text on the page (not relevant)

Alternatives Considered

Alternative Pros Cons
Provenance only at document level Simpler; fewer columns Cannot trace an individual number back to a specific location on a page
Separate provenance table (K1LineItemProvenance) Clean separation of concerns Extra JOIN for every audit query; 1:1 relationship is usually better as columns
Store full page image crops per line item Visual proof Massive storage; PDF coordinates + original file are sufficient for re-rendering
Provenance on line item (chosen) Direct traceability; no extra JOINs; enables "highlight in PDF"; supports audit queries Slightly wider rows (acceptable for <10K rows)

Topic 5: PostgreSQL Materialized Views for Financial Reporting

Decision

Use materialized views for cross-partnership/cross-year aggregation dashboards, refreshed on a schedule or triggered by KDocument changes. Use regular views for single-document or single-partnership queries. Do not use denormalized reporting tables.

Rationale

When to use each approach in this system:

Scenario Approach Reason
"Show Box 1–21 for one K-1" Regular query on K1LineItem Small result set; no aggregation; fast enough
"Total income by box for one partnership across years" Regular SQL GROUP BY <20 rows × <10 years = <200 rows; trivial for PostgreSQL
"Dashboard: all partnerships × all entities × 5 years" Materialized view Cross-joins across dimensions; 50 partnerships × 5 entities × 5 years × 20 boxes = 25,000 aggregated values; worth pre-computing
"Tax planning: find partnerships with specific loss patterns" Materialized view or indexed view Complex filtering across many K-1s
"YoY change in Box 1 by partnership" Materialized view Window functions over multiple years

Proposed materialized views:

-- MV 1: K-1 Summary by Partnership/Year
CREATE MATERIALIZED VIEW mv_k1_partnership_year_summary AS
SELECT
    kd.partnership_id,
    kd.tax_year,
    li.box_key,
    bd.label,
    bd.section,
    SUM(li.amount) AS total_amount,
    COUNT(*) AS line_count,
    kd.filing_status
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
JOIN k1_box_definition bd ON li.box_key = bd.box_key
GROUP BY kd.partnership_id, kd.tax_year, li.box_key, bd.label, bd.section, kd.filing_status;

-- MV 2: Entity-level Income Aggregation
CREATE MATERIALIZED VIEW mv_entity_income_summary AS
SELECT
    e.id AS entity_id,
    e.name AS entity_name,
    kd.tax_year,
    li.box_key,
    SUM(li.amount * pm.ownership_percent / 100) AS allocated_amount
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
JOIN partnership_membership pm ON pm.partnership_id = kd.partnership_id
JOIN entity e ON pm.entity_id = e.id
WHERE pm.effective_date <= make_date(kd.tax_year, 12, 31)
  AND (pm.end_date IS NULL OR pm.end_date > make_date(kd.tax_year, 12, 31))
GROUP BY e.id, e.name, kd.tax_year, li.box_key;

Refresh strategy:

  • Trigger-based refresh: After any KDocument insert/update/delete or status change to FINAL, refresh affected materialized views. In NestJS, this is a @OnEvent('k-document.changed') handler that calls REFRESH MATERIALIZED VIEW CONCURRENTLY.
  • CONCURRENTLY keyword: Allows reads during refresh (requires a unique index on the MV). Essential for a multi-user system.
  • Frequency: For a family office with <100 K-1s updated per year, refresh takes <1 second. No scheduling needed — event-driven refresh is sufficient.

Why not denormalized reporting tables:

Denormalized tables (duplicating data into a flat reporting structure) require write-time consistency management — every KDocument change must update the reporting table transactionally. This is the pattern used in high-write OLTP systems, but K-1 data is low-write (<100 writes/year) and high-read (dashboards queried many times). Materialized views handle this perfectly with zero application-level sync logic.

Why not computed/generated columns:

PostgreSQL generated columns cannot reference other tables. Since aggregations span KDocument → K1LineItem → Partnership → Entity, generated columns are structurally insufficient.

Alternatives Considered

Alternative Pros Cons
Application-level caching (Redis/in-memory) No DB schema changes Cache invalidation complexity; doesn't help SQL-based analytics
Denormalized reporting tables Fastest reads; works at any scale Write-time maintenance burden; consistency bugs; overkill for <10K rows
Regular views (not materialized) Always fresh; no refresh needed Recomputed on every query; slow for cross-entity dashboards
Materialized views (chosen) Pre-computed; concurrent reads; event-driven refresh; zero application-level sync Slight staleness (mitigated by event-driven refresh); requires unique indexes for CONCURRENTLY

Topic 6: Migration Strategy from JSON Blob to Normalized Tables

Decision

Phase the migration in 3 steps: (1) Create new tables alongside existing JSON, (2) Dual-write to both during a transition period, (3) Make normalized tables authoritative. Keep the JSON blob immutable as an archive — never delete it.

Rationale

Step 1: Additive schema changes (zero breaking changes)

Migration 1: Create K1BoxDefinition table, seed with IRS default box definitions
Migration 2: Create K1LineItem table with FK to KDocument and K1BoxDefinition
Migration 3: Backfill K1LineItem from existing KDocument.data JSON blobs

The backfill migration for Step 3:

-- Pseudocode: For each KDocument, iterate JSON keys and insert K1LineItems
INSERT INTO k1_line_item (id, k_document_id, box_key, amount, created_at, updated_at)
SELECT
    gen_random_uuid(),
    kd.id,
    je.key,
    (je.value)::decimal,
    kd.created_at,
    NOW()
FROM k_document kd,
     jsonb_each(kd.data::jsonb) AS je(key, value)
WHERE jsonb_typeof(je.value) = 'number';

Step 2: Dual-write transition period

During the transition:

  • k1-import.service.ts confirmImport() writes to both KDocument.data (JSON) and K1LineItem (rows)
  • Read operations gradually migrate from JSON-based to K1LineItem-based
  • k1-aggregation.service.ts switches from JSON iteration to SELECT SUM on K1LineItem
  • Run validation queries comparing JSON-derived totals to K1LineItem-derived totals

Step 3: K1LineItem becomes authoritative

  • New features (dashboards, tax planning, LLM queries) read only from K1LineItem
  • KDocument.data is retained as immutable archive but no longer written to for new documents
  • CellAggregationRule.sourceCells continues to work — the boxKey values are the same strings
  • CellMapping evolves into or is replaced by K1BoxDefinition

Should the old JSON be kept immutable?

Yes, permanently. Reasons:

  1. Audit requirement — The JSON blob is the original imported representation. Regulatory and audit standards require preserving source data in its original form.
  2. Rollback safety — If the migration has bugs, the JSON blob is the recovery source.
  3. Storage is trivial — A JSON blob with ~30 key-value pairs is <1 KB. Even 1,000 KDocuments = <1 MB total. There's no storage pressure to delete it.
  4. Import session already preserves extractionK1ImportSession.rawExtraction holds the pre-verification extraction. KDocument.data holds the post-verification snapshot. Both should survive indefinitely.

Backward compatibility considerations:

  • The KDocument.data column type stays Json (not nullable, not removed)
  • The existing k-document-form.component.ts UI reads from KDocument.data — it continues to work during transition
  • The computeForKDocument() aggregation service works against JSON through the transition, then switches to K1LineItem queries
  • No existing API contracts change — GET /k-documents/:id returns the same shape

Handling the CellMapping → K1BoxDefinition transition:

The existing CellMapping table (per-partnership box definitions) maps closely to the proposed K1BoxDefinition. The migration strategy:

  • K1BoxDefinition absorbs the global (partnershipId = null) CellMapping records
  • Per-partnership CellMapping overrides become per-partnership K1BoxDefinition rows (or remain as display-layer configuration separate from the data model)
  • CellMapping fields like isIgnored, isCustom are presentation concerns that may not belong on the data-layer K1BoxDefinition

Alternatives Considered

Alternative Pros Cons
Big-bang migration (drop JSON, create tables, migrate in one step) Clean; no dual-write complexity Risk of data loss; requires full feature freeze; hard to validate
Dual-write indefinitely Maximum safety Permanent write overhead; divergence risk between JSON and rows
Keep JSON as authoritative, add views No migration of writes Doesn't solve the core query limitation; views over JSONB are slow
Phased migration with immutable archive (chosen) Zero-downtime; incremental validation; rollback possible; preserves audit trail Dual-write period adds complexity (bounded to weeks, not permanent)

Topic 7: Schema Design for Future LLM NL-to-SQL

Decision

Design tables with self-documenting names, add PostgreSQL COMMENT ON annotations for every table and column, use consistent naming conventions, and avoid ambiguity between similarly-named entities.

Rationale

LLMs generating SQL (via text-to-SQL or NL-to-SQL) work by receiving the schema as context and mapping natural language to table/column references. The schema itself is the prompt. Research from the Spider benchmark (Yale), BIRD benchmark, and production NL-to-SQL systems (e.g., Vanna.ai, DataHerald) identifies these factors as most impactful:

1. Naming conventions that LLMs parse correctly:

Current Name Problem Proposed Name Why Better
KDocument "K" is ambiguous to LLMs k1_document Explicitly says "K-1"
KDocument.data "data" is the most generic possible name k1_document.raw_data_json Describes what it holds
K1LineItem.amount Could be confused with Distribution.amount k1_line_item.reported_amount Disambiguates
CellMapping "Cell" is a spreadsheet term, not a tax term k1_box_definition Domain-specific
CellAggregationRule LLMs may not connect "cell" to K-1 boxes k1_aggregation_rule Clearer context

Naming conventions to adopt:

  • snake_case for all table and column names (PostgreSQL convention; LLMs trained on more snake_case SQL than camelCase)
  • Prefix K-1-specific tables with k1_ to create a namespace
  • Use _id suffix for all foreign keys
  • Avoid abbreviations (partnership_id not ptnr_id)
  • Use _at suffix for timestamps (created_at, updated_at)
  • Use descriptive names over short names (tax_year not yr, filing_status not status)

2. PostgreSQL COMMENT annotations:

COMMENT ON TABLE k1_line_item IS 'Individual financial line item from an IRS Schedule K-1 (Form 1065). One row per box number per K-1 document.';
COMMENT ON COLUMN k1_line_item.box_key IS 'IRS K-1 box identifier such as "1" for ordinary income, "9a" for long-term capital gains, or "20-A" for other information code A.';
COMMENT ON COLUMN k1_line_item.reported_amount IS 'Dollar amount reported on this K-1 line item, in the partnership base currency. Negative values represent losses.';
COMMENT ON TABLE k1_box_definition IS 'Reference table of IRS Schedule K-1 box definitions. Maps box identifiers to human-readable labels and categories.';

LLM NL-to-SQL systems extract these comments as schema context. A model asked "what is total ordinary income?" can map "ordinary income" → k1_box_definition.label = 'Ordinary business income (loss)'box_key = '1' → join to k1_line_item.

3. Avoiding ambiguity:

Current pain points for LLM-generated SQL:

  • Distribution.amount vs K1LineItem.amount — an LLM asked "total distributions" might query the wrong table. Solution: k1_line_item.reported_amount vs distribution.distribution_amount.
  • Partnership has distributions, kDocuments, valuations — naming all FK columns partnership_id is correct and expected by LLMs.
  • Entity is overloaded (database entities, legal entities). The table comment must clarify: "A legal person or structure (trust, LLC, individual) that owns assets and receives K-1 allocations."

4. Schema metadata table for LLM context:

Consider a lightweight schema_metadata table or a markdown document that provides the LLM with:

  • Table relationships in natural language
  • Common query patterns with examples
  • Business rules ("Box 19a distributions are allocated to entities by ownership percentage")
  • Valid values for enum columns

This is cheaper than fine-tuning and more maintainable than few-shot prompts.

5. Avoid patterns that confuse LLMs:

Anti-pattern Why It Confuses LLMs Alternative
JSON columns for queryable data LLMs generate -> / ->> operators inconsistently Normalized columns
Composite primary keys LLMs often forget one part of the key in JOINs Surrogate UUID PK + unique constraint
Polymorphic FKs (one FK, multiple target tables) LLMs can't determine which table to JOIN Separate FK columns
Generic column names (type, status, data, value) Ambiguous across tables Prefix with table context (filing_status, box_data_type)
Soft deletes (is_deleted) LLMs forget the WHERE is_deleted = false filter Use end_date IS NULL pattern (already in use for memberships)

Alternatives Considered

Alternative Pros Cons
No schema changes for LLM No work LLM accuracy drops significantly with ambiguous/generic names; JSONB columns are nearly unusable for NL-to-SQL
Fine-tune LLM on this schema Can handle any naming convention Expensive; needs retraining on every schema change; vendor lock-in
RAG over schema docs Flexible; schema-aware Still limited by underlying schema quality; garbage-in-garbage-out
Self-documenting schema + COMMENT annotations (chosen) Works with any LLM; zero runtime cost; maintainable; improves human readability too Requires discipline to maintain comments on schema changes

Summary of Decisions

# Topic Decision
1 Wide vs Normalized Normalized fact table for Part III financial data; JSON retained for Part I/II metadata
2 EAV vs Normalized Hybrid: typed EAV fact table (K1LineItem) with reference dimension (K1BoxDefinition); uniform DECIMAL value type avoids classic EAV pitfalls
3 Financial fact tables Star-schema-inspired design with K1LineItem as fact, KDocument/Partnership/Entity as dimensions
4 Source traceability Per-line-item provenance (page, coordinates, confidence, raw text, user-edit flag); K1ImportSession.rawExtraction as immutable full extraction archive
5 Materialized views Event-driven materialized views for cross-entity dashboards; regular queries for single-document access
6 Migration strategy 3-phase: additive tables → dual-write → K1LineItem authoritative; JSON blob kept immutable forever
7 LLM NL-to-SQL Self-documenting snake_case names, COMMENT ON annotations, disambiguation of similar columns, k1_ table prefix namespace