32 KiB
Research: Normalized Relational Model for K-1 Financial Data
Phase 0 Output | Date: 2026-03-20 | Research Only — No Code
Context
The current system stores K-1 box data as a flat JSON blob on KDocument.data:
{"1": 50000, "9a": -1200, "11-ZZ*": 500, "20-A": 1200}
Aggregations are computed on-the-fly in k1-aggregation.service.ts by iterating JSON keys. CellMapping provides label metadata, and CellAggregationRule defines which box keys to SUM. The system currently has ~80+ possible K-1 fields (boxes 1–21 with subtypes, Sections J/K/L/M/N, metadata fields like A–I).
The goal is to evaluate whether and how to transform this into a normalized relational model.
Topic 1: Wide vs Normalized Financial Data Models
Decision
Move to a normalized fact table (K1LineItem) for Part III financial data (boxes 1–21), but keep a JSON metadata column for Part I/II identity fields (A–I, J–N) that are queried infrequently.
Rationale
The current JSON blob approach has these specific weaknesses for analytics:
Query limitations observed in this codebase:
- No SQL-level filtering or aggregation — The
computeForKDocument()method ink1-aggregation.service.tsmust fetch the entireKDocumentrow, deserialize JSON, and loop throughObject.entries(data)in application code. This means you cannot writeSELECT SUM(amount) FROM ... WHERE box_number = '1' AND tax_year BETWEEN 2020 AND 2025— every aggregation requires fetching and deserializing all rows. - No indexes on values — Cannot index
data->'1'effectively in PostgreSQL JSONB for range queries. While GIN indexes support containment (@>), they don't help with>,<, orBETWEENon numeric values within the JSON. - No referential integrity — A typo like
"9A"vs"9a"silently creates bad data. The currentCellMappingtable defines valid box numbers, but nothing enforces thatKDocument.datakeys match them. - Cross-document aggregation is O(n) deserialization — To compute "total ordinary income (Box 1) across all partnerships for 2025," every KDocument row matching the year must be fetched and parsed. With 50+ partnerships × 5 years, this is 250+ JSON deserializations for one number.
- No partial update tracking — When a KDocument transitions from ESTIMATED → FINAL, the entire JSON blob is replaced.
previousDatapreserves the old blob but provides no field-level diff. - Schema evolution is invisible — If the IRS adds a Box 6d in 2027, there's no migration — it just appears as a new JSON key. This sounds convenient but means no validation, no type checking, and no discoverability for future NL-to-SQL.
When the wide/JSON model is acceptable:
- Archival storage of the complete raw extraction (already served by
K1ImportSession.rawExtraction) - Rarely-queried metadata fields (Part I/II: partnership name, EIN, addresses)
- Configurations and user preferences (already used for
Settings.settings) - Fewer than ~10 documents with no cross-document queries needed
When it breaks down (the current situation):
- Cross-entity/cross-year aggregation (core family office use case)
- Performance analytics over time (partnership returns by year)
- Tax planning queries ("show me all partnerships with Section 1231 losses > $10K")
- Audit trail at field granularity
- LLM-generated SQL queries (LLMs cannot reliably generate JSONB path expressions)
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| Keep JSON blob (status quo) | No migration, flexible schema | All query limitations above; blocks analytics roadmap |
| JSONB with generated columns | No schema change for K-1 fields; PostgreSQL 12+ supports GENERATED ALWAYS AS (data->>'1')::numeric |
Max ~30 generated columns practical; doesn't scale to 80+ fields; still no FK integrity |
| Wide table with 80+ columns | Simple queries, strong typing | Extremely sparse (most K-1s populate ~20 of 80+ boxes); ALTER TABLE for every IRS form change; NULL-heavy |
| Normalized fact table (chosen) | SQL aggregation, indexes, FK integrity, LLM-friendly, field-level audit trail | More JOINs; migration effort; slightly more complex insert logic |
Topic 2: EAV vs Normalized Tables for Tax Document Fields
Decision
Use a hybrid approach: a single EAV-style fact table (K1LineItem) for all Part III financial line items, combined with a reference/dimension table (K1BoxDefinition) that provides metadata, typing, and validation rules. Keep Part I/II identity metadata as structured JSON on the KDocument.
This is technically EAV but with strong constraints — it's closer to a typed fact table pattern than classic unconstrained EAV.
Rationale
Why EAV is appropriate here (and usually isn't):
Classic EAV fails because it loses type safety, makes queries verbose, and resists validation. K-1 data avoids these pitfalls because:
-
Uniform value type — All Part III financial values (boxes 1–21) are
Decimalamounts. Unlike generic EAV where attributes might be strings, dates, booleans, or blobs, K-1 line items are uniformly monetary amounts with a known currency. This eliminates the "value_string / value_number / value_date" anti-pattern. -
Closed attribute set — The IRS defines ~50 Part III line items. This is not open-ended. The
K1BoxDefinitionreference table enumerates all valid attributes, so there's no unbounded attribute sprawl. -
Natural query pattern — The primary queries are aggregations across one attribute dimension:
SUM(amount) WHERE box_key = '1'. This is exactly what EAV is good at — pivot-style aggregation across a known set of attributes. -
Sparse data — A typical K-1 populates 15–25 of ~50 possible line items. A wide table would be 50–70% NULL. The EAV/fact table stores only populated fields, which is both space-efficient and semantically clearer.
Proposed structure (conceptual):
K1BoxDefinition (reference/dimension table)
├── boxKey VARCHAR PK -- "1", "9a", "11-ZZ*", "20-A"
├── label VARCHAR -- "Ordinary business income (loss)"
├── section VARCHAR -- "PART_III", "PART_I", "SECTION_J"
├── dataType VARCHAR -- "CURRENCY", "PERCENTAGE", "BOOLEAN", "TEXT"
├── sortOrder INT
├── irsFormLine VARCHAR -- "Box 1", "Box 9a", "Section J, Line 1"
└── description TEXT
K1LineItem (fact table — one row per box per KDocument)
├── id UUID PK
├── kDocumentId UUID FK → KDocument.id
├── boxKey VARCHAR FK → K1BoxDefinition.boxKey
├── amount DECIMAL(15,2) -- financial value (null for non-monetary)
├── textValue VARCHAR -- for text/boolean fields if needed
├── sourceConfidence DECIMAL(3,2) -- 0.00–1.00, from extraction
├── sourcePageNumber INT -- PDF page where extracted
├── sourceCoordinates JSON -- {x, y, width, height} on the page
├── isUserEdited BOOLEAN -- true if user modified during verification
├── createdAt TIMESTAMP
├── updatedAt TIMESTAMP
└── @@unique([kDocumentId, boxKey])
Why not separate normalized tables for each box category:
An alternative is dedicated tables: K1IncomeItems, K1DeductionItems, K1CreditItems, K1CapitalAccount, etc. This was rejected because:
- K-1 boxes don't cleanly partition into fixed categories (Box 11 "Other income" spans multiple categories via sub-codes)
- Sub-code boxes (11-A through 11-ZZ*, 13-A through 13-ZZ*, 20-A through 20-ZZ*) have partnership-specific meaning — the same structural pattern repeats across boxes
- It would require 6–8 tables with identical column shapes, making queries harder, not easier
- The
K1BoxDefinitionreference table provides the categorical metadata without needing separate physical tables
Treatment of Part I/II metadata fields:
Fields like Partnership EIN (Box A), Partner name (Box F), Section J percentages, and Section L capital account data are better stored as structured JSON on KDocument in a metadata column because:
- They're queried for display, not for aggregation
- They have heterogeneous types (strings, booleans, percentages, addresses)
- They identify the document rather than representing financial facts
- There are ~30 of them, and they're almost all populated (not sparse)
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| Pure EAV (no reference table) | Maximum flexibility | No validation of box keys; CellMapping already serves this role but without FK enforcement |
| Wide table (one column per box) | Simple SELECTs for specific boxes | 80+ columns; 50–70% NULLs; ALTER TABLE for new boxes; poor for cross-box aggregation |
| Separate tables per box category | Strong typing per category | 6–8 near-identical tables; complex UNION queries; sub-code boxes don't fit cleanly |
| Hybrid EAV + reference table (chosen) | Uniform fact table; strong FK validation; sparse-friendly; single query pattern for aggregation; field-level provenance | Pivot queries needed for "show one K-1 as a form"; slightly more complex writes |
Topic 3: Financial Fact Tables for Tax Data
Decision
Model K-1 line items as a financial fact table in a star-schema-inspired design, with KDocument as the central bridge to dimension tables (Partnership, Entity, TaxYear). Monetary values stored as DECIMAL(15,2) with explicit currency.
Rationale
Financial data warehouses consistently use a fact/dimension pattern for tax line items:
Star schema mapping for K-1 data:
┌──────────────┐
│ Partnership │ (dimension)
│ ────────── │
│ id, name, │
│ type, ein │
└──────┬───────┘
│
┌──────────────┐ ┌──────┴───────┐ ┌──────────────────┐
│ Entity │────│ KDocument │────│ K1BoxDefinition │ (dimension)
│ (dimension) │ │ (bridge) │ │ ────────────────│
│ ────────── │ │ ────────── │ │ boxKey, label, │
│ id, name, │ │ id, taxYear,│ │ section, type │
│ type, taxId │ │ status │ └──────────────────┘
└──────────────┘ └──────┬───────┘
│
┌──────┴───────┐
│ K1LineItem │ (FACT)
│ ────────── │
│ amount, │
│ boxKey, │
│ confidence │
└──────────────┘
Best practices from financial data warehousing applied here:
-
Additive facts only —
K1LineItem.amountis fully additive: you can SUM across tax years, partnerships, entities, or box types. Non-additive data (percentages, booleans, text) is stored separately intextValueor on the KDocument metadata. -
Grain = one box value per K-1 document — Each row in
K1LineItemrepresents one financial amount from one K-1 for one tax year. This is the atomic grain. Aggregation rules fromCellAggregationRuleoperate on this grain. -
Slowly changing dimensions —
PartnershipMembershipalready handles SCD Type 2 (effective dates) for ownership percentages.K1BoxDefinitionis SCD Type 1 (overwritten on IRS form changes, with version tracking if needed). -
Conformed dimensions —
PartnershipandEntityserve as conformed dimensions shared between K-1 facts, Distribution facts, and Valuation facts. A singleEntitydimension joins to multiple fact tables. -
Currency handling — Store amounts in the source currency with a
currencycolumn. The KDocument inherits currency from Partnership. Conversion to reporting currency happens at query time or in materialized views, never by mutating the fact. -
Decimal precision —
DECIMAL(15,2)covers amounts up to $9,999,999,999,999.99. K-1 amounts from large partnerships (PE funds, hedge funds) can reach tens of millions. 15 digits provides headroom. Use 2 decimal places to match IRS reporting precision.
Aggregation queries enabled by this model:
-- Total ordinary income across all partnerships for 2025
SELECT SUM(li.amount)
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
WHERE li.box_key = '1' AND kd.tax_year = 2025;
-- Income breakdown by entity for tax year 2025
SELECT e.name, li.box_key, SUM(li.amount)
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
JOIN partnership p ON kd.partnership_id = p.id
JOIN partnership_membership pm ON pm.partnership_id = p.id
JOIN entity e ON pm.entity_id = e.id
WHERE kd.tax_year = 2025
GROUP BY e.name, li.box_key;
-- Partnership performance: Box 1 over time
SELECT kd.tax_year, p.name, li.amount
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
JOIN partnership p ON kd.partnership_id = p.id
WHERE li.box_key = '1'
ORDER BY kd.tax_year;
These queries are impossible or impractical with the current JSON blob model.
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| Snowflake schema (more normalization) | Normalized box categories into sub-dimensions | Over-normalized for ~50 box types; extra JOINs for no benefit |
| Flat denormalized reporting table | Fastest reads; no JOINs | Write complexity; data duplication; hard to keep consistent |
| OLAP cube / column store | Best aggregation performance | Overkill for <10K rows; adds infrastructure complexity |
| Star-schema-inspired fact table (chosen) | Natural fit for K-1 aggregation queries; leverages existing dimensions; PostgreSQL handles this scale trivially | Requires JOINs for full context (acceptable) |
Topic 4: Source Traceability in Financial Systems
Decision
Store extraction provenance at the line-item grain — each K1LineItem records the source page number, bounding-box coordinates, raw extracted text, confidence score, and whether it was user-edited. The K1ImportSession retains the complete raw extraction as an immutable JSON snapshot.
Rationale
The audit trail must support this flow:
Displayed aggregated number
→ K1LineItem (individual box value)
→ KDocument (which K-1, which year, which partnership)
→ K1ImportSession (extraction record)
→ Document (source PDF file)
→ Specific page + coordinates on that page
→ Raw extracted text before parsing
Granularity levels and what to store where:
| Level | Table | Fields | Purpose |
|---|---|---|---|
| Aggregation | Computed at query time | SUM/formula from CellAggregationRule |
"Where does this total come from?" → list of K1LineItems |
| Line item | K1LineItem |
amount, boxKey, sourceConfidence, sourcePageNumber, sourceCoordinates, rawExtractedText, isUserEdited |
"What exactly was extracted and from where?" |
| Document | K1ImportSession |
rawExtraction (full JSON), extractionMethod, fileName |
"What did the system originally see?" (immutable after extraction) |
| File | Document |
filePath, fileSize, mimeType |
"Where is the original PDF?" |
Key design principles:
-
Immutability of raw extraction —
K1ImportSession.rawExtractionis written once at extraction time and never modified.verifiedDatacaptures user edits. This provides a complete before/after audit trail. -
Coordinate-level provenance — Current
k1-positions-dump.txtshows the parser already extractsx, ycoordinates for each text element. StoringsourceCoordinates: {x, y, width, height}on eachK1LineItemenables a future "click to highlight in PDF" feature. -
Confidence as first-class data — The system already computes confidence scores (0.0–1.0) during extraction. Persisting this on the line item (not just in the import session JSON) enables queries like "show me all low-confidence values across all partnerships" and supports audit prioritization.
-
User edit tracking —
isUserEdited: booleandistinguishes machine-extracted values from human-verified overrides. This is critical for audit and for training future extraction models. -
No deletion of source data — When a KDocument transitions from ESTIMATED → FINAL, the old line items should be soft-versioned (via
KDocument.previousDataor a separate version table), not deleted.
What NOT to store at line-item level:
- Full PDF binary (stay on Document/filesystem)
- Complete OCR output for the entire page (stay on K1ImportSession.rawExtraction)
- Rendering coordinates for non-K-1 text on the page (not relevant)
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| Provenance only at document level | Simpler; fewer columns | Cannot trace an individual number back to a specific location on a page |
| Separate provenance table (K1LineItemProvenance) | Clean separation of concerns | Extra JOIN for every audit query; 1:1 relationship is usually better as columns |
| Store full page image crops per line item | Visual proof | Massive storage; PDF coordinates + original file are sufficient for re-rendering |
| Provenance on line item (chosen) | Direct traceability; no extra JOINs; enables "highlight in PDF"; supports audit queries | Slightly wider rows (acceptable for <10K rows) |
Topic 5: PostgreSQL Materialized Views for Financial Reporting
Decision
Use materialized views for cross-partnership/cross-year aggregation dashboards, refreshed on a schedule or triggered by KDocument changes. Use regular views for single-document or single-partnership queries. Do not use denormalized reporting tables.
Rationale
When to use each approach in this system:
| Scenario | Approach | Reason |
|---|---|---|
| "Show Box 1–21 for one K-1" | Regular query on K1LineItem |
Small result set; no aggregation; fast enough |
| "Total income by box for one partnership across years" | Regular SQL GROUP BY |
<20 rows × <10 years = <200 rows; trivial for PostgreSQL |
| "Dashboard: all partnerships × all entities × 5 years" | Materialized view | Cross-joins across dimensions; 50 partnerships × 5 entities × 5 years × 20 boxes = 25,000 aggregated values; worth pre-computing |
| "Tax planning: find partnerships with specific loss patterns" | Materialized view or indexed view | Complex filtering across many K-1s |
| "YoY change in Box 1 by partnership" | Materialized view | Window functions over multiple years |
Proposed materialized views:
-- MV 1: K-1 Summary by Partnership/Year
CREATE MATERIALIZED VIEW mv_k1_partnership_year_summary AS
SELECT
kd.partnership_id,
kd.tax_year,
li.box_key,
bd.label,
bd.section,
SUM(li.amount) AS total_amount,
COUNT(*) AS line_count,
kd.filing_status
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
JOIN k1_box_definition bd ON li.box_key = bd.box_key
GROUP BY kd.partnership_id, kd.tax_year, li.box_key, bd.label, bd.section, kd.filing_status;
-- MV 2: Entity-level Income Aggregation
CREATE MATERIALIZED VIEW mv_entity_income_summary AS
SELECT
e.id AS entity_id,
e.name AS entity_name,
kd.tax_year,
li.box_key,
SUM(li.amount * pm.ownership_percent / 100) AS allocated_amount
FROM k1_line_item li
JOIN k_document kd ON li.k_document_id = kd.id
JOIN partnership_membership pm ON pm.partnership_id = kd.partnership_id
JOIN entity e ON pm.entity_id = e.id
WHERE pm.effective_date <= make_date(kd.tax_year, 12, 31)
AND (pm.end_date IS NULL OR pm.end_date > make_date(kd.tax_year, 12, 31))
GROUP BY e.id, e.name, kd.tax_year, li.box_key;
Refresh strategy:
- Trigger-based refresh: After any KDocument insert/update/delete or status change to FINAL, refresh affected materialized views. In NestJS, this is a
@OnEvent('k-document.changed')handler that callsREFRESH MATERIALIZED VIEW CONCURRENTLY. CONCURRENTLYkeyword: Allows reads during refresh (requires a unique index on the MV). Essential for a multi-user system.- Frequency: For a family office with <100 K-1s updated per year, refresh takes <1 second. No scheduling needed — event-driven refresh is sufficient.
Why not denormalized reporting tables:
Denormalized tables (duplicating data into a flat reporting structure) require write-time consistency management — every KDocument change must update the reporting table transactionally. This is the pattern used in high-write OLTP systems, but K-1 data is low-write (<100 writes/year) and high-read (dashboards queried many times). Materialized views handle this perfectly with zero application-level sync logic.
Why not computed/generated columns:
PostgreSQL generated columns cannot reference other tables. Since aggregations span KDocument → K1LineItem → Partnership → Entity, generated columns are structurally insufficient.
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| Application-level caching (Redis/in-memory) | No DB schema changes | Cache invalidation complexity; doesn't help SQL-based analytics |
| Denormalized reporting tables | Fastest reads; works at any scale | Write-time maintenance burden; consistency bugs; overkill for <10K rows |
| Regular views (not materialized) | Always fresh; no refresh needed | Recomputed on every query; slow for cross-entity dashboards |
| Materialized views (chosen) | Pre-computed; concurrent reads; event-driven refresh; zero application-level sync | Slight staleness (mitigated by event-driven refresh); requires unique indexes for CONCURRENTLY |
Topic 6: Migration Strategy from JSON Blob to Normalized Tables
Decision
Phase the migration in 3 steps: (1) Create new tables alongside existing JSON, (2) Dual-write to both during a transition period, (3) Make normalized tables authoritative. Keep the JSON blob immutable as an archive — never delete it.
Rationale
Step 1: Additive schema changes (zero breaking changes)
Migration 1: Create K1BoxDefinition table, seed with IRS default box definitions
Migration 2: Create K1LineItem table with FK to KDocument and K1BoxDefinition
Migration 3: Backfill K1LineItem from existing KDocument.data JSON blobs
The backfill migration for Step 3:
-- Pseudocode: For each KDocument, iterate JSON keys and insert K1LineItems
INSERT INTO k1_line_item (id, k_document_id, box_key, amount, created_at, updated_at)
SELECT
gen_random_uuid(),
kd.id,
je.key,
(je.value)::decimal,
kd.created_at,
NOW()
FROM k_document kd,
jsonb_each(kd.data::jsonb) AS je(key, value)
WHERE jsonb_typeof(je.value) = 'number';
Step 2: Dual-write transition period
During the transition:
k1-import.service.tsconfirmImport()writes to bothKDocument.data(JSON) andK1LineItem(rows)- Read operations gradually migrate from JSON-based to K1LineItem-based
k1-aggregation.service.tsswitches from JSON iteration toSELECT SUMon K1LineItem- Run validation queries comparing JSON-derived totals to K1LineItem-derived totals
Step 3: K1LineItem becomes authoritative
- New features (dashboards, tax planning, LLM queries) read only from K1LineItem
KDocument.datais retained as immutable archive but no longer written to for new documentsCellAggregationRule.sourceCellscontinues to work — the boxKey values are the same stringsCellMappingevolves into or is replaced byK1BoxDefinition
Should the old JSON be kept immutable?
Yes, permanently. Reasons:
- Audit requirement — The JSON blob is the original imported representation. Regulatory and audit standards require preserving source data in its original form.
- Rollback safety — If the migration has bugs, the JSON blob is the recovery source.
- Storage is trivial — A JSON blob with ~30 key-value pairs is <1 KB. Even 1,000 KDocuments = <1 MB total. There's no storage pressure to delete it.
- Import session already preserves extraction —
K1ImportSession.rawExtractionholds the pre-verification extraction.KDocument.dataholds the post-verification snapshot. Both should survive indefinitely.
Backward compatibility considerations:
- The
KDocument.datacolumn type staysJson(not nullable, not removed) - The existing
k-document-form.component.tsUI reads fromKDocument.data— it continues to work during transition - The
computeForKDocument()aggregation service works against JSON through the transition, then switches to K1LineItem queries - No existing API contracts change —
GET /k-documents/:idreturns the same shape
Handling the CellMapping → K1BoxDefinition transition:
The existing CellMapping table (per-partnership box definitions) maps closely to the proposed K1BoxDefinition. The migration strategy:
K1BoxDefinitionabsorbs the global (partnershipId = null) CellMapping records- Per-partnership CellMapping overrides become per-partnership
K1BoxDefinitionrows (or remain as display-layer configuration separate from the data model) CellMappingfields likeisIgnored,isCustomare presentation concerns that may not belong on the data-layerK1BoxDefinition
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| Big-bang migration (drop JSON, create tables, migrate in one step) | Clean; no dual-write complexity | Risk of data loss; requires full feature freeze; hard to validate |
| Dual-write indefinitely | Maximum safety | Permanent write overhead; divergence risk between JSON and rows |
| Keep JSON as authoritative, add views | No migration of writes | Doesn't solve the core query limitation; views over JSONB are slow |
| Phased migration with immutable archive (chosen) | Zero-downtime; incremental validation; rollback possible; preserves audit trail | Dual-write period adds complexity (bounded to weeks, not permanent) |
Topic 7: Schema Design for Future LLM NL-to-SQL
Decision
Design tables with self-documenting names, add PostgreSQL COMMENT ON annotations for every table and column, use consistent naming conventions, and avoid ambiguity between similarly-named entities.
Rationale
LLMs generating SQL (via text-to-SQL or NL-to-SQL) work by receiving the schema as context and mapping natural language to table/column references. The schema itself is the prompt. Research from the Spider benchmark (Yale), BIRD benchmark, and production NL-to-SQL systems (e.g., Vanna.ai, DataHerald) identifies these factors as most impactful:
1. Naming conventions that LLMs parse correctly:
| Current Name | Problem | Proposed Name | Why Better |
|---|---|---|---|
KDocument |
"K" is ambiguous to LLMs | k1_document |
Explicitly says "K-1" |
KDocument.data |
"data" is the most generic possible name | k1_document.raw_data_json |
Describes what it holds |
K1LineItem.amount |
Could be confused with Distribution.amount | k1_line_item.reported_amount |
Disambiguates |
CellMapping |
"Cell" is a spreadsheet term, not a tax term | k1_box_definition |
Domain-specific |
CellAggregationRule |
LLMs may not connect "cell" to K-1 boxes | k1_aggregation_rule |
Clearer context |
Naming conventions to adopt:
snake_casefor all table and column names (PostgreSQL convention; LLMs trained on more snake_case SQL than camelCase)- Prefix K-1-specific tables with
k1_to create a namespace - Use
_idsuffix for all foreign keys - Avoid abbreviations (
partnership_idnotptnr_id) - Use
_atsuffix for timestamps (created_at,updated_at) - Use descriptive names over short names (
tax_yearnotyr,filing_statusnotstatus)
2. PostgreSQL COMMENT annotations:
COMMENT ON TABLE k1_line_item IS 'Individual financial line item from an IRS Schedule K-1 (Form 1065). One row per box number per K-1 document.';
COMMENT ON COLUMN k1_line_item.box_key IS 'IRS K-1 box identifier such as "1" for ordinary income, "9a" for long-term capital gains, or "20-A" for other information code A.';
COMMENT ON COLUMN k1_line_item.reported_amount IS 'Dollar amount reported on this K-1 line item, in the partnership base currency. Negative values represent losses.';
COMMENT ON TABLE k1_box_definition IS 'Reference table of IRS Schedule K-1 box definitions. Maps box identifiers to human-readable labels and categories.';
LLM NL-to-SQL systems extract these comments as schema context. A model asked "what is total ordinary income?" can map "ordinary income" → k1_box_definition.label = 'Ordinary business income (loss)' → box_key = '1' → join to k1_line_item.
3. Avoiding ambiguity:
Current pain points for LLM-generated SQL:
Distribution.amountvsK1LineItem.amount— an LLM asked "total distributions" might query the wrong table. Solution:k1_line_item.reported_amountvsdistribution.distribution_amount.Partnershiphasdistributions,kDocuments,valuations— naming all FK columnspartnership_idis correct and expected by LLMs.Entityis overloaded (database entities, legal entities). The table comment must clarify: "A legal person or structure (trust, LLC, individual) that owns assets and receives K-1 allocations."
4. Schema metadata table for LLM context:
Consider a lightweight schema_metadata table or a markdown document that provides the LLM with:
- Table relationships in natural language
- Common query patterns with examples
- Business rules ("Box 19a distributions are allocated to entities by ownership percentage")
- Valid values for enum columns
This is cheaper than fine-tuning and more maintainable than few-shot prompts.
5. Avoid patterns that confuse LLMs:
| Anti-pattern | Why It Confuses LLMs | Alternative |
|---|---|---|
| JSON columns for queryable data | LLMs generate -> / ->> operators inconsistently |
Normalized columns |
| Composite primary keys | LLMs often forget one part of the key in JOINs | Surrogate UUID PK + unique constraint |
| Polymorphic FKs (one FK, multiple target tables) | LLMs can't determine which table to JOIN | Separate FK columns |
Generic column names (type, status, data, value) |
Ambiguous across tables | Prefix with table context (filing_status, box_data_type) |
Soft deletes (is_deleted) |
LLMs forget the WHERE is_deleted = false filter |
Use end_date IS NULL pattern (already in use for memberships) |
Alternatives Considered
| Alternative | Pros | Cons |
|---|---|---|
| No schema changes for LLM | No work | LLM accuracy drops significantly with ambiguous/generic names; JSONB columns are nearly unusable for NL-to-SQL |
| Fine-tune LLM on this schema | Can handle any naming convention | Expensive; needs retraining on every schema change; vendor lock-in |
| RAG over schema docs | Flexible; schema-aware | Still limited by underlying schema quality; garbage-in-garbage-out |
| Self-documenting schema + COMMENT annotations (chosen) | Works with any LLM; zero runtime cost; maintainable; improves human readability too | Requires discipline to maintain comments on schema changes |
Summary of Decisions
| # | Topic | Decision |
|---|---|---|
| 1 | Wide vs Normalized | Normalized fact table for Part III financial data; JSON retained for Part I/II metadata |
| 2 | EAV vs Normalized | Hybrid: typed EAV fact table (K1LineItem) with reference dimension (K1BoxDefinition); uniform DECIMAL value type avoids classic EAV pitfalls |
| 3 | Financial fact tables | Star-schema-inspired design with K1LineItem as fact, KDocument/Partnership/Entity as dimensions |
| 4 | Source traceability | Per-line-item provenance (page, coordinates, confidence, raw text, user-edit flag); K1ImportSession.rawExtraction as immutable full extraction archive |
| 5 | Materialized views | Event-driven materialized views for cross-entity dashboards; regular queries for single-document access |
| 6 | Migration strategy | 3-phase: additive tables → dual-write → K1LineItem authoritative; JSON blob kept immutable forever |
| 7 | LLM NL-to-SQL | Self-documenting snake_case names, COMMENT ON annotations, disambiguation of similar columns, k1_ table prefix namespace |