gauntlet-docs, architecture and cost analysis files

1 month ago · 3117ee5e34
2 changed files with 222 additions and 0 deletions
--- a/gauntlet-docs/architecture.md
+++ b/gauntlet-docs/architecture.md
@ -0,0 +1,105 @@
 # Agent Architecture Document — Ghostfolio AI Financial Agent
 ## Domain & Use Cases
 **Domain:** Personal finance and wealth management, built on Ghostfolio — an open-source, self-hosted portfolio tracking application.
 **Why this domain:** Financial data is high-stakes (wrong numbers lead to bad investment decisions), richly structured (holdings, transactions, performance metrics, exchange rates), and verification-critical (every claim must be backed by real data). Ghostfolio's existing NestJS + Angular + Prisma + PostgreSQL stack provided a mature service layer to wrap with AI tools rather than building from scratch.
 **Problems solved:** Users can ask natural language questions about their portfolio instead of navigating multiple dashboard pages. The agent handles portfolio Q&A ("What are my holdings?"), performance analysis ("How is my portfolio performing this year?"), dividend tracking, transaction history, market data lookups, exchange rate conversions, account summaries, and risk/diversification analysis — all backed by real portfolio data with financial disclaimers.
 ---
 ## Agent Architecture
 **Framework:** Vercel AI SDK v4.3.16 (already a Ghostfolio dependency). Chosen over LangChain/LangGraph because it was already in the repo, provides native TypeScript tool calling via `generateText()` with Zod schemas, and is provider-agnostic. Adding a second framework would have introduced unnecessary complexity in a brownfield codebase.
 **LLM:** Anthropic Claude Haiku 3.5 via `@ai-sdk/anthropic`. Originally used Sonnet for quality during development, then switched to Haiku for production — 3-5x faster latency and 70% cost reduction with no degradation in eval pass rate (still 100%). Originally planned for OpenRouter (already configured in Ghostfolio) but switched to direct Anthropic when OpenRouter's payment system went down. The Vercel AI SDK's provider abstraction made both switches trivial one-line changes.
 **Architecture:** Single agent with 8-tool registry. The agent receives a user query, the LLM selects appropriate tools, tool functions call existing Ghostfolio services (PortfolioService, OrderService, DataProviderService, ExchangeRateService), and the LLM synthesizes results into a natural language response. Multi-step reasoning is handled via `maxSteps: 10` — the agent can chain up to 10 tool calls before responding. Responses stream to the frontend via Server-Sent Events, so users see tokens appearing in real time rather than waiting for the full response.
 **Tools (8 implemented):**
 | Tool | Wraps | Purpose |
 |---|---|---|
 | get_portfolio_holdings | PortfolioService.getDetails() | Holdings, allocations, performance per position |
 | get_portfolio_performance | Direct Prisma + DataProviderService | All-time returns (cost basis vs current value) |
 | get_dividend_summary | PortfolioService.getDividends() | Dividend income breakdown |
 | get_transaction_history | Prisma Order queries | Buy/sell/dividend activity history |
 | lookup_market_data | DataProviderService.getQuotes() | Current prices and asset profiles |
 | get_portfolio_report | PortfolioService.getReport() | X-ray: diversification, concentration, fees |
 | get_exchange_rate | ExchangeRateDataService | Currency pair conversions |
 | get_account_summary | PortfolioService.getAccounts() | Account names, platforms, balances |
 **Memory:** Conversation history stored client-side in Angular component state. The full message array is passed to the server on each request, enabling multi-turn conversations without server-side session storage.
 **Key design decision:** The `get_portfolio_performance` tool bypasses Ghostfolio's built-in portfolio calculator. The calculator depends on pre-computed snapshots from background data-gathering jobs; in a freshly seeded demo environment, these snapshots don't exist, causing all performance metrics to return zero. The tool instead queries orders directly via Prisma and fetches live market prices, computing returns manually.
 ---
 ## Verification Strategy
 Three verification checks run on every agent response:
 **1. Financial Disclaimer Injection:** Responses containing financial figures (dollar amounts, percentages) automatically receive a disclaimer: "This is informational only and not financial advice." This enforces regulatory awareness in a financial domain where users might act on agent output.
 **2. Data-Backed Claim Verification:** Extracts all numerical values from the agent's response text and cross-references them against the structured data returned by tool calls. If more than 50% of numbers in the response cannot be traced to tool results, the response is flagged. This catches hallucinated figures. Known limitation: the LLM sometimes computes derived values (e.g., summing individual holdings into a total) which are flagged as "unverified" even though they're arithmetically correct.
 **3. Portfolio Scope Validation:** Checks that any stock symbols mentioned in the response as user holdings actually appear in the tool results. Prevents the agent from claiming the user owns assets they don't — a critical failure mode in portfolio analysis.
 **Why these three:** They target the highest-risk failure modes in financial AI: fabricated numbers, confident claims about nonexistent holdings, and users mistaking AI output for professional advice.
 ---
 ## Eval Results
 **55 test cases** across four categories:
 | Category | Count | Pass Rate |
 |---|---|---|
 | Happy path | 20 | 100% |
 | Edge cases | 12 | 100% |
 | Adversarial | 12 | 100% |
 | Multi-step | 11 | 100% |
 | **Total** | **55** | **100%** |
 **Failure analysis:** An earlier version had one multi-step test (MS-009) failing because the agent exhausted the default `maxSteps` limit (5) before generating a response after calling 5+ tools. Increasing `maxSteps` to 10 resolved this — the agent now completes complex multi-tool queries that require up to 7 sequential tool calls. LLM-as-judge scoring averages 4.18/5 across all 55 tests, with the lowest scores on queries involving exchange rate data (known data-gathering dependency) and computed values the judge couldn't independently verify.
 **Performance metrics:**
 - Average latency: 7.7 seconds (with Sonnet), improving to ~3-4s with Haiku
 - Single-tool queries: 4-9 seconds (target: <5s — met with Haiku model switch)
 - Multi-step queries: 8-20 seconds (target: <15s — mostly met, complex queries with 5+ tools can exceed)
 - Tool success rate: >99% (target: >95%)
 - Eval pass rate: 100% (target: >80%)
 **Eval types covered:** Correctness (data-backed claims verification), tool selection (expected vs actual tools), tool execution (no server errors), safety (adversarial refusals), edge cases (gibberish, empty input, fake symbols, SQL injection), and latency bounds.
 ---
 ## Observability Setup
 **Tool:** Langfuse (open source, hosted at cloud.langfuse.com)
 **Integration:** Via OpenTelemetry SDK with LangfuseSpanProcessor, initialized at application startup before any other imports. The Vercel AI SDK's `experimental_telemetry` option sends traces automatically on every `generateText()` call.
 **What we track:**
 - Full request traces: input → LLM reasoning → tool selection → tool execution → LLM synthesis → output
 - Latency breakdown: per-LLM-call timing and per-tool-execution timing
 - Token usage: input and output tokens per LLM call
 - Cost per query: automatically calculated from model and token count (~$0.018 per typical single-tool query)
 - Error traces: failed tool calls and LLM errors captured with context
 - User feedback: thumbs up/down on chat responses, logged to Langfuse as scores
 **Insights gained:** Langfuse revealed that multi-tool queries make two LLM calls (one to select tools, one to synthesize), with the second call being more expensive due to tool results in context. Average query cost is $0.018, with multi-step queries reaching $0.03-0.05.
 ---
 ## Open Source Contribution
 **Type:** Published eval dataset + feature PR to Ghostfolio
 **Eval dataset:** 55 test cases published as a structured JSON file in the repository under `eval-dataset/`, covering happy path, edge case, adversarial, and multi-step scenarios for financial AI agents. Each test case includes input query, expected tools, pass/fail criteria, and category tags. Licensed AGPL-3.0 (matching Ghostfolio).
 **Repository:** github.com/a8garber/ghostfolio (fork with AI agent module)
 **What was contributed:** A complete AI agent module for Ghostfolio adding conversational financial analysis capabilities — 8 tools, verification layer, eval suite, Langfuse observability, and Angular chat UI. Any Ghostfolio instance can enable AI features by adding an Anthropic API key.
--- a/gauntlet-docs/cost-analysis.md
+++ b/gauntlet-docs/cost-analysis.md
@ -0,0 +1,117 @@
 # AI Cost Analysis — Ghostfolio AI Financial Agent
 ## Development & Testing Costs
 ### LLM API Costs (Anthropic Claude Sonnet)
 | Category | Estimated API Calls | Estimated Cost |
 |---|---|---|
 | Agent development & manual testing | ~200 queries | ~$4.00 |
 | Eval suite runs (55 tests × ~8 runs) | ~440 queries | ~$8.00 |
 | LLM-as-judge eval runs | ~55 queries | ~$1.00 |
 | Claude Code (development assistant) | — | ~$20.00 (Anthropic Max subscription) |
 | **Total development LLM spend** | **~695 queries** | **~$33.00** |
 ### Token Consumption
 Based on Langfuse telemetry data from production traces:
 | Metric | Per Query (avg) | Total Development (est.) |
 |---|---|---|
 | Input tokens | ~2,000 | ~1,390,000 |
 | Output tokens | ~200 | ~139,000 |
 | Total tokens | ~2,200 | ~1,529,000 |
 Typical single-tool query: ~1,800 input + 50 output (tool selection) → tool executes → ~2,300 input + 340 output (synthesis). Total: ~4,490 tokens across 2 LLM calls.
 ### Observability Tool Costs
 | Tool | Cost |
 |---|---|
 | Langfuse Cloud (free tier) | $0.00 |
 | Railway hosting (Hobby plan) | ~$5.00/month |
 | Railway PostgreSQL | Included |
 | Railway Redis | Included |
 | **Total infrastructure** | **~$5.00/month** |
 ### Total Development Cost
 | Item | Cost |
 |---|---|
 | LLM API (Anthropic) | ~$33.00 |
 | Infrastructure (Railway, 1 week) | ~$1.25 |
 | Observability (Langfuse free tier) | $0.00 |
 | **Total** | **~$34.25** |
 ---
 ## Production Cost Projections
 ### Assumptions
 - **Queries per user per day:** 5 (portfolio check, performance review, a few follow-ups)
 - **Average tokens per query:** 4,490 (2 LLM calls: tool selection + synthesis)
  - Input: ~4,100 tokens (system prompt + tools + conversation + tool results)
  - Output: ~390 tokens (tool call + response text)
 - **Average tool calls per query:** 1.5
 - **LLM model:** Claude Haiku 3.5 ($0.80/M input tokens, $4/M output tokens)
 - **Average cost per query:** ~$0.005 (validated by Langfuse production traces after model switch)
 - **Verification overhead:** Negligible (string matching, no additional LLM calls)
 - **Days per month:** 30
 ### Cost Per Query Breakdown
 | Component | Tokens | Cost |
 |---|---|---|
 | LLM Call 1 (tool selection) | 1,758 in + 53 out | $0.0016 |
 | Tool execution | 0 (database queries only) | $0.000 |
 | LLM Call 2 (synthesis) | 2,289 in + 339 out | $0.0032 |
 | **Total per query** | **~4,490** | **~$0.005** |
 ### Monthly Projections
 | Scale | Users | Queries/day | Queries/month | Monthly LLM Cost | Infrastructure | Total/month |
 |---|---|---|---|---|---|---|
 | Small | 100 | 500 | 15,000 | $75 | $20 | **$95** |
 | Medium | 1,000 | 5,000 | 150,000 | $750 | $50 | **$800** |
 | Large | 10,000 | 50,000 | 1,500,000 | $7,500 | $200 | **$7,700** |
 | Enterprise | 100,000 | 500,000 | 15,000,000 | $75,000 | $2,000 | **$77,000** |
 ### Cost per User per Month
 | Scale | Cost/user/month |
 |---|---|
 | 100 users | $0.95 |
 | 1,000 users | $0.80 |
 | 10,000 users | $0.77 |
 | 100,000 users | $0.77 |
 Cost per user is nearly flat because LLM API costs dominate and scale linearly. Infrastructure becomes negligible at scale. The switch from Sonnet to Haiku reduced per-query costs by ~70% while maintaining 100% eval pass rate.
 ---
 ## Cost Optimization Strategies
 **Implemented:**
 - Switched from Sonnet to Haiku 3.5 — 70% cost reduction with no eval quality loss
 - Tool results are structured and minimal (only relevant fields returned to LLM, not raw API responses)
 - System prompt is concise (~500 tokens) to minimize per-query overhead
 - Single-agent architecture avoids multi-agent token multiplication
 - Streaming responses reduce perceived latency
 **Recommended for production:**
 | Strategy | Estimated Savings | Complexity |
 |---|---|---|
 | Response caching (same portfolio, same question within 5 min) | 20-40% | Low |
 | Prompt compression (shorter tool descriptions) | 10-15% | Low |
 | Batch token optimization (combine related tool results) | 5-10% | Medium |
 | Switch to open-source model (Llama 3 via OpenRouter) | 50-70% | Low (provider swap) |
 **Most impactful:** Adding response caching could reduce costs by 20-40%, bringing the 10,000-user scenario from $7,700 to ~$4,500-6,000/month.
 ---
 ## Key Insight
 At $0.005 per query and 5 queries/user/day, the per-user cost of under $1/month is extremely affordable for a premium feature. For Ghostfolio's self-hosted model where users provide their own API keys, this cost is negligible — roughly the price of a single coffee every three months for conversational access to portfolio analytics.