From 3a769c8f496504c49f4d3a7eeab603d6db6d8924 Mon Sep 17 00:00:00 2001 From: Ryan Waits Date: Sun, 1 Mar 2026 22:56:57 -0600 Subject: [PATCH] docs: add agent architecture, cost analysis, and pre-search documents Document agent architecture with request flow, tool inventory, verification pipeline, and deployment considerations. Add cost analysis and pre-search research documents. --- docs/architecture.md | 143 ++++++++++++++++++++++++++++++++++++++++++ docs/cost-analysis.md | 106 +++++++++++++++++++++++++++++++ docs/pre-search.md | 124 ++++++++++++++++++++++++++++++++++++ 3 files changed, 373 insertions(+) create mode 100644 docs/architecture.md create mode 100644 docs/cost-analysis.md create mode 100644 docs/pre-search.md diff --git a/docs/architecture.md b/docs/architecture.md new file mode 100644 index 000000000..dd60c65fa --- /dev/null +++ b/docs/architecture.md @@ -0,0 +1,143 @@ +# Agent Architecture Document + +## Domain & Use Cases + +**Domain**: Personal finance portfolio management (Ghostfolio fork) + +**Problems solved**: Ghostfolio's existing UI requires manual navigation across multiple pages to understand portfolio state. The agent provides a conversational interface that synthesizes holdings, performance, market data, and transaction history into coherent natural language — and can execute write operations (account management, activity logging, watchlist, tags) with native tool approval gates. + +**Target users**: Self-directed investors tracking multi-asset portfolios (stocks, ETFs, crypto) who want quick portfolio insights without clicking through dashboards. + +## Agent Architecture + +### Framework & Stack + +| Layer | Choice | Rationale | +| --------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | +| Runtime | NestJS (TypeScript) | Native to Ghostfolio codebase | +| Agent framework | Vercel AI SDK v6 (`ToolLoopAgent`) | Native TS, streaming SSE, built-in tool dispatch | +| LLM | Claude Sonnet 4.6 (default) | Strong function calling, structured output, 200K context | +| Model options | Haiku 4.5 ($0.005/chat), Sonnet 4.6 ($0.015/chat), Opus 4.6 ($0.077/chat) | User-selectable per session | +| Schemas | Zod v4 | Required by AI SDK v6 `inputSchema` | +| Database | Prisma + Postgres | Shared with Ghostfolio, plus agent-specific tables | +| Cache warming | `warmPortfolioCache` helper | Redis + BullMQ (`PortfolioSnapshotService`) — ensures portfolio reads reflect recent writes | + +### Request Flow + +``` +User message + │ + ▼ +POST /api/v1/agent/chat (JWT auth) + Body: { messages: UIMessage[], toolHistory?, model?, approvedActions? } + │ + ▼ +ToolLoopAgent created → pipeAgentUIStreamToResponse() + │ + ├─► prepareStep() + │ ├─ Injects current date into system prompt + │ ├─ All 10 tools available from step 1 (activity_manage auto-resolves accountId) + │ └─ Loads contextual SKILL.md files based on tool history + │ + ├─► LLM reasoning → tool selection + │ └─ Up to 10 steps (stopWhen: stepCountIs(10)) + │ + ├─► Tool execution (try/catch per tool) + │ └─ Returns structured JSON to LLM for synthesis + │ + ├─► Approval gate (write tools only) + │ ├─ needsApproval() evaluates per invocation + │ ├─ Skips: list actions, previously-approved signatures, SKIP_APPROVAL env + │ └─ If required: stream pauses → client shows approval card → resumes on approve/deny + │ + ├─► Post-write cache warming (activity_manage, account_manage) + │ └─ warmPortfolioCache: clear Redis → drain stale jobs → enqueue HIGH priority → await (30s timeout) + │ + ├─► SSE stream to client (UIMessage protocol) + │ └─ Events: text-delta, tool-input-start, tool-input-available, tool-approval-request, tool-output-available, finish + │ + └─► onFinish callback + ├─ Verification pipeline (3 systems) + ├─ Metrics recording (in-memory + Postgres) + └─ Structured log: chat_complete +``` + +### Tool Design + +10 tools organized into read (6) and write (4) categories: + +| Tool | Type | Purpose | +| ----------------------- | ----- | ----------------------------------------------------------------- | +| `portfolio_analysis` | Read | Holdings, allocations, total value, account breakdown | +| `portfolio_performance` | Read | Returns, net performance, chart data (downsampled to ~20 points) | +| `holdings_lookup` | Read | Deep dive on single position: dividends, fees, sectors, countries | +| `market_data` | Read | Live quotes for 1-10 symbols (FMP + CoinGecko) | +| `symbol_search` | Read | Disambiguate crypto vs stock, find correct data source | +| `transaction_history` | Read | Buys, sells, dividends, fees, deposits, withdrawals | +| `account_manage` | Write | CRUD accounts + transfers between accounts | +| `activity_manage` | Write | CRUD transactions (BUY/SELL/DIVIDEND/FEE/INTEREST/LIABILITY) | +| `watchlist_manage` | Write | Add/remove/list watchlist items | +| `tag_manage` | Write | CRUD tags for transaction organization | + +**Auto-resolution**: `activity_manage` auto-resolves `accountId` when omitted on creates — matches accounts by asset type keywords (crypto → "crypto"/"wallet" accounts, stocks → "stock"/"brokerage" accounts) with fallback to highest-activity account. No tool gating; all 10 tools available from step 1. + +**Approval gates**: All 4 write tools define `needsApproval` — a function-based gate evaluated per invocation. Read-only actions (`list`) and previously-approved action signatures are auto-skipped. `SKIP_APPROVAL=true` env var disables all gates (used in evals). Action signatures follow the pattern `tool_name:action:identifier` (e.g., `activity_manage:create:AAPL`). + +| Tool | Approval Rule | +| ------------------ | -------------------------------------------------------------------------------- | +| `activity_manage` | `create` and `delete` only (not `update`), unless signature in `approvedActions` | +| `account_manage` | Everything except `list`, unless signature in `approvedActions` | +| `tag_manage` | Everything except `list` | +| `watchlist_manage` | Everything except `list` | + +**Cache warming**: After every write operation in `activity_manage` and `account_manage`, `warmPortfolioCache()` runs — clears stale Redis portfolio snapshots, drains in-flight BullMQ jobs, enqueues fresh computation at HIGH priority, and awaits completion with a 30s timeout. Ensures subsequent read tools return up-to-date portfolio data. Injected via `PortfolioSnapshotService`, `RedisCacheService`, and `UserService`. + +**Skill injection**: Contextual SKILL.md documents (transaction workflow, market data patterns) are injected into the system prompt based on which tools have been used, providing the LLM with domain-specific guidance without bloating every request. + +## Verification Strategy + +Three deterministic systems run on every response in the `onFinish` callback: + +| System | What It Checks | Weight in Composite | +| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | +| **Output Validation** | Min/max length, numeric data present when tools called, disclaimer on forward-looking language, write claims backed by write tool calls | 0.3 | +| **Hallucination Detection** | Ticker symbols in response exist in tool data, dollar amounts match within 5% or $1, holding claims reference actual tool results | 0.3 | +| **Confidence Score** | Composite 0-1: tool success rate (0.3), step efficiency (0.1), output validity (0.3), hallucination score (0.3) | — | + +**Why deterministic**: No additional LLM calls = zero latency overhead and zero cost. Cross-references response text against actual tool result data. Catches the most common financial hallucination patterns (phantom tickers, fabricated dollar amounts) with high precision. + +**Persistence**: Results stored on `AgentChatLog` (`verificationScore`, `verificationResult`) and exposed via `GET /agent/verification/:requestId`. + +## Eval Results + +**Framework**: evalite — dedicated eval runner with scoring, UI, and CI integration. + +| Suite | Cases | Pass Rate | Scorers | +| ---------- | ------ | --------- | ---------------------------------------------------------------- | +| Golden Set | 19 | **100%** | GoldenCheck (deterministic binary) | +| Scenarios | 67 | **91%** | ToolCallAccuracy (F1), HasResponse, ResponseQuality (LLM-judged) | +| **Total** | **86** | | | + +**Category breakdown**: single-tool (10), multi-tool (10), ambiguous (6), account management (8), activity management (10), watchlist (4), tag (4), multi-step write (4), adversarial write (4), edge cases (7), golden routing (7), golden structural (4), golden behavioral (2), golden guardrails (4). + +**CI gate**: GitHub Actions runs golden set on every push to main. Threshold: 100%. Hits deployed Render instance. + +**Remaining 9% scenario gap**: Agent calls extra tools on ambiguous queries (thorough, not wrong). Write-operation approval is now handled by native `needsApproval` gates (bypassed in evals via `SKIP_APPROVAL=true`). + +## Observability Setup + +| Capability | Implementation | +| -------------------- | ------------------------------------------------------------------------------------------------------------------------------- | +| **Trace logging** | Structured JSON: `chat_start`, `step_finish` (tool names, token usage), `verification`, `chat_complete` | +| **Latency tracking** | Total `latencyMs` per request, persisted to Postgres | +| **Token usage** | prompt/completion/total per request, averaged in metrics summary | +| **Cost tracking** | Model-specific pricing (Sonnet $3/$15, Haiku $1/$5, Opus $15/$75 per MTok), `cost.totalUsd` and `cost.avgPerChatUsd` in metrics | +| **Error tracking** | Errors recorded with sanitized messages (DB URLs, API keys redacted) | +| **User feedback** | Thumbs up/down per message, unique per (requestId, userId), satisfaction rate in metrics | +| **Metrics endpoint** | `GET /agent/metrics?since=1h` — summary, feedback, recent chats | + +**Key insight**: Avg $0.015/chat with Sonnet 4.6, ~5.9s latency, ~1.9 steps per query. Portfolio analysis and market data are the most-used tools. + +## Open Source Contribution + +**Eval dataset**: 86-case structured JSON dataset (`evals/dataset.json`) covering tool routing, multi-tool chaining, write operations, adversarial inputs, and edge cases for a financial AI agent. Each case includes input query, expected tools, expected behavioral assertions, and category labels. Published in the repository for reuse by other Ghostfolio agent implementations. diff --git a/docs/cost-analysis.md b/docs/cost-analysis.md new file mode 100644 index 000000000..b41906ba9 --- /dev/null +++ b/docs/cost-analysis.md @@ -0,0 +1,106 @@ +# AI Cost Analysis + +## Development & Testing Costs + +### Observed Usage (Local + Production) + +| Metric | Value | +| -------------------------- | --------------------- | +| Total chats logged | 89 | +| Total prompt tokens | 367,389 | +| Total completion tokens | 17,771 | +| Total tokens | 385,160 | +| Avg prompt tokens/chat | 4,128 | +| Avg completion tokens/chat | 200 | +| Avg total tokens/chat | 4,328 | +| Error rate | 9.0% (8/89) | +| Development period | Feb 26 – Feb 28, 2026 | + +### Development API Costs + +Primary model: Claude Sonnet 4.6 ($3/MTok input, $15/MTok output) + +| Component | Tokens | Cost | +| ------------------------ | ------- | --------- | +| Prompt tokens (367K) | 367,389 | $1.10 | +| Completion tokens (18K) | 17,771 | $0.27 | +| **Total dev/test spend** | 385,160 | **$1.37** | + +Additional costs: + +- Eval scoring (Haiku 4.5): ~86 cases x ~500 tokens/judgment = ~43K tokens = $0.05 +- Multiple eval runs during development: ~10 runs x $0.05 = $0.50 +- **Estimated total dev spend: ~$2.00** + +### Per-Chat Cost Breakdown + +| Model | Avg Input | Avg Output | Cost/Chat | +| -------------------- | --------- | ---------- | --------- | +| Sonnet 4.6 (default) | 4,128 tok | 200 tok | $0.0154 | +| Haiku 4.5 | 4,128 tok | 200 tok | $0.0051 | +| Opus 4.6 | 4,128 tok | 200 tok | $0.0769 | + +## Production Cost Projections + +### Assumptions + +- **Queries per user per day**: 5 (portfolio check, performance, transactions, market data, misc) +- **Avg tokens per query**: 4,328 (4,128 input + 200 output) — observed from production data +- **Tool call frequency**: 1.9 tool calls/chat average (observed) +- **Verification overhead**: Negligible (deterministic, no extra LLM calls) +- **Cache warming overhead**: `warmPortfolioCache` runs after activity/account writes — Redis + BullMQ job, zero LLM tokens, up to 30s added latency per write operation +- **Model mix**: 90% Sonnet 4.6, 8% Haiku 4.5, 2% Opus 4.6 +- **Blended cost per chat**: $0.0154 x 0.90 + $0.0051 x 0.08 + $0.0769 x 0.02 = $0.0157 + +### Monthly Projections + +| Scale | Users | Chats/Month | Token Volume | Monthly Cost | +| ------------- | ------- | ----------- | ------------ | ------------ | +| 100 users | 100 | 15,000 | 64.9M tokens | **$235** | +| 1,000 users | 1,000 | 150,000 | 649M tokens | **$2,355** | +| 10,000 users | 10,000 | 1,500,000 | 6.49B tokens | **$23,550** | +| 100,000 users | 100,000 | 15,000,000 | 64.9B tokens | **$235,500** | + +### Cost Optimization Levers + +| Strategy | Estimated Savings | Trade-off | +| --------------------------------------- | ----------------- | ------------------------------- | +| Default to Haiku 4.5 for simple queries | 60-70% | Slightly less nuanced responses | +| Prompt caching (repeated system prompt) | 30-40% | Requires API support | +| Response caching for market data | 10-20% | Staleness window | +| Reduce system prompt size | 15-25% | Less detailed agent behavior | + +### Infrastructure Costs (Render) + +| Service | Plan | Monthly Cost | +| -------------------- | -------- | ------------- | +| Web (Standard) | Standard | $25 | +| Redis (Standard) | Standard | $10 | +| Postgres (Basic 1GB) | Basic | $7 | +| **Total infra** | | **$42/month** | + +### Total Cost of Ownership + +| Scale | AI Cost | Infra | Total/Month | Cost/User/Month | +| ------------- | -------- | ------ | ----------- | --------------- | +| 100 users | $235 | $42 | $277 | $2.77 | +| 1,000 users | $2,355 | $42 | $2,397 | $2.40 | +| 10,000 users | $23,550 | $100\* | $23,650 | $2.37 | +| 100,000 users | $235,500 | $500\* | $236,000 | $2.36 | + +\*Infrastructure scales with traffic — estimated for higher tiers. + +### Real-Time Cost Tracking + +Cost is tracked live via `GET /api/v1/agent/metrics`: + +```json +{ + "cost": { + "totalUsd": 0.0226, + "avgPerChatUsd": 0.0226 + } +} +``` + +Computed per request using model-specific pricing (Sonnet/Haiku/Opus rates) applied to actual prompt and completion token counts. diff --git a/docs/pre-search.md b/docs/pre-search.md new file mode 100644 index 000000000..19f49439c --- /dev/null +++ b/docs/pre-search.md @@ -0,0 +1,124 @@ +# Pre-Search Document + +Completed before writing agent code. Decisions informed all subsequent architecture choices. + +## Phase 1: Define Your Constraints + +### 1. Domain Selection + +- **Domain**: Finance (Ghostfolio — open source portfolio tracker) +- **Use cases**: Portfolio analysis, performance tracking, market data lookups, transaction management, account CRUD, watchlist management, tagging +- **Verification requirements**: Hallucination detection (dollar amounts, ticker symbols), output validation, confidence scoring. Financial data must match tool results exactly. +- **Data sources**: Ghostfolio's existing Prisma/Postgres models, Financial Modeling Prep API (stocks/ETFs), CoinGecko API (crypto) + +### 2. Scale & Performance + +- **Expected query volume**: 100-1,000 chats/day during demo period +- **Acceptable latency**: <5s single-tool, <15s multi-step +- **Concurrent users**: ~10-50 simultaneous (Render Standard plan) +- **Cost constraints**: <$50/month for demo, Sonnet 4.6 at ~$0.015/chat is viable + +### 3. Reliability Requirements + +- **Cost of wrong answer**: Medium — incorrect portfolio values or transaction amounts erode trust. Not life-threatening (unlike healthcare), but financial data must be accurate. +- **Non-negotiable verification**: Dollar amounts in response must match tool data (within 5% or $1). Ticker symbols must reference actual holdings. +- **Human-in-the-loop**: Write operations (buy/sell/delete) require confirmation before execution. Agent asks clarifying questions rather than assuming. +- **Audit needs**: All chats logged to AgentChatLog with requestId, tokens, latency, verification scores. + +### 4. Team & Skill Constraints + +- **Agent frameworks**: Familiar with Vercel AI SDK, LangChain. Chose Vercel AI SDK for native TypeScript + NestJS integration. +- **Domain experience**: Familiar with portfolio management concepts. Ghostfolio codebase already provides all financial data models. +- **Eval/testing**: Experience with Jest/Vitest. Chose evalite for dedicated eval framework with UI and scoring. + +## Phase 2: Architecture Discovery + +### 5. Agent Framework Selection + +- **Choice**: Vercel AI SDK v6 (`ToolLoopAgent`) +- **Why not LangChain**: Ghostfolio is NestJS/TypeScript — Vercel AI SDK is native TS, lighter weight, no Python dependency. LangChain's JS version is less mature. +- **Why not custom**: Vercel AI SDK provides streaming, tool dispatch, step management out of the box. No need to reinvent. +- **Architecture**: Single agent with tool gating via `prepareStep`. Not multi-agent — single domain, single user context. +- **State management**: Conversation history passed per turn. Tool gating state tracked via `toolHistory` array across turns. + +### 6. LLM Selection + +- **Choice**: Claude Sonnet 4.6 (default), with Haiku 4.5 and Opus 4.6 selectable per request +- **Why Claude**: Strong function calling, excellent at structured financial output, good instruction following for system prompt rules +- **Context window**: Sonnet 4.6 has 200K context — more than sufficient for portfolio data + conversation history +- **Cost per query**: $0.015 avg with Sonnet — acceptable for demo and small-scale production + +### 7. Tool Design + +- **Tools built**: 10 total (6 read + 4 write) + - Read: portfolio_analysis, portfolio_performance, holdings_lookup, market_data, symbol_search, transaction_history + - Write: account_manage, activity_manage, watchlist_manage, tag_manage +- **External APIs**: Financial Modeling Prep (stocks/ETFs), CoinGecko (crypto) — both via Ghostfolio's existing data provider layer +- **Mock vs real**: Real data from seeded demo portfolio (AAPL, MSFT, VOO, GOOGL, BTC, TSLA, NVDA). No mocks in dev or eval. +- **Error handling**: Every tool wrapped in try/catch, returns structured error messages. Agent gracefully surfaces errors to user. + +### 8. Observability Strategy + +- **Choice**: Self-hosted structured JSON logging + Postgres persistence + in-memory metrics +- **Why not LangSmith/Braintrust**: Adds external dependency and cost. Structured logs + Postgres give full control and queryability. Metrics endpoint provides real-time dashboard. +- **Key metrics**: Latency, token usage (prompt/completion/total), cost per chat, tool usage frequency, error rate, verification scores +- **Real-time monitoring**: `GET /agent/metrics?since=1h` endpoint with summary + recent chats +- **Cost tracking**: Token counts × model-specific pricing computed in metrics summary + +### 9. Eval Approach + +- **Framework**: evalite (dedicated eval runner with UI, separate from unit tests) +- **Correctness**: GoldenCheck scorer — deterministic binary pass/fail on tool routing, output patterns, forbidden content +- **Quality**: ResponseQuality scorer — LLM-judged (Haiku 4.5) scoring relevance, data-groundedness, conciseness, formatting +- **Tool accuracy**: ToolCallAccuracy scorer — F1-style partial credit for expected vs actual tool calls +- **Ground truth**: Real API responses from seeded demo portfolio +- **CI integration**: GitHub Actions runs golden evals on push, threshold 100% + +### 10. Verification Design + +- **Claims verified**: Dollar amounts, ticker symbols, holding references +- **Fact-checking sources**: Tool result data (the agent's own tool calls serve as ground truth) +- **Confidence thresholds**: Composite 0-1 score (tool success rate 0.3, step efficiency 0.1, output validity 0.3, hallucination score 0.3) +- **Escalation triggers**: Low confidence score logged but no automated escalation (deterministic verification only) + +## Phase 3: Post-Stack Refinement + +### 11. Failure Mode Analysis + +- **Tool failures**: Try/catch per tool, error surfaced to user with suggestion to retry. Agent continues conversation. +- **Ambiguous queries**: Agent uses multiple tools when unsure (e.g., "How am I doing?" calls both performance and analysis). Over-fetching preferred over under-fetching. +- **Rate limiting**: No explicit rate limiting implemented. Render Standard plan handles moderate traffic. Would add Redis-based rate limiting at scale. +- **Graceful degradation**: If market data API fails, agent acknowledges and proceeds with available data. + +### 12. Security Considerations + +- **Prompt injection**: System prompt instructs agent to stay in role. Eval suite includes adversarial inputs (jailbreak, developer mode, system prompt extraction). +- **Data leakage**: Agent scoped to authenticated user's data only. JWT auth on all endpoints. +- **API key management**: All keys in environment variables, not committed. Error messages sanitized (DB URLs, API keys redacted). +- **Audit logging**: Every chat logged with requestId, userId, tools used, token counts, verification results. + +### 13. Testing Strategy + +- **Unit tests**: 16 tests for prepareStep tool gating logic +- **Eval tests**: 86 cases across golden (19) and scenarios (67) +- **Adversarial testing**: 11 cases covering prompt injection, jailbreak, resource abuse, system prompt extraction +- **Regression**: CI gate enforces 100% golden pass rate on every push + +### 14. Open Source Planning + +- **Release**: Eval dataset (86 cases) published as `evals/dataset.json` — structured JSON with input, expected tools, expected behavior, categories +- **License**: Follows Ghostfolio's existing AGPL-3.0 license +- **Documentation**: Agent README (280 lines), architecture doc, cost analysis, this pre-search document + +### 15. Deployment & Operations + +- **Hosting**: Render (Docker) — web + Redis + Postgres, Oregon region +- **CI/CD**: GitHub Actions for eval gate. Render auto-deploys from main branch. +- **Monitoring**: Structured JSON logs + `/agent/metrics` endpoint +- **Rollback**: Render provides instant rollback to previous deploy + +### 16. Iteration Planning + +- **User feedback**: Thumbs up/down per message, stored in AgentFeedback table, surfaced in metrics +- **Eval-driven improvement**: Failures in eval suite drive agent refinement (e.g., stale test cases updated when write tools added) +- **Future work**: Automated feedback → eval pipeline, LLM-judged hallucination detection, useChat migration if UI complexity grows