docs: add agent architecture, cost analysis, and pre-search documents

Document agent architecture with request flow, tool inventory, verification pipeline, and deployment considerations. Add cost analysis and pre-search research documents.
1 month ago · 3a769c8f49
3 changed files with 373 additions and 0 deletions
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -0,0 +1,143 @@
+# Agent Architecture Document
+
+## Domain & Use Cases
+
+**Domain**: Personal finance portfolio management (Ghostfolio fork)
+
+**Problems solved**: Ghostfolio's existing UI requires manual navigation across multiple pages to understand portfolio state. The agent provides a conversational interface that synthesizes holdings, performance, market data, and transaction history into coherent natural language — and can execute write operations (account management, activity logging, watchlist, tags) with native tool approval gates.
+
+**Target users**: Self-directed investors tracking multi-asset portfolios (stocks, ETFs, crypto) who want quick portfolio insights without clicking through dashboards.
+
+## Agent Architecture
+
+### Framework & Stack
+
+| Layer           | Choice                                                                    | Rationale                                                                                   |
+| --------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
+| Runtime         | NestJS (TypeScript)                                                       | Native to Ghostfolio codebase                                                               |
+| Agent framework | Vercel AI SDK v6 (`ToolLoopAgent`)                                        | Native TS, streaming SSE, built-in tool dispatch                                            |
+| LLM             | Claude Sonnet 4.6 (default)                                               | Strong function calling, structured output, 200K context                                    |
+| Model options   | Haiku 4.5 ($0.005/chat), Sonnet 4.6 ($0.015/chat), Opus 4.6 ($0.077/chat) | User-selectable per session                                                                 |
+| Schemas         | Zod v4                                                                    | Required by AI SDK v6 `inputSchema`                                                         |
+| Database        | Prisma + Postgres                                                         | Shared with Ghostfolio, plus agent-specific tables                                          |
+| Cache warming   | `warmPortfolioCache` helper                                               | Redis + BullMQ (`PortfolioSnapshotService`) — ensures portfolio reads reflect recent writes |
+
+### Request Flow
+
+```
+User message
+    │
+    ▼
+POST /api/v1/agent/chat (JWT auth)
+    Body: { messages: UIMessage[], toolHistory?, model?, approvedActions? }
+    │
+    ▼
+ToolLoopAgent created → pipeAgentUIStreamToResponse()
+    │
+    ├─► prepareStep()
+    │     ├─ Injects current date into system prompt
+    │     ├─ All 10 tools available from step 1 (activity_manage auto-resolves accountId)
+    │     └─ Loads contextual SKILL.md files based on tool history
+    │
+    ├─► LLM reasoning → tool selection
+    │     └─ Up to 10 steps (stopWhen: stepCountIs(10))
+    │
+    ├─► Tool execution (try/catch per tool)
+    │     └─ Returns structured JSON to LLM for synthesis
+    │
+    ├─► Approval gate (write tools only)
+    │     ├─ needsApproval() evaluates per invocation
+    │     ├─ Skips: list actions, previously-approved signatures, SKIP_APPROVAL env
+    │     └─ If required: stream pauses → client shows approval card → resumes on approve/deny
+    │
+    ├─► Post-write cache warming (activity_manage, account_manage)
+    │     └─ warmPortfolioCache: clear Redis → drain stale jobs → enqueue HIGH priority → await (30s timeout)
+    │
+    ├─► SSE stream to client (UIMessage protocol)
+    │     └─ Events: text-delta, tool-input-start, tool-input-available, tool-approval-request, tool-output-available, finish
+    │
+    └─► onFinish callback
+          ├─ Verification pipeline (3 systems)
+          ├─ Metrics recording (in-memory + Postgres)
+          └─ Structured log: chat_complete
+```
+
+### Tool Design
+
+10 tools organized into read (6) and write (4) categories:
+
+| Tool                    | Type  | Purpose                                                           |
+| ----------------------- | ----- | ----------------------------------------------------------------- |
+| `portfolio_analysis`    | Read  | Holdings, allocations, total value, account breakdown             |
+| `portfolio_performance` | Read  | Returns, net performance, chart data (downsampled to ~20 points)  |
+| `holdings_lookup`       | Read  | Deep dive on single position: dividends, fees, sectors, countries |
+| `market_data`           | Read  | Live quotes for 1-10 symbols (FMP + CoinGecko)                    |
+| `symbol_search`         | Read  | Disambiguate crypto vs stock, find correct data source            |
+| `transaction_history`   | Read  | Buys, sells, dividends, fees, deposits, withdrawals               |
+| `account_manage`        | Write | CRUD accounts + transfers between accounts                        |
+| `activity_manage`       | Write | CRUD transactions (BUY/SELL/DIVIDEND/FEE/INTEREST/LIABILITY)      |
+| `watchlist_manage`      | Write | Add/remove/list watchlist items                                   |
+| `tag_manage`            | Write | CRUD tags for transaction organization                            |
+
+**Auto-resolution**: `activity_manage` auto-resolves `accountId` when omitted on creates — matches accounts by asset type keywords (crypto → "crypto"/"wallet" accounts, stocks → "stock"/"brokerage" accounts) with fallback to highest-activity account. No tool gating; all 10 tools available from step 1.
+
+**Approval gates**: All 4 write tools define `needsApproval` — a function-based gate evaluated per invocation. Read-only actions (`list`) and previously-approved action signatures are auto-skipped. `SKIP_APPROVAL=true` env var disables all gates (used in evals). Action signatures follow the pattern `tool_name:action:identifier` (e.g., `activity_manage:create:AAPL`).
+
+| Tool               | Approval Rule                                                                    |
+| ------------------ | -------------------------------------------------------------------------------- |
+| `activity_manage`  | `create` and `delete` only (not `update`), unless signature in `approvedActions` |
+| `account_manage`   | Everything except `list`, unless signature in `approvedActions`                  |
+| `tag_manage`       | Everything except `list`                                                         |
+| `watchlist_manage` | Everything except `list`                                                         |
+
+**Cache warming**: After every write operation in `activity_manage` and `account_manage`, `warmPortfolioCache()` runs — clears stale Redis portfolio snapshots, drains in-flight BullMQ jobs, enqueues fresh computation at HIGH priority, and awaits completion with a 30s timeout. Ensures subsequent read tools return up-to-date portfolio data. Injected via `PortfolioSnapshotService`, `RedisCacheService`, and `UserService`.
+
+**Skill injection**: Contextual SKILL.md documents (transaction workflow, market data patterns) are injected into the system prompt based on which tools have been used, providing the LLM with domain-specific guidance without bloating every request.
+
+## Verification Strategy
+
+Three deterministic systems run on every response in the `onFinish` callback:
+
+| System                      | What It Checks                                                                                                                          | Weight in Composite |
+| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | ------------------- |
+| **Output Validation**       | Min/max length, numeric data present when tools called, disclaimer on forward-looking language, write claims backed by write tool calls | 0.3                 |
+| **Hallucination Detection** | Ticker symbols in response exist in tool data, dollar amounts match within 5% or $1, holding claims reference actual tool results       | 0.3                 |
+| **Confidence Score**        | Composite 0-1: tool success rate (0.3), step efficiency (0.1), output validity (0.3), hallucination score (0.3)                         | —                   |
+
+**Why deterministic**: No additional LLM calls = zero latency overhead and zero cost. Cross-references response text against actual tool result data. Catches the most common financial hallucination patterns (phantom tickers, fabricated dollar amounts) with high precision.
+
+**Persistence**: Results stored on `AgentChatLog` (`verificationScore`, `verificationResult`) and exposed via `GET /agent/verification/:requestId`.
+
+## Eval Results
+
+**Framework**: evalite — dedicated eval runner with scoring, UI, and CI integration.
+
+| Suite      | Cases  | Pass Rate | Scorers                                                          |
+| ---------- | ------ | --------- | ---------------------------------------------------------------- |
+| Golden Set | 19     | **100%**  | GoldenCheck (deterministic binary)                               |
+| Scenarios  | 67     | **91%**   | ToolCallAccuracy (F1), HasResponse, ResponseQuality (LLM-judged) |
+| **Total**  | **86** |           |                                                                  |
+
+**Category breakdown**: single-tool (10), multi-tool (10), ambiguous (6), account management (8), activity management (10), watchlist (4), tag (4), multi-step write (4), adversarial write (4), edge cases (7), golden routing (7), golden structural (4), golden behavioral (2), golden guardrails (4).
+
+**CI gate**: GitHub Actions runs golden set on every push to main. Threshold: 100%. Hits deployed Render instance.
+
+**Remaining 9% scenario gap**: Agent calls extra tools on ambiguous queries (thorough, not wrong). Write-operation approval is now handled by native `needsApproval` gates (bypassed in evals via `SKIP_APPROVAL=true`).
+
+## Observability Setup
+
+| Capability           | Implementation                                                                                                                  |
+| -------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
+| **Trace logging**    | Structured JSON: `chat_start`, `step_finish` (tool names, token usage), `verification`, `chat_complete`                         |
+| **Latency tracking** | Total `latencyMs` per request, persisted to Postgres                                                                            |
+| **Token usage**      | prompt/completion/total per request, averaged in metrics summary                                                                |
+| **Cost tracking**    | Model-specific pricing (Sonnet $3/$15, Haiku $1/$5, Opus $15/$75 per MTok), `cost.totalUsd` and `cost.avgPerChatUsd` in metrics |
+| **Error tracking**   | Errors recorded with sanitized messages (DB URLs, API keys redacted)                                                            |
+| **User feedback**    | Thumbs up/down per message, unique per (requestId, userId), satisfaction rate in metrics                                        |
+| **Metrics endpoint** | `GET /agent/metrics?since=1h` — summary, feedback, recent chats                                                                 |
+
+**Key insight**: Avg $0.015/chat with Sonnet 4.6, ~5.9s latency, ~1.9 steps per query. Portfolio analysis and market data are the most-used tools.
+
+## Open Source Contribution
+
+**Eval dataset**: 86-case structured JSON dataset (`evals/dataset.json`) covering tool routing, multi-tool chaining, write operations, adversarial inputs, and edge cases for a financial AI agent. Each case includes input query, expected tools, expected behavioral assertions, and category labels. Published in the repository for reuse by other Ghostfolio agent implementations.
--- a/docs/cost-analysis.md
+++ b/docs/cost-analysis.md
@ -0,0 +1,106 @@
+# AI Cost Analysis
+
+## Development & Testing Costs
+
+### Observed Usage (Local + Production)
+
+| Metric                     | Value                 |
+| -------------------------- | --------------------- |
+| Total chats logged         | 89                    |
+| Total prompt tokens        | 367,389               |
+| Total completion tokens    | 17,771                |
+| Total tokens               | 385,160               |
+| Avg prompt tokens/chat     | 4,128                 |
+| Avg completion tokens/chat | 200                   |
+| Avg total tokens/chat      | 4,328                 |
+| Error rate                 | 9.0% (8/89)           |
+| Development period         | Feb 26 – Feb 28, 2026 |
+
+### Development API Costs
+
+Primary model: Claude Sonnet 4.6 ($3/MTok input, $15/MTok output)
+
+| Component                | Tokens  | Cost      |
+| ------------------------ | ------- | --------- |
+| Prompt tokens (367K)     | 367,389 | $1.10     |
+| Completion tokens (18K)  | 17,771  | $0.27     |
+| **Total dev/test spend** | 385,160 | **$1.37** |
+
+Additional costs:
+
+- Eval scoring (Haiku 4.5): ~86 cases x ~500 tokens/judgment = ~43K tokens = $0.05
+- Multiple eval runs during development: ~10 runs x $0.05 = $0.50
+- **Estimated total dev spend: ~$2.00**
+
+### Per-Chat Cost Breakdown
+
+| Model                | Avg Input | Avg Output | Cost/Chat |
+| -------------------- | --------- | ---------- | --------- |
+| Sonnet 4.6 (default) | 4,128 tok | 200 tok    | $0.0154   |
+| Haiku 4.5            | 4,128 tok | 200 tok    | $0.0051   |
+| Opus 4.6             | 4,128 tok | 200 tok    | $0.0769   |
+
+## Production Cost Projections
+
+### Assumptions
+
+- **Queries per user per day**: 5 (portfolio check, performance, transactions, market data, misc)
+- **Avg tokens per query**: 4,328 (4,128 input + 200 output) — observed from production data
+- **Tool call frequency**: 1.9 tool calls/chat average (observed)
+- **Verification overhead**: Negligible (deterministic, no extra LLM calls)
+- **Cache warming overhead**: `warmPortfolioCache` runs after activity/account writes — Redis + BullMQ job, zero LLM tokens, up to 30s added latency per write operation
+- **Model mix**: 90% Sonnet 4.6, 8% Haiku 4.5, 2% Opus 4.6
+- **Blended cost per chat**: $0.0154 x 0.90 + $0.0051 x 0.08 + $0.0769 x 0.02 = $0.0157
+
+### Monthly Projections
+
+| Scale         | Users   | Chats/Month | Token Volume | Monthly Cost |
+| ------------- | ------- | ----------- | ------------ | ------------ |
+| 100 users     | 100     | 15,000      | 64.9M tokens | **$235**     |
+| 1,000 users   | 1,000   | 150,000     | 649M tokens  | **$2,355**   |
+| 10,000 users  | 10,000  | 1,500,000   | 6.49B tokens | **$23,550**  |
+| 100,000 users | 100,000 | 15,000,000  | 64.9B tokens | **$235,500** |
+
+### Cost Optimization Levers
+
+| Strategy                                | Estimated Savings | Trade-off                       |
+| --------------------------------------- | ----------------- | ------------------------------- |
+| Default to Haiku 4.5 for simple queries | 60-70%            | Slightly less nuanced responses |
+| Prompt caching (repeated system prompt) | 30-40%            | Requires API support            |
+| Response caching for market data        | 10-20%            | Staleness window                |
+| Reduce system prompt size               | 15-25%            | Less detailed agent behavior    |
+
+### Infrastructure Costs (Render)
+
+| Service              | Plan     | Monthly Cost  |
+| -------------------- | -------- | ------------- |
+| Web (Standard)       | Standard | $25           |
+| Redis (Standard)     | Standard | $10           |
+| Postgres (Basic 1GB) | Basic    | $7            |
+| **Total infra**      |          | **$42/month** |
+
+### Total Cost of Ownership
+
+| Scale         | AI Cost  | Infra  | Total/Month | Cost/User/Month |
+| ------------- | -------- | ------ | ----------- | --------------- |
+| 100 users     | $235     | $42    | $277        | $2.77           |
+| 1,000 users   | $2,355   | $42    | $2,397      | $2.40           |
+| 10,000 users  | $23,550  | $100\* | $23,650     | $2.37           |
+| 100,000 users | $235,500 | $500\* | $236,000    | $2.36           |
+
+\*Infrastructure scales with traffic — estimated for higher tiers.
+
+### Real-Time Cost Tracking
+
+Cost is tracked live via `GET /api/v1/agent/metrics`:
+
+```json
+{
+  "cost": {
+    "totalUsd": 0.0226,
+    "avgPerChatUsd": 0.0226
+  }
+}
+```
+
+Computed per request using model-specific pricing (Sonnet/Haiku/Opus rates) applied to actual prompt and completion token counts.
--- a/docs/pre-search.md
+++ b/docs/pre-search.md
@ -0,0 +1,124 @@
+# Pre-Search Document
+
+Completed before writing agent code. Decisions informed all subsequent architecture choices.
+
+## Phase 1: Define Your Constraints
+
+### 1. Domain Selection
+
+- **Domain**: Finance (Ghostfolio — open source portfolio tracker)
+- **Use cases**: Portfolio analysis, performance tracking, market data lookups, transaction management, account CRUD, watchlist management, tagging
+- **Verification requirements**: Hallucination detection (dollar amounts, ticker symbols), output validation, confidence scoring. Financial data must match tool results exactly.
+- **Data sources**: Ghostfolio's existing Prisma/Postgres models, Financial Modeling Prep API (stocks/ETFs), CoinGecko API (crypto)
+
+### 2. Scale & Performance
+
+- **Expected query volume**: 100-1,000 chats/day during demo period
+- **Acceptable latency**: <5s single-tool, <15s multi-step
+- **Concurrent users**: ~10-50 simultaneous (Render Standard plan)
+- **Cost constraints**: <$50/month for demo, Sonnet 4.6 at ~$0.015/chat is viable
+
+### 3. Reliability Requirements
+
+- **Cost of wrong answer**: Medium — incorrect portfolio values or transaction amounts erode trust. Not life-threatening (unlike healthcare), but financial data must be accurate.
+- **Non-negotiable verification**: Dollar amounts in response must match tool data (within 5% or $1). Ticker symbols must reference actual holdings.
+- **Human-in-the-loop**: Write operations (buy/sell/delete) require confirmation before execution. Agent asks clarifying questions rather than assuming.
+- **Audit needs**: All chats logged to AgentChatLog with requestId, tokens, latency, verification scores.
+
+### 4. Team & Skill Constraints
+
+- **Agent frameworks**: Familiar with Vercel AI SDK, LangChain. Chose Vercel AI SDK for native TypeScript + NestJS integration.
+- **Domain experience**: Familiar with portfolio management concepts. Ghostfolio codebase already provides all financial data models.
+- **Eval/testing**: Experience with Jest/Vitest. Chose evalite for dedicated eval framework with UI and scoring.
+
+## Phase 2: Architecture Discovery
+
+### 5. Agent Framework Selection
+
+- **Choice**: Vercel AI SDK v6 (`ToolLoopAgent`)
+- **Why not LangChain**: Ghostfolio is NestJS/TypeScript — Vercel AI SDK is native TS, lighter weight, no Python dependency. LangChain's JS version is less mature.
+- **Why not custom**: Vercel AI SDK provides streaming, tool dispatch, step management out of the box. No need to reinvent.
+- **Architecture**: Single agent with tool gating via `prepareStep`. Not multi-agent — single domain, single user context.
+- **State management**: Conversation history passed per turn. Tool gating state tracked via `toolHistory` array across turns.
+
+### 6. LLM Selection
+
+- **Choice**: Claude Sonnet 4.6 (default), with Haiku 4.5 and Opus 4.6 selectable per request
+- **Why Claude**: Strong function calling, excellent at structured financial output, good instruction following for system prompt rules
+- **Context window**: Sonnet 4.6 has 200K context — more than sufficient for portfolio data + conversation history
+- **Cost per query**: $0.015 avg with Sonnet — acceptable for demo and small-scale production
+
+### 7. Tool Design
+
+- **Tools built**: 10 total (6 read + 4 write)
+  - Read: portfolio_analysis, portfolio_performance, holdings_lookup, market_data, symbol_search, transaction_history
+  - Write: account_manage, activity_manage, watchlist_manage, tag_manage
+- **External APIs**: Financial Modeling Prep (stocks/ETFs), CoinGecko (crypto) — both via Ghostfolio's existing data provider layer
+- **Mock vs real**: Real data from seeded demo portfolio (AAPL, MSFT, VOO, GOOGL, BTC, TSLA, NVDA). No mocks in dev or eval.
+- **Error handling**: Every tool wrapped in try/catch, returns structured error messages. Agent gracefully surfaces errors to user.
+
+### 8. Observability Strategy
+
+- **Choice**: Self-hosted structured JSON logging + Postgres persistence + in-memory metrics
+- **Why not LangSmith/Braintrust**: Adds external dependency and cost. Structured logs + Postgres give full control and queryability. Metrics endpoint provides real-time dashboard.
+- **Key metrics**: Latency, token usage (prompt/completion/total), cost per chat, tool usage frequency, error rate, verification scores
+- **Real-time monitoring**: `GET /agent/metrics?since=1h` endpoint with summary + recent chats
+- **Cost tracking**: Token counts × model-specific pricing computed in metrics summary
+
+### 9. Eval Approach
+
+- **Framework**: evalite (dedicated eval runner with UI, separate from unit tests)
+- **Correctness**: GoldenCheck scorer — deterministic binary pass/fail on tool routing, output patterns, forbidden content
+- **Quality**: ResponseQuality scorer — LLM-judged (Haiku 4.5) scoring relevance, data-groundedness, conciseness, formatting
+- **Tool accuracy**: ToolCallAccuracy scorer — F1-style partial credit for expected vs actual tool calls
+- **Ground truth**: Real API responses from seeded demo portfolio
+- **CI integration**: GitHub Actions runs golden evals on push, threshold 100%
+
+### 10. Verification Design
+
+- **Claims verified**: Dollar amounts, ticker symbols, holding references
+- **Fact-checking sources**: Tool result data (the agent's own tool calls serve as ground truth)
+- **Confidence thresholds**: Composite 0-1 score (tool success rate 0.3, step efficiency 0.1, output validity 0.3, hallucination score 0.3)
+- **Escalation triggers**: Low confidence score logged but no automated escalation (deterministic verification only)
+
+## Phase 3: Post-Stack Refinement
+
+### 11. Failure Mode Analysis
+
+- **Tool failures**: Try/catch per tool, error surfaced to user with suggestion to retry. Agent continues conversation.
+- **Ambiguous queries**: Agent uses multiple tools when unsure (e.g., "How am I doing?" calls both performance and analysis). Over-fetching preferred over under-fetching.
+- **Rate limiting**: No explicit rate limiting implemented. Render Standard plan handles moderate traffic. Would add Redis-based rate limiting at scale.
+- **Graceful degradation**: If market data API fails, agent acknowledges and proceeds with available data.
+
+### 12. Security Considerations
+
+- **Prompt injection**: System prompt instructs agent to stay in role. Eval suite includes adversarial inputs (jailbreak, developer mode, system prompt extraction).
+- **Data leakage**: Agent scoped to authenticated user's data only. JWT auth on all endpoints.
+- **API key management**: All keys in environment variables, not committed. Error messages sanitized (DB URLs, API keys redacted).
+- **Audit logging**: Every chat logged with requestId, userId, tools used, token counts, verification results.
+
+### 13. Testing Strategy
+
+- **Unit tests**: 16 tests for prepareStep tool gating logic
+- **Eval tests**: 86 cases across golden (19) and scenarios (67)
+- **Adversarial testing**: 11 cases covering prompt injection, jailbreak, resource abuse, system prompt extraction
+- **Regression**: CI gate enforces 100% golden pass rate on every push
+
+### 14. Open Source Planning
+
+- **Release**: Eval dataset (86 cases) published as `evals/dataset.json` — structured JSON with input, expected tools, expected behavior, categories
+- **License**: Follows Ghostfolio's existing AGPL-3.0 license
+- **Documentation**: Agent README (280 lines), architecture doc, cost analysis, this pre-search document
+
+### 15. Deployment & Operations
+
+- **Hosting**: Render (Docker) — web + Redis + Postgres, Oregon region
+- **CI/CD**: GitHub Actions for eval gate. Render auto-deploys from main branch.
+- **Monitoring**: Structured JSON logs + `/agent/metrics` endpoint
+- **Rollback**: Render provides instant rollback to previous deploy
+
+### 16. Iteration Planning
+
+- **User feedback**: Thumbs up/down per message, stored in AgentFeedback table, surfaced in metrics
+- **Eval-driven improvement**: Failures in eval suite drive agent refinement (e.g., stale test cases updated when write tools added)
+- **Future work**: Automated feedback → eval pipeline, LLM-judged hallucination detection, useChat migration if UI complexity grows