Browse Source

docs: add agent architecture, cost analysis, and pre-search documents

Document agent architecture with request flow, tool inventory,
verification pipeline, and deployment considerations.
Add cost analysis and pre-search research documents.
pull/6458/head
Ryan Waits 4 weeks ago
parent
commit
3a769c8f49
  1. 143
      docs/architecture.md
  2. 106
      docs/cost-analysis.md
  3. 124
      docs/pre-search.md

143
docs/architecture.md

@ -0,0 +1,143 @@
# Agent Architecture Document
## Domain & Use Cases
**Domain**: Personal finance portfolio management (Ghostfolio fork)
**Problems solved**: Ghostfolio's existing UI requires manual navigation across multiple pages to understand portfolio state. The agent provides a conversational interface that synthesizes holdings, performance, market data, and transaction history into coherent natural language — and can execute write operations (account management, activity logging, watchlist, tags) with native tool approval gates.
**Target users**: Self-directed investors tracking multi-asset portfolios (stocks, ETFs, crypto) who want quick portfolio insights without clicking through dashboards.
## Agent Architecture
### Framework & Stack
| Layer | Choice | Rationale |
| --------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |
| Runtime | NestJS (TypeScript) | Native to Ghostfolio codebase |
| Agent framework | Vercel AI SDK v6 (`ToolLoopAgent`) | Native TS, streaming SSE, built-in tool dispatch |
| LLM | Claude Sonnet 4.6 (default) | Strong function calling, structured output, 200K context |
| Model options | Haiku 4.5 ($0.005/chat), Sonnet 4.6 ($0.015/chat), Opus 4.6 ($0.077/chat) | User-selectable per session |
| Schemas | Zod v4 | Required by AI SDK v6 `inputSchema` |
| Database | Prisma + Postgres | Shared with Ghostfolio, plus agent-specific tables |
| Cache warming | `warmPortfolioCache` helper | Redis + BullMQ (`PortfolioSnapshotService`) — ensures portfolio reads reflect recent writes |
### Request Flow
```
User message
POST /api/v1/agent/chat (JWT auth)
Body: { messages: UIMessage[], toolHistory?, model?, approvedActions? }
ToolLoopAgent created → pipeAgentUIStreamToResponse()
├─► prepareStep()
│ ├─ Injects current date into system prompt
│ ├─ All 10 tools available from step 1 (activity_manage auto-resolves accountId)
│ └─ Loads contextual SKILL.md files based on tool history
├─► LLM reasoning → tool selection
│ └─ Up to 10 steps (stopWhen: stepCountIs(10))
├─► Tool execution (try/catch per tool)
│ └─ Returns structured JSON to LLM for synthesis
├─► Approval gate (write tools only)
│ ├─ needsApproval() evaluates per invocation
│ ├─ Skips: list actions, previously-approved signatures, SKIP_APPROVAL env
│ └─ If required: stream pauses → client shows approval card → resumes on approve/deny
├─► Post-write cache warming (activity_manage, account_manage)
│ └─ warmPortfolioCache: clear Redis → drain stale jobs → enqueue HIGH priority → await (30s timeout)
├─► SSE stream to client (UIMessage protocol)
│ └─ Events: text-delta, tool-input-start, tool-input-available, tool-approval-request, tool-output-available, finish
└─► onFinish callback
├─ Verification pipeline (3 systems)
├─ Metrics recording (in-memory + Postgres)
└─ Structured log: chat_complete
```
### Tool Design
10 tools organized into read (6) and write (4) categories:
| Tool | Type | Purpose |
| ----------------------- | ----- | ----------------------------------------------------------------- |
| `portfolio_analysis` | Read | Holdings, allocations, total value, account breakdown |
| `portfolio_performance` | Read | Returns, net performance, chart data (downsampled to ~20 points) |
| `holdings_lookup` | Read | Deep dive on single position: dividends, fees, sectors, countries |
| `market_data` | Read | Live quotes for 1-10 symbols (FMP + CoinGecko) |
| `symbol_search` | Read | Disambiguate crypto vs stock, find correct data source |
| `transaction_history` | Read | Buys, sells, dividends, fees, deposits, withdrawals |
| `account_manage` | Write | CRUD accounts + transfers between accounts |
| `activity_manage` | Write | CRUD transactions (BUY/SELL/DIVIDEND/FEE/INTEREST/LIABILITY) |
| `watchlist_manage` | Write | Add/remove/list watchlist items |
| `tag_manage` | Write | CRUD tags for transaction organization |
**Auto-resolution**: `activity_manage` auto-resolves `accountId` when omitted on creates — matches accounts by asset type keywords (crypto → "crypto"/"wallet" accounts, stocks → "stock"/"brokerage" accounts) with fallback to highest-activity account. No tool gating; all 10 tools available from step 1.
**Approval gates**: All 4 write tools define `needsApproval` — a function-based gate evaluated per invocation. Read-only actions (`list`) and previously-approved action signatures are auto-skipped. `SKIP_APPROVAL=true` env var disables all gates (used in evals). Action signatures follow the pattern `tool_name:action:identifier` (e.g., `activity_manage:create:AAPL`).
| Tool | Approval Rule |
| ------------------ | -------------------------------------------------------------------------------- |
| `activity_manage` | `create` and `delete` only (not `update`), unless signature in `approvedActions` |
| `account_manage` | Everything except `list`, unless signature in `approvedActions` |
| `tag_manage` | Everything except `list` |
| `watchlist_manage` | Everything except `list` |
**Cache warming**: After every write operation in `activity_manage` and `account_manage`, `warmPortfolioCache()` runs — clears stale Redis portfolio snapshots, drains in-flight BullMQ jobs, enqueues fresh computation at HIGH priority, and awaits completion with a 30s timeout. Ensures subsequent read tools return up-to-date portfolio data. Injected via `PortfolioSnapshotService`, `RedisCacheService`, and `UserService`.
**Skill injection**: Contextual SKILL.md documents (transaction workflow, market data patterns) are injected into the system prompt based on which tools have been used, providing the LLM with domain-specific guidance without bloating every request.
## Verification Strategy
Three deterministic systems run on every response in the `onFinish` callback:
| System | What It Checks | Weight in Composite |
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | ------------------- |
| **Output Validation** | Min/max length, numeric data present when tools called, disclaimer on forward-looking language, write claims backed by write tool calls | 0.3 |
| **Hallucination Detection** | Ticker symbols in response exist in tool data, dollar amounts match within 5% or $1, holding claims reference actual tool results | 0.3 |
| **Confidence Score** | Composite 0-1: tool success rate (0.3), step efficiency (0.1), output validity (0.3), hallucination score (0.3) | — |
**Why deterministic**: No additional LLM calls = zero latency overhead and zero cost. Cross-references response text against actual tool result data. Catches the most common financial hallucination patterns (phantom tickers, fabricated dollar amounts) with high precision.
**Persistence**: Results stored on `AgentChatLog` (`verificationScore`, `verificationResult`) and exposed via `GET /agent/verification/:requestId`.
## Eval Results
**Framework**: evalite — dedicated eval runner with scoring, UI, and CI integration.
| Suite | Cases | Pass Rate | Scorers |
| ---------- | ------ | --------- | ---------------------------------------------------------------- |
| Golden Set | 19 | **100%** | GoldenCheck (deterministic binary) |
| Scenarios | 67 | **91%** | ToolCallAccuracy (F1), HasResponse, ResponseQuality (LLM-judged) |
| **Total** | **86** | | |
**Category breakdown**: single-tool (10), multi-tool (10), ambiguous (6), account management (8), activity management (10), watchlist (4), tag (4), multi-step write (4), adversarial write (4), edge cases (7), golden routing (7), golden structural (4), golden behavioral (2), golden guardrails (4).
**CI gate**: GitHub Actions runs golden set on every push to main. Threshold: 100%. Hits deployed Render instance.
**Remaining 9% scenario gap**: Agent calls extra tools on ambiguous queries (thorough, not wrong). Write-operation approval is now handled by native `needsApproval` gates (bypassed in evals via `SKIP_APPROVAL=true`).
## Observability Setup
| Capability | Implementation |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------- |
| **Trace logging** | Structured JSON: `chat_start`, `step_finish` (tool names, token usage), `verification`, `chat_complete` |
| **Latency tracking** | Total `latencyMs` per request, persisted to Postgres |
| **Token usage** | prompt/completion/total per request, averaged in metrics summary |
| **Cost tracking** | Model-specific pricing (Sonnet $3/$15, Haiku $1/$5, Opus $15/$75 per MTok), `cost.totalUsd` and `cost.avgPerChatUsd` in metrics |
| **Error tracking** | Errors recorded with sanitized messages (DB URLs, API keys redacted) |
| **User feedback** | Thumbs up/down per message, unique per (requestId, userId), satisfaction rate in metrics |
| **Metrics endpoint** | `GET /agent/metrics?since=1h` — summary, feedback, recent chats |
**Key insight**: Avg $0.015/chat with Sonnet 4.6, ~5.9s latency, ~1.9 steps per query. Portfolio analysis and market data are the most-used tools.
## Open Source Contribution
**Eval dataset**: 86-case structured JSON dataset (`evals/dataset.json`) covering tool routing, multi-tool chaining, write operations, adversarial inputs, and edge cases for a financial AI agent. Each case includes input query, expected tools, expected behavioral assertions, and category labels. Published in the repository for reuse by other Ghostfolio agent implementations.

106
docs/cost-analysis.md

@ -0,0 +1,106 @@
# AI Cost Analysis
## Development & Testing Costs
### Observed Usage (Local + Production)
| Metric | Value |
| -------------------------- | --------------------- |
| Total chats logged | 89 |
| Total prompt tokens | 367,389 |
| Total completion tokens | 17,771 |
| Total tokens | 385,160 |
| Avg prompt tokens/chat | 4,128 |
| Avg completion tokens/chat | 200 |
| Avg total tokens/chat | 4,328 |
| Error rate | 9.0% (8/89) |
| Development period | Feb 26 – Feb 28, 2026 |
### Development API Costs
Primary model: Claude Sonnet 4.6 ($3/MTok input, $15/MTok output)
| Component | Tokens | Cost |
| ------------------------ | ------- | --------- |
| Prompt tokens (367K) | 367,389 | $1.10 |
| Completion tokens (18K) | 17,771 | $0.27 |
| **Total dev/test spend** | 385,160 | **$1.37** |
Additional costs:
- Eval scoring (Haiku 4.5): ~86 cases x ~500 tokens/judgment = ~43K tokens = $0.05
- Multiple eval runs during development: ~10 runs x $0.05 = $0.50
- **Estimated total dev spend: ~$2.00**
### Per-Chat Cost Breakdown
| Model | Avg Input | Avg Output | Cost/Chat |
| -------------------- | --------- | ---------- | --------- |
| Sonnet 4.6 (default) | 4,128 tok | 200 tok | $0.0154 |
| Haiku 4.5 | 4,128 tok | 200 tok | $0.0051 |
| Opus 4.6 | 4,128 tok | 200 tok | $0.0769 |
## Production Cost Projections
### Assumptions
- **Queries per user per day**: 5 (portfolio check, performance, transactions, market data, misc)
- **Avg tokens per query**: 4,328 (4,128 input + 200 output) — observed from production data
- **Tool call frequency**: 1.9 tool calls/chat average (observed)
- **Verification overhead**: Negligible (deterministic, no extra LLM calls)
- **Cache warming overhead**: `warmPortfolioCache` runs after activity/account writes — Redis + BullMQ job, zero LLM tokens, up to 30s added latency per write operation
- **Model mix**: 90% Sonnet 4.6, 8% Haiku 4.5, 2% Opus 4.6
- **Blended cost per chat**: $0.0154 x 0.90 + $0.0051 x 0.08 + $0.0769 x 0.02 = $0.0157
### Monthly Projections
| Scale | Users | Chats/Month | Token Volume | Monthly Cost |
| ------------- | ------- | ----------- | ------------ | ------------ |
| 100 users | 100 | 15,000 | 64.9M tokens | **$235** |
| 1,000 users | 1,000 | 150,000 | 649M tokens | **$2,355** |
| 10,000 users | 10,000 | 1,500,000 | 6.49B tokens | **$23,550** |
| 100,000 users | 100,000 | 15,000,000 | 64.9B tokens | **$235,500** |
### Cost Optimization Levers
| Strategy | Estimated Savings | Trade-off |
| --------------------------------------- | ----------------- | ------------------------------- |
| Default to Haiku 4.5 for simple queries | 60-70% | Slightly less nuanced responses |
| Prompt caching (repeated system prompt) | 30-40% | Requires API support |
| Response caching for market data | 10-20% | Staleness window |
| Reduce system prompt size | 15-25% | Less detailed agent behavior |
### Infrastructure Costs (Render)
| Service | Plan | Monthly Cost |
| -------------------- | -------- | ------------- |
| Web (Standard) | Standard | $25 |
| Redis (Standard) | Standard | $10 |
| Postgres (Basic 1GB) | Basic | $7 |
| **Total infra** | | **$42/month** |
### Total Cost of Ownership
| Scale | AI Cost | Infra | Total/Month | Cost/User/Month |
| ------------- | -------- | ------ | ----------- | --------------- |
| 100 users | $235 | $42 | $277 | $2.77 |
| 1,000 users | $2,355 | $42 | $2,397 | $2.40 |
| 10,000 users | $23,550 | $100\* | $23,650 | $2.37 |
| 100,000 users | $235,500 | $500\* | $236,000 | $2.36 |
\*Infrastructure scales with traffic — estimated for higher tiers.
### Real-Time Cost Tracking
Cost is tracked live via `GET /api/v1/agent/metrics`:
```json
{
"cost": {
"totalUsd": 0.0226,
"avgPerChatUsd": 0.0226
}
}
```
Computed per request using model-specific pricing (Sonnet/Haiku/Opus rates) applied to actual prompt and completion token counts.

124
docs/pre-search.md

@ -0,0 +1,124 @@
# Pre-Search Document
Completed before writing agent code. Decisions informed all subsequent architecture choices.
## Phase 1: Define Your Constraints
### 1. Domain Selection
- **Domain**: Finance (Ghostfolio — open source portfolio tracker)
- **Use cases**: Portfolio analysis, performance tracking, market data lookups, transaction management, account CRUD, watchlist management, tagging
- **Verification requirements**: Hallucination detection (dollar amounts, ticker symbols), output validation, confidence scoring. Financial data must match tool results exactly.
- **Data sources**: Ghostfolio's existing Prisma/Postgres models, Financial Modeling Prep API (stocks/ETFs), CoinGecko API (crypto)
### 2. Scale & Performance
- **Expected query volume**: 100-1,000 chats/day during demo period
- **Acceptable latency**: <5s single-tool, <15s multi-step
- **Concurrent users**: ~10-50 simultaneous (Render Standard plan)
- **Cost constraints**: <$50/month for demo, Sonnet 4.6 at ~$0.015/chat is viable
### 3. Reliability Requirements
- **Cost of wrong answer**: Medium — incorrect portfolio values or transaction amounts erode trust. Not life-threatening (unlike healthcare), but financial data must be accurate.
- **Non-negotiable verification**: Dollar amounts in response must match tool data (within 5% or $1). Ticker symbols must reference actual holdings.
- **Human-in-the-loop**: Write operations (buy/sell/delete) require confirmation before execution. Agent asks clarifying questions rather than assuming.
- **Audit needs**: All chats logged to AgentChatLog with requestId, tokens, latency, verification scores.
### 4. Team & Skill Constraints
- **Agent frameworks**: Familiar with Vercel AI SDK, LangChain. Chose Vercel AI SDK for native TypeScript + NestJS integration.
- **Domain experience**: Familiar with portfolio management concepts. Ghostfolio codebase already provides all financial data models.
- **Eval/testing**: Experience with Jest/Vitest. Chose evalite for dedicated eval framework with UI and scoring.
## Phase 2: Architecture Discovery
### 5. Agent Framework Selection
- **Choice**: Vercel AI SDK v6 (`ToolLoopAgent`)
- **Why not LangChain**: Ghostfolio is NestJS/TypeScript — Vercel AI SDK is native TS, lighter weight, no Python dependency. LangChain's JS version is less mature.
- **Why not custom**: Vercel AI SDK provides streaming, tool dispatch, step management out of the box. No need to reinvent.
- **Architecture**: Single agent with tool gating via `prepareStep`. Not multi-agent — single domain, single user context.
- **State management**: Conversation history passed per turn. Tool gating state tracked via `toolHistory` array across turns.
### 6. LLM Selection
- **Choice**: Claude Sonnet 4.6 (default), with Haiku 4.5 and Opus 4.6 selectable per request
- **Why Claude**: Strong function calling, excellent at structured financial output, good instruction following for system prompt rules
- **Context window**: Sonnet 4.6 has 200K context — more than sufficient for portfolio data + conversation history
- **Cost per query**: $0.015 avg with Sonnet — acceptable for demo and small-scale production
### 7. Tool Design
- **Tools built**: 10 total (6 read + 4 write)
- Read: portfolio_analysis, portfolio_performance, holdings_lookup, market_data, symbol_search, transaction_history
- Write: account_manage, activity_manage, watchlist_manage, tag_manage
- **External APIs**: Financial Modeling Prep (stocks/ETFs), CoinGecko (crypto) — both via Ghostfolio's existing data provider layer
- **Mock vs real**: Real data from seeded demo portfolio (AAPL, MSFT, VOO, GOOGL, BTC, TSLA, NVDA). No mocks in dev or eval.
- **Error handling**: Every tool wrapped in try/catch, returns structured error messages. Agent gracefully surfaces errors to user.
### 8. Observability Strategy
- **Choice**: Self-hosted structured JSON logging + Postgres persistence + in-memory metrics
- **Why not LangSmith/Braintrust**: Adds external dependency and cost. Structured logs + Postgres give full control and queryability. Metrics endpoint provides real-time dashboard.
- **Key metrics**: Latency, token usage (prompt/completion/total), cost per chat, tool usage frequency, error rate, verification scores
- **Real-time monitoring**: `GET /agent/metrics?since=1h` endpoint with summary + recent chats
- **Cost tracking**: Token counts × model-specific pricing computed in metrics summary
### 9. Eval Approach
- **Framework**: evalite (dedicated eval runner with UI, separate from unit tests)
- **Correctness**: GoldenCheck scorer — deterministic binary pass/fail on tool routing, output patterns, forbidden content
- **Quality**: ResponseQuality scorer — LLM-judged (Haiku 4.5) scoring relevance, data-groundedness, conciseness, formatting
- **Tool accuracy**: ToolCallAccuracy scorer — F1-style partial credit for expected vs actual tool calls
- **Ground truth**: Real API responses from seeded demo portfolio
- **CI integration**: GitHub Actions runs golden evals on push, threshold 100%
### 10. Verification Design
- **Claims verified**: Dollar amounts, ticker symbols, holding references
- **Fact-checking sources**: Tool result data (the agent's own tool calls serve as ground truth)
- **Confidence thresholds**: Composite 0-1 score (tool success rate 0.3, step efficiency 0.1, output validity 0.3, hallucination score 0.3)
- **Escalation triggers**: Low confidence score logged but no automated escalation (deterministic verification only)
## Phase 3: Post-Stack Refinement
### 11. Failure Mode Analysis
- **Tool failures**: Try/catch per tool, error surfaced to user with suggestion to retry. Agent continues conversation.
- **Ambiguous queries**: Agent uses multiple tools when unsure (e.g., "How am I doing?" calls both performance and analysis). Over-fetching preferred over under-fetching.
- **Rate limiting**: No explicit rate limiting implemented. Render Standard plan handles moderate traffic. Would add Redis-based rate limiting at scale.
- **Graceful degradation**: If market data API fails, agent acknowledges and proceeds with available data.
### 12. Security Considerations
- **Prompt injection**: System prompt instructs agent to stay in role. Eval suite includes adversarial inputs (jailbreak, developer mode, system prompt extraction).
- **Data leakage**: Agent scoped to authenticated user's data only. JWT auth on all endpoints.
- **API key management**: All keys in environment variables, not committed. Error messages sanitized (DB URLs, API keys redacted).
- **Audit logging**: Every chat logged with requestId, userId, tools used, token counts, verification results.
### 13. Testing Strategy
- **Unit tests**: 16 tests for prepareStep tool gating logic
- **Eval tests**: 86 cases across golden (19) and scenarios (67)
- **Adversarial testing**: 11 cases covering prompt injection, jailbreak, resource abuse, system prompt extraction
- **Regression**: CI gate enforces 100% golden pass rate on every push
### 14. Open Source Planning
- **Release**: Eval dataset (86 cases) published as `evals/dataset.json` — structured JSON with input, expected tools, expected behavior, categories
- **License**: Follows Ghostfolio's existing AGPL-3.0 license
- **Documentation**: Agent README (280 lines), architecture doc, cost analysis, this pre-search document
### 15. Deployment & Operations
- **Hosting**: Render (Docker) — web + Redis + Postgres, Oregon region
- **CI/CD**: GitHub Actions for eval gate. Render auto-deploys from main branch.
- **Monitoring**: Structured JSON logs + `/agent/metrics` endpoint
- **Rollback**: Render provides instant rollback to previous deploy
### 16. Iteration Planning
- **User feedback**: Thumbs up/down per message, stored in AgentFeedback table, surfaced in metrics
- **Eval-driven improvement**: Failures in eval suite drive agent refinement (e.g., stale test cases updated when write tools added)
- **Future work**: Automated feedback → eval pipeline, LLM-judged hallucination detection, useChat migration if UI complexity grows
Loading…
Cancel
Save