12 KiB

Raw Blame History

Agent Architecture Document

Domain & Use Cases

Domain: Personal finance portfolio management (Ghostfolio fork)

Problems solved: Ghostfolio's existing UI requires manual navigation across multiple pages to understand portfolio state. The agent provides a conversational interface that synthesizes holdings, performance, market data, and transaction history into coherent natural language — and can execute write operations (account management, activity logging, watchlist, tags) with native tool approval gates.

Target users: Self-directed investors tracking multi-asset portfolios (stocks, ETFs, crypto) who want quick portfolio insights without clicking through dashboards.

Agent Architecture

Framework & Stack

Layer	Choice	Rationale
Runtime	NestJS (TypeScript)	Native to Ghostfolio codebase
Agent framework	Vercel AI SDK v6 (`ToolLoopAgent`)	Native TS, streaming SSE, built-in tool dispatch
LLM	Claude Sonnet 4.6 (default)	Strong function calling, structured output, 200K context
Model options	Haiku 4.5 ($0.005/chat), Sonnet 4.6 ($0.015/chat), Opus 4.6 ($0.077/chat)	User-selectable per session
Schemas	Zod v4	Required by AI SDK v6 `inputSchema`
Database	Prisma + Postgres	Shared with Ghostfolio, plus agent-specific tables
Cache warming	`warmPortfolioCache` helper	Redis + BullMQ (`PortfolioSnapshotService`) — ensures portfolio reads reflect recent writes

Request Flow

User message
    │
    ▼
POST /api/v1/agent/chat (JWT auth)
    Body: { messages: UIMessage[], toolHistory?, model?, approvedActions? }
    │
    ▼
ToolLoopAgent created → pipeAgentUIStreamToResponse()
    │
    ├─► prepareStep()
    │     ├─ Injects current date into system prompt
    │     ├─ All 10 tools available from step 1 (activity_manage auto-resolves accountId)
    │     └─ Loads contextual SKILL.md files based on tool history
    │
    ├─► LLM reasoning → tool selection
    │     └─ Up to 10 steps (stopWhen: stepCountIs(10))
    │
    ├─► Tool execution (try/catch per tool)
    │     └─ Returns structured JSON to LLM for synthesis
    │
    ├─► Approval gate (write tools only)
    │     ├─ needsApproval() evaluates per invocation
    │     ├─ Skips: list actions, previously-approved signatures, SKIP_APPROVAL env
    │     └─ If required: stream pauses → client shows approval card → resumes on approve/deny
    │
    ├─► Post-write cache warming (activity_manage, account_manage)
    │     └─ warmPortfolioCache: clear Redis → drain stale jobs → enqueue HIGH priority → await (30s timeout)
    │
    ├─► SSE stream to client (UIMessage protocol)
    │     └─ Events: text-delta, tool-input-start, tool-input-available, tool-approval-request, tool-output-available, finish
    │
    └─► onFinish callback
          ├─ Verification pipeline (3 systems)
          ├─ Metrics recording (in-memory + Postgres)
          └─ Structured log: chat_complete

Tool Design

10 tools organized into read (6) and write (4) categories:

Tool	Type	Purpose
`portfolio_analysis`	Read	Holdings, allocations, total value, account breakdown
`portfolio_performance`	Read	Returns, net performance, chart data (downsampled to ~20 points)
`holdings_lookup`	Read	Deep dive on single position: dividends, fees, sectors, countries
`market_data`	Read	Live quotes for 1-10 symbols (FMP + CoinGecko)
`symbol_search`	Read	Disambiguate crypto vs stock, find correct data source
`transaction_history`	Read	Buys, sells, dividends, fees, deposits, withdrawals
`account_manage`	Write	CRUD accounts + transfers between accounts
`activity_manage`	Write	CRUD transactions (BUY/SELL/DIVIDEND/FEE/INTEREST/LIABILITY)
`watchlist_manage`	Write	Add/remove/list watchlist items
`tag_manage`	Write	CRUD tags for transaction organization

Auto-resolution: activity_manage auto-resolves accountId when omitted on creates — matches accounts by asset type keywords (crypto → "crypto"/"wallet" accounts, stocks → "stock"/"brokerage" accounts) with fallback to highest-activity account. No tool gating; all 10 tools available from step 1.

Approval gates: All 4 write tools define needsApproval — a function-based gate evaluated per invocation. Read-only actions (list) and previously-approved action signatures are auto-skipped. SKIP_APPROVAL=true env var disables all gates (used in evals). Action signatures follow the pattern tool_name:action:identifier (e.g., activity_manage:create:AAPL).

Tool	Approval Rule
`activity_manage`	`create` and `delete` only (not `update`), unless signature in `approvedActions`
`account_manage`	Everything except `list`, unless signature in `approvedActions`
`tag_manage`	Everything except `list`
`watchlist_manage`	Everything except `list`

Cache warming: After every write operation in activity_manage and account_manage, warmPortfolioCache() runs — clears stale Redis portfolio snapshots, drains in-flight BullMQ jobs, enqueues fresh computation at HIGH priority, and awaits completion with a 30s timeout. Ensures subsequent read tools return up-to-date portfolio data. Injected via PortfolioSnapshotService, RedisCacheService, and UserService.

Skill injection: Contextual SKILL.md documents (transaction workflow, market data patterns) are injected into the system prompt based on which tools have been used, providing the LLM with domain-specific guidance without bloating every request.

Verification Strategy

Three deterministic systems run on every response in the onFinish callback:

System	What It Checks	Weight in Composite
Output Validation	Min/max length, numeric data present when tools called, disclaimer on forward-looking language, write claims backed by write tool calls	0.3
Hallucination Detection	Ticker symbols in response exist in tool data, dollar amounts match within 5% or $1, holding claims reference actual tool results	0.3
Confidence Score	Composite 0-1: tool success rate (0.3), step efficiency (0.1), output validity (0.3), hallucination score (0.3)	—

Why deterministic: No additional LLM calls = zero latency overhead and zero cost. Cross-references response text against actual tool result data. Catches the most common financial hallucination patterns (phantom tickers, fabricated dollar amounts) with high precision.

Persistence: Results stored on AgentChatLog (verificationScore, verificationResult) and exposed via GET /agent/verification/:requestId.

Eval Results

Framework: evalite — dedicated eval runner with scoring, UI, and CI integration.

Suite	Cases	Pass Rate	Scorers
Golden Set	19	100%	GoldenCheck (deterministic binary)
Scenarios	67	91%	ToolCallAccuracy (F1), HasResponse, ResponseQuality (LLM-judged)
Total	86

Category breakdown: single-tool (10), multi-tool (10), ambiguous (6), account management (8), activity management (10), watchlist (4), tag (4), multi-step write (4), adversarial write (4), edge cases (7), golden routing (7), golden structural (4), golden behavioral (2), golden guardrails (4).

CI gate: GitHub Actions runs golden set on every push to main. Threshold: 100%. Hits deployed Render instance.

Remaining 9% scenario gap: Agent calls extra tools on ambiguous queries (thorough, not wrong). Write-operation approval is now handled by native needsApproval gates (bypassed in evals via SKIP_APPROVAL=true).

Observability Setup

Capability	Implementation
Trace logging	Structured JSON: `chat_start`, `step_finish` (tool names, token usage), `verification`, `chat_complete`
Latency tracking	Total `latencyMs` per request, persisted to Postgres
Token usage	prompt/completion/total per request, averaged in metrics summary
Cost tracking	Model-specific pricing (Sonnet $3/$15, Haiku $1/$5, Opus $15/$75 per MTok), `cost.totalUsd` and `cost.avgPerChatUsd` in metrics
Error tracking	Errors recorded with sanitized messages (DB URLs, API keys redacted)
User feedback	Thumbs up/down per message, unique per (requestId, userId), satisfaction rate in metrics
Metrics endpoint	`GET /agent/metrics?since=1h` — summary, feedback, recent chats

Key insight: Avg $0.015/chat with Sonnet 4.6, ~5.9s latency, ~1.9 steps per query. Portfolio analysis and market data are the most-used tools.

Open Source Contribution

Eval dataset: 86-case structured JSON dataset (evals/dataset.json) covering tool routing, multi-tool chaining, write operations, adversarial inputs, and edge cases for a financial AI agent. Each case includes input query, expected tools, expected behavioral assertions, and category labels. Published in the repository for reuse by other Ghostfolio agent implementations.

12 KiB Raw Blame History