# Pre-Search Document Completed before writing agent code. Decisions informed all subsequent architecture choices. ## Phase 1: Define Your Constraints ### 1. Domain Selection - **Domain**: Finance (Ghostfolio — open source portfolio tracker) - **Use cases**: Portfolio analysis, performance tracking, market data lookups, transaction management, account CRUD, watchlist management, tagging - **Verification requirements**: Hallucination detection (dollar amounts, ticker symbols), output validation, confidence scoring. Financial data must match tool results exactly. - **Data sources**: Ghostfolio's existing Prisma/Postgres models, Financial Modeling Prep API (stocks/ETFs), CoinGecko API (crypto) ### 2. Scale & Performance - **Expected query volume**: 100-1,000 chats/day during demo period - **Acceptable latency**: <5s single-tool, <15s multi-step - **Concurrent users**: ~10-50 simultaneous (Render Standard plan) - **Cost constraints**: <$50/month for demo, Sonnet 4.6 at ~$0.015/chat is viable ### 3. Reliability Requirements - **Cost of wrong answer**: Medium — incorrect portfolio values or transaction amounts erode trust. Not life-threatening (unlike healthcare), but financial data must be accurate. - **Non-negotiable verification**: Dollar amounts in response must match tool data (within 5% or $1). Ticker symbols must reference actual holdings. - **Human-in-the-loop**: Write operations (buy/sell/delete) require confirmation before execution. Agent asks clarifying questions rather than assuming. - **Audit needs**: All chats logged to AgentChatLog with requestId, tokens, latency, verification scores. ### 4. Team & Skill Constraints - **Agent frameworks**: Familiar with Vercel AI SDK, LangChain. Chose Vercel AI SDK for native TypeScript + NestJS integration. - **Domain experience**: Familiar with portfolio management concepts. Ghostfolio codebase already provides all financial data models. - **Eval/testing**: Experience with Jest/Vitest. Chose evalite for dedicated eval framework with UI and scoring. ## Phase 2: Architecture Discovery ### 5. Agent Framework Selection - **Choice**: Vercel AI SDK v6 (`ToolLoopAgent`) - **Why not LangChain**: Ghostfolio is NestJS/TypeScript — Vercel AI SDK is native TS, lighter weight, no Python dependency. LangChain's JS version is less mature. - **Why not custom**: Vercel AI SDK provides streaming, tool dispatch, step management out of the box. No need to reinvent. - **Architecture**: Single agent with tool gating via `prepareStep`. Not multi-agent — single domain, single user context. - **State management**: Conversation history passed per turn. Tool gating state tracked via `toolHistory` array across turns. ### 6. LLM Selection - **Choice**: Claude Sonnet 4.6 (default), with Haiku 4.5 and Opus 4.6 selectable per request - **Why Claude**: Strong function calling, excellent at structured financial output, good instruction following for system prompt rules - **Context window**: Sonnet 4.6 has 200K context — more than sufficient for portfolio data + conversation history - **Cost per query**: $0.015 avg with Sonnet — acceptable for demo and small-scale production ### 7. Tool Design - **Tools built**: 10 total (6 read + 4 write) - Read: portfolio_analysis, portfolio_performance, holdings_lookup, market_data, symbol_search, transaction_history - Write: account_manage, activity_manage, watchlist_manage, tag_manage - **External APIs**: Financial Modeling Prep (stocks/ETFs), CoinGecko (crypto) — both via Ghostfolio's existing data provider layer - **Mock vs real**: Real data from seeded demo portfolio (AAPL, MSFT, VOO, GOOGL, BTC, TSLA, NVDA). No mocks in dev or eval. - **Error handling**: Every tool wrapped in try/catch, returns structured error messages. Agent gracefully surfaces errors to user. ### 8. Observability Strategy - **Choice**: Self-hosted structured JSON logging + Postgres persistence + in-memory metrics - **Why not LangSmith/Braintrust**: Adds external dependency and cost. Structured logs + Postgres give full control and queryability. Metrics endpoint provides real-time dashboard. - **Key metrics**: Latency, token usage (prompt/completion/total), cost per chat, tool usage frequency, error rate, verification scores - **Real-time monitoring**: `GET /agent/metrics?since=1h` endpoint with summary + recent chats - **Cost tracking**: Token counts × model-specific pricing computed in metrics summary ### 9. Eval Approach - **Framework**: evalite (dedicated eval runner with UI, separate from unit tests) - **Correctness**: GoldenCheck scorer — deterministic binary pass/fail on tool routing, output patterns, forbidden content - **Quality**: ResponseQuality scorer — LLM-judged (Haiku 4.5) scoring relevance, data-groundedness, conciseness, formatting - **Tool accuracy**: ToolCallAccuracy scorer — F1-style partial credit for expected vs actual tool calls - **Ground truth**: Real API responses from seeded demo portfolio - **CI integration**: GitHub Actions runs golden evals on push, threshold 100% ### 10. Verification Design - **Claims verified**: Dollar amounts, ticker symbols, holding references - **Fact-checking sources**: Tool result data (the agent's own tool calls serve as ground truth) - **Confidence thresholds**: Composite 0-1 score (tool success rate 0.3, step efficiency 0.1, output validity 0.3, hallucination score 0.3) - **Escalation triggers**: Low confidence score logged but no automated escalation (deterministic verification only) ## Phase 3: Post-Stack Refinement ### 11. Failure Mode Analysis - **Tool failures**: Try/catch per tool, error surfaced to user with suggestion to retry. Agent continues conversation. - **Ambiguous queries**: Agent uses multiple tools when unsure (e.g., "How am I doing?" calls both performance and analysis). Over-fetching preferred over under-fetching. - **Rate limiting**: No explicit rate limiting implemented. Render Standard plan handles moderate traffic. Would add Redis-based rate limiting at scale. - **Graceful degradation**: If market data API fails, agent acknowledges and proceeds with available data. ### 12. Security Considerations - **Prompt injection**: System prompt instructs agent to stay in role. Eval suite includes adversarial inputs (jailbreak, developer mode, system prompt extraction). - **Data leakage**: Agent scoped to authenticated user's data only. JWT auth on all endpoints. - **API key management**: All keys in environment variables, not committed. Error messages sanitized (DB URLs, API keys redacted). - **Audit logging**: Every chat logged with requestId, userId, tools used, token counts, verification results. ### 13. Testing Strategy - **Unit tests**: 16 tests for prepareStep tool gating logic - **Eval tests**: 86 cases across golden (19) and scenarios (67) - **Adversarial testing**: 11 cases covering prompt injection, jailbreak, resource abuse, system prompt extraction - **Regression**: CI gate enforces 100% golden pass rate on every push ### 14. Open Source Planning - **Release**: Eval dataset (86 cases) published as `evals/dataset.json` — structured JSON with input, expected tools, expected behavior, categories - **License**: Follows Ghostfolio's existing AGPL-3.0 license - **Documentation**: Agent README (280 lines), architecture doc, cost analysis, this pre-search document ### 15. Deployment & Operations - **Hosting**: Render (Docker) — web + Redis + Postgres, Oregon region - **CI/CD**: GitHub Actions for eval gate. Render auto-deploys from main branch. - **Monitoring**: Structured JSON logs + `/agent/metrics` endpoint - **Rollback**: Render provides instant rollback to previous deploy ### 16. Iteration Planning - **User feedback**: Thumbs up/down per message, stored in AgentFeedback table, surfaced in metrics - **Eval-driven improvement**: Failures in eval suite drive agent refinement (e.g., stale test cases updated when write tools added) - **Future work**: Automated feedback → eval pipeline, LLM-judged hallucination detection, useChat migration if UI complexity grows