--- # AgentForge Pre-Search ## Finance Domain · Ghostfolio · AI Portfolio Intelligence Agent ### G4 Cohort · Week 2 · February 2026 > **Note on plan vs delivery:** This pre-search was > completed before development began. The final > implementation evolved from this plan: > - 16 tools were built (vs 5 planned) with real > estate portfolio tracking added to the original > portfolio analysis scope > - Human-in-the-loop was implemented via an > awaiting_confirmation state in the graph > - Wash-sale enforcement was implemented in > tax_estimate.py > - Open source contribution was delivered as a > public GitHub eval dataset (183 tests) rather > than an npm package > - See AGENT_README.md for final implementation --- ## Phase 1: Domain & Constraints ### Use Cases - Portfolio health check: "How diversified am I? Where am I overexposed?" - Performance Q&A: "What's my YTD return vs S&P 500?" - Tax-loss harvesting: Surface unrealized losses to offset gains; flag wash-sale violations (30-day IRS rule) - Natural language activity log: "Show me all my MSFT trades this year" → query Ghostfolio activities - Compliance alerts: Concentration risk >20% single holding, missing cost basis warnings ### Ghostfolio — Actual API Integration Points Stack: Angular frontend · NestJS backend · PostgreSQL + Prisma ORM · Redis cache · TypeScript Data sources: Ghostfolio PostgreSQL DB (primary), Yahoo Finance API (market data), IRS published tax rules (compliance). All external data fetched at runtime — no static snapshots. ### 4. Team & Skill Constraints - Domain experience: Licensed real estate agent — strong knowledge of real estate use cases - Framework familiarity: New to LangGraph, learned during build - Biggest risk: Python context switch on Day 1 Mitigation: minimal LangGraph hello world before touching domain logic ### 2. Scale & Performance - Expected query volume: 5-20 queries/user/day - Acceptable latency: Under 10 seconds for LLM synthesis - Concurrent users: FastAPI async handles moderate load - Cost constraint: Under $0.02 per query ### Reliability Requirements - Cost of wrong answer: High — financial decisions have real consequences - Non-negotiable: All claims cite sources, confidence score on every response, no specific advice without disclaimer - Human-in-the-loop: Implemented for high-risk queries - Audit: LangSmith traces every request ### Performance Targets - Single-tool latency: <5s target (actual: 8-10s due to Claude Sonnet synthesis, documented) - Tool success rate: >95% - Eval pass rate: >80% (actual: 100%) - Hallucination rate: <5% (citation enforcement) --- ## Phase 2: Architecture ### Framework & LLM Decisions - Framework: LangGraph (Python) - LLM: Claude Sonnet (claude-sonnet-4-20250514) - Observability: LangSmith - Backend: FastAPI - Database: PostgreSQL (Ghostfolio) + SQLite (properties) - Deployment: Railway ### Why LangGraph Chosen over plain LangChain because financial workflows require explicit state management: loop-back for human confirmation, conditional branching by query type, and mid-graph verification before any response returns. ### LangGraph State Schema Core state persists across every node: - query_type: routes to correct tool executor - tool_result: structured output from tool call - confidence_score: quantified certainty per response - awaiting_confirmation: pauses graph for high-risk queries - portfolio_snapshot: immutable per request for verification - messages: full conversation history for LLM context Key design: portfolio_snapshot is immutable once set — verification node compares all numeric claims against it. awaiting_confirmation pauses the graph at the human-in-the-loop node; resumes only on explicit user confirmation. confidence_score below 0.6 routes to clarification node. ### Agent Tools (Final: 16 tools across 7 files) Original plan was 5 core portfolio tools. Implementation expanded to 16 tools adding real estate, wealth planning, and life decision capabilities. See AGENT_README.md for complete tool table. Integration pattern: LangGraph agent authenticates with Ghostfolio via anonymous token endpoint, then calls portfolio/activity endpoints with Bearer token. No NestJS modification required. Error handling: every tool returns structured error dict on failure — never throws to the agent. ### Verification Stack (3 Layers Implemented) 1. Confidence Scoring: Every response scored 0.0-1.0. Below 0.80 returns verified=false to client. 2. Citation Enforcement: System prompt requires every factual claim to name its data source. LLM cannot return a number without citation. 3. Domain Constraint Check: Pre-return scan for high-risk phrases. Flags responses making specific investment recommendations without disclaimers. Note: Pre-search planned fact-check node with tool_result_id tagging. Citation enforcement via system prompt proved more reliable in practice because it cannot be bypassed by routing logic. ### Observability Plan Tool: LangSmith — native to LangGraph. Metrics per request: - Full reasoning trace - LLM + tool latency breakdown - Input/output tokens + rolling cost - Tool success/failure rate by tool name - Verification outcome (pass/flag/escalate) - User thumbs up/down linked to trace_id - /metrics endpoint for aggregate stats - eval_history.json for regression detection --- ## Phase 3: Evaluation Framework ### Test Suite (183 tests, 100% pass rate) Categories: - Happy path: 20 tests - Edge cases: 14 tests - Adversarial: 14 tests - Multi-step: 13 tests - Additional tool unit tests: 122 tests All integration and eval tests run against deterministic data — fully reproducible, eliminates flakiness from live market data. ### Testing Strategy - Unit tests for every tool - Integration tests for multi-step chains - Adversarial tests: SQL injection, prompt injection, extreme values, malformed inputs - Regression: save_eval_results.py stores pass rate history in eval_history.json, flags drops --- ## Failure Modes & Security ### Failure Modes - Tool fails: returns error dict, LLM synthesizes graceful response - Ambiguous query: keyword classifier falls back to portfolio_analysis as safe default - Rate limiting: not implemented at MVP - Graceful degradation: every tool has try/except, log_error() captures context and stack trace ### Security - API keys: environment variables only, never logged - Data leakage: portfolio data scoped per authenticated user, agent context cleared between sessions - Prompt injection: user input sanitized before prompt construction; system prompt resists override instructions - Audit logging: every tool call + result stored with timestamp; LangSmith retains traces --- ## AI Cost Analysis ### Development Spend Estimate - Estimated queries during development: ~2,000 - Average tokens per query: 1,600 input + 400 output - Cost per query: ~$0.01 (Claude Sonnet pricing) - Total development cost estimate: ~$20-40 - LangSmith: Free tier - Railway: Free tier ### Production Cost Projections Assumptions: 10 queries/user/day, 2,000 input tokens, 500 output tokens, 1.5 avg tool calls | Scale | Monthly Cost | |-------|-------------| | 100 users | ~$18/mo | | 1,000 users | ~$180/mo | | 10,000 users | ~$1,800/mo | | 100,000 users | ~$18,000/mo* | *At 100k users: semantic caching + Haiku for simple queries cuts LLM cost ~40%. Real target ~$11k/mo. --- ## Open Source Contribution Delivered: 183-test public eval dataset for finance AI agents — first eval suite for Ghostfolio agents. MIT licensed, accepts contributions. Location: agent/evals/ on submission/final branch Documentation: agent/evals/EVAL_DATASET_README.md Note: Pre-search planned npm package + Hugging Face. GitHub eval dataset was chosen instead — more directly useful to developers forking Ghostfolio since they can run the test suite immediately. --- ## Deployment & Operations - Hosting: Railway (FastAPI agent + Ghostfolio) - CI/CD: Manual deploy via Railway GitHub integration - Monitoring: LangSmith dashboard + /metrics endpoint - Rollback: git revert + Railway auto-deploys from main --- ## Iteration Planning - User feedback: thumbs up/down in chat UI, each vote stored with LangSmith trace_id - Improvement cycle: eval failures → tool fixes → re-run suite → confirm improvement - Regression gate: new feature must not drop eval pass rate - Model updates: full eval suite runs against new model before switching in production ---