You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

8.3 KiB

# AgentForge Pre-Search ## Finance Domain · Ghostfolio · AI Portfolio Intelligence Agent ### G4 Cohort · Week 2 · February 2026 > Note on plan vs delivery: This pre-search was > completed before development began. The final > implementation evolved from this plan: > - 16 tools were built (vs 5 planned) with real > estate portfolio tracking added to the original > portfolio analysis scope > - Human-in-the-loop was implemented via an > awaiting_confirmation state in the graph > - Wash-sale enforcement was implemented in > tax_estimate.py > - Open source contribution was delivered as a > public GitHub eval dataset (183 tests) rather > than an npm package > - See AGENT_README.md for final implementation

Phase 1: Domain & Constraints

Use Cases

  • Portfolio health check: "How diversified am I? Where am I overexposed?"
  • Performance Q&A: "What's my YTD return vs S&P 500?"
  • Tax-loss harvesting: Surface unrealized losses to offset gains; flag wash-sale violations (30-day IRS rule)
  • Natural language activity log: "Show me all my MSFT trades this year" → query Ghostfolio activities
  • Compliance alerts: Concentration risk >20% single holding, missing cost basis warnings

Ghostfolio — Actual API Integration Points

Stack: Angular frontend · NestJS backend · PostgreSQL + Prisma ORM · Redis cache · TypeScript

Data sources: Ghostfolio PostgreSQL DB (primary), Yahoo Finance API (market data), IRS published tax rules (compliance). All external data fetched at runtime — no static snapshots.

4. Team & Skill Constraints

  • Domain experience: Licensed real estate agent — strong knowledge of real estate use cases
  • Framework familiarity: New to LangGraph, learned during build
  • Biggest risk: Python context switch on Day 1 Mitigation: minimal LangGraph hello world before touching domain logic

2. Scale & Performance

  • Expected query volume: 5-20 queries/user/day
  • Acceptable latency: Under 10 seconds for LLM synthesis
  • Concurrent users: FastAPI async handles moderate load
  • Cost constraint: Under $0.02 per query

Reliability Requirements

  • Cost of wrong answer: High — financial decisions have real consequences
  • Non-negotiable: All claims cite sources, confidence score on every response, no specific advice without disclaimer
  • Human-in-the-loop: Implemented for high-risk queries
  • Audit: LangSmith traces every request

Performance Targets

  • Single-tool latency: <5s target (actual: 8-10s due to Claude Sonnet synthesis, documented)
  • Tool success rate: >95%
  • Eval pass rate: >80% (actual: 100%)
  • Hallucination rate: <5% (citation enforcement)

Phase 2: Architecture

Framework & LLM Decisions

  • Framework: LangGraph (Python)
  • LLM: Claude Sonnet (claude-sonnet-4-20250514)
  • Observability: LangSmith
  • Backend: FastAPI
  • Database: PostgreSQL (Ghostfolio) + SQLite (properties)
  • Deployment: Railway

Why LangGraph

Chosen over plain LangChain because financial workflows require explicit state management: loop-back for human confirmation, conditional branching by query type, and mid-graph verification before any response returns.

LangGraph State Schema

Core state persists across every node:

  • query_type: routes to correct tool executor
  • tool_result: structured output from tool call
  • confidence_score: quantified certainty per response
  • awaiting_confirmation: pauses graph for high-risk queries
  • portfolio_snapshot: immutable per request for verification
  • messages: full conversation history for LLM context

Key design: portfolio_snapshot is immutable once set — verification node compares all numeric claims against it. awaiting_confirmation pauses the graph at the human-in-the-loop node; resumes only on explicit user confirmation. confidence_score below 0.6 routes to clarification node.

Agent Tools (Final: 16 tools across 7 files)

Original plan was 5 core portfolio tools. Implementation expanded to 16 tools adding real estate, wealth planning, and life decision capabilities.

See AGENT_README.md for complete tool table.

Integration pattern: LangGraph agent authenticates with Ghostfolio via anonymous token endpoint, then calls portfolio/activity endpoints with Bearer token. No NestJS modification required.

Error handling: every tool returns structured error dict on failure — never throws to the agent.

Verification Stack (3 Layers Implemented)

  1. Confidence Scoring: Every response scored 0.0-1.0. Below 0.80 returns verified=false to client.
  2. Citation Enforcement: System prompt requires every factual claim to name its data source. LLM cannot return a number without citation.
  3. Domain Constraint Check: Pre-return scan for high-risk phrases. Flags responses making specific investment recommendations without disclaimers.

Note: Pre-search planned fact-check node with tool_result_id tagging. Citation enforcement via system prompt proved more reliable in practice because it cannot be bypassed by routing logic.

Observability Plan

Tool: LangSmith — native to LangGraph. Metrics per request:

  • Full reasoning trace
  • LLM + tool latency breakdown
  • Input/output tokens + rolling cost
  • Tool success/failure rate by tool name
  • Verification outcome (pass/flag/escalate)
  • User thumbs up/down linked to trace_id
  • /metrics endpoint for aggregate stats
  • eval_history.json for regression detection

Phase 3: Evaluation Framework

Test Suite (183 tests, 100% pass rate)

Categories:

  • Happy path: 20 tests
  • Edge cases: 14 tests
  • Adversarial: 14 tests
  • Multi-step: 13 tests
  • Additional tool unit tests: 122 tests

All integration and eval tests run against deterministic data — fully reproducible, eliminates flakiness from live market data.

Testing Strategy

  • Unit tests for every tool
  • Integration tests for multi-step chains
  • Adversarial tests: SQL injection, prompt injection, extreme values, malformed inputs
  • Regression: save_eval_results.py stores pass rate history in eval_history.json, flags drops

Failure Modes & Security

Failure Modes

  • Tool fails: returns error dict, LLM synthesizes graceful response
  • Ambiguous query: keyword classifier falls back to portfolio_analysis as safe default
  • Rate limiting: not implemented at MVP
  • Graceful degradation: every tool has try/except, log_error() captures context and stack trace

Security

  • API keys: environment variables only, never logged
  • Data leakage: portfolio data scoped per authenticated user, agent context cleared between sessions
  • Prompt injection: user input sanitized before prompt construction; system prompt resists override instructions
  • Audit logging: every tool call + result stored with timestamp; LangSmith retains traces

AI Cost Analysis

Development Spend Estimate

  • Estimated queries during development: ~2,000
  • Average tokens per query: 1,600 input + 400 output
  • Cost per query: ~$0.01 (Claude Sonnet pricing)
  • Total development cost estimate: ~$20-40
  • LangSmith: Free tier
  • Railway: Free tier

Production Cost Projections

Assumptions: 10 queries/user/day, 2,000 input tokens, 500 output tokens, 1.5 avg tool calls

Scale Monthly Cost
100 users ~$18/mo
1,000 users ~$180/mo
10,000 users ~$1,800/mo
100,000 users ~$18,000/mo*

*At 100k users: semantic caching + Haiku for simple queries cuts LLM cost ~40%. Real target ~$11k/mo.


Open Source Contribution

Delivered: 183-test public eval dataset for finance AI agents — first eval suite for Ghostfolio agents. MIT licensed, accepts contributions.

Location: agent/evals/ on submission/final branch Documentation: agent/evals/EVAL_DATASET_README.md

Note: Pre-search planned npm package + Hugging Face. GitHub eval dataset was chosen instead — more directly useful to developers forking Ghostfolio since they can run the test suite immediately.


Deployment & Operations

  • Hosting: Railway (FastAPI agent + Ghostfolio)
  • CI/CD: Manual deploy via Railway GitHub integration
  • Monitoring: LangSmith dashboard + /metrics endpoint
  • Rollback: git revert + Railway auto-deploys from main

Iteration Planning

  • User feedback: thumbs up/down in chat UI, each vote stored with LangSmith trace_id
  • Improvement cycle: eval failures → tool fixes → re-run suite → confirm improvement
  • Regression gate: new feature must not drop eval pass rate
  • Model updates: full eval suite runs against new model before switching in production