You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

9.4 KiB

Complete Ghostfolio Finance Agent Requirements

Status: Implemented (2026-02-24 local) Priority: High Deadline: Sunday 10:59 PM CT (submission)

Overview

Complete the remaining technical requirements for the Ghostfolio AI Agent submission to Gauntlet G4.

Current Completion: 6/10

Completed:

  • MVP Agent (5 tools, natural language, tool execution)
  • Redis memory system
  • Verification (confidence, citations, checks)
  • Error handling
  • 10 MVP eval cases
  • Railway deployment
  • Submission docs (presearch, dev log, cost analysis)
  • ADR/docs structure

Remaining:

  • Eval dataset: 10 → 50+ test cases
  • LangSmith observability integration

Requirements Analysis

1. Eval Dataset Expansion (40+ new cases)

Required Breakdown (from docs/requirements.md):

  • 20+ happy path scenarios
  • 10+ edge cases (missing data, boundary conditions)
  • 10+ adversarial inputs (bypass verification attempts)
  • 10+ multi-step reasoning scenarios

Current State: 10 cases in apps/api/src/app/endpoints/ai/evals/mvp-eval.dataset.ts

Categories Covered:

  • Happy path: ~6 cases (portfolio overview, risk, market data, multi-tool, rebalance, stress test)
  • Edge cases: ~2 cases (tool failure, partial market coverage)
  • Adversarial: ~1 case (implicit in fallback scenarios)
  • Multi-step: ~2 cases (multi-tool query, memory continuity)

Gaps to Fill:

  • Happy path: +14 cases
  • Edge cases: +8 cases
  • Adversarial: +9 cases
  • Multi-step: +8 cases

Available Tools:

  1. portfolio_analysis - holdings, allocation, performance
  2. risk_assessment - concentration risk analysis
  3. market_data_lookup - current prices, market state
  4. rebalance_plan - allocation adjustment recommendations
  5. stress_test - drawdown/impact scenarios

Test Case Categories to Add:

Happy Path (+14):

  • Allocation analysis queries
  • Performance comparison requests
  • Portfolio health summaries
  • Investment guidance questions
  • Sector/asset class breakdowns
  • Currency impact analysis
  • Time-based performance queries
  • Benchmark comparisons
  • Diversification metrics
  • Fee analysis queries
  • Dividend/income queries
  • Holdings detail requests
  • Market context questions
  • Goal progress queries

Edge Cases (+8):

  • Empty portfolio (no holdings)
  • Single-symbol portfolio
  • Very large portfolio (100+ symbols)
  • Multiple accounts with different currencies
  • Portfolio with only data issues (no quotes available)
  • Zero-value positions
  • Historical date queries (backtesting)
  • Real-time data unavailable

Adversarial (+9):

  • SQL injection attempts in queries
  • Prompt injection (ignore previous instructions)
  • Malicious code generation requests
  • Requests for other users' data
  • Bypassing rate limits
  • Manipulating confidence scores
  • Fake verification scenarios
  • Exfiltration attempts
  • Privilege escalation attempts

Multi-Step (+8):

  • Compare performance then rebalance
  • Stress test then adjust allocation
  • Market lookup → portfolio analysis → recommendation
  • Risk assessment → stress test → rebalance
  • Multi-symbol market data → portfolio impact
  • Historical query → trend analysis → forward guidance
  • Multi-account aggregation → consolidated analysis
  • Portfolio + market + risk comprehensive report

2. LangSmith Observability Integration

Requirements (from docs/requirements.md):

Capability Requirements
Trace Logging Full trace: input → reasoning → tool calls → output
Latency Tracking Time breakdown: LLM calls, tool execution, total response
Error Tracking Capture failures, stack traces, context
Token Usage Input/output tokens per request, cost tracking
Eval Results Historical eval scores, regression detection
User Feedback Thumbs up/down, corrections mechanism

Integration Points:

  1. Package: @langchain/langsmith (already in dependencies?)
  2. Environment: LANGCHAIN_TRACING_V2=true, LANGCHAIN_API_KEY
  3. Location: apps/api/src/app/endpoints/ai/ai.service.ts

Implementation Approach:

// Initialize LangSmith tracer
import { Client } from '@langchain/langsmith';

const langsmithClient = new Client({
  apiKey: process.env.LANGCHAIN_API_KEY,
  apiUrl: process.env.LANGCHAIN_ENDPOINT
});

// Wrap chat execution in trace
async function chatWithTrace(request: AiChatRequest) {
  const trace = langsmithClient.run({
    name: 'ai_agent_chat',
    inputs: { query: request.query, userId: request.userId }
  });

  try {
    // Log LLM calls
    // Log tool execution
    // Log verification checks
    // Log final output

    await trace.end({
      outputs: { answer: response.answer },
      metadata: { latency, tokens, toolCalls }
    });
  } catch (error) {
    await trace.end({ error: error.message });
  }
}

Files to Modify:

  • apps/api/src/app/endpoints/ai/ai.service.ts - Add tracing to chat method
  • .env.example - Add LangSmith env vars
  • apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.ts - Add eval result upload to LangSmith

Testing:

  • Verify traces appear in LangSmith dashboard
  • Check latency breakdown accuracy
  • Validate token usage tracking
  • Test error capture

Implementation Plan

Phase 1: Eval Dataset Expansion (Priority: High)

Step 1.1: Design test case template

  • Review existing 10 cases structure
  • Define patterns for each category
  • Create helper functions for setup data

Step 1.2: Generate happy path cases (+14)

  • Allocation analysis (4 cases)
  • Performance queries (3 cases)
  • Portfolio health (3 cases)
  • Market context (2 cases)
  • Benchmarks/diversification (2 cases)

Step 1.3: Generate edge case scenarios (+8)

  • Empty/edge portfolios (4 cases)
  • Data availability issues (2 cases)
  • Boundary conditions (2 cases)

Step 1.4: Generate adversarial cases (+9)

  • Injection attacks (4 cases)
  • Data access violations (3 cases)
  • System manipulation (2 cases)

Step 1.5: Generate multi-step cases (+8)

  • 2-3 tool chains (4 cases)
  • Complex reasoning (4 cases)

Step 1.6: Update eval runner

  • Expand dataset import
  • Add category-based reporting
  • Track pass rates by category

Step 1.7: Run and validate

  • npm run test:mvp-eval
  • Fix any failures
  • Document results

Phase 2: LangSmith Integration (Priority: High)

Step 2.1: Add dependencies

  • Check if @langchain/langsmith in package.json
  • Add if missing

Step 2.2: Configure environment

  • Add LANGCHAIN_TRACING_V2=true to .env.example
  • Add LANGCHAIN_API_KEY to .env.example
  • Add setup notes to docs/LOCAL-TESTING.md

Step 2.3: Initialize tracer in AI service

  • Import LangSmith client
  • Configure initialization
  • Add error handling for missing credentials

Step 2.4: Wrap chat execution

  • Create trace on request start
  • Log LLM calls with latency
  • Log tool execution with results
  • Log verification checks
  • End trace with output

Step 2.5: Add metrics tracking

  • Token usage (input/output)
  • Latency breakdown (LLM, tools, total)
  • Success/failure rates
  • Tool selection frequencies

Step 2.6: Integrate eval results

  • Upload eval runs to LangSmith
  • Create dataset for regression testing
  • Track historical scores

Step 2.7: Test and verify

  • Run npm run test:ai with tracing enabled
  • Check LangSmith dashboard for traces
  • Verify metrics accuracy
  • Test error capture

Phase 3: Documentation and Validation

Step 3.1: Update submission docs

  • Update docs/AI-DEVELOPMENT-LOG.md with LangSmith
  • Update eval count in docs
  • Add observability section to architecture doc

Step 3.2: Final verification

  • Run full test suite
  • Check production deployment
  • Validate submission checklist

Step 3.3: Update tasks tracking

  • Mark tickets complete
  • Update Tasks.md
  • Document any lessons learned

Success Criteria

Eval Dataset:

  • 50+ test cases total
  • 20+ happy path scenarios
  • 10+ edge cases
  • 10+ adversarial inputs
  • 10+ multi-step scenarios
  • All tests pass (npm run test:mvp-eval)
  • Category-specific pass rates tracked

LangSmith Observability:

  • Traces visible in LangSmith dashboard
  • Full request lifecycle captured (input → reasoning → tools → output)
  • Latency breakdown accurate (LLM, tools, total)
  • Token usage tracked per request
  • Error tracking functional
  • Eval results uploadable
  • Zero performance degradation (<5% overhead)

Documentation:

  • Env vars documented in .env.example
  • Setup instructions in docs/LOCAL-TESTING.md
  • Architecture doc updated with observability
  • Submission docs reflect final state

Estimated Effort

  • Phase 1 (Eval Dataset): 3-4 hours
  • Phase 2 (LangSmith): 2-3 hours
  • Phase 3 (Docs/Validation): 1 hour

Total: 6-8 hours

Risks and Dependencies

Risks:

  • LangSmith API key not available → Need to obtain or use alternative
  • Test case generation takes longer → Focus on high-value categories first
  • Performance regression from tracing → Monitor and optimize

Dependencies:

  • LangSmith account/API key
  • Access to LangSmith dashboard
  • Railway deployment for production tracing

Resolved Decisions (2026-02-24)

  1. LangSmith key handling is env-gated with compatibility for both LANGCHAIN_* and LANGSMITH_* variables.
  2. LangSmith managed service integration is in place through langsmith RunTree traces.
  3. Adversarial eval coverage includes prompt-injection, data-exfiltration, confidence manipulation, and privilege escalation attempts.
  4. Eval dataset is split across category files for maintainability and merged in mvp-eval.dataset.ts.