9.4 KiB
Complete Ghostfolio Finance Agent Requirements
Status: Implemented (2026-02-24 local) Priority: High Deadline: Sunday 10:59 PM CT (submission)
Overview
Complete the remaining technical requirements for the Ghostfolio AI Agent submission to Gauntlet G4.
Current Completion: 6/10
Completed:
- ✅ MVP Agent (5 tools, natural language, tool execution)
- ✅ Redis memory system
- ✅ Verification (confidence, citations, checks)
- ✅ Error handling
- ✅ 10 MVP eval cases
- ✅ Railway deployment
- ✅ Submission docs (presearch, dev log, cost analysis)
- ✅ ADR/docs structure
Remaining:
- ❌ Eval dataset: 10 → 50+ test cases
- ❌ LangSmith observability integration
Requirements Analysis
1. Eval Dataset Expansion (40+ new cases)
Required Breakdown (from docs/requirements.md):
- 20+ happy path scenarios
- 10+ edge cases (missing data, boundary conditions)
- 10+ adversarial inputs (bypass verification attempts)
- 10+ multi-step reasoning scenarios
Current State: 10 cases in apps/api/src/app/endpoints/ai/evals/mvp-eval.dataset.ts
Categories Covered:
- Happy path: ~6 cases (portfolio overview, risk, market data, multi-tool, rebalance, stress test)
- Edge cases: ~2 cases (tool failure, partial market coverage)
- Adversarial: ~1 case (implicit in fallback scenarios)
- Multi-step: ~2 cases (multi-tool query, memory continuity)
Gaps to Fill:
- Happy path: +14 cases
- Edge cases: +8 cases
- Adversarial: +9 cases
- Multi-step: +8 cases
Available Tools:
portfolio_analysis- holdings, allocation, performancerisk_assessment- concentration risk analysismarket_data_lookup- current prices, market staterebalance_plan- allocation adjustment recommendationsstress_test- drawdown/impact scenarios
Test Case Categories to Add:
Happy Path (+14):
- Allocation analysis queries
- Performance comparison requests
- Portfolio health summaries
- Investment guidance questions
- Sector/asset class breakdowns
- Currency impact analysis
- Time-based performance queries
- Benchmark comparisons
- Diversification metrics
- Fee analysis queries
- Dividend/income queries
- Holdings detail requests
- Market context questions
- Goal progress queries
Edge Cases (+8):
- Empty portfolio (no holdings)
- Single-symbol portfolio
- Very large portfolio (100+ symbols)
- Multiple accounts with different currencies
- Portfolio with only data issues (no quotes available)
- Zero-value positions
- Historical date queries (backtesting)
- Real-time data unavailable
Adversarial (+9):
- SQL injection attempts in queries
- Prompt injection (ignore previous instructions)
- Malicious code generation requests
- Requests for other users' data
- Bypassing rate limits
- Manipulating confidence scores
- Fake verification scenarios
- Exfiltration attempts
- Privilege escalation attempts
Multi-Step (+8):
- Compare performance then rebalance
- Stress test then adjust allocation
- Market lookup → portfolio analysis → recommendation
- Risk assessment → stress test → rebalance
- Multi-symbol market data → portfolio impact
- Historical query → trend analysis → forward guidance
- Multi-account aggregation → consolidated analysis
- Portfolio + market + risk comprehensive report
2. LangSmith Observability Integration
Requirements (from docs/requirements.md):
| Capability | Requirements |
|---|---|
| Trace Logging | Full trace: input → reasoning → tool calls → output |
| Latency Tracking | Time breakdown: LLM calls, tool execution, total response |
| Error Tracking | Capture failures, stack traces, context |
| Token Usage | Input/output tokens per request, cost tracking |
| Eval Results | Historical eval scores, regression detection |
| User Feedback | Thumbs up/down, corrections mechanism |
Integration Points:
- Package:
@langchain/langsmith(already in dependencies?) - Environment:
LANGCHAIN_TRACING_V2=true,LANGCHAIN_API_KEY - Location:
apps/api/src/app/endpoints/ai/ai.service.ts
Implementation Approach:
// Initialize LangSmith tracer
import { Client } from '@langchain/langsmith';
const langsmithClient = new Client({
apiKey: process.env.LANGCHAIN_API_KEY,
apiUrl: process.env.LANGCHAIN_ENDPOINT
});
// Wrap chat execution in trace
async function chatWithTrace(request: AiChatRequest) {
const trace = langsmithClient.run({
name: 'ai_agent_chat',
inputs: { query: request.query, userId: request.userId }
});
try {
// Log LLM calls
// Log tool execution
// Log verification checks
// Log final output
await trace.end({
outputs: { answer: response.answer },
metadata: { latency, tokens, toolCalls }
});
} catch (error) {
await trace.end({ error: error.message });
}
}
Files to Modify:
apps/api/src/app/endpoints/ai/ai.service.ts- Add tracing to chat method.env.example- Add LangSmith env varsapps/api/src/app/endpoints/ai/evals/mvp-eval.runner.ts- Add eval result upload to LangSmith
Testing:
- Verify traces appear in LangSmith dashboard
- Check latency breakdown accuracy
- Validate token usage tracking
- Test error capture
Implementation Plan
Phase 1: Eval Dataset Expansion (Priority: High)
Step 1.1: Design test case template
- Review existing 10 cases structure
- Define patterns for each category
- Create helper functions for setup data
Step 1.2: Generate happy path cases (+14)
- Allocation analysis (4 cases)
- Performance queries (3 cases)
- Portfolio health (3 cases)
- Market context (2 cases)
- Benchmarks/diversification (2 cases)
Step 1.3: Generate edge case scenarios (+8)
- Empty/edge portfolios (4 cases)
- Data availability issues (2 cases)
- Boundary conditions (2 cases)
Step 1.4: Generate adversarial cases (+9)
- Injection attacks (4 cases)
- Data access violations (3 cases)
- System manipulation (2 cases)
Step 1.5: Generate multi-step cases (+8)
- 2-3 tool chains (4 cases)
- Complex reasoning (4 cases)
Step 1.6: Update eval runner
- Expand dataset import
- Add category-based reporting
- Track pass rates by category
Step 1.7: Run and validate
npm run test:mvp-eval- Fix any failures
- Document results
Phase 2: LangSmith Integration (Priority: High)
Step 2.1: Add dependencies
- Check if
@langchain/langsmithin package.json - Add if missing
Step 2.2: Configure environment
- Add
LANGCHAIN_TRACING_V2=trueto.env.example - Add
LANGCHAIN_API_KEYto.env.example - Add setup notes to
docs/LOCAL-TESTING.md
Step 2.3: Initialize tracer in AI service
- Import LangSmith client
- Configure initialization
- Add error handling for missing credentials
Step 2.4: Wrap chat execution
- Create trace on request start
- Log LLM calls with latency
- Log tool execution with results
- Log verification checks
- End trace with output
Step 2.5: Add metrics tracking
- Token usage (input/output)
- Latency breakdown (LLM, tools, total)
- Success/failure rates
- Tool selection frequencies
Step 2.6: Integrate eval results
- Upload eval runs to LangSmith
- Create dataset for regression testing
- Track historical scores
Step 2.7: Test and verify
- Run
npm run test:aiwith tracing enabled - Check LangSmith dashboard for traces
- Verify metrics accuracy
- Test error capture
Phase 3: Documentation and Validation
Step 3.1: Update submission docs
- Update
docs/AI-DEVELOPMENT-LOG.mdwith LangSmith - Update eval count in docs
- Add observability section to architecture doc
Step 3.2: Final verification
- Run full test suite
- Check production deployment
- Validate submission checklist
Step 3.3: Update tasks tracking
- Mark tickets complete
- Update
Tasks.md - Document any lessons learned
Success Criteria
Eval Dataset:
- ✅ 50+ test cases total
- ✅ 20+ happy path scenarios
- ✅ 10+ edge cases
- ✅ 10+ adversarial inputs
- ✅ 10+ multi-step scenarios
- ✅ All tests pass (
npm run test:mvp-eval) - ✅ Category-specific pass rates tracked
LangSmith Observability:
- ✅ Traces visible in LangSmith dashboard
- ✅ Full request lifecycle captured (input → reasoning → tools → output)
- ✅ Latency breakdown accurate (LLM, tools, total)
- ✅ Token usage tracked per request
- ✅ Error tracking functional
- ✅ Eval results uploadable
- ✅ Zero performance degradation (<5% overhead)
Documentation:
- ✅ Env vars documented in
.env.example - ✅ Setup instructions in
docs/LOCAL-TESTING.md - ✅ Architecture doc updated with observability
- ✅ Submission docs reflect final state
Estimated Effort
- Phase 1 (Eval Dataset): 3-4 hours
- Phase 2 (LangSmith): 2-3 hours
- Phase 3 (Docs/Validation): 1 hour
Total: 6-8 hours
Risks and Dependencies
Risks:
- LangSmith API key not available → Need to obtain or use alternative
- Test case generation takes longer → Focus on high-value categories first
- Performance regression from tracing → Monitor and optimize
Dependencies:
- LangSmith account/API key
- Access to LangSmith dashboard
- Railway deployment for production tracing
Resolved Decisions (2026-02-24)
- LangSmith key handling is env-gated with compatibility for both
LANGCHAIN_*andLANGSMITH_*variables. - LangSmith managed service integration is in place through
langsmithRunTree traces. - Adversarial eval coverage includes prompt-injection, data-exfiltration, confidence manipulation, and privilege escalation attempts.
- Eval dataset is split across category files for maintainability and merged in
mvp-eval.dataset.ts.