# Complete Ghostfolio Finance Agent Requirements

**Status:** Implemented (2026-02-24 local)
**Priority:** High
**Deadline:** Sunday 10:59 PM CT (submission)

## Overview

Complete the remaining technical requirements for the Ghostfolio AI Agent submission to Gauntlet G4.

### Current Completion: 6/10

**Completed:**
- ✅ MVP Agent (5 tools, natural language, tool execution)
- ✅ Redis memory system
- ✅ Verification (confidence, citations, checks)
- ✅ Error handling
- ✅ 10 MVP eval cases
- ✅ Railway deployment
- ✅ Submission docs (presearch, dev log, cost analysis)
- ✅ ADR/docs structure

**Remaining:**
- ❌ Eval dataset: 10 → 50+ test cases
- ❌ LangSmith observability integration

## Requirements Analysis

### 1. Eval Dataset Expansion (40+ new cases)

**Required Breakdown (from docs/requirements.md):**
- 20+ happy path scenarios
- 10+ edge cases (missing data, boundary conditions)
- 10+ adversarial inputs (bypass verification attempts)
- 10+ multi-step reasoning scenarios

**Current State:** 10 cases in `apps/api/src/app/endpoints/ai/evals/mvp-eval.dataset.ts`

**Categories Covered:**
- Happy path: ~6 cases (portfolio overview, risk, market data, multi-tool, rebalance, stress test)
- Edge cases: ~2 cases (tool failure, partial market coverage)
- Adversarial: ~1 case (implicit in fallback scenarios)
- Multi-step: ~2 cases (multi-tool query, memory continuity)

**Gaps to Fill:**
- Happy path: +14 cases
- Edge cases: +8 cases
- Adversarial: +9 cases
- Multi-step: +8 cases

**Available Tools:**
1. `portfolio_analysis` - holdings, allocation, performance
2. `risk_assessment` - concentration risk analysis
3. `market_data_lookup` - current prices, market state
4. `rebalance_plan` - allocation adjustment recommendations
5. `stress_test` - drawdown/impact scenarios

**Test Case Categories to Add:**

*Happy Path (+14):*
- Allocation analysis queries
- Performance comparison requests
- Portfolio health summaries
- Investment guidance questions
- Sector/asset class breakdowns
- Currency impact analysis
- Time-based performance queries
- Benchmark comparisons
- Diversification metrics
- Fee analysis queries
- Dividend/income queries
- Holdings detail requests
- Market context questions
- Goal progress queries

*Edge Cases (+8):*
- Empty portfolio (no holdings)
- Single-symbol portfolio
- Very large portfolio (100+ symbols)
- Multiple accounts with different currencies
- Portfolio with only data issues (no quotes available)
- Zero-value positions
- Historical date queries (backtesting)
- Real-time data unavailable

*Adversarial (+9):*
- SQL injection attempts in queries
- Prompt injection (ignore previous instructions)
- Malicious code generation requests
- Requests for other users' data
- Bypassing rate limits
- Manipulating confidence scores
- Fake verification scenarios
- Exfiltration attempts
- Privilege escalation attempts

*Multi-Step (+8):*
- Compare performance then rebalance
- Stress test then adjust allocation
- Market lookup → portfolio analysis → recommendation
- Risk assessment → stress test → rebalance
- Multi-symbol market data → portfolio impact
- Historical query → trend analysis → forward guidance
- Multi-account aggregation → consolidated analysis
- Portfolio + market + risk comprehensive report

### 2. LangSmith Observability Integration

**Requirements (from docs/requirements.md):**

| Capability | Requirements |
|---|---|
| Trace Logging | Full trace: input → reasoning → tool calls → output |
| Latency Tracking | Time breakdown: LLM calls, tool execution, total response |
| Error Tracking | Capture failures, stack traces, context |
| Token Usage | Input/output tokens per request, cost tracking |
| Eval Results | Historical eval scores, regression detection |
| User Feedback | Thumbs up/down, corrections mechanism |

**Integration Points:**

1. **Package:** `@langchain/langsmith` (already in dependencies?)
2. **Environment:** `LANGCHAIN_TRACING_V2=true`, `LANGCHAIN_API_KEY`
3. **Location:** `apps/api/src/app/endpoints/ai/ai.service.ts`

**Implementation Approach:**

```typescript
// Initialize LangSmith tracer
import { Client } from '@langchain/langsmith';

const langsmithClient = new Client({
  apiKey: process.env.LANGCHAIN_API_KEY,
  apiUrl: process.env.LANGCHAIN_ENDPOINT
});

// Wrap chat execution in trace
async function chatWithTrace(request: AiChatRequest) {
  const trace = langsmithClient.run({
    name: 'ai_agent_chat',
    inputs: { query: request.query, userId: request.userId }
  });

  try {
    // Log LLM calls
    // Log tool execution
    // Log verification checks
    // Log final output

    await trace.end({
      outputs: { answer: response.answer },
      metadata: { latency, tokens, toolCalls }
    });
  } catch (error) {
    await trace.end({ error: error.message });
  }
}
```

**Files to Modify:**
- `apps/api/src/app/endpoints/ai/ai.service.ts` - Add tracing to chat method
- `.env.example` - Add LangSmith env vars
- `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.ts` - Add eval result upload to LangSmith

**Testing:**
- Verify traces appear in LangSmith dashboard
- Check latency breakdown accuracy
- Validate token usage tracking
- Test error capture

## Implementation Plan

### Phase 1: Eval Dataset Expansion (Priority: High)

**Step 1.1:** Design test case template
- Review existing 10 cases structure
- Define patterns for each category
- Create helper functions for setup data

**Step 1.2:** Generate happy path cases (+14)
- Allocation analysis (4 cases)
- Performance queries (3 cases)
- Portfolio health (3 cases)
- Market context (2 cases)
- Benchmarks/diversification (2 cases)

**Step 1.3:** Generate edge case scenarios (+8)
- Empty/edge portfolios (4 cases)
- Data availability issues (2 cases)
- Boundary conditions (2 cases)

**Step 1.4:** Generate adversarial cases (+9)
- Injection attacks (4 cases)
- Data access violations (3 cases)
- System manipulation (2 cases)

**Step 1.5:** Generate multi-step cases (+8)
- 2-3 tool chains (4 cases)
- Complex reasoning (4 cases)

**Step 1.6:** Update eval runner
- Expand dataset import
- Add category-based reporting
- Track pass rates by category

**Step 1.7:** Run and validate
- `npm run test:mvp-eval`
- Fix any failures
- Document results

### Phase 2: LangSmith Integration (Priority: High)

**Step 2.1:** Add dependencies
- Check if `@langchain/langsmith` in package.json
- Add if missing

**Step 2.2:** Configure environment
- Add `LANGCHAIN_TRACING_V2=true` to `.env.example`
- Add `LANGCHAIN_API_KEY` to `.env.example`
- Add setup notes to `docs/LOCAL-TESTING.md`

**Step 2.3:** Initialize tracer in AI service
- Import LangSmith client
- Configure initialization
- Add error handling for missing credentials

**Step 2.4:** Wrap chat execution
- Create trace on request start
- Log LLM calls with latency
- Log tool execution with results
- Log verification checks
- End trace with output

**Step 2.5:** Add metrics tracking
- Token usage (input/output)
- Latency breakdown (LLM, tools, total)
- Success/failure rates
- Tool selection frequencies

**Step 2.6:** Integrate eval results
- Upload eval runs to LangSmith
- Create dataset for regression testing
- Track historical scores

**Step 2.7:** Test and verify
- Run `npm run test:ai` with tracing enabled
- Check LangSmith dashboard for traces
- Verify metrics accuracy
- Test error capture

### Phase 3: Documentation and Validation

**Step 3.1:** Update submission docs
- Update `docs/AI-DEVELOPMENT-LOG.md` with LangSmith
- Update eval count in docs
- Add observability section to architecture doc

**Step 3.2:** Final verification
- Run full test suite
- Check production deployment
- Validate submission checklist

**Step 3.3:** Update tasks tracking
- Mark tickets complete
- Update `Tasks.md`
- Document any lessons learned

## Success Criteria

### Eval Dataset:
- ✅ 50+ test cases total
- ✅ 20+ happy path scenarios
- ✅ 10+ edge cases
- ✅ 10+ adversarial inputs
- ✅ 10+ multi-step scenarios
- ✅ All tests pass (`npm run test:mvp-eval`)
- ✅ Category-specific pass rates tracked

### LangSmith Observability:
- ✅ Traces visible in LangSmith dashboard
- ✅ Full request lifecycle captured (input → reasoning → tools → output)
- ✅ Latency breakdown accurate (LLM, tools, total)
- ✅ Token usage tracked per request
- ✅ Error tracking functional
- ✅ Eval results uploadable
- ✅ Zero performance degradation (<5% overhead)

### Documentation:
- ✅ Env vars documented in `.env.example`
- ✅ Setup instructions in `docs/LOCAL-TESTING.md`
- ✅ Architecture doc updated with observability
- ✅ Submission docs reflect final state

## Estimated Effort

- **Phase 1 (Eval Dataset):** 3-4 hours
- **Phase 2 (LangSmith):** 2-3 hours
- **Phase 3 (Docs/Validation):** 1 hour

**Total:** 6-8 hours

## Risks and Dependencies

**Risks:**
- LangSmith API key not available → Need to obtain or use alternative
- Test case generation takes longer → Focus on high-value categories first
- Performance regression from tracing → Monitor and optimize

**Dependencies:**
- LangSmith account/API key
- Access to LangSmith dashboard
- Railway deployment for production tracing

## Resolved Decisions (2026-02-24)

1. LangSmith key handling is env-gated with compatibility for both `LANGCHAIN_*` and `LANGSMITH_*` variables.
2. LangSmith managed service integration is in place through `langsmith` RunTree traces.
3. Adversarial eval coverage includes prompt-injection, data-exfiltration, confidence manipulation, and privilege escalation attempts.
4. Eval dataset is split across category files for maintainability and merged in `mvp-eval.dataset.ts`.