mirror of https://github.com/ghostfolio/ghostfolio
Browse Source
Document agent architecture with request flow, tool inventory, verification pipeline, and deployment considerations. Add cost analysis and pre-search research documents.pull/6458/head
3 changed files with 373 additions and 0 deletions
@ -0,0 +1,143 @@ |
|||
# Agent Architecture Document |
|||
|
|||
## Domain & Use Cases |
|||
|
|||
**Domain**: Personal finance portfolio management (Ghostfolio fork) |
|||
|
|||
**Problems solved**: Ghostfolio's existing UI requires manual navigation across multiple pages to understand portfolio state. The agent provides a conversational interface that synthesizes holdings, performance, market data, and transaction history into coherent natural language — and can execute write operations (account management, activity logging, watchlist, tags) with native tool approval gates. |
|||
|
|||
**Target users**: Self-directed investors tracking multi-asset portfolios (stocks, ETFs, crypto) who want quick portfolio insights without clicking through dashboards. |
|||
|
|||
## Agent Architecture |
|||
|
|||
### Framework & Stack |
|||
|
|||
| Layer | Choice | Rationale | |
|||
| --------------- | ------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- | |
|||
| Runtime | NestJS (TypeScript) | Native to Ghostfolio codebase | |
|||
| Agent framework | Vercel AI SDK v6 (`ToolLoopAgent`) | Native TS, streaming SSE, built-in tool dispatch | |
|||
| LLM | Claude Sonnet 4.6 (default) | Strong function calling, structured output, 200K context | |
|||
| Model options | Haiku 4.5 ($0.005/chat), Sonnet 4.6 ($0.015/chat), Opus 4.6 ($0.077/chat) | User-selectable per session | |
|||
| Schemas | Zod v4 | Required by AI SDK v6 `inputSchema` | |
|||
| Database | Prisma + Postgres | Shared with Ghostfolio, plus agent-specific tables | |
|||
| Cache warming | `warmPortfolioCache` helper | Redis + BullMQ (`PortfolioSnapshotService`) — ensures portfolio reads reflect recent writes | |
|||
|
|||
### Request Flow |
|||
|
|||
``` |
|||
User message |
|||
│ |
|||
▼ |
|||
POST /api/v1/agent/chat (JWT auth) |
|||
Body: { messages: UIMessage[], toolHistory?, model?, approvedActions? } |
|||
│ |
|||
▼ |
|||
ToolLoopAgent created → pipeAgentUIStreamToResponse() |
|||
│ |
|||
├─► prepareStep() |
|||
│ ├─ Injects current date into system prompt |
|||
│ ├─ All 10 tools available from step 1 (activity_manage auto-resolves accountId) |
|||
│ └─ Loads contextual SKILL.md files based on tool history |
|||
│ |
|||
├─► LLM reasoning → tool selection |
|||
│ └─ Up to 10 steps (stopWhen: stepCountIs(10)) |
|||
│ |
|||
├─► Tool execution (try/catch per tool) |
|||
│ └─ Returns structured JSON to LLM for synthesis |
|||
│ |
|||
├─► Approval gate (write tools only) |
|||
│ ├─ needsApproval() evaluates per invocation |
|||
│ ├─ Skips: list actions, previously-approved signatures, SKIP_APPROVAL env |
|||
│ └─ If required: stream pauses → client shows approval card → resumes on approve/deny |
|||
│ |
|||
├─► Post-write cache warming (activity_manage, account_manage) |
|||
│ └─ warmPortfolioCache: clear Redis → drain stale jobs → enqueue HIGH priority → await (30s timeout) |
|||
│ |
|||
├─► SSE stream to client (UIMessage protocol) |
|||
│ └─ Events: text-delta, tool-input-start, tool-input-available, tool-approval-request, tool-output-available, finish |
|||
│ |
|||
└─► onFinish callback |
|||
├─ Verification pipeline (3 systems) |
|||
├─ Metrics recording (in-memory + Postgres) |
|||
└─ Structured log: chat_complete |
|||
``` |
|||
|
|||
### Tool Design |
|||
|
|||
10 tools organized into read (6) and write (4) categories: |
|||
|
|||
| Tool | Type | Purpose | |
|||
| ----------------------- | ----- | ----------------------------------------------------------------- | |
|||
| `portfolio_analysis` | Read | Holdings, allocations, total value, account breakdown | |
|||
| `portfolio_performance` | Read | Returns, net performance, chart data (downsampled to ~20 points) | |
|||
| `holdings_lookup` | Read | Deep dive on single position: dividends, fees, sectors, countries | |
|||
| `market_data` | Read | Live quotes for 1-10 symbols (FMP + CoinGecko) | |
|||
| `symbol_search` | Read | Disambiguate crypto vs stock, find correct data source | |
|||
| `transaction_history` | Read | Buys, sells, dividends, fees, deposits, withdrawals | |
|||
| `account_manage` | Write | CRUD accounts + transfers between accounts | |
|||
| `activity_manage` | Write | CRUD transactions (BUY/SELL/DIVIDEND/FEE/INTEREST/LIABILITY) | |
|||
| `watchlist_manage` | Write | Add/remove/list watchlist items | |
|||
| `tag_manage` | Write | CRUD tags for transaction organization | |
|||
|
|||
**Auto-resolution**: `activity_manage` auto-resolves `accountId` when omitted on creates — matches accounts by asset type keywords (crypto → "crypto"/"wallet" accounts, stocks → "stock"/"brokerage" accounts) with fallback to highest-activity account. No tool gating; all 10 tools available from step 1. |
|||
|
|||
**Approval gates**: All 4 write tools define `needsApproval` — a function-based gate evaluated per invocation. Read-only actions (`list`) and previously-approved action signatures are auto-skipped. `SKIP_APPROVAL=true` env var disables all gates (used in evals). Action signatures follow the pattern `tool_name:action:identifier` (e.g., `activity_manage:create:AAPL`). |
|||
|
|||
| Tool | Approval Rule | |
|||
| ------------------ | -------------------------------------------------------------------------------- | |
|||
| `activity_manage` | `create` and `delete` only (not `update`), unless signature in `approvedActions` | |
|||
| `account_manage` | Everything except `list`, unless signature in `approvedActions` | |
|||
| `tag_manage` | Everything except `list` | |
|||
| `watchlist_manage` | Everything except `list` | |
|||
|
|||
**Cache warming**: After every write operation in `activity_manage` and `account_manage`, `warmPortfolioCache()` runs — clears stale Redis portfolio snapshots, drains in-flight BullMQ jobs, enqueues fresh computation at HIGH priority, and awaits completion with a 30s timeout. Ensures subsequent read tools return up-to-date portfolio data. Injected via `PortfolioSnapshotService`, `RedisCacheService`, and `UserService`. |
|||
|
|||
**Skill injection**: Contextual SKILL.md documents (transaction workflow, market data patterns) are injected into the system prompt based on which tools have been used, providing the LLM with domain-specific guidance without bloating every request. |
|||
|
|||
## Verification Strategy |
|||
|
|||
Three deterministic systems run on every response in the `onFinish` callback: |
|||
|
|||
| System | What It Checks | Weight in Composite | |
|||
| --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | |
|||
| **Output Validation** | Min/max length, numeric data present when tools called, disclaimer on forward-looking language, write claims backed by write tool calls | 0.3 | |
|||
| **Hallucination Detection** | Ticker symbols in response exist in tool data, dollar amounts match within 5% or $1, holding claims reference actual tool results | 0.3 | |
|||
| **Confidence Score** | Composite 0-1: tool success rate (0.3), step efficiency (0.1), output validity (0.3), hallucination score (0.3) | — | |
|||
|
|||
**Why deterministic**: No additional LLM calls = zero latency overhead and zero cost. Cross-references response text against actual tool result data. Catches the most common financial hallucination patterns (phantom tickers, fabricated dollar amounts) with high precision. |
|||
|
|||
**Persistence**: Results stored on `AgentChatLog` (`verificationScore`, `verificationResult`) and exposed via `GET /agent/verification/:requestId`. |
|||
|
|||
## Eval Results |
|||
|
|||
**Framework**: evalite — dedicated eval runner with scoring, UI, and CI integration. |
|||
|
|||
| Suite | Cases | Pass Rate | Scorers | |
|||
| ---------- | ------ | --------- | ---------------------------------------------------------------- | |
|||
| Golden Set | 19 | **100%** | GoldenCheck (deterministic binary) | |
|||
| Scenarios | 67 | **91%** | ToolCallAccuracy (F1), HasResponse, ResponseQuality (LLM-judged) | |
|||
| **Total** | **86** | | | |
|||
|
|||
**Category breakdown**: single-tool (10), multi-tool (10), ambiguous (6), account management (8), activity management (10), watchlist (4), tag (4), multi-step write (4), adversarial write (4), edge cases (7), golden routing (7), golden structural (4), golden behavioral (2), golden guardrails (4). |
|||
|
|||
**CI gate**: GitHub Actions runs golden set on every push to main. Threshold: 100%. Hits deployed Render instance. |
|||
|
|||
**Remaining 9% scenario gap**: Agent calls extra tools on ambiguous queries (thorough, not wrong). Write-operation approval is now handled by native `needsApproval` gates (bypassed in evals via `SKIP_APPROVAL=true`). |
|||
|
|||
## Observability Setup |
|||
|
|||
| Capability | Implementation | |
|||
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------- | |
|||
| **Trace logging** | Structured JSON: `chat_start`, `step_finish` (tool names, token usage), `verification`, `chat_complete` | |
|||
| **Latency tracking** | Total `latencyMs` per request, persisted to Postgres | |
|||
| **Token usage** | prompt/completion/total per request, averaged in metrics summary | |
|||
| **Cost tracking** | Model-specific pricing (Sonnet $3/$15, Haiku $1/$5, Opus $15/$75 per MTok), `cost.totalUsd` and `cost.avgPerChatUsd` in metrics | |
|||
| **Error tracking** | Errors recorded with sanitized messages (DB URLs, API keys redacted) | |
|||
| **User feedback** | Thumbs up/down per message, unique per (requestId, userId), satisfaction rate in metrics | |
|||
| **Metrics endpoint** | `GET /agent/metrics?since=1h` — summary, feedback, recent chats | |
|||
|
|||
**Key insight**: Avg $0.015/chat with Sonnet 4.6, ~5.9s latency, ~1.9 steps per query. Portfolio analysis and market data are the most-used tools. |
|||
|
|||
## Open Source Contribution |
|||
|
|||
**Eval dataset**: 86-case structured JSON dataset (`evals/dataset.json`) covering tool routing, multi-tool chaining, write operations, adversarial inputs, and edge cases for a financial AI agent. Each case includes input query, expected tools, expected behavioral assertions, and category labels. Published in the repository for reuse by other Ghostfolio agent implementations. |
|||
@ -0,0 +1,106 @@ |
|||
# AI Cost Analysis |
|||
|
|||
## Development & Testing Costs |
|||
|
|||
### Observed Usage (Local + Production) |
|||
|
|||
| Metric | Value | |
|||
| -------------------------- | --------------------- | |
|||
| Total chats logged | 89 | |
|||
| Total prompt tokens | 367,389 | |
|||
| Total completion tokens | 17,771 | |
|||
| Total tokens | 385,160 | |
|||
| Avg prompt tokens/chat | 4,128 | |
|||
| Avg completion tokens/chat | 200 | |
|||
| Avg total tokens/chat | 4,328 | |
|||
| Error rate | 9.0% (8/89) | |
|||
| Development period | Feb 26 – Feb 28, 2026 | |
|||
|
|||
### Development API Costs |
|||
|
|||
Primary model: Claude Sonnet 4.6 ($3/MTok input, $15/MTok output) |
|||
|
|||
| Component | Tokens | Cost | |
|||
| ------------------------ | ------- | --------- | |
|||
| Prompt tokens (367K) | 367,389 | $1.10 | |
|||
| Completion tokens (18K) | 17,771 | $0.27 | |
|||
| **Total dev/test spend** | 385,160 | **$1.37** | |
|||
|
|||
Additional costs: |
|||
|
|||
- Eval scoring (Haiku 4.5): ~86 cases x ~500 tokens/judgment = ~43K tokens = $0.05 |
|||
- Multiple eval runs during development: ~10 runs x $0.05 = $0.50 |
|||
- **Estimated total dev spend: ~$2.00** |
|||
|
|||
### Per-Chat Cost Breakdown |
|||
|
|||
| Model | Avg Input | Avg Output | Cost/Chat | |
|||
| -------------------- | --------- | ---------- | --------- | |
|||
| Sonnet 4.6 (default) | 4,128 tok | 200 tok | $0.0154 | |
|||
| Haiku 4.5 | 4,128 tok | 200 tok | $0.0051 | |
|||
| Opus 4.6 | 4,128 tok | 200 tok | $0.0769 | |
|||
|
|||
## Production Cost Projections |
|||
|
|||
### Assumptions |
|||
|
|||
- **Queries per user per day**: 5 (portfolio check, performance, transactions, market data, misc) |
|||
- **Avg tokens per query**: 4,328 (4,128 input + 200 output) — observed from production data |
|||
- **Tool call frequency**: 1.9 tool calls/chat average (observed) |
|||
- **Verification overhead**: Negligible (deterministic, no extra LLM calls) |
|||
- **Cache warming overhead**: `warmPortfolioCache` runs after activity/account writes — Redis + BullMQ job, zero LLM tokens, up to 30s added latency per write operation |
|||
- **Model mix**: 90% Sonnet 4.6, 8% Haiku 4.5, 2% Opus 4.6 |
|||
- **Blended cost per chat**: $0.0154 x 0.90 + $0.0051 x 0.08 + $0.0769 x 0.02 = $0.0157 |
|||
|
|||
### Monthly Projections |
|||
|
|||
| Scale | Users | Chats/Month | Token Volume | Monthly Cost | |
|||
| ------------- | ------- | ----------- | ------------ | ------------ | |
|||
| 100 users | 100 | 15,000 | 64.9M tokens | **$235** | |
|||
| 1,000 users | 1,000 | 150,000 | 649M tokens | **$2,355** | |
|||
| 10,000 users | 10,000 | 1,500,000 | 6.49B tokens | **$23,550** | |
|||
| 100,000 users | 100,000 | 15,000,000 | 64.9B tokens | **$235,500** | |
|||
|
|||
### Cost Optimization Levers |
|||
|
|||
| Strategy | Estimated Savings | Trade-off | |
|||
| --------------------------------------- | ----------------- | ------------------------------- | |
|||
| Default to Haiku 4.5 for simple queries | 60-70% | Slightly less nuanced responses | |
|||
| Prompt caching (repeated system prompt) | 30-40% | Requires API support | |
|||
| Response caching for market data | 10-20% | Staleness window | |
|||
| Reduce system prompt size | 15-25% | Less detailed agent behavior | |
|||
|
|||
### Infrastructure Costs (Render) |
|||
|
|||
| Service | Plan | Monthly Cost | |
|||
| -------------------- | -------- | ------------- | |
|||
| Web (Standard) | Standard | $25 | |
|||
| Redis (Standard) | Standard | $10 | |
|||
| Postgres (Basic 1GB) | Basic | $7 | |
|||
| **Total infra** | | **$42/month** | |
|||
|
|||
### Total Cost of Ownership |
|||
|
|||
| Scale | AI Cost | Infra | Total/Month | Cost/User/Month | |
|||
| ------------- | -------- | ------ | ----------- | --------------- | |
|||
| 100 users | $235 | $42 | $277 | $2.77 | |
|||
| 1,000 users | $2,355 | $42 | $2,397 | $2.40 | |
|||
| 10,000 users | $23,550 | $100\* | $23,650 | $2.37 | |
|||
| 100,000 users | $235,500 | $500\* | $236,000 | $2.36 | |
|||
|
|||
\*Infrastructure scales with traffic — estimated for higher tiers. |
|||
|
|||
### Real-Time Cost Tracking |
|||
|
|||
Cost is tracked live via `GET /api/v1/agent/metrics`: |
|||
|
|||
```json |
|||
{ |
|||
"cost": { |
|||
"totalUsd": 0.0226, |
|||
"avgPerChatUsd": 0.0226 |
|||
} |
|||
} |
|||
``` |
|||
|
|||
Computed per request using model-specific pricing (Sonnet/Haiku/Opus rates) applied to actual prompt and completion token counts. |
|||
@ -0,0 +1,124 @@ |
|||
# Pre-Search Document |
|||
|
|||
Completed before writing agent code. Decisions informed all subsequent architecture choices. |
|||
|
|||
## Phase 1: Define Your Constraints |
|||
|
|||
### 1. Domain Selection |
|||
|
|||
- **Domain**: Finance (Ghostfolio — open source portfolio tracker) |
|||
- **Use cases**: Portfolio analysis, performance tracking, market data lookups, transaction management, account CRUD, watchlist management, tagging |
|||
- **Verification requirements**: Hallucination detection (dollar amounts, ticker symbols), output validation, confidence scoring. Financial data must match tool results exactly. |
|||
- **Data sources**: Ghostfolio's existing Prisma/Postgres models, Financial Modeling Prep API (stocks/ETFs), CoinGecko API (crypto) |
|||
|
|||
### 2. Scale & Performance |
|||
|
|||
- **Expected query volume**: 100-1,000 chats/day during demo period |
|||
- **Acceptable latency**: <5s single-tool, <15s multi-step |
|||
- **Concurrent users**: ~10-50 simultaneous (Render Standard plan) |
|||
- **Cost constraints**: <$50/month for demo, Sonnet 4.6 at ~$0.015/chat is viable |
|||
|
|||
### 3. Reliability Requirements |
|||
|
|||
- **Cost of wrong answer**: Medium — incorrect portfolio values or transaction amounts erode trust. Not life-threatening (unlike healthcare), but financial data must be accurate. |
|||
- **Non-negotiable verification**: Dollar amounts in response must match tool data (within 5% or $1). Ticker symbols must reference actual holdings. |
|||
- **Human-in-the-loop**: Write operations (buy/sell/delete) require confirmation before execution. Agent asks clarifying questions rather than assuming. |
|||
- **Audit needs**: All chats logged to AgentChatLog with requestId, tokens, latency, verification scores. |
|||
|
|||
### 4. Team & Skill Constraints |
|||
|
|||
- **Agent frameworks**: Familiar with Vercel AI SDK, LangChain. Chose Vercel AI SDK for native TypeScript + NestJS integration. |
|||
- **Domain experience**: Familiar with portfolio management concepts. Ghostfolio codebase already provides all financial data models. |
|||
- **Eval/testing**: Experience with Jest/Vitest. Chose evalite for dedicated eval framework with UI and scoring. |
|||
|
|||
## Phase 2: Architecture Discovery |
|||
|
|||
### 5. Agent Framework Selection |
|||
|
|||
- **Choice**: Vercel AI SDK v6 (`ToolLoopAgent`) |
|||
- **Why not LangChain**: Ghostfolio is NestJS/TypeScript — Vercel AI SDK is native TS, lighter weight, no Python dependency. LangChain's JS version is less mature. |
|||
- **Why not custom**: Vercel AI SDK provides streaming, tool dispatch, step management out of the box. No need to reinvent. |
|||
- **Architecture**: Single agent with tool gating via `prepareStep`. Not multi-agent — single domain, single user context. |
|||
- **State management**: Conversation history passed per turn. Tool gating state tracked via `toolHistory` array across turns. |
|||
|
|||
### 6. LLM Selection |
|||
|
|||
- **Choice**: Claude Sonnet 4.6 (default), with Haiku 4.5 and Opus 4.6 selectable per request |
|||
- **Why Claude**: Strong function calling, excellent at structured financial output, good instruction following for system prompt rules |
|||
- **Context window**: Sonnet 4.6 has 200K context — more than sufficient for portfolio data + conversation history |
|||
- **Cost per query**: $0.015 avg with Sonnet — acceptable for demo and small-scale production |
|||
|
|||
### 7. Tool Design |
|||
|
|||
- **Tools built**: 10 total (6 read + 4 write) |
|||
- Read: portfolio_analysis, portfolio_performance, holdings_lookup, market_data, symbol_search, transaction_history |
|||
- Write: account_manage, activity_manage, watchlist_manage, tag_manage |
|||
- **External APIs**: Financial Modeling Prep (stocks/ETFs), CoinGecko (crypto) — both via Ghostfolio's existing data provider layer |
|||
- **Mock vs real**: Real data from seeded demo portfolio (AAPL, MSFT, VOO, GOOGL, BTC, TSLA, NVDA). No mocks in dev or eval. |
|||
- **Error handling**: Every tool wrapped in try/catch, returns structured error messages. Agent gracefully surfaces errors to user. |
|||
|
|||
### 8. Observability Strategy |
|||
|
|||
- **Choice**: Self-hosted structured JSON logging + Postgres persistence + in-memory metrics |
|||
- **Why not LangSmith/Braintrust**: Adds external dependency and cost. Structured logs + Postgres give full control and queryability. Metrics endpoint provides real-time dashboard. |
|||
- **Key metrics**: Latency, token usage (prompt/completion/total), cost per chat, tool usage frequency, error rate, verification scores |
|||
- **Real-time monitoring**: `GET /agent/metrics?since=1h` endpoint with summary + recent chats |
|||
- **Cost tracking**: Token counts × model-specific pricing computed in metrics summary |
|||
|
|||
### 9. Eval Approach |
|||
|
|||
- **Framework**: evalite (dedicated eval runner with UI, separate from unit tests) |
|||
- **Correctness**: GoldenCheck scorer — deterministic binary pass/fail on tool routing, output patterns, forbidden content |
|||
- **Quality**: ResponseQuality scorer — LLM-judged (Haiku 4.5) scoring relevance, data-groundedness, conciseness, formatting |
|||
- **Tool accuracy**: ToolCallAccuracy scorer — F1-style partial credit for expected vs actual tool calls |
|||
- **Ground truth**: Real API responses from seeded demo portfolio |
|||
- **CI integration**: GitHub Actions runs golden evals on push, threshold 100% |
|||
|
|||
### 10. Verification Design |
|||
|
|||
- **Claims verified**: Dollar amounts, ticker symbols, holding references |
|||
- **Fact-checking sources**: Tool result data (the agent's own tool calls serve as ground truth) |
|||
- **Confidence thresholds**: Composite 0-1 score (tool success rate 0.3, step efficiency 0.1, output validity 0.3, hallucination score 0.3) |
|||
- **Escalation triggers**: Low confidence score logged but no automated escalation (deterministic verification only) |
|||
|
|||
## Phase 3: Post-Stack Refinement |
|||
|
|||
### 11. Failure Mode Analysis |
|||
|
|||
- **Tool failures**: Try/catch per tool, error surfaced to user with suggestion to retry. Agent continues conversation. |
|||
- **Ambiguous queries**: Agent uses multiple tools when unsure (e.g., "How am I doing?" calls both performance and analysis). Over-fetching preferred over under-fetching. |
|||
- **Rate limiting**: No explicit rate limiting implemented. Render Standard plan handles moderate traffic. Would add Redis-based rate limiting at scale. |
|||
- **Graceful degradation**: If market data API fails, agent acknowledges and proceeds with available data. |
|||
|
|||
### 12. Security Considerations |
|||
|
|||
- **Prompt injection**: System prompt instructs agent to stay in role. Eval suite includes adversarial inputs (jailbreak, developer mode, system prompt extraction). |
|||
- **Data leakage**: Agent scoped to authenticated user's data only. JWT auth on all endpoints. |
|||
- **API key management**: All keys in environment variables, not committed. Error messages sanitized (DB URLs, API keys redacted). |
|||
- **Audit logging**: Every chat logged with requestId, userId, tools used, token counts, verification results. |
|||
|
|||
### 13. Testing Strategy |
|||
|
|||
- **Unit tests**: 16 tests for prepareStep tool gating logic |
|||
- **Eval tests**: 86 cases across golden (19) and scenarios (67) |
|||
- **Adversarial testing**: 11 cases covering prompt injection, jailbreak, resource abuse, system prompt extraction |
|||
- **Regression**: CI gate enforces 100% golden pass rate on every push |
|||
|
|||
### 14. Open Source Planning |
|||
|
|||
- **Release**: Eval dataset (86 cases) published as `evals/dataset.json` — structured JSON with input, expected tools, expected behavior, categories |
|||
- **License**: Follows Ghostfolio's existing AGPL-3.0 license |
|||
- **Documentation**: Agent README (280 lines), architecture doc, cost analysis, this pre-search document |
|||
|
|||
### 15. Deployment & Operations |
|||
|
|||
- **Hosting**: Render (Docker) — web + Redis + Postgres, Oregon region |
|||
- **CI/CD**: GitHub Actions for eval gate. Render auto-deploys from main branch. |
|||
- **Monitoring**: Structured JSON logs + `/agent/metrics` endpoint |
|||
- **Rollback**: Render provides instant rollback to previous deploy |
|||
|
|||
### 16. Iteration Planning |
|||
|
|||
- **User feedback**: Thumbs up/down per message, stored in AgentFeedback table, surfaced in metrics |
|||
- **Eval-driven improvement**: Failures in eval suite drive agent refinement (e.g., stale test cases updated when write tools added) |
|||
- **Future work**: Automated feedback → eval pipeline, LLM-judged hallucination detection, useChat migration if UI complexity grows |
|||
Loading…
Reference in new issue