mirror of https://github.com/ghostfolio/ghostfolio
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
163 lines
5.6 KiB
163 lines
5.6 KiB
---
|
|
description: AI Agents architecture patterns, ReAct loops, tool design, evaluation, and production guardrails reference
|
|
globs:
|
|
alwaysApply: true
|
|
---
|
|
|
|
# AI Agents — Architecture & Production Reference
|
|
|
|
## Core Terminology
|
|
|
|
- **LLM Call**: Single request-response, no tools, no iteration
|
|
- **LLM + Tools**: LLM can call functions, still single-turn
|
|
- **Agentic System**: Multi-turn loop with Reasoning → Action → Observation cycle
|
|
|
|
We are building an **Agentic System**, not just an LLM wrapper.
|
|
|
|
## The ReAct Loop
|
|
|
|
```
|
|
THOUGHT ("What do I need?") → ACTION (Call tool) → OBSERVATION (Process result) → REPEAT or ANSWER
|
|
```
|
|
|
|
## When to Use Agentic Patterns
|
|
|
|
**Use agents when:**
|
|
- Unknown information needs (can't predict data sources)
|
|
- Multi-step reasoning with dependencies
|
|
- Complex analysis requiring iteration
|
|
- Dynamic decision trees
|
|
|
|
**Do NOT use agents when:**
|
|
- Deterministic workflows (use code / if-else)
|
|
- Batch processing (use pipelines)
|
|
- Simple classification (use single LLM call)
|
|
- Speed critical (<1s needed)
|
|
|
|
**Key insight**: Start simple. Most problems don't need agents. Match complexity to uncertainty.
|
|
|
|
## Complexity Spectrum (Cost & Latency)
|
|
|
|
| Pattern | Cost | Latency |
|
|
|---|---|---|
|
|
| Single LLM | 1x | <1s |
|
|
| LLM + Tools | 2-3x | 1-3s |
|
|
| ReAct Agent | 5-10x | 5-15s |
|
|
| Planning | 10-20x | 15-30s |
|
|
| Multi-Agent | 20-50x | 30s+ |
|
|
|
|
## Tool Design Principles
|
|
|
|
Tools are the building blocks. Each tool must be:
|
|
- **Atomic**: One clear purpose, does ONE thing
|
|
- **Idempotent**: Safe to retry
|
|
- **Well-documented**: LLM reads the description to decide usage
|
|
- **Error-handled**: Returns structured errors (ToolResult with status, data, message)
|
|
- **Verified**: Check results before returning
|
|
|
|
**Anti-patterns**: Too broad ("manage_patient"), missing error states, undocumented, side effects, unverified raw API data.
|
|
|
|
## Production Guardrails (Non-Negotiable)
|
|
|
|
All four must be implemented:
|
|
|
|
1. **MAX_ITERATIONS (10-15)**: Prevents infinite loops and runaway costs
|
|
2. **TIMEOUT (30-45s)**: User experience limit / API gateway timeout
|
|
3. **COST_LIMIT ($1/query)**: Prevent bill explosions, alert on anomalies
|
|
4. **CIRCUIT_BREAKER**: Same action 3x → abort, log for debugging
|
|
|
|
Without these: $10K bills from loops, 5min timeouts, hammered downstream services.
|
|
|
|
## Verification Layer (3+ Required)
|
|
|
|
| Type | Description |
|
|
|---|---|
|
|
| Fact Checking | Cross-reference claims against authoritative sources |
|
|
| Hallucination Detection | Flag unsupported claims, require source attribution |
|
|
| Confidence Scoring | Quantify certainty (0-1), surface low-confidence |
|
|
| Domain Constraints | Enforce business rules (dosage limits, trade limits) |
|
|
| Human-in-the-Loop | Escalation triggers for high-risk decisions |
|
|
|
|
Implementation pattern:
|
|
```
|
|
ToolResult(status, data, verification: VerificationResult(passed, confidence, warnings, errors, sources))
|
|
```
|
|
|
|
## Evaluation Framework (50+ Test Cases Required)
|
|
|
|
**What to measure**: Correctness, tool selection, tool execution, safety, consistency, edge cases, latency, cost.
|
|
|
|
**Test case breakdown**:
|
|
- 20+ happy path scenarios with expected outcomes
|
|
- 10+ edge cases (missing data, boundary conditions)
|
|
- 10+ adversarial inputs (bypass attempts)
|
|
- 10+ multi-step reasoning scenarios
|
|
|
|
Each test case includes: input query, expected tool calls, expected output, pass/fail criteria.
|
|
|
|
**Targets**: Pass rate >80% (good), >90% (excellent). Run evals daily.
|
|
|
|
## Performance Targets
|
|
|
|
| Metric | Target |
|
|
|---|---|
|
|
| Single-tool latency | <5 seconds |
|
|
| Multi-step latency (3+ tools) | <15 seconds |
|
|
| Tool success rate | >95% |
|
|
| Eval pass rate | >80% |
|
|
| Hallucination rate | <5% |
|
|
| Verification accuracy | >90% |
|
|
|
|
## Observability (Required)
|
|
|
|
- **Trace Logging**: Full trace per request (input → reasoning → tool calls → output)
|
|
- **Latency Tracking**: Time breakdown for LLM calls, tool execution, total response
|
|
- **Error Tracking**: Capture and categorize failures with context
|
|
- **Token Usage**: Input/output tokens per request, cost tracking
|
|
- **Eval Results**: Historical scores, regression detection
|
|
|
|
## BaseAgent Implementation Pattern
|
|
|
|
```python
|
|
class BaseAgent:
|
|
def __init__(self, model, max_iterations=10, timeout_seconds=30.0, max_cost_usd=1.0):
|
|
# Register tools, set guardrails
|
|
|
|
def run(self, task):
|
|
# ReAct loop:
|
|
# 1. LLM call with tool schemas
|
|
# 2. If no tool_calls → return final answer
|
|
# 3. Execute tool calls, append results
|
|
# 4. Check guardrails (iterations, timeout, cost, circuit breaker)
|
|
# 5. Repeat
|
|
```
|
|
|
|
## Cost Analysis (Required for Submission)
|
|
|
|
Track during development:
|
|
- LLM API costs, total tokens (input/output), number of API calls, observability tool costs
|
|
|
|
Project to production at: 100 / 1,000 / 10,000 / 100,000 users/month.
|
|
|
|
## Build Strategy (Priority Order)
|
|
|
|
1. Basic agent — single tool call working end-to-end
|
|
2. Tool expansion — add remaining tools, verify each works
|
|
3. Multi-step reasoning — agent chains tools appropriately
|
|
4. Observability — integrate tracing
|
|
5. Eval framework — build test suite, measure baseline
|
|
6. Verification layer — domain-specific checks
|
|
7. Iterate on evals — improve agent based on failures
|
|
8. Open source prep — package and document
|
|
|
|
## Submission Deliverables
|
|
|
|
- GitHub repo with setup guide, architecture overview, deployed link
|
|
- Demo video (3-5 min)
|
|
- Pre-Search document
|
|
- Agent architecture doc (1-2 pages)
|
|
- AI cost analysis (dev spend + projections)
|
|
- Eval dataset (50+ test cases with results)
|
|
- Open source contribution (package, PR, or public dataset)
|
|
- Deployed application (publicly accessible)
|
|
- Social post (X or LinkedIn, tag @GauntletAI)
|
|
|