--- description: AI Agents architecture patterns, ReAct loops, tool design, evaluation, and production guardrails reference globs: alwaysApply: true --- # AI Agents — Architecture & Production Reference ## Core Terminology - **LLM Call**: Single request-response, no tools, no iteration - **LLM + Tools**: LLM can call functions, still single-turn - **Agentic System**: Multi-turn loop with Reasoning → Action → Observation cycle We are building an **Agentic System**, not just an LLM wrapper. ## The ReAct Loop ``` THOUGHT ("What do I need?") → ACTION (Call tool) → OBSERVATION (Process result) → REPEAT or ANSWER ``` ## When to Use Agentic Patterns **Use agents when:** - Unknown information needs (can't predict data sources) - Multi-step reasoning with dependencies - Complex analysis requiring iteration - Dynamic decision trees **Do NOT use agents when:** - Deterministic workflows (use code / if-else) - Batch processing (use pipelines) - Simple classification (use single LLM call) - Speed critical (<1s needed) **Key insight**: Start simple. Most problems don't need agents. Match complexity to uncertainty. ## Complexity Spectrum (Cost & Latency) | Pattern | Cost | Latency | |---|---|---| | Single LLM | 1x | <1s | | LLM + Tools | 2-3x | 1-3s | | ReAct Agent | 5-10x | 5-15s | | Planning | 10-20x | 15-30s | | Multi-Agent | 20-50x | 30s+ | ## Tool Design Principles Tools are the building blocks. Each tool must be: - **Atomic**: One clear purpose, does ONE thing - **Idempotent**: Safe to retry - **Well-documented**: LLM reads the description to decide usage - **Error-handled**: Returns structured errors (ToolResult with status, data, message) - **Verified**: Check results before returning **Anti-patterns**: Too broad ("manage_patient"), missing error states, undocumented, side effects, unverified raw API data. ## Production Guardrails (Non-Negotiable) All four must be implemented: 1. **MAX_ITERATIONS (10-15)**: Prevents infinite loops and runaway costs 2. **TIMEOUT (30-45s)**: User experience limit / API gateway timeout 3. **COST_LIMIT ($1/query)**: Prevent bill explosions, alert on anomalies 4. **CIRCUIT_BREAKER**: Same action 3x → abort, log for debugging Without these: $10K bills from loops, 5min timeouts, hammered downstream services. ## Verification Layer (3+ Required) | Type | Description | |---|---| | Fact Checking | Cross-reference claims against authoritative sources | | Hallucination Detection | Flag unsupported claims, require source attribution | | Confidence Scoring | Quantify certainty (0-1), surface low-confidence | | Domain Constraints | Enforce business rules (dosage limits, trade limits) | | Human-in-the-Loop | Escalation triggers for high-risk decisions | Implementation pattern: ``` ToolResult(status, data, verification: VerificationResult(passed, confidence, warnings, errors, sources)) ``` ## Evaluation Framework (50+ Test Cases Required) **What to measure**: Correctness, tool selection, tool execution, safety, consistency, edge cases, latency, cost. **Test case breakdown**: - 20+ happy path scenarios with expected outcomes - 10+ edge cases (missing data, boundary conditions) - 10+ adversarial inputs (bypass attempts) - 10+ multi-step reasoning scenarios Each test case includes: input query, expected tool calls, expected output, pass/fail criteria. **Targets**: Pass rate >80% (good), >90% (excellent). Run evals daily. ## Performance Targets | Metric | Target | |---|---| | Single-tool latency | <5 seconds | | Multi-step latency (3+ tools) | <15 seconds | | Tool success rate | >95% | | Eval pass rate | >80% | | Hallucination rate | <5% | | Verification accuracy | >90% | ## Observability (Required) - **Trace Logging**: Full trace per request (input → reasoning → tool calls → output) - **Latency Tracking**: Time breakdown for LLM calls, tool execution, total response - **Error Tracking**: Capture and categorize failures with context - **Token Usage**: Input/output tokens per request, cost tracking - **Eval Results**: Historical scores, regression detection ## BaseAgent Implementation Pattern ```python class BaseAgent: def __init__(self, model, max_iterations=10, timeout_seconds=30.0, max_cost_usd=1.0): # Register tools, set guardrails def run(self, task): # ReAct loop: # 1. LLM call with tool schemas # 2. If no tool_calls → return final answer # 3. Execute tool calls, append results # 4. Check guardrails (iterations, timeout, cost, circuit breaker) # 5. Repeat ``` ## Cost Analysis (Required for Submission) Track during development: - LLM API costs, total tokens (input/output), number of API calls, observability tool costs Project to production at: 100 / 1,000 / 10,000 / 100,000 users/month. ## Build Strategy (Priority Order) 1. Basic agent — single tool call working end-to-end 2. Tool expansion — add remaining tools, verify each works 3. Multi-step reasoning — agent chains tools appropriately 4. Observability — integrate tracing 5. Eval framework — build test suite, measure baseline 6. Verification layer — domain-specific checks 7. Iterate on evals — improve agent based on failures 8. Open source prep — package and document ## Submission Deliverables - GitHub repo with setup guide, architecture overview, deployed link - Demo video (3-5 min) - Pre-Search document - Agent architecture doc (1-2 pages) - AI cost analysis (dev spend + projections) - Eval dataset (50+ test cases with results) - Open source contribution (package, PR, or public dataset) - Deployed application (publicly accessible) - Social post (X or LinkedIn, tag @GauntletAI)