community: add AI agent eval

dataset (183 test cases) Contributes a community eval dataset for testing AI agents built on the Ghostfolio API. Includes 183 test cases covering portfolio analysis, market data, tax estimation, real estate, life decisions, and adversarial inputs. Full implementation available at: https://github.com/lakshmipunukollu-ai/ghostfolio-agent-priya Contributed by Gauntlet AI bounty submission. Made-with: Cursor
1 month ago · 018af2c09c
2 changed files with 133 additions and 0 deletions
--- a/community/agent-evals/EVAL_SUMMARY.md
+++ b/community/agent-evals/EVAL_SUMMARY.md
@ -0,0 +1,72 @@
+# Ghostfolio AI Agent — Eval Test Dataset
+
+A community-contributed evaluation dataset for
+testing AI agents built on top of the Ghostfolio API.
+
+## Overview
+
+183 test cases covering the full Ghostfolio API
+surface, contributed by the Gauntlet AI bounty
+submission (March 2026).
+
+## Categories
+
+| Category | Count | Description |
+|----------|-------|-------------|
+| Happy Path | 20 | Standard queries that should always work |
+| Edge Cases | 14 | Boundary conditions, empty data, typos |
+| Adversarial | 14 | Prompt injection, nonsense, abuse attempts |
+| Multi-Step | 13 | Conversations spanning multiple tool calls |
+| Domain Specific | 122 | Portfolio, market, property, tax, relocation |
+
+## Query Types Covered
+
+- Portfolio analysis (holdings, allocation, performance)
+- Market data (live stock prices, YTD returns)
+- Tax estimation (capital gains, tax-loss harvesting)
+- Risk & compliance (concentration, diversification)
+- Transaction history (trade history, recent activity)
+- Real estate (property CRUD, equity analysis)
+- Life decisions (job offers, relocation, retirement)
+- Wealth gap analysis (Fed SCF benchmarking)
+
+## How to Use
+
+These test cases can be used to evaluate any
+AI agent built on top of the Ghostfolio API.
+Each test case includes:
+- Input query (natural language)
+- Expected tool called
+- Expected response characteristics
+- Verification criteria
+
+## Full Dataset
+
+The full test dataset with code is available at:
+https://github.com/lakshmipunukollu-ai/ghostfolio-agent-priya
+in the `agent/evals/` directory.
+
+## Routing Test Cases
+
+| Query | Expected Tool |
+|-------|--------------|
+| how is my portfolio | portfolio_analysis |
+| check apple for me | market_data |
+| how many shares of APPL do I have | portfolio_analysis |
+| can I afford to retire | wealth_gap |
+| tell me about my house | property_tracker |
+| what can this do | capabilities |
+| give me a full portfolio summary | portfolio_analysis |
+| is 180k in seattle a raise from 120k in austin | relocation_runway |
+| estimate my capital gains | tax_estimate |
+| am I over concentrated | compliance_check |
+| what is the price of NVDA | market_data |
+| show me my transactions | transaction_query |
+| how much is AAPL | market_data |
+| can I afford a house | wealth_down_payment |
+
+## Contributing
+
+If you build an AI agent on Ghostfolio and want
+to contribute additional test cases, please open
+a PR adding them to this directory.
--- a/community/agent-evals/README.md
+++ b/community/agent-evals/README.md
@ -0,0 +1,61 @@
+# Finance AI Agent — Public Eval Dataset
+
+183 test cases for AI agents built on personal finance and portfolio management software.
+
+Built on top of Ghostfolio — an open source wealth management platform.
+
+Released publicly as a resource for developers building finance AI agents.
+
+## Test Categories
+
+| Category | Count |
+|----------|-------|
+| Happy Path | 20 |
+| Edge Cases | 14 |
+| Adversarial | 14 |
+| Multi-Step | 13 |
+| Other | 122 |
+| **Total** | **183** |
+
+## How To Run
+```bash
+git clone https://github.com/lakshmipunukollu-ai/ghostfolio
+cd ghostfolio
+git checkout submission/final
+pip install -r agent/requirements.txt
+python -m pytest agent/evals/ -v
+```
+
+## Test Structure
+
+Every test in test_eval_dataset.py follows:
+```python
+# TYPE: happy_path | edge_case | adversarial | multi_step
+# INPUT: what is being tested
+# EXPECTED: what the tool should return
+# CRITERIA: the specific assertion
+def test_name():
+    from tools.tool_name import function_name
+    result = function_name(params)
+    assert "key" in result
+```
+
+## Results
+
+- Tests: 183
+- Pass rate: 100%
+- Runtime: ~30 seconds
+
+## Contribute
+
+Submit a PR with new test cases.
+Follow the TYPE/INPUT/EXPECTED/CRITERIA pattern.
+
+## License
+
+MIT
+
+## Author
+
+Priya Lakshmipunukollu — AgentForge, February 2026
+https://github.com/lakshmipunukollu-ai/ghostfolio