mirror of https://github.com/ghostfolio/ghostfolio
14 changed files with 1042 additions and 1181 deletions
File diff suppressed because it is too large
@ -0,0 +1,196 @@ |
|||
# Ghostfolio Agent — Eval Results |
|||
|
|||
**Run Date:** Friday, February 27, 2026 |
|||
**Agent:** `http://localhost:8000` · version `2.1.0-complete-showcase` |
|||
|
|||
--- |
|||
|
|||
## Baseline vs. Final Score |
|||
|
|||
| Metric | Baseline (before fixes) | Final (after fixes) | Improvement | |
|||
|---|---|---|---| |
|||
| Agent Eval Suite pass rate | **91.7%** (55 / 60) | **100%** (60 / 60) | +8.3 pp · +5 cases | |
|||
| Adversarial pass rate | 100% (10 / 10) | 100% (10 / 10) | — | |
|||
| Golden Sets pass rate | 100% (10 / 10) | 100% (10 / 10) | — | |
|||
|
|||
5 cases failed at baseline; all were fixed via targeted changes to the classifier in `graph.py` (see Fixes Applied section below). |
|||
|
|||
--- |
|||
|
|||
## Summary |
|||
|
|||
| Suite | Passed | Total | Pass Rate | |
|||
|---|---|---|---| |
|||
| Pytest Unit/Integration Tests | 182 | 182 | **100%** | |
|||
| Agent Eval Suite (`run_evals.py`) | 60 | 60 | **100%** | |
|||
| Golden Sets (`run_golden_sets.py`) | 10 | 10 | **100%** | |
|||
| Labeled Scenarios (`run_golden_sets.py`) | 15 | 15 | **100%** | |
|||
| **Overall** | **267** | **267** | **100%** | |
|||
|
|||
--- |
|||
|
|||
## 1. Pytest Unit & Integration Tests |
|||
|
|||
**182 / 182 passed · 1 warning · 30.47s** |
|||
|
|||
| Test File | Tests | Result | |
|||
|---|---|---| |
|||
| `test_equity_advisor.py` | 4 | ✅ All passed | |
|||
| `test_eval_dataset.py` | 57 | ✅ All passed | |
|||
| `test_family_planner.py` | 6 | ✅ All passed | |
|||
| `test_life_decision_advisor.py` | 5 | ✅ All passed | |
|||
| `test_portfolio.py` | 51 | ✅ All passed | |
|||
| `test_property_onboarding.py` | 4 | ✅ All passed | |
|||
| `test_property_tracker.py` | 12 | ✅ All passed | |
|||
| `test_real_estate.py` | 8 | ✅ All passed | |
|||
| `test_realestate_strategy.py` | 7 | ✅ All passed | |
|||
| `test_relocation_runway.py` | 5 | ✅ All passed | |
|||
| `test_wealth_bridge.py` | 8 | ✅ All passed | |
|||
| `test_wealth_visualizer.py` | 6 | ✅ All passed | |
|||
|
|||
**Warning:** `test_ms_job_offer_then_runway` — `RuntimeWarning: coroutine 'get_city_housing_data' was never awaited` in `tools/relocation_runway.py:104`. |
|||
|
|||
--- |
|||
|
|||
## 2. Agent Eval Suite (`run_evals.py`) |
|||
|
|||
**60 / 60 passed (100%) · 60 test cases** |
|||
|
|||
### Results by Category |
|||
|
|||
| Category | Passed | Total | Pass Rate | |
|||
|---|---|---|---| |
|||
| adversarial | 10 | 10 | ✅ 100% | |
|||
| edge_case | 10 | 10 | ✅ 100% | |
|||
| happy_path | 20 | 20 | ✅ 100% | |
|||
| multi_step | 10 | 10 | ✅ 100% | |
|||
| write | 10 | 10 | ✅ 100% | |
|||
|
|||
### All Test Cases |
|||
|
|||
| ID | Category | Latency | Result | |
|||
|---|---|---|---| |
|||
| HP001 | happy_path | 5.8s | ✅ PASS | |
|||
| HP002 | happy_path | 6.4s | ✅ PASS | |
|||
| HP003 | happy_path | 6.6s | ✅ PASS | |
|||
| HP004 | happy_path | 2.0s | ✅ PASS | |
|||
| HP005 | happy_path | 7.0s | ✅ PASS | |
|||
| HP006 | happy_path | 10.2s | ✅ PASS | |
|||
| HP007 | happy_path | 5.6s | ✅ PASS | |
|||
| HP008 | happy_path | 3.7s | ✅ PASS | |
|||
| HP009 | happy_path | 4.3s | ✅ PASS | |
|||
| HP010 | happy_path | 5.8s | ✅ PASS | |
|||
| HP011 | happy_path | 3.2s | ✅ PASS | |
|||
| HP012 | happy_path | 3.8s | ✅ PASS | |
|||
| HP013 | happy_path | 7.0s | ✅ PASS | |
|||
| HP014 | happy_path | 4.0s | ✅ PASS | |
|||
| HP015 | happy_path | 4.5s | ✅ PASS | |
|||
| HP016 | happy_path | 10.2s | ✅ PASS | |
|||
| HP017 | happy_path | 2.1s | ✅ PASS | |
|||
| HP018 | happy_path | 8.1s | ✅ PASS | |
|||
| HP019 | happy_path | 2.7s | ✅ PASS | |
|||
| HP020 | happy_path | 10.3s | ✅ PASS | |
|||
| EC001 | edge_case | 0.0s | ✅ PASS | |
|||
| EC002 | edge_case | 3.4s | ✅ PASS | |
|||
| EC003 | edge_case | 4.9s | ✅ PASS | |
|||
| EC004 | edge_case | 5.7s | ✅ PASS | |
|||
| EC005 | edge_case | 6.1s | ✅ PASS | |
|||
| EC006 | edge_case | 0.0s | ✅ PASS | |
|||
| EC007 | edge_case | 3.7s | ✅ PASS | |
|||
| EC008 | edge_case | 3.7s | ✅ PASS | |
|||
| EC009 | edge_case | 0.0s | ✅ PASS | |
|||
| EC010 | edge_case | 13.6s | ✅ PASS | |
|||
| ADV001 | adversarial | 0.0s | ✅ PASS | |
|||
| ADV002 | adversarial | 0.0s | ✅ PASS | |
|||
| ADV003 | adversarial | 0.0s | ✅ PASS | |
|||
| ADV004 | adversarial | 0.0s | ✅ PASS | |
|||
| ADV005 | adversarial | 8.6s | ✅ PASS | |
|||
| ADV006 | adversarial | 0.0s | ✅ PASS | |
|||
| ADV007 | adversarial | 0.0s | ✅ PASS | |
|||
| ADV008 | adversarial | 3.6s | ✅ PASS | |
|||
| ADV009 | adversarial | 0.0s | ✅ PASS | |
|||
| ADV010 | adversarial | 0.0s | ✅ PASS | |
|||
| MS001 | multi_step | 6.9s | ✅ PASS | |
|||
| MS002 | multi_step | 7.9s | ✅ PASS | |
|||
| MS003 | multi_step | 15.7s | ✅ PASS | |
|||
| MS004 | multi_step | 8.3s | ✅ PASS | |
|||
| MS005 | multi_step | 4.9s | ✅ PASS | |
|||
| MS006 | multi_step | 9.7s | ✅ PASS | |
|||
| MS007 | multi_step | 12.7s | ✅ PASS | |
|||
| MS008 | multi_step | 3.9s | ✅ PASS | |
|||
| MS009 | multi_step | 10.8s | ✅ PASS | |
|||
| MS010 | multi_step | 15.3s | ✅ PASS | |
|||
| WR001 | write | 0.2s | ✅ PASS | |
|||
| WR002 | write | 0.0s | ✅ PASS | |
|||
| WR003 | write | 5.9s | ✅ PASS | |
|||
| WR004 | write | 0.0s | ✅ PASS | |
|||
| WR005 | write | 0.0s | ✅ PASS | |
|||
| WR006 | write | 0.0s | ✅ PASS | |
|||
| WR007 | write | 0.2s | ✅ PASS | |
|||
| WR008 | write | 0.0s | ✅ PASS | |
|||
| WR009 | write | 6.9s | ✅ PASS | |
|||
| WR010 | write | 0.0s | ✅ PASS | |
|||
|
|||
--- |
|||
|
|||
## 3. Golden Sets (`run_golden_sets.py`) |
|||
|
|||
### Golden Sets — 10 / 10 passed (100%) |
|||
|
|||
| ID | Latency | Tools Used | Result | |
|||
|---|---|---|---| |
|||
| gs-001 | 3.1s | `portfolio_analysis`, `compliance_check` | ✅ PASS | |
|||
| gs-002 | 7.0s | `transaction_query` | ✅ PASS | |
|||
| gs-003 | 6.5s | `portfolio_analysis`, `compliance_check` | ✅ PASS | |
|||
| gs-004 | 2.3s | `market_data` | ✅ PASS | |
|||
| gs-005 | 7.5s | `portfolio_analysis`, `transaction_query`, `tax_estimate` | ✅ PASS | |
|||
| gs-006 | 7.6s | `portfolio_analysis`, `compliance_check` | ✅ PASS | |
|||
| gs-007 | 0.0s | (none) | ✅ PASS | |
|||
| gs-008 | 12.1s | `market_data`, `portfolio_analysis`, `transaction_query`, `compliance_check` | ✅ PASS | |
|||
| gs-009 | 0.0s | (none) | ✅ PASS | |
|||
| gs-010 | 5.0s | `portfolio_analysis`, `compliance_check` | ✅ PASS | |
|||
|
|||
### Labeled Scenarios — 15 / 15 passed (100%) |
|||
|
|||
#### Results by Difficulty |
|||
|
|||
| Difficulty | Passed | Total | |
|||
|---|---|---| |
|||
| straightforward | 7 | 7 | |
|||
| ambiguous | 5 | 5 | |
|||
| edge_case | 2 | 2 | |
|||
| adversarial | 1 | 1 | |
|||
|
|||
#### All Scenarios |
|||
|
|||
| ID | Difficulty | Subcategory | Latency | Result | |
|||
|---|---|---|---|---| |
|||
| sc-001 | straightforward | performance | 4.0s | ✅ PASS | |
|||
| sc-002 | straightforward | transaction_and_market | 8.2s | ✅ PASS | |
|||
| sc-003 | straightforward | compliance_and_tax | 9.1s | ✅ PASS | |
|||
| sc-004 | ambiguous | performance | 8.7s | ✅ PASS | |
|||
| sc-005 | edge_case | transaction | 3.3s | ✅ PASS | |
|||
| sc-006 | adversarial | prompt_injection | 0.0s | ✅ PASS | |
|||
| sc-007 | straightforward | performance_and_compliance | 5.7s | ✅ PASS | |
|||
| sc-008 | straightforward | transaction_and_analysis | 9.1s | ✅ PASS | |
|||
| sc-009 | ambiguous | tax_and_performance | 9.2s | ✅ PASS | |
|||
| sc-010 | ambiguous | compliance | 7.9s | ✅ PASS | |
|||
| sc-011 | straightforward | full_position_analysis | 10.4s | ✅ PASS | |
|||
| sc-012 | edge_case | performance | 0.0s | ✅ PASS | |
|||
| sc-013 | ambiguous | performance | 6.6s | ✅ PASS | |
|||
| sc-014 | straightforward | full_report | 13.1s | ✅ PASS | |
|||
| sc-015 | ambiguous | performance | 7.2s | ✅ PASS | |
|||
|
|||
--- |
|||
|
|||
## Fixes Applied |
|||
|
|||
All 5 previous failures were resolved with targeted changes to the classifier in `graph.py`: |
|||
|
|||
| Case | Root Cause | Fix | |
|||
|---|---|---| |
|||
| HP007 | `"biggest"` not in any keyword list | Added `"biggest holding"`, `"biggest position"`, `"top holdings"` etc. to `natural_performance_kws` and `performance_kws` | |
|||
| HP013 | `"drawdown"` not in any keyword list | Added `"drawdown"`, `"max drawdown"` to `performance_kws` | |
|||
| MS005 | `"sf"` matched as substring of `"msft"` → false positive city detection → routed to `real_estate` | Changed city matching for tokens ≤4 chars to require word boundary (`\b...\b`) | |
|||
| MS010 | `full_report_kws` routed to `"compliance"` (only `portfolio_analysis` + `compliance_check`), missing `transaction_query` for "recent activity" | Changed route from `"compliance"` to `"performance+compliance+activity"` | |
|||
| sc-004 | Typo `"portflio"` ≠ `"portfolio"` → no keyword matched | Added common `portfolio` misspellings to `natural_performance_kws` | |
|||
Loading…
Reference in new issue