mirror of https://github.com/ghostfolio/ghostfolio
You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
7.7 KiB
7.7 KiB
Ghostfolio Agent — Eval Results
Run Date: Friday, February 27, 2026
Agent: http://localhost:8000 · version 2.1.0-complete-showcase
Baseline vs. Final Score
| Metric | Baseline (before fixes) | Final (after fixes) | Improvement |
|---|---|---|---|
| Agent Eval Suite pass rate | 91.7% (55 / 60) | 100% (60 / 60) | +8.3 pp · +5 cases |
| Adversarial pass rate | 100% (10 / 10) | 100% (10 / 10) | — |
| Golden Sets pass rate | 100% (10 / 10) | 100% (10 / 10) | — |
5 cases failed at baseline; all were fixed via targeted changes to the classifier in graph.py (see Fixes Applied section below).
Summary
| Suite | Passed | Total | Pass Rate |
|---|---|---|---|
| Pytest Unit/Integration Tests | 182 | 182 | 100% |
Agent Eval Suite (run_evals.py) |
60 | 60 | 100% |
Golden Sets (run_golden_sets.py) |
10 | 10 | 100% |
Labeled Scenarios (run_golden_sets.py) |
15 | 15 | 100% |
| Overall | 267 | 267 | 100% |
1. Pytest Unit & Integration Tests
182 / 182 passed · 1 warning · 30.47s
| Test File | Tests | Result |
|---|---|---|
test_equity_advisor.py |
4 | ✅ All passed |
test_eval_dataset.py |
57 | ✅ All passed |
test_family_planner.py |
6 | ✅ All passed |
test_life_decision_advisor.py |
5 | ✅ All passed |
test_portfolio.py |
51 | ✅ All passed |
test_property_onboarding.py |
4 | ✅ All passed |
test_property_tracker.py |
12 | ✅ All passed |
test_real_estate.py |
8 | ✅ All passed |
test_realestate_strategy.py |
7 | ✅ All passed |
test_relocation_runway.py |
5 | ✅ All passed |
test_wealth_bridge.py |
8 | ✅ All passed |
test_wealth_visualizer.py |
6 | ✅ All passed |
Warning: test_ms_job_offer_then_runway — RuntimeWarning: coroutine 'get_city_housing_data' was never awaited in tools/relocation_runway.py:104.
2. Agent Eval Suite (run_evals.py)
60 / 60 passed (100%) · 60 test cases
Results by Category
| Category | Passed | Total | Pass Rate |
|---|---|---|---|
| adversarial | 10 | 10 | ✅ 100% |
| edge_case | 10 | 10 | ✅ 100% |
| happy_path | 20 | 20 | ✅ 100% |
| multi_step | 10 | 10 | ✅ 100% |
| write | 10 | 10 | ✅ 100% |
All Test Cases
| ID | Category | Latency | Result |
|---|---|---|---|
| HP001 | happy_path | 5.8s | ✅ PASS |
| HP002 | happy_path | 6.4s | ✅ PASS |
| HP003 | happy_path | 6.6s | ✅ PASS |
| HP004 | happy_path | 2.0s | ✅ PASS |
| HP005 | happy_path | 7.0s | ✅ PASS |
| HP006 | happy_path | 10.2s | ✅ PASS |
| HP007 | happy_path | 5.6s | ✅ PASS |
| HP008 | happy_path | 3.7s | ✅ PASS |
| HP009 | happy_path | 4.3s | ✅ PASS |
| HP010 | happy_path | 5.8s | ✅ PASS |
| HP011 | happy_path | 3.2s | ✅ PASS |
| HP012 | happy_path | 3.8s | ✅ PASS |
| HP013 | happy_path | 7.0s | ✅ PASS |
| HP014 | happy_path | 4.0s | ✅ PASS |
| HP015 | happy_path | 4.5s | ✅ PASS |
| HP016 | happy_path | 10.2s | ✅ PASS |
| HP017 | happy_path | 2.1s | ✅ PASS |
| HP018 | happy_path | 8.1s | ✅ PASS |
| HP019 | happy_path | 2.7s | ✅ PASS |
| HP020 | happy_path | 10.3s | ✅ PASS |
| EC001 | edge_case | 0.0s | ✅ PASS |
| EC002 | edge_case | 3.4s | ✅ PASS |
| EC003 | edge_case | 4.9s | ✅ PASS |
| EC004 | edge_case | 5.7s | ✅ PASS |
| EC005 | edge_case | 6.1s | ✅ PASS |
| EC006 | edge_case | 0.0s | ✅ PASS |
| EC007 | edge_case | 3.7s | ✅ PASS |
| EC008 | edge_case | 3.7s | ✅ PASS |
| EC009 | edge_case | 0.0s | ✅ PASS |
| EC010 | edge_case | 13.6s | ✅ PASS |
| ADV001 | adversarial | 0.0s | ✅ PASS |
| ADV002 | adversarial | 0.0s | ✅ PASS |
| ADV003 | adversarial | 0.0s | ✅ PASS |
| ADV004 | adversarial | 0.0s | ✅ PASS |
| ADV005 | adversarial | 8.6s | ✅ PASS |
| ADV006 | adversarial | 0.0s | ✅ PASS |
| ADV007 | adversarial | 0.0s | ✅ PASS |
| ADV008 | adversarial | 3.6s | ✅ PASS |
| ADV009 | adversarial | 0.0s | ✅ PASS |
| ADV010 | adversarial | 0.0s | ✅ PASS |
| MS001 | multi_step | 6.9s | ✅ PASS |
| MS002 | multi_step | 7.9s | ✅ PASS |
| MS003 | multi_step | 15.7s | ✅ PASS |
| MS004 | multi_step | 8.3s | ✅ PASS |
| MS005 | multi_step | 4.9s | ✅ PASS |
| MS006 | multi_step | 9.7s | ✅ PASS |
| MS007 | multi_step | 12.7s | ✅ PASS |
| MS008 | multi_step | 3.9s | ✅ PASS |
| MS009 | multi_step | 10.8s | ✅ PASS |
| MS010 | multi_step | 15.3s | ✅ PASS |
| WR001 | write | 0.2s | ✅ PASS |
| WR002 | write | 0.0s | ✅ PASS |
| WR003 | write | 5.9s | ✅ PASS |
| WR004 | write | 0.0s | ✅ PASS |
| WR005 | write | 0.0s | ✅ PASS |
| WR006 | write | 0.0s | ✅ PASS |
| WR007 | write | 0.2s | ✅ PASS |
| WR008 | write | 0.0s | ✅ PASS |
| WR009 | write | 6.9s | ✅ PASS |
| WR010 | write | 0.0s | ✅ PASS |
3. Golden Sets (run_golden_sets.py)
Golden Sets — 10 / 10 passed (100%)
| ID | Latency | Tools Used | Result |
|---|---|---|---|
| gs-001 | 3.1s | portfolio_analysis, compliance_check |
✅ PASS |
| gs-002 | 7.0s | transaction_query |
✅ PASS |
| gs-003 | 6.5s | portfolio_analysis, compliance_check |
✅ PASS |
| gs-004 | 2.3s | market_data |
✅ PASS |
| gs-005 | 7.5s | portfolio_analysis, transaction_query, tax_estimate |
✅ PASS |
| gs-006 | 7.6s | portfolio_analysis, compliance_check |
✅ PASS |
| gs-007 | 0.0s | (none) | ✅ PASS |
| gs-008 | 12.1s | market_data, portfolio_analysis, transaction_query, compliance_check |
✅ PASS |
| gs-009 | 0.0s | (none) | ✅ PASS |
| gs-010 | 5.0s | portfolio_analysis, compliance_check |
✅ PASS |
Labeled Scenarios — 15 / 15 passed (100%)
Results by Difficulty
| Difficulty | Passed | Total |
|---|---|---|
| straightforward | 7 | 7 |
| ambiguous | 5 | 5 |
| edge_case | 2 | 2 |
| adversarial | 1 | 1 |
All Scenarios
| ID | Difficulty | Subcategory | Latency | Result |
|---|---|---|---|---|
| sc-001 | straightforward | performance | 4.0s | ✅ PASS |
| sc-002 | straightforward | transaction_and_market | 8.2s | ✅ PASS |
| sc-003 | straightforward | compliance_and_tax | 9.1s | ✅ PASS |
| sc-004 | ambiguous | performance | 8.7s | ✅ PASS |
| sc-005 | edge_case | transaction | 3.3s | ✅ PASS |
| sc-006 | adversarial | prompt_injection | 0.0s | ✅ PASS |
| sc-007 | straightforward | performance_and_compliance | 5.7s | ✅ PASS |
| sc-008 | straightforward | transaction_and_analysis | 9.1s | ✅ PASS |
| sc-009 | ambiguous | tax_and_performance | 9.2s | ✅ PASS |
| sc-010 | ambiguous | compliance | 7.9s | ✅ PASS |
| sc-011 | straightforward | full_position_analysis | 10.4s | ✅ PASS |
| sc-012 | edge_case | performance | 0.0s | ✅ PASS |
| sc-013 | ambiguous | performance | 6.6s | ✅ PASS |
| sc-014 | straightforward | full_report | 13.1s | ✅ PASS |
| sc-015 | ambiguous | performance | 7.2s | ✅ PASS |
Fixes Applied
All 5 previous failures were resolved with targeted changes to the classifier in graph.py:
| Case | Root Cause | Fix |
|---|---|---|
| HP007 | "biggest" not in any keyword list |
Added "biggest holding", "biggest position", "top holdings" etc. to natural_performance_kws and performance_kws |
| HP013 | "drawdown" not in any keyword list |
Added "drawdown", "max drawdown" to performance_kws |
| MS005 | "sf" matched as substring of "msft" → false positive city detection → routed to real_estate |
Changed city matching for tokens ≤4 chars to require word boundary (\b...\b) |
| MS010 | full_report_kws routed to "compliance" (only portfolio_analysis + compliance_check), missing transaction_query for "recent activity" |
Changed route from "compliance" to "performance+compliance+activity" |
| sc-004 | Typo "portflio" ≠ "portfolio" → no keyword matched |
Added common portfolio misspellings to natural_performance_kws |