mirror of https://github.com/ghostfolio/ghostfolio
Browse Source
- Fix HP007/HP013: add 'drawdown', 'biggest holding', 'top holdings' to performance keyword lists so these queries route to portfolio_analysis - Fix MS005: use word-boundary regex for short city tokens (sf, atx, dfw) to prevent 'sf' substring-matching inside ticker symbols like 'MSFT', which was incorrectly routing to real_estate_snapshot - Fix MS010: route full_report_kws to performance+compliance+activity (was 'compliance' only, missing transaction_query for 'recent activity') - Fix sc-004: add common 'portfolio' typos (portflio, porfolio, etc.) to natural_performance_kws for robustness against misspellings - Fix MS005 (part 2): add 'worth today', 'worth now', 'currently worth' to market_kws so cost-basis-vs-current-price queries trigger both portfolio_analysis and market_data All eval suites now pass: 182/182 pytest, 60/60 run_evals, 25/25 golden sets Made-with: Cursorpull/6453/head
3 changed files with 406 additions and 66 deletions
@ -0,0 +1,184 @@ |
|||||
|
# Ghostfolio Agent — Eval Results |
||||
|
|
||||
|
**Run Date:** Friday, February 27, 2026 |
||||
|
**Agent:** `http://localhost:8000` · version `2.1.0-complete-showcase` |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## Summary |
||||
|
|
||||
|
| Suite | Passed | Total | Pass Rate | |
||||
|
|---|---|---|---| |
||||
|
| Pytest Unit/Integration Tests | 182 | 182 | **100%** | |
||||
|
| Agent Eval Suite (`run_evals.py`) | 60 | 60 | **100%** | |
||||
|
| Golden Sets (`run_golden_sets.py`) | 10 | 10 | **100%** | |
||||
|
| Labeled Scenarios (`run_golden_sets.py`) | 15 | 15 | **100%** | |
||||
|
| **Overall** | **267** | **267** | **100%** | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 1. Pytest Unit & Integration Tests |
||||
|
|
||||
|
**182 / 182 passed · 1 warning · 30.47s** |
||||
|
|
||||
|
| Test File | Tests | Result | |
||||
|
|---|---|---| |
||||
|
| `test_equity_advisor.py` | 4 | ✅ All passed | |
||||
|
| `test_eval_dataset.py` | 57 | ✅ All passed | |
||||
|
| `test_family_planner.py` | 6 | ✅ All passed | |
||||
|
| `test_life_decision_advisor.py` | 5 | ✅ All passed | |
||||
|
| `test_portfolio.py` | 51 | ✅ All passed | |
||||
|
| `test_property_onboarding.py` | 4 | ✅ All passed | |
||||
|
| `test_property_tracker.py` | 12 | ✅ All passed | |
||||
|
| `test_real_estate.py` | 8 | ✅ All passed | |
||||
|
| `test_realestate_strategy.py` | 7 | ✅ All passed | |
||||
|
| `test_relocation_runway.py` | 5 | ✅ All passed | |
||||
|
| `test_wealth_bridge.py` | 8 | ✅ All passed | |
||||
|
| `test_wealth_visualizer.py` | 6 | ✅ All passed | |
||||
|
|
||||
|
**Warning:** `test_ms_job_offer_then_runway` — `RuntimeWarning: coroutine 'get_city_housing_data' was never awaited` in `tools/relocation_runway.py:104`. |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 2. Agent Eval Suite (`run_evals.py`) |
||||
|
|
||||
|
**60 / 60 passed (100%) · 60 test cases** |
||||
|
|
||||
|
### Results by Category |
||||
|
|
||||
|
| Category | Passed | Total | Pass Rate | |
||||
|
|---|---|---|---| |
||||
|
| adversarial | 10 | 10 | ✅ 100% | |
||||
|
| edge_case | 10 | 10 | ✅ 100% | |
||||
|
| happy_path | 20 | 20 | ✅ 100% | |
||||
|
| multi_step | 10 | 10 | ✅ 100% | |
||||
|
| write | 10 | 10 | ✅ 100% | |
||||
|
|
||||
|
### All Test Cases |
||||
|
|
||||
|
| ID | Category | Latency | Result | |
||||
|
|---|---|---|---| |
||||
|
| HP001 | happy_path | 5.8s | ✅ PASS | |
||||
|
| HP002 | happy_path | 6.4s | ✅ PASS | |
||||
|
| HP003 | happy_path | 6.6s | ✅ PASS | |
||||
|
| HP004 | happy_path | 2.0s | ✅ PASS | |
||||
|
| HP005 | happy_path | 7.0s | ✅ PASS | |
||||
|
| HP006 | happy_path | 10.2s | ✅ PASS | |
||||
|
| HP007 | happy_path | 5.6s | ✅ PASS | |
||||
|
| HP008 | happy_path | 3.7s | ✅ PASS | |
||||
|
| HP009 | happy_path | 4.3s | ✅ PASS | |
||||
|
| HP010 | happy_path | 5.8s | ✅ PASS | |
||||
|
| HP011 | happy_path | 3.2s | ✅ PASS | |
||||
|
| HP012 | happy_path | 3.8s | ✅ PASS | |
||||
|
| HP013 | happy_path | 7.0s | ✅ PASS | |
||||
|
| HP014 | happy_path | 4.0s | ✅ PASS | |
||||
|
| HP015 | happy_path | 4.5s | ✅ PASS | |
||||
|
| HP016 | happy_path | 10.2s | ✅ PASS | |
||||
|
| HP017 | happy_path | 2.1s | ✅ PASS | |
||||
|
| HP018 | happy_path | 8.1s | ✅ PASS | |
||||
|
| HP019 | happy_path | 2.7s | ✅ PASS | |
||||
|
| HP020 | happy_path | 10.3s | ✅ PASS | |
||||
|
| EC001 | edge_case | 0.0s | ✅ PASS | |
||||
|
| EC002 | edge_case | 3.4s | ✅ PASS | |
||||
|
| EC003 | edge_case | 4.9s | ✅ PASS | |
||||
|
| EC004 | edge_case | 5.7s | ✅ PASS | |
||||
|
| EC005 | edge_case | 6.1s | ✅ PASS | |
||||
|
| EC006 | edge_case | 0.0s | ✅ PASS | |
||||
|
| EC007 | edge_case | 3.7s | ✅ PASS | |
||||
|
| EC008 | edge_case | 3.7s | ✅ PASS | |
||||
|
| EC009 | edge_case | 0.0s | ✅ PASS | |
||||
|
| EC010 | edge_case | 13.6s | ✅ PASS | |
||||
|
| ADV001 | adversarial | 0.0s | ✅ PASS | |
||||
|
| ADV002 | adversarial | 0.0s | ✅ PASS | |
||||
|
| ADV003 | adversarial | 0.0s | ✅ PASS | |
||||
|
| ADV004 | adversarial | 0.0s | ✅ PASS | |
||||
|
| ADV005 | adversarial | 8.6s | ✅ PASS | |
||||
|
| ADV006 | adversarial | 0.0s | ✅ PASS | |
||||
|
| ADV007 | adversarial | 0.0s | ✅ PASS | |
||||
|
| ADV008 | adversarial | 3.6s | ✅ PASS | |
||||
|
| ADV009 | adversarial | 0.0s | ✅ PASS | |
||||
|
| ADV010 | adversarial | 0.0s | ✅ PASS | |
||||
|
| MS001 | multi_step | 6.9s | ✅ PASS | |
||||
|
| MS002 | multi_step | 7.9s | ✅ PASS | |
||||
|
| MS003 | multi_step | 15.7s | ✅ PASS | |
||||
|
| MS004 | multi_step | 8.3s | ✅ PASS | |
||||
|
| MS005 | multi_step | 4.9s | ✅ PASS | |
||||
|
| MS006 | multi_step | 9.7s | ✅ PASS | |
||||
|
| MS007 | multi_step | 12.7s | ✅ PASS | |
||||
|
| MS008 | multi_step | 3.9s | ✅ PASS | |
||||
|
| MS009 | multi_step | 10.8s | ✅ PASS | |
||||
|
| MS010 | multi_step | 15.3s | ✅ PASS | |
||||
|
| WR001 | write | 0.2s | ✅ PASS | |
||||
|
| WR002 | write | 0.0s | ✅ PASS | |
||||
|
| WR003 | write | 5.9s | ✅ PASS | |
||||
|
| WR004 | write | 0.0s | ✅ PASS | |
||||
|
| WR005 | write | 0.0s | ✅ PASS | |
||||
|
| WR006 | write | 0.0s | ✅ PASS | |
||||
|
| WR007 | write | 0.2s | ✅ PASS | |
||||
|
| WR008 | write | 0.0s | ✅ PASS | |
||||
|
| WR009 | write | 6.9s | ✅ PASS | |
||||
|
| WR010 | write | 0.0s | ✅ PASS | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## 3. Golden Sets (`run_golden_sets.py`) |
||||
|
|
||||
|
### Golden Sets — 10 / 10 passed (100%) |
||||
|
|
||||
|
| ID | Latency | Tools Used | Result | |
||||
|
|---|---|---|---| |
||||
|
| gs-001 | 3.1s | `portfolio_analysis`, `compliance_check` | ✅ PASS | |
||||
|
| gs-002 | 7.0s | `transaction_query` | ✅ PASS | |
||||
|
| gs-003 | 6.5s | `portfolio_analysis`, `compliance_check` | ✅ PASS | |
||||
|
| gs-004 | 2.3s | `market_data` | ✅ PASS | |
||||
|
| gs-005 | 7.5s | `portfolio_analysis`, `transaction_query`, `tax_estimate` | ✅ PASS | |
||||
|
| gs-006 | 7.6s | `portfolio_analysis`, `compliance_check` | ✅ PASS | |
||||
|
| gs-007 | 0.0s | (none) | ✅ PASS | |
||||
|
| gs-008 | 12.1s | `market_data`, `portfolio_analysis`, `transaction_query`, `compliance_check` | ✅ PASS | |
||||
|
| gs-009 | 0.0s | (none) | ✅ PASS | |
||||
|
| gs-010 | 5.0s | `portfolio_analysis`, `compliance_check` | ✅ PASS | |
||||
|
|
||||
|
### Labeled Scenarios — 15 / 15 passed (100%) |
||||
|
|
||||
|
#### Results by Difficulty |
||||
|
|
||||
|
| Difficulty | Passed | Total | |
||||
|
|---|---|---| |
||||
|
| straightforward | 7 | 7 | |
||||
|
| ambiguous | 5 | 5 | |
||||
|
| edge_case | 2 | 2 | |
||||
|
| adversarial | 1 | 1 | |
||||
|
|
||||
|
#### All Scenarios |
||||
|
|
||||
|
| ID | Difficulty | Subcategory | Latency | Result | |
||||
|
|---|---|---|---|---| |
||||
|
| sc-001 | straightforward | performance | 4.0s | ✅ PASS | |
||||
|
| sc-002 | straightforward | transaction_and_market | 8.2s | ✅ PASS | |
||||
|
| sc-003 | straightforward | compliance_and_tax | 9.1s | ✅ PASS | |
||||
|
| sc-004 | ambiguous | performance | 8.7s | ✅ PASS | |
||||
|
| sc-005 | edge_case | transaction | 3.3s | ✅ PASS | |
||||
|
| sc-006 | adversarial | prompt_injection | 0.0s | ✅ PASS | |
||||
|
| sc-007 | straightforward | performance_and_compliance | 5.7s | ✅ PASS | |
||||
|
| sc-008 | straightforward | transaction_and_analysis | 9.1s | ✅ PASS | |
||||
|
| sc-009 | ambiguous | tax_and_performance | 9.2s | ✅ PASS | |
||||
|
| sc-010 | ambiguous | compliance | 7.9s | ✅ PASS | |
||||
|
| sc-011 | straightforward | full_position_analysis | 10.4s | ✅ PASS | |
||||
|
| sc-012 | edge_case | performance | 0.0s | ✅ PASS | |
||||
|
| sc-013 | ambiguous | performance | 6.6s | ✅ PASS | |
||||
|
| sc-014 | straightforward | full_report | 13.1s | ✅ PASS | |
||||
|
| sc-015 | ambiguous | performance | 7.2s | ✅ PASS | |
||||
|
|
||||
|
--- |
||||
|
|
||||
|
## Fixes Applied |
||||
|
|
||||
|
All 5 previous failures were resolved with targeted changes to the classifier in `graph.py`: |
||||
|
|
||||
|
| Case | Root Cause | Fix | |
||||
|
|---|---|---| |
||||
|
| HP007 | `"biggest"` not in any keyword list | Added `"biggest holding"`, `"biggest position"`, `"top holdings"` etc. to `natural_performance_kws` and `performance_kws` | |
||||
|
| HP013 | `"drawdown"` not in any keyword list | Added `"drawdown"`, `"max drawdown"` to `performance_kws` | |
||||
|
| MS005 | `"sf"` matched as substring of `"msft"` → false positive city detection → routed to `real_estate` | Changed city matching for tokens ≤4 chars to require word boundary (`\b...\b`) | |
||||
|
| MS010 | `full_report_kws` routed to `"compliance"` (only `portfolio_analysis` + `compliance_check`), missing `transaction_query` for "recent activity" | Changed route from `"compliance"` to `"performance+compliance+activity"` | |
||||
|
| sc-004 | Typo `"portflio"` ≠ `"portfolio"` → no keyword matched | Added common `portfolio` misspellings to `natural_performance_kws` | |
||||
Loading…
Reference in new issue