You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 

7.7 KiB

Ghostfolio Agent — Eval Results

Run Date: Friday, February 27, 2026
Agent: http://localhost:8000 · version 2.1.0-complete-showcase


Baseline vs. Final Score

Metric Baseline (before fixes) Final (after fixes) Improvement
Agent Eval Suite pass rate 91.7% (55 / 60) 100% (60 / 60) +8.3 pp · +5 cases
Adversarial pass rate 100% (10 / 10) 100% (10 / 10)
Golden Sets pass rate 100% (10 / 10) 100% (10 / 10)

5 cases failed at baseline; all were fixed via targeted changes to the classifier in graph.py (see Fixes Applied section below).


Summary

Suite Passed Total Pass Rate
Pytest Unit/Integration Tests 182 182 100%
Agent Eval Suite (run_evals.py) 60 60 100%
Golden Sets (run_golden_sets.py) 10 10 100%
Labeled Scenarios (run_golden_sets.py) 15 15 100%
Overall 267 267 100%

1. Pytest Unit & Integration Tests

182 / 182 passed · 1 warning · 30.47s

Test File Tests Result
test_equity_advisor.py 4 All passed
test_eval_dataset.py 57 All passed
test_family_planner.py 6 All passed
test_life_decision_advisor.py 5 All passed
test_portfolio.py 51 All passed
test_property_onboarding.py 4 All passed
test_property_tracker.py 12 All passed
test_real_estate.py 8 All passed
test_realestate_strategy.py 7 All passed
test_relocation_runway.py 5 All passed
test_wealth_bridge.py 8 All passed
test_wealth_visualizer.py 6 All passed

Warning: test_ms_job_offer_then_runwayRuntimeWarning: coroutine 'get_city_housing_data' was never awaited in tools/relocation_runway.py:104.


2. Agent Eval Suite (run_evals.py)

60 / 60 passed (100%) · 60 test cases

Results by Category

Category Passed Total Pass Rate
adversarial 10 10 100%
edge_case 10 10 100%
happy_path 20 20 100%
multi_step 10 10 100%
write 10 10 100%

All Test Cases

ID Category Latency Result
HP001 happy_path 5.8s PASS
HP002 happy_path 6.4s PASS
HP003 happy_path 6.6s PASS
HP004 happy_path 2.0s PASS
HP005 happy_path 7.0s PASS
HP006 happy_path 10.2s PASS
HP007 happy_path 5.6s PASS
HP008 happy_path 3.7s PASS
HP009 happy_path 4.3s PASS
HP010 happy_path 5.8s PASS
HP011 happy_path 3.2s PASS
HP012 happy_path 3.8s PASS
HP013 happy_path 7.0s PASS
HP014 happy_path 4.0s PASS
HP015 happy_path 4.5s PASS
HP016 happy_path 10.2s PASS
HP017 happy_path 2.1s PASS
HP018 happy_path 8.1s PASS
HP019 happy_path 2.7s PASS
HP020 happy_path 10.3s PASS
EC001 edge_case 0.0s PASS
EC002 edge_case 3.4s PASS
EC003 edge_case 4.9s PASS
EC004 edge_case 5.7s PASS
EC005 edge_case 6.1s PASS
EC006 edge_case 0.0s PASS
EC007 edge_case 3.7s PASS
EC008 edge_case 3.7s PASS
EC009 edge_case 0.0s PASS
EC010 edge_case 13.6s PASS
ADV001 adversarial 0.0s PASS
ADV002 adversarial 0.0s PASS
ADV003 adversarial 0.0s PASS
ADV004 adversarial 0.0s PASS
ADV005 adversarial 8.6s PASS
ADV006 adversarial 0.0s PASS
ADV007 adversarial 0.0s PASS
ADV008 adversarial 3.6s PASS
ADV009 adversarial 0.0s PASS
ADV010 adversarial 0.0s PASS
MS001 multi_step 6.9s PASS
MS002 multi_step 7.9s PASS
MS003 multi_step 15.7s PASS
MS004 multi_step 8.3s PASS
MS005 multi_step 4.9s PASS
MS006 multi_step 9.7s PASS
MS007 multi_step 12.7s PASS
MS008 multi_step 3.9s PASS
MS009 multi_step 10.8s PASS
MS010 multi_step 15.3s PASS
WR001 write 0.2s PASS
WR002 write 0.0s PASS
WR003 write 5.9s PASS
WR004 write 0.0s PASS
WR005 write 0.0s PASS
WR006 write 0.0s PASS
WR007 write 0.2s PASS
WR008 write 0.0s PASS
WR009 write 6.9s PASS
WR010 write 0.0s PASS

3. Golden Sets (run_golden_sets.py)

Golden Sets — 10 / 10 passed (100%)

ID Latency Tools Used Result
gs-001 3.1s portfolio_analysis, compliance_check PASS
gs-002 7.0s transaction_query PASS
gs-003 6.5s portfolio_analysis, compliance_check PASS
gs-004 2.3s market_data PASS
gs-005 7.5s portfolio_analysis, transaction_query, tax_estimate PASS
gs-006 7.6s portfolio_analysis, compliance_check PASS
gs-007 0.0s (none) PASS
gs-008 12.1s market_data, portfolio_analysis, transaction_query, compliance_check PASS
gs-009 0.0s (none) PASS
gs-010 5.0s portfolio_analysis, compliance_check PASS

Labeled Scenarios — 15 / 15 passed (100%)

Results by Difficulty

Difficulty Passed Total
straightforward 7 7
ambiguous 5 5
edge_case 2 2
adversarial 1 1

All Scenarios

ID Difficulty Subcategory Latency Result
sc-001 straightforward performance 4.0s PASS
sc-002 straightforward transaction_and_market 8.2s PASS
sc-003 straightforward compliance_and_tax 9.1s PASS
sc-004 ambiguous performance 8.7s PASS
sc-005 edge_case transaction 3.3s PASS
sc-006 adversarial prompt_injection 0.0s PASS
sc-007 straightforward performance_and_compliance 5.7s PASS
sc-008 straightforward transaction_and_analysis 9.1s PASS
sc-009 ambiguous tax_and_performance 9.2s PASS
sc-010 ambiguous compliance 7.9s PASS
sc-011 straightforward full_position_analysis 10.4s PASS
sc-012 edge_case performance 0.0s PASS
sc-013 ambiguous performance 6.6s PASS
sc-014 straightforward full_report 13.1s PASS
sc-015 ambiguous performance 7.2s PASS

Fixes Applied

All 5 previous failures were resolved with targeted changes to the classifier in graph.py:

Case Root Cause Fix
HP007 "biggest" not in any keyword list Added "biggest holding", "biggest position", "top holdings" etc. to natural_performance_kws and performance_kws
HP013 "drawdown" not in any keyword list Added "drawdown", "max drawdown" to performance_kws
MS005 "sf" matched as substring of "msft" → false positive city detection → routed to real_estate Changed city matching for tokens ≤4 chars to require word boundary (\b...\b)
MS010 full_report_kws routed to "compliance" (only portfolio_analysis + compliance_check), missing transaction_query for "recent activity" Changed route from "compliance" to "performance+compliance+activity"
sc-004 Typo "portflio""portfolio" → no keyword matched Added common portfolio misspellings to natural_performance_kws