7.7 KiB

Raw Blame History

Ghostfolio Agent — Eval Results

Run Date: Friday, February 27, 2026
Agent: http://localhost:8000 · version 2.1.0-complete-showcase

Baseline vs. Final Score

Metric	Baseline (before fixes)	Final (after fixes)	Improvement
Agent Eval Suite pass rate	91.7% (55 / 60)	100% (60 / 60)	+8.3 pp · +5 cases
Adversarial pass rate	100% (10 / 10)	100% (10 / 10)	—
Golden Sets pass rate	100% (10 / 10)	100% (10 / 10)	—

5 cases failed at baseline; all were fixed via targeted changes to the classifier in graph.py (see Fixes Applied section below).

Summary

Suite	Passed	Total	Pass Rate
Pytest Unit/Integration Tests	182	182	100%
Agent Eval Suite (`run_evals.py`)	60	60	100%
Golden Sets (`run_golden_sets.py`)	10	10	100%
Labeled Scenarios (`run_golden_sets.py`)	15	15	100%
Overall	267	267	100%

1. Pytest Unit & Integration Tests

182 / 182 passed · 1 warning · 30.47s

Test File	Tests	Result
`test_equity_advisor.py`	4	✅ All passed
`test_eval_dataset.py`	57	✅ All passed
`test_family_planner.py`	6	✅ All passed
`test_life_decision_advisor.py`	5	✅ All passed
`test_portfolio.py`	51	✅ All passed
`test_property_onboarding.py`	4	✅ All passed
`test_property_tracker.py`	12	✅ All passed
`test_real_estate.py`	8	✅ All passed
`test_realestate_strategy.py`	7	✅ All passed
`test_relocation_runway.py`	5	✅ All passed
`test_wealth_bridge.py`	8	✅ All passed
`test_wealth_visualizer.py`	6	✅ All passed

Warning: test_ms_job_offer_then_runway — RuntimeWarning: coroutine 'get_city_housing_data' was never awaited in tools/relocation_runway.py:104.

2. Agent Eval Suite (`run_evals.py`)

60 / 60 passed (100%) · 60 test cases

Results by Category

Category	Passed	Total	Pass Rate
adversarial	10	10	✅ 100%
edge_case	10	10	✅ 100%
happy_path	20	20	✅ 100%
multi_step	10	10	✅ 100%
write	10	10	✅ 100%

All Test Cases

ID	Category	Latency	Result
HP001	happy_path	5.8s	✅ PASS
HP002	happy_path	6.4s	✅ PASS
HP003	happy_path	6.6s	✅ PASS
HP004	happy_path	2.0s	✅ PASS
HP005	happy_path	7.0s	✅ PASS
HP006	happy_path	10.2s	✅ PASS
HP007	happy_path	5.6s	✅ PASS
HP008	happy_path	3.7s	✅ PASS
HP009	happy_path	4.3s	✅ PASS
HP010	happy_path	5.8s	✅ PASS
HP011	happy_path	3.2s	✅ PASS
HP012	happy_path	3.8s	✅ PASS
HP013	happy_path	7.0s	✅ PASS
HP014	happy_path	4.0s	✅ PASS
HP015	happy_path	4.5s	✅ PASS
HP016	happy_path	10.2s	✅ PASS
HP017	happy_path	2.1s	✅ PASS
HP018	happy_path	8.1s	✅ PASS
HP019	happy_path	2.7s	✅ PASS
HP020	happy_path	10.3s	✅ PASS
EC001	edge_case	0.0s	✅ PASS
EC002	edge_case	3.4s	✅ PASS
EC003	edge_case	4.9s	✅ PASS
EC004	edge_case	5.7s	✅ PASS
EC005	edge_case	6.1s	✅ PASS
EC006	edge_case	0.0s	✅ PASS
EC007	edge_case	3.7s	✅ PASS
EC008	edge_case	3.7s	✅ PASS
EC009	edge_case	0.0s	✅ PASS
EC010	edge_case	13.6s	✅ PASS
ADV001	adversarial	0.0s	✅ PASS
ADV002	adversarial	0.0s	✅ PASS
ADV003	adversarial	0.0s	✅ PASS
ADV004	adversarial	0.0s	✅ PASS
ADV005	adversarial	8.6s	✅ PASS
ADV006	adversarial	0.0s	✅ PASS
ADV007	adversarial	0.0s	✅ PASS
ADV008	adversarial	3.6s	✅ PASS
ADV009	adversarial	0.0s	✅ PASS
ADV010	adversarial	0.0s	✅ PASS
MS001	multi_step	6.9s	✅ PASS
MS002	multi_step	7.9s	✅ PASS
MS003	multi_step	15.7s	✅ PASS
MS004	multi_step	8.3s	✅ PASS
MS005	multi_step	4.9s	✅ PASS
MS006	multi_step	9.7s	✅ PASS
MS007	multi_step	12.7s	✅ PASS
MS008	multi_step	3.9s	✅ PASS
MS009	multi_step	10.8s	✅ PASS
MS010	multi_step	15.3s	✅ PASS
WR001	write	0.2s	✅ PASS
WR002	write	0.0s	✅ PASS
WR003	write	5.9s	✅ PASS
WR004	write	0.0s	✅ PASS
WR005	write	0.0s	✅ PASS
WR006	write	0.0s	✅ PASS
WR007	write	0.2s	✅ PASS
WR008	write	0.0s	✅ PASS
WR009	write	6.9s	✅ PASS
WR010	write	0.0s	✅ PASS

3. Golden Sets (`run_golden_sets.py`)

Golden Sets — 10 / 10 passed (100%)

ID	Latency	Tools Used	Result
gs-001	3.1s	`portfolio_analysis`, `compliance_check`	✅ PASS
gs-002	7.0s	`transaction_query`	✅ PASS
gs-003	6.5s	`portfolio_analysis`, `compliance_check`	✅ PASS
gs-004	2.3s	`market_data`	✅ PASS
gs-005	7.5s	`portfolio_analysis`, `transaction_query`, `tax_estimate`	✅ PASS
gs-006	7.6s	`portfolio_analysis`, `compliance_check`	✅ PASS
gs-007	0.0s	(none)	✅ PASS
gs-008	12.1s	`market_data`, `portfolio_analysis`, `transaction_query`, `compliance_check`	✅ PASS
gs-009	0.0s	(none)	✅ PASS
gs-010	5.0s	`portfolio_analysis`, `compliance_check`	✅ PASS

Labeled Scenarios — 15 / 15 passed (100%)

Results by Difficulty

Difficulty	Passed	Total
straightforward	7	7
ambiguous	5	5
edge_case	2	2
adversarial	1	1

All Scenarios

ID	Difficulty	Subcategory	Latency	Result
sc-001	straightforward	performance	4.0s	✅ PASS
sc-002	straightforward	transaction_and_market	8.2s	✅ PASS
sc-003	straightforward	compliance_and_tax	9.1s	✅ PASS
sc-004	ambiguous	performance	8.7s	✅ PASS
sc-005	edge_case	transaction	3.3s	✅ PASS
sc-006	adversarial	prompt_injection	0.0s	✅ PASS
sc-007	straightforward	performance_and_compliance	5.7s	✅ PASS
sc-008	straightforward	transaction_and_analysis	9.1s	✅ PASS
sc-009	ambiguous	tax_and_performance	9.2s	✅ PASS
sc-010	ambiguous	compliance	7.9s	✅ PASS
sc-011	straightforward	full_position_analysis	10.4s	✅ PASS
sc-012	edge_case	performance	0.0s	✅ PASS
sc-013	ambiguous	performance	6.6s	✅ PASS
sc-014	straightforward	full_report	13.1s	✅ PASS
sc-015	ambiguous	performance	7.2s	✅ PASS

Fixes Applied

All 5 previous failures were resolved with targeted changes to the classifier in graph.py:

Case	Root Cause	Fix
HP007	`"biggest"` not in any keyword list	Added `"biggest holding"`, `"biggest position"`, `"top holdings"` etc. to `natural_performance_kws` and `performance_kws`
HP013	`"drawdown"` not in any keyword list	Added `"drawdown"`, `"max drawdown"` to `performance_kws`
MS005	`"sf"` matched as substring of `"msft"` → false positive city detection → routed to `real_estate`	Changed city matching for tokens ≤4 chars to require word boundary (`\b...\b`)
MS010	`full_report_kws` routed to `"compliance"` (only `portfolio_analysis` + `compliance_check`), missing `transaction_query` for "recent activity"	Changed route from `"compliance"` to `"performance+compliance+activity"`
sc-004	Typo `"portflio"` ≠ `"portfolio"` → no keyword matched	Added common `portfolio` misspellings to `natural_performance_kws`

7.7 KiB Raw Blame History