diff --git a/Tasks.md b/Tasks.md index 88fffdb05..0253ba8d5 100644 --- a/Tasks.md +++ b/Tasks.md @@ -14,7 +14,7 @@ Last updated: 2026-02-24 | T-006 | Full eval dataset (50+) | Complete | `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation | | T-007 | Observability wiring (LangSmith traces and metrics) | Complete | `apps/api/src/app/endpoints/ai/ai.service.spec.ts`, `apps/api/src/app/endpoints/ai/ai-feedback.service.spec.ts`, `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation | | T-008 | Deployment and submission bundle | Complete | `npm run test:ai` + Railway healthcheck + submission docs checklist | `2b6506de8` | -| T-009 | Open source eval framework contribution | Ready for Publish | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | thoughts/shared/plans/open-source-eval-framework.md | +| T-009 | Open source eval framework contribution | In Review | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | openai/evals PR #1625 + langchain PR #35421 | ## Notes @@ -27,6 +27,9 @@ Last updated: 2026-02-24 - Reply quality gate (2026-02-24): `npm run test:ai:quality` added with deterministic anti-disclaimer and actionability checks. - Eval quality metrics (2026-02-24): hallucination-rate (`<=5%`) and verification-accuracy (`>=90%`) tracked and asserted in MVP eval suite. - Open-source package scaffold (2026-02-24): `tools/evals/finance-agent-evals/` with dataset export, runner, smoke test, and pack dry-run. +- External OSS PRs (2026-02-24): + - https://github.com/openai/evals/pull/1625 + - https://github.com/langchain-ai/langchain/pull/35421 - Condensed architecture doc (2026-02-24): `docs/ARCHITECTURE-CONDENSED.md`. - Railway crash recovery (2026-02-23): `railway.toml` start command corrected to `node dist/apps/api/main.js`, deployed to Railway (`4f26063a-97e5-43dd-b2dd-360e9e12a951`), and validated with production health check. - Tool gating hardening (2026-02-24): planner unknown-intent fallback changed to no-tools, executor policy gate added (`direct|tools|clarify`), and policy metrics emitted via verification and observability logs. diff --git a/docs/CRITICAL-REQUIREMENTS-STATUS.md b/docs/CRITICAL-REQUIREMENTS-STATUS.md index c743aa9eb..2d11dad92 100644 --- a/docs/CRITICAL-REQUIREMENTS-STATUS.md +++ b/docs/CRITICAL-REQUIREMENTS-STATUS.md @@ -94,7 +94,12 @@ These are still outstanding at submission level: - Demo video (3-5 min) - Social post with `@GauntletAI` -- Open-source release link (local scaffold complete at `tools/evals/finance-agent-evals/`, external publish/PR link still pending) +- Open-source package npm publish (local scaffold complete at `tools/evals/finance-agent-evals/`, upstream PRs opened) + +External contribution links: + +- https://github.com/openai/evals/pull/1625 +- https://github.com/langchain-ai/langchain/pull/35421 Open-source scaffold verification commands: diff --git a/docs/tasks/tasks.md b/docs/tasks/tasks.md index ec7ca2d5c..48bc6b3b1 100644 --- a/docs/tasks/tasks.md +++ b/docs/tasks/tasks.md @@ -14,7 +14,7 @@ Last updated: 2026-02-24 | T-006 | Full eval dataset (50+) | Complete | `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation | | T-007 | Observability wiring (LangSmith traces and metrics) | Complete | `apps/api/src/app/endpoints/ai/ai.service.spec.ts`, `apps/api/src/app/endpoints/ai/ai-feedback.service.spec.ts`, `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation | | T-008 | Deployment and submission bundle | Complete | `npm run test:ai` + Railway healthcheck + submission docs checklist | `2b6506de8` | -| T-009 | Open source eval framework contribution | Ready for Publish | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | `thoughts/shared/plans/open-source-eval-framework.md` | +| T-009 | Open source eval framework contribution | In Review | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | openai/evals PR #1625 + langchain PR #35421 | ## Notes diff --git a/tasks/lessons.md b/tasks/lessons.md index 5f9c5238a..0b7243c73 100644 --- a/tasks/lessons.md +++ b/tasks/lessons.md @@ -31,3 +31,7 @@ Updated: 2026-02-24 7. Context: AI routing hardening in deterministic tool orchestration Mistake: Considered model-structured output guards before validating actual failure surface Rule: When tool routing is deterministic, prioritize planner fallback correctness and executor policy gating before adding LLM classifier layers. + +8. Context: Open-source submission strategy after publish constraints + Mistake: Treated npm publication as the only completion path for contribution evidence + Rule: When package publication is blocked, ship the tool in-repo and open upstream PRs in high-signal repositories to preserve external contribution progress. diff --git a/tasks/tasks.md b/tasks/tasks.md index 8debd45e0..7d56682ef 100644 --- a/tasks/tasks.md +++ b/tasks/tasks.md @@ -27,7 +27,7 @@ Last updated: 2026-02-24 | T-006 | Full eval dataset (50+) | Complete | `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation | | T-007 | Observability wiring (LangSmith traces and metrics) | Complete | `apps/api/src/app/endpoints/ai/ai.service.spec.ts`, `apps/api/src/app/endpoints/ai/ai-feedback.service.spec.ts`, `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation | | T-008 | Deployment and submission bundle | Complete | `npm run test:ai` + Railway healthcheck + submission docs checklist | `2b6506de8` | -| T-009 | Open source eval framework contribution | Ready for Publish | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | `thoughts/shared/plans/open-source-eval-framework.md` | +| T-009 | Open source eval framework contribution | In Review | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | openai/evals PR #1625 + langchain PR #35421 | ## Notes @@ -185,6 +185,14 @@ Last updated: 2026-02-24 - [x] Add/adjust unit tests for planner fallback, policy enforcement, and no-tool execution path. - [x] Run focused verification (`npm run test:ai`, `npm run test:mvp-eval`) and capture evidence. +## Session Plan (2026-02-24, OSS Publish + External PRs) + +- [x] Confirm in-repo open-source tool package is committed and documented for direct repo consumption. +- [x] Create high-signal upstream PR in `openai/evals` fork with finance-agent eval dataset integration docs/template. +- [x] Create high-signal upstream PR in `langchain` fork with a focused docs contribution for finance-agent eval interoperability. +- [x] Push fork branches and open PRs against upstream repositories. +- [x] Update `Tasks.md` and plan artifact with PR links and current status. + ## Verification Notes - `nx run api:lint` completed successfully (existing workspace warnings only). diff --git a/thoughts/shared/plans/open-source-eval-framework.md b/thoughts/shared/plans/open-source-eval-framework.md index 38dba81aa..24ad4d647 100644 --- a/thoughts/shared/plans/open-source-eval-framework.md +++ b/thoughts/shared/plans/open-source-eval-framework.md @@ -1,6 +1,6 @@ # Open Source Eval Framework Contribution Plan -**Status:** In Progress (Track 1 scaffold complete locally) +**Status:** In Progress (Track 1 complete locally, external PRs opened) **Priority:** High **Task:** Publish 53-case eval framework as open source package **Created:** 2026-02-24 @@ -20,9 +20,13 @@ Completed locally: Remaining for external completion: - Publish npm package -- Open PR to LangChain - Submit benchmark/dataset links +External PR evidence: + +- OpenAI Evals: https://github.com/openai/evals/pull/1625 +- LangChain: https://github.com/langchain-ai/langchain/pull/35421 + ## Overview Contribute the Ghostfolio AI Agent's 53-case evaluation framework to the open source community, meeting the Gauntlet G4 open source contribution requirement.