docs: record external OSS PR evidence and update task status

5 months ago · 3479a0eb24
6 changed files with 30 additions and 6 deletions
--- a/Tasks.md
+++ b/Tasks.md
@ -14,7 +14,7 @@ Last updated: 2026-02-24
 | T-006 | Full eval dataset (50+) | Complete | `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
 | T-007 | Observability wiring (LangSmith traces and metrics) | Complete | `apps/api/src/app/endpoints/ai/ai.service.spec.ts`, `apps/api/src/app/endpoints/ai/ai-feedback.service.spec.ts`, `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
 | T-008 | Deployment and submission bundle | Complete | `npm run test:ai` + Railway healthcheck + submission docs checklist | `2b6506de8` |
-| T-009 | Open source eval framework contribution | Ready for Publish | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | thoughts/shared/plans/open-source-eval-framework.md |
+| T-009 | Open source eval framework contribution | In Review | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | openai/evals PR #1625 + langchain PR #35421 |

 ## Notes

@ -27,6 +27,9 @@ Last updated: 2026-02-24
 - Reply quality gate (2026-02-24): `npm run test:ai:quality` added with deterministic anti-disclaimer and actionability checks.
 - Eval quality metrics (2026-02-24): hallucination-rate (`<=5%`) and verification-accuracy (`>=90%`) tracked and asserted in MVP eval suite.
 - Open-source package scaffold (2026-02-24): `tools/evals/finance-agent-evals/` with dataset export, runner, smoke test, and pack dry-run.
+- External OSS PRs (2026-02-24):
+  - https://github.com/openai/evals/pull/1625
+  - https://github.com/langchain-ai/langchain/pull/35421
 - Condensed architecture doc (2026-02-24): `docs/ARCHITECTURE-CONDENSED.md`.
 - Railway crash recovery (2026-02-23): `railway.toml` start command corrected to `node dist/apps/api/main.js`, deployed to Railway (`4f26063a-97e5-43dd-b2dd-360e9e12a951`), and validated with production health check.
 - Tool gating hardening (2026-02-24): planner unknown-intent fallback changed to no-tools, executor policy gate added (`direct|tools|clarify`), and policy metrics emitted via verification and observability logs.
--- a/docs/CRITICAL-REQUIREMENTS-STATUS.md
+++ b/docs/CRITICAL-REQUIREMENTS-STATUS.md
@ -94,7 +94,12 @@ These are still outstanding at submission level:

 - Demo video (3-5 min)
 - Social post with `@GauntletAI`
- Open-source release link (local scaffold complete at `tools/evals/finance-agent-evals/`, external publish/PR link still pending)
+- Open-source package npm publish (local scaffold complete at `tools/evals/finance-agent-evals/`, upstream PRs opened)
+
+External contribution links:
+
+- https://github.com/openai/evals/pull/1625
+- https://github.com/langchain-ai/langchain/pull/35421

 Open-source scaffold verification commands:

--- a/docs/tasks/tasks.md
+++ b/docs/tasks/tasks.md
@ -14,7 +14,7 @@ Last updated: 2026-02-24
 | T-006 | Full eval dataset (50+) | Complete | `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
 | T-007 | Observability wiring (LangSmith traces and metrics) | Complete | `apps/api/src/app/endpoints/ai/ai.service.spec.ts`, `apps/api/src/app/endpoints/ai/ai-feedback.service.spec.ts`, `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
 | T-008 | Deployment and submission bundle | Complete | `npm run test:ai` + Railway healthcheck + submission docs checklist | `2b6506de8` |
-| T-009 | Open source eval framework contribution | Ready for Publish | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | `thoughts/shared/plans/open-source-eval-framework.md` |
+| T-009 | Open source eval framework contribution | In Review | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | openai/evals PR #1625 + langchain PR #35421 |

 ## Notes

--- a/tasks/lessons.md
+++ b/tasks/lessons.md
@ -31,3 +31,7 @@ Updated: 2026-02-24
 7. Context: AI routing hardening in deterministic tool orchestration
   Mistake: Considered model-structured output guards before validating actual failure surface
   Rule: When tool routing is deterministic, prioritize planner fallback correctness and executor policy gating before adding LLM classifier layers.
+
+8. Context: Open-source submission strategy after publish constraints
+   Mistake: Treated npm publication as the only completion path for contribution evidence
+   Rule: When package publication is blocked, ship the tool in-repo and open upstream PRs in high-signal repositories to preserve external contribution progress.
--- a/tasks/tasks.md
+++ b/tasks/tasks.md
@ -27,7 +27,7 @@ Last updated: 2026-02-24
 | T-006 | Full eval dataset (50+) | Complete | `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
 | T-007 | Observability wiring (LangSmith traces and metrics) | Complete | `apps/api/src/app/endpoints/ai/ai.service.spec.ts`, `apps/api/src/app/endpoints/ai/ai-feedback.service.spec.ts`, `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
 | T-008 | Deployment and submission bundle | Complete | `npm run test:ai` + Railway healthcheck + submission docs checklist | `2b6506de8` |
-| T-009 | Open source eval framework contribution | Ready for Publish | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | `thoughts/shared/plans/open-source-eval-framework.md` |
+| T-009 | Open source eval framework contribution | In Review | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | openai/evals PR #1625 + langchain PR #35421 |

 ## Notes

@ -185,6 +185,14 @@ Last updated: 2026-02-24
 - [x] Add/adjust unit tests for planner fallback, policy enforcement, and no-tool execution path.
 - [x] Run focused verification (`npm run test:ai`, `npm run test:mvp-eval`) and capture evidence.

+## Session Plan (2026-02-24, OSS Publish + External PRs)
+
+- [x] Confirm in-repo open-source tool package is committed and documented for direct repo consumption.
+- [x] Create high-signal upstream PR in `openai/evals` fork with finance-agent eval dataset integration docs/template.
+- [x] Create high-signal upstream PR in `langchain` fork with a focused docs contribution for finance-agent eval interoperability.
+- [x] Push fork branches and open PRs against upstream repositories.
+- [x] Update `Tasks.md` and plan artifact with PR links and current status.
+
 ## Verification Notes

 - `nx run api:lint` completed successfully (existing workspace warnings only).
--- a/thoughts/shared/plans/open-source-eval-framework.md
+++ b/thoughts/shared/plans/open-source-eval-framework.md
@ -1,6 +1,6 @@
 # Open Source Eval Framework Contribution Plan

-**Status:** In Progress (Track 1 scaffold complete locally)
+**Status:** In Progress (Track 1 complete locally, external PRs opened)
 **Priority:** High
 **Task:** Publish 53-case eval framework as open source package
 **Created:** 2026-02-24
@ -20,9 +20,13 @@ Completed locally:
 Remaining for external completion:

 - Publish npm package
- Open PR to LangChain
 - Submit benchmark/dataset links

+External PR evidence:
+
+- OpenAI Evals: https://github.com/openai/evals/pull/1625
+- LangChain: https://github.com/langchain-ai/langchain/pull/35421
+
 ## Overview

 Contribute the Ghostfolio AI Agent's 53-case evaluation framework to the open source community, meeting the Gauntlet G4 open source contribution requirement.