Browse Source

docs: record external OSS PR evidence and update task status

pull/6395/head
Max P 1 month ago
parent
commit
3479a0eb24
  1. 5
      Tasks.md
  2. 7
      docs/CRITICAL-REQUIREMENTS-STATUS.md
  3. 2
      docs/tasks/tasks.md
  4. 4
      tasks/lessons.md
  5. 10
      tasks/tasks.md
  6. 8
      thoughts/shared/plans/open-source-eval-framework.md

5
Tasks.md

@ -14,7 +14,7 @@ Last updated: 2026-02-24
| T-006 | Full eval dataset (50+) | Complete | `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
| T-007 | Observability wiring (LangSmith traces and metrics) | Complete | `apps/api/src/app/endpoints/ai/ai.service.spec.ts`, `apps/api/src/app/endpoints/ai/ai-feedback.service.spec.ts`, `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
| T-008 | Deployment and submission bundle | Complete | `npm run test:ai` + Railway healthcheck + submission docs checklist | `2b6506de8` |
| T-009 | Open source eval framework contribution | Ready for Publish | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | thoughts/shared/plans/open-source-eval-framework.md |
| T-009 | Open source eval framework contribution | In Review | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | openai/evals PR #1625 + langchain PR #35421 |
## Notes
@ -27,6 +27,9 @@ Last updated: 2026-02-24
- Reply quality gate (2026-02-24): `npm run test:ai:quality` added with deterministic anti-disclaimer and actionability checks.
- Eval quality metrics (2026-02-24): hallucination-rate (`<=5%`) and verification-accuracy (`>=90%`) tracked and asserted in MVP eval suite.
- Open-source package scaffold (2026-02-24): `tools/evals/finance-agent-evals/` with dataset export, runner, smoke test, and pack dry-run.
- External OSS PRs (2026-02-24):
- https://github.com/openai/evals/pull/1625
- https://github.com/langchain-ai/langchain/pull/35421
- Condensed architecture doc (2026-02-24): `docs/ARCHITECTURE-CONDENSED.md`.
- Railway crash recovery (2026-02-23): `railway.toml` start command corrected to `node dist/apps/api/main.js`, deployed to Railway (`4f26063a-97e5-43dd-b2dd-360e9e12a951`), and validated with production health check.
- Tool gating hardening (2026-02-24): planner unknown-intent fallback changed to no-tools, executor policy gate added (`direct|tools|clarify`), and policy metrics emitted via verification and observability logs.

7
docs/CRITICAL-REQUIREMENTS-STATUS.md

@ -94,7 +94,12 @@ These are still outstanding at submission level:
- Demo video (3-5 min)
- Social post with `@GauntletAI`
- Open-source release link (local scaffold complete at `tools/evals/finance-agent-evals/`, external publish/PR link still pending)
- Open-source package npm publish (local scaffold complete at `tools/evals/finance-agent-evals/`, upstream PRs opened)
External contribution links:
- https://github.com/openai/evals/pull/1625
- https://github.com/langchain-ai/langchain/pull/35421
Open-source scaffold verification commands:

2
docs/tasks/tasks.md

@ -14,7 +14,7 @@ Last updated: 2026-02-24
| T-006 | Full eval dataset (50+) | Complete | `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
| T-007 | Observability wiring (LangSmith traces and metrics) | Complete | `apps/api/src/app/endpoints/ai/ai.service.spec.ts`, `apps/api/src/app/endpoints/ai/ai-feedback.service.spec.ts`, `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
| T-008 | Deployment and submission bundle | Complete | `npm run test:ai` + Railway healthcheck + submission docs checklist | `2b6506de8` |
| T-009 | Open source eval framework contribution | Ready for Publish | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | `thoughts/shared/plans/open-source-eval-framework.md` |
| T-009 | Open source eval framework contribution | In Review | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | openai/evals PR #1625 + langchain PR #35421 |
## Notes

4
tasks/lessons.md

@ -31,3 +31,7 @@ Updated: 2026-02-24
7. Context: AI routing hardening in deterministic tool orchestration
Mistake: Considered model-structured output guards before validating actual failure surface
Rule: When tool routing is deterministic, prioritize planner fallback correctness and executor policy gating before adding LLM classifier layers.
8. Context: Open-source submission strategy after publish constraints
Mistake: Treated npm publication as the only completion path for contribution evidence
Rule: When package publication is blocked, ship the tool in-repo and open upstream PRs in high-signal repositories to preserve external contribution progress.

10
tasks/tasks.md

@ -27,7 +27,7 @@ Last updated: 2026-02-24
| T-006 | Full eval dataset (50+) | Complete | `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
| T-007 | Observability wiring (LangSmith traces and metrics) | Complete | `apps/api/src/app/endpoints/ai/ai.service.spec.ts`, `apps/api/src/app/endpoints/ai/ai-feedback.service.spec.ts`, `apps/api/src/app/endpoints/ai/evals/mvp-eval.runner.spec.ts` | Local implementation |
| T-008 | Deployment and submission bundle | Complete | `npm run test:ai` + Railway healthcheck + submission docs checklist | `2b6506de8` |
| T-009 | Open source eval framework contribution | Ready for Publish | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | `thoughts/shared/plans/open-source-eval-framework.md` |
| T-009 | Open source eval framework contribution | In Review | `@ghostfolio/finance-agent-evals` package scaffold + dataset export + smoke/pack checks | openai/evals PR #1625 + langchain PR #35421 |
## Notes
@ -185,6 +185,14 @@ Last updated: 2026-02-24
- [x] Add/adjust unit tests for planner fallback, policy enforcement, and no-tool execution path.
- [x] Run focused verification (`npm run test:ai`, `npm run test:mvp-eval`) and capture evidence.
## Session Plan (2026-02-24, OSS Publish + External PRs)
- [x] Confirm in-repo open-source tool package is committed and documented for direct repo consumption.
- [x] Create high-signal upstream PR in `openai/evals` fork with finance-agent eval dataset integration docs/template.
- [x] Create high-signal upstream PR in `langchain` fork with a focused docs contribution for finance-agent eval interoperability.
- [x] Push fork branches and open PRs against upstream repositories.
- [x] Update `Tasks.md` and plan artifact with PR links and current status.
## Verification Notes
- `nx run api:lint` completed successfully (existing workspace warnings only).

8
thoughts/shared/plans/open-source-eval-framework.md

@ -1,6 +1,6 @@
# Open Source Eval Framework Contribution Plan
**Status:** In Progress (Track 1 scaffold complete locally)
**Status:** In Progress (Track 1 complete locally, external PRs opened)
**Priority:** High
**Task:** Publish 53-case eval framework as open source package
**Created:** 2026-02-24
@ -20,9 +20,13 @@ Completed locally:
Remaining for external completion:
- Publish npm package
- Open PR to LangChain
- Submit benchmark/dataset links
External PR evidence:
- OpenAI Evals: https://github.com/openai/evals/pull/1625
- LangChain: https://github.com/langchain-ai/langchain/pull/35421
## Overview
Contribute the Ghostfolio AI Agent's 53-case evaluation framework to the open source community, meeting the Gauntlet G4 open source contribution requirement.

Loading…
Cancel
Save