task 11 reproducibility manifest#6
Conversation
There was a problem hiding this comment.
@luistafoi if results are fine then comment "good to merge"
There was a problem hiding this comment.
@luistafoi if results are fine then comment "good to merge"
There was a problem hiding this comment.
@luistafoi if results are fine then comment "good to merge"
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a reproducibility bundle for the Chem2TextQA “gold” release by freezing prompts/scripting and documenting the exact environment/models/randomness assumptions so reviewers can verify the pipeline can be replayed.
Changes:
- Add frozen Phase 1–3 prompt/script snapshots under
reproducibility_manifest/prompts_frozen/. - Add a conda environment snapshot (
environment.lock.yml) for dependency pinning. - Add a detailed reproducibility manifest (
REPRODUCIBILITY.md) describing models, RNG surface, datasets, and replay steps.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| reproducibility_manifest/prompts_frozen/phase_3_judge_2026-04-24.py | Frozen Phase 3 judge script/prompt used for agreement validation. |
| reproducibility_manifest/prompts_frozen/phase_2_independent_2026-04-24.py | Frozen Phase 2 independent-answer script/prompt used for blind re-answering. |
| reproducibility_manifest/prompts_frozen/phase_1_prompts_2026-04-24.py | Frozen Phase 1 Q&A generation prompts/taxonomy used to generate questions and answers. |
| reproducibility_manifest/environment.lock.yml | Pinned environment snapshot for replaying the pipeline. |
| reproducibility_manifest/REPRODUCIBILITY.md | Written manifest enumerating models, randomness sources, data snapshots, and replay procedure. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| tasks = [_process(rec) for rec in todo] | ||
| await tqdm_asyncio.gather(*tasks, desc="Phase 3") |
There was a problem hiding this comment.
This schedules one task per record upfront; for large todo (e.g., hundreds of thousands of QAs) it can create very high memory pressure and slowdowns even though the HTTP calls are semaphored. Prefer a bounded-concurrency pattern (e.g., an asyncio.Queue with workers consumer tasks, or a small set of worker coroutines pulling from todo) so only O(workers) tasks exist at once.
| tasks = [_process(rec, qa_index, qa) for rec, qa_index, qa in todo] | ||
| await tqdm_asyncio.gather(*tasks, desc="Phase 2") |
There was a problem hiding this comment.
Same issue as Phase 3: creating one task per question upfront can blow up memory on large runs. Consider switching to a bounded worker pool (queue + N workers) so the number of live tasks stays proportional to workers rather than len(todo).
| async def _process(rec): | ||
| result, err, source = await judge_one_pair( | ||
| client, session, rec, use_heuristic=use_heuristic, | ||
| ) | ||
| async with out_lock: | ||
| if result is None: | ||
| stats.failed += 1 | ||
| return |
There was a problem hiding this comment.
When judging fails, err is computed but discarded and no error record is written. Operationally this makes it hard to debug failures and also causes indefinite retries on reruns (since (cid, qa_index) never lands in validated.jsonl). Consider appending failures to a dedicated Phase 3 error JSONL (like Phase 2 does) with cid, qa_index, and err, or emitting a failure record that can be excluded downstream but still marks the key as processed.
| out_f = output_path.open("a", encoding="utf-8") | ||
| out_lock = asyncio.Lock() |
There was a problem hiding this comment.
The output file is manually opened/closed; if tqdm_asyncio.gather(...) raises (cancellation, unexpected exception), out_f.close() is skipped. Use a context manager (with output_path.open(...) as out_f:) to ensure the handle is always closed on error.
| tasks = [_process(rec) for rec in todo] | ||
| await tqdm_asyncio.gather(*tasks, desc="Phase 3") | ||
|
|
||
| out_f.close() |
There was a problem hiding this comment.
The output file is manually opened/closed; if tqdm_asyncio.gather(...) raises (cancellation, unexpected exception), out_f.close() is skipped. Use a context manager (with output_path.open(...) as out_f:) to ensure the handle is always closed on error.
| The lock file in the repo was produced with the above sequence on the | ||
| release machine; exact OS and CUDA details are in its header. |
There was a problem hiding this comment.
environment.lock.yml in this PR doesn’t include a header with OS/CUDA metadata (it starts directly with name:). Either add the referenced header to the lock file, or adjust this text so it accurately reflects what is actually captured.
| Set `OPENROUTER_API_KEY` in `.env` before running the QA phases (see | ||
| `.env.example`). The four QA phases each write append-only JSONL and |
There was a problem hiding this comment.
This section is internally inconsistent: it refers to “four QA phases” while also saying Phase 0 is “no LLM” earlier, and it asks to set OPENROUTER_API_KEY before running “the QA phases” even though Phase 0 doesn’t require it. Clarify which phases require the API key and exactly which phases are counted as “QA phases” (e.g., Phase 1–3 only).
Document every source of randomness, model version, and configuration so the paper's reviewer can answer "is this reproducible?" with yes.