task 11 reproducibility manifest by TazoDaGreat · Pull Request #6 · vinash85/Chem2TextQA

TazoDaGreat · 2026-04-24T23:11:48Z

Document every source of randomness, model version, and configuration so the paper's reviewer can answer "is this reproducible?" with yes.

kushalviit · 2026-04-25T00:41:05Z

@luistafoi if results are fine then comment "good to merge"

kushalviit · 2026-04-25T00:41:19Z

@luistafoi if results are fine then comment "good to merge"

kushalviit · 2026-04-25T00:41:34Z

@luistafoi if results are fine then comment "good to merge"

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a reproducibility bundle for the Chem2TextQA “gold” release by freezing prompts/scripting and documenting the exact environment/models/randomness assumptions so reviewers can verify the pipeline can be replayed.

Changes:

Add frozen Phase 1–3 prompt/script snapshots under reproducibility_manifest/prompts_frozen/.
Add a conda environment snapshot (environment.lock.yml) for dependency pinning.
Add a detailed reproducibility manifest (REPRODUCIBILITY.md) describing models, RNG surface, datasets, and replay steps.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
reproducibility_manifest/prompts_frozen/phase_3_judge_2026-04-24.py	Frozen Phase 3 judge script/prompt used for agreement validation.
reproducibility_manifest/prompts_frozen/phase_2_independent_2026-04-24.py	Frozen Phase 2 independent-answer script/prompt used for blind re-answering.
reproducibility_manifest/prompts_frozen/phase_1_prompts_2026-04-24.py	Frozen Phase 1 Q&A generation prompts/taxonomy used to generate questions and answers.
reproducibility_manifest/environment.lock.yml	Pinned environment snapshot for replaying the pipeline.
reproducibility_manifest/REPRODUCIBILITY.md	Written manifest enumerating models, randomness sources, data snapshots, and replay procedure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T02:31:47Z

+        tasks = [_process(rec) for rec in todo]
+        await tqdm_asyncio.gather(*tasks, desc="Phase 3")


This schedules one task per record upfront; for large todo (e.g., hundreds of thousands of QAs) it can create very high memory pressure and slowdowns even though the HTTP calls are semaphored. Prefer a bounded-concurrency pattern (e.g., an asyncio.Queue with workers consumer tasks, or a small set of worker coroutines pulling from todo) so only O(workers) tasks exist at once.

Copilot · 2026-04-25T02:31:48Z

+        tasks = [_process(rec, qa_index, qa) for rec, qa_index, qa in todo]
+        await tqdm_asyncio.gather(*tasks, desc="Phase 2")


Same issue as Phase 3: creating one task per question upfront can blow up memory on large runs. Consider switching to a bounded worker pool (queue + N workers) so the number of live tasks stays proportional to workers rather than len(todo).

Copilot · 2026-04-25T02:31:48Z

+        async def _process(rec):
+            result, err, source = await judge_one_pair(
+                client, session, rec, use_heuristic=use_heuristic,
+            )
+            async with out_lock:
+                if result is None:
+                    stats.failed += 1
+                    return


When judging fails, err is computed but discarded and no error record is written. Operationally this makes it hard to debug failures and also causes indefinite retries on reruns (since (cid, qa_index) never lands in validated.jsonl). Consider appending failures to a dedicated Phase 3 error JSONL (like Phase 2 does) with cid, qa_index, and err, or emitting a failure record that can be excluded downstream but still marks the key as processed.

Copilot · 2026-04-25T02:31:48Z

+        out_f = output_path.open("a", encoding="utf-8")
+        out_lock = asyncio.Lock()


The output file is manually opened/closed; if tqdm_asyncio.gather(...) raises (cancellation, unexpected exception), out_f.close() is skipped. Use a context manager (with output_path.open(...) as out_f:) to ensure the handle is always closed on error.

Copilot · 2026-04-25T02:31:48Z

+        tasks = [_process(rec) for rec in todo]
+        await tqdm_asyncio.gather(*tasks, desc="Phase 3")
+
+        out_f.close()


The output file is manually opened/closed; if tqdm_asyncio.gather(...) raises (cancellation, unexpected exception), out_f.close() is skipped. Use a context manager (with output_path.open(...) as out_f:) to ensure the handle is always closed on error.

Copilot · 2026-04-25T02:31:49Z

+The lock file in the repo was produced with the above sequence on the
+release machine; exact OS and CUDA details are in its header.


environment.lock.yml in this PR doesn’t include a header with OS/CUDA metadata (it starts directly with name:). Either add the referenced header to the lock file, or adjust this text so it accurately reflects what is actually captured.

Copilot · 2026-04-25T02:31:49Z

+Set `OPENROUTER_API_KEY` in `.env` before running the QA phases (see
+`.env.example`). The four QA phases each write append-only JSONL and


This section is internally inconsistent: it refers to “four QA phases” while also saying Phase 0 is “no LLM” earlier, and it asks to set OPENROUTER_API_KEY before running “the QA phases” even though Phase 0 doesn’t require it. Clarify which phases require the API key and exactly which phases are counted as “QA phases” (e.g., Phase 1–3 only).

task 11 reproducibility manifest

4745fdc

TazoDaGreat requested review from avi-lab and luistafoi April 24, 2026 23:12

kushalviit reviewed Apr 25, 2026

View reviewed changes

Macaulay001 requested a review from Copilot April 25, 2026 02:20

Copilot started reviewing on behalf of Macaulay001 April 25, 2026 02:21 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

task 11 reproducibility manifest#6

task 11 reproducibility manifest#6
TazoDaGreat wants to merge 1 commit into
mainfrom
task11_Reproducibility_manifest

TazoDaGreat commented Apr 24, 2026

Uh oh!

kushalviit Apr 25, 2026

Uh oh!

kushalviit Apr 25, 2026

Uh oh!

kushalviit Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		tasks = [_process(rec) for rec in todo]
		await tqdm_asyncio.gather(*tasks, desc="Phase 3")

		tasks = [_process(rec, qa_index, qa) for rec, qa_index, qa in todo]
		await tqdm_asyncio.gather(*tasks, desc="Phase 2")

		out_f = output_path.open("a", encoding="utf-8")
		out_lock = asyncio.Lock()

		The lock file in the repo was produced with the above sequence on the
		release machine; exact OS and CUDA details are in its header.

		Set `OPENROUTER_API_KEY` in `.env` before running the QA phases (see
		`.env.example`). The four QA phases each write append-only JSONL and

Uh oh!

Conversation

TazoDaGreat commented Apr 24, 2026

Uh oh!

kushalviit Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

kushalviit Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

kushalviit Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants