Skip to content

task 11 reproducibility manifest#6

Open
TazoDaGreat wants to merge 1 commit into
mainfrom
task11_Reproducibility_manifest
Open

task 11 reproducibility manifest#6
TazoDaGreat wants to merge 1 commit into
mainfrom
task11_Reproducibility_manifest

Conversation

@TazoDaGreat

Copy link
Copy Markdown
Collaborator

Document every source of randomness, model version, and configuration so the paper's reviewer can answer "is this reproducible?" with yes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a reproducibility bundle for the Chem2TextQA “gold” release by freezing prompts/scripting and documenting the exact environment/models/randomness assumptions so reviewers can verify the pipeline can be replayed.

Changes:

  • Add frozen Phase 1–3 prompt/script snapshots under reproducibility_manifest/prompts_frozen/.
  • Add a conda environment snapshot (environment.lock.yml) for dependency pinning.
  • Add a detailed reproducibility manifest (REPRODUCIBILITY.md) describing models, RNG surface, datasets, and replay steps.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
reproducibility_manifest/prompts_frozen/phase_3_judge_2026-04-24.py Frozen Phase 3 judge script/prompt used for agreement validation.
reproducibility_manifest/prompts_frozen/phase_2_independent_2026-04-24.py Frozen Phase 2 independent-answer script/prompt used for blind re-answering.
reproducibility_manifest/prompts_frozen/phase_1_prompts_2026-04-24.py Frozen Phase 1 Q&A generation prompts/taxonomy used to generate questions and answers.
reproducibility_manifest/environment.lock.yml Pinned environment snapshot for replaying the pipeline.
reproducibility_manifest/REPRODUCIBILITY.md Written manifest enumerating models, randomness sources, data snapshots, and replay procedure.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +267 to +268
tasks = [_process(rec) for rec in todo]
await tqdm_asyncio.gather(*tasks, desc="Phase 3")

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This schedules one task per record upfront; for large todo (e.g., hundreds of thousands of QAs) it can create very high memory pressure and slowdowns even though the HTTP calls are semaphored. Prefer a bounded-concurrency pattern (e.g., an asyncio.Queue with workers consumer tasks, or a small set of worker coroutines pulling from todo) so only O(workers) tasks exist at once.

Copilot uses AI. Check for mistakes.
Comment on lines +261 to +262
tasks = [_process(rec, qa_index, qa) for rec, qa_index, qa in todo]
await tqdm_asyncio.gather(*tasks, desc="Phase 2")

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as Phase 3: creating one task per question upfront can blow up memory on large runs. Consider switching to a bounded worker pool (queue + N workers) so the number of live tasks stays proportional to workers rather than len(todo).

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +249
async def _process(rec):
result, err, source = await judge_one_pair(
client, session, rec, use_heuristic=use_heuristic,
)
async with out_lock:
if result is None:
stats.failed += 1
return

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When judging fails, err is computed but discarded and no error record is written. Operationally this makes it hard to debug failures and also causes indefinite retries on reruns (since (cid, qa_index) never lands in validated.jsonl). Consider appending failures to a dedicated Phase 3 error JSONL (like Phase 2 does) with cid, qa_index, and err, or emitting a failure record that can be excluded downstream but still marks the key as processed.

Copilot uses AI. Check for mistakes.
Comment on lines +239 to +240
out_f = output_path.open("a", encoding="utf-8")
out_lock = asyncio.Lock()

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output file is manually opened/closed; if tqdm_asyncio.gather(...) raises (cancellation, unexpected exception), out_f.close() is skipped. Use a context manager (with output_path.open(...) as out_f:) to ensure the handle is always closed on error.

Copilot uses AI. Check for mistakes.
tasks = [_process(rec) for rec in todo]
await tqdm_asyncio.gather(*tasks, desc="Phase 3")

out_f.close()

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output file is manually opened/closed; if tqdm_asyncio.gather(...) raises (cancellation, unexpected exception), out_f.close() is skipped. Use a context manager (with output_path.open(...) as out_f:) to ensure the handle is always closed on error.

Copilot uses AI. Check for mistakes.
Comment on lines +127 to +128
The lock file in the repo was produced with the above sequence on the
release machine; exact OS and CUDA details are in its header.

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

environment.lock.yml in this PR doesn’t include a header with OS/CUDA metadata (it starts directly with name:). Either add the referenced header to the lock file, or adjust this text so it accurately reflects what is actually captured.

Copilot uses AI. Check for mistakes.
Comment on lines +176 to +177
Set `OPENROUTER_API_KEY` in `.env` before running the QA phases (see
`.env.example`). The four QA phases each write append-only JSONL and

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is internally inconsistent: it refers to “four QA phases” while also saying Phase 0 is “no LLM” earlier, and it asks to set OPENROUTER_API_KEY before running “the QA phases” even though Phase 0 doesn’t require it. Clarify which phases require the API key and exactly which phases are counted as “QA phases” (e.g., Phase 1–3 only).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants