Task 7: Chem2TextQA Diversity Analysis by avi-lab · Pull Request #9 · vinash85/Chem2TextQA

avi-lab · 2026-04-25T02:05:37Z

Comprehensive diversity analysis comparing Chem2TextQA against SMolInstruct baseline.

Key Results:

Chem2TextQA: 46,992 question vocab, 328,159 answer vocab
SMolInstruct: 79 question vocab, 32 answer vocab
Diversity gap: 595× larger vocab, 2,480× more unique question stems
Topics: 2,169 fine-grained (vs. 14 templates in baseline)

Deliverables:

4 analysis scripts (C1-C4 pipeline)
10 figures (3 final manuscript-ready PDFs + 7 intermediate)
LaTeX diversity table
JSON caches with all statistics

Scripts:

C1: Lexical diversity, topic distribution, structural analysis
C2: Semantic embeddings and cosine similarity
C3: Type-Token Ratio decay curves
C4: Final composite figures + LaTeX table

See ANALYSIS_SUMMARY.md and README.md for full details.

Comprehensive diversity analysis comparing Chem2TextQA against SMolInstruct baseline. Key Results: - Chem2TextQA: 46,992 question vocab, 328,159 answer vocab - SMolInstruct: 79 question vocab, 32 answer vocab - Diversity gap: 595× larger vocab, 2,480× more unique question stems - Topics: 2,169 fine-grained (vs. 14 templates in baseline) Deliverables: - 4 analysis scripts (C1-C4 pipeline) - 10 figures (3 final manuscript-ready PDFs + 7 intermediate) - LaTeX diversity table - JSON caches with all statistics Scripts: - C1: Lexical diversity, topic distribution, structural analysis - C2: Semantic embeddings and cosine similarity - C3: Type-Token Ratio decay curves - C4: Final composite figures + LaTeX table See ANALYSIS_SUMMARY.md and README.md for full details.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a full C1–C4 analysis pipeline to quantify and visualize lexical/semantic/topic diversity for Chem2TextQA vs. a SMolInstruct baseline, along with cached outputs and manuscript-ready artifacts.

Changes:

Introduces 4 analysis scripts (C1–C4) to compute lexical stats, semantic cosine similarity, TTR decay curves, and final composite figures/table.
Adds cached JSON outputs plus a LaTeX table for inclusion in a manuscript.
Adds documentation summarizing results and how to run the pipeline.

Reviewed changes

Copilot reviewed 9 out of 29 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
Task7_DiversityAnalysis/C1_diversity_analysis.py	Computes lexical/topic/structural stats; writes caches, figures, and draft LaTeX table
Task7_DiversityAnalysis/C2_semantic_diversity.py	Computes embedding-based cosine similarity stats and histogram figures
Task7_DiversityAnalysis/C3_ttr_curve.py	Generates TTR decay curve figures for questions/answers
Task7_DiversityAnalysis/C4_combined_figures.py	Assembles final manuscript figures and writes the final LaTeX table
Task7_DiversityAnalysis/README.md	Provides run instructions and headline results
Task7_DiversityAnalysis/ANALYSIS_SUMMARY.md	Detailed narrative summary of findings and outputs
Task7_DiversityAnalysis/diversity_summary.json	Cached lexical/topic/structural statistics from C1
Task7_DiversityAnalysis/semantic_diversity_stats.json	Cached semantic cosine similarity statistics from C2
Task7_DiversityAnalysis/tables/diversity_analysis.tex	Publication-ready table for manuscript inclusion

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-25T02:33:08Z

+stats_list = [
+    ("Chem2TextQA Questions (188k)", q_chem_stats),
+    ("SMolInstruct Questions (all)", q_smol_stats),
+    ("Chem2TextQA Answers (188k)", a_chem_stats),
+    ("SMolInstruct Answers (all)", a_smol_stats),
+]


summary_c2 is keyed by "Chem2TextQA Questions", "SMolInstruct Questions", "Chem2TextQA Answers", "SMolInstruct Answers", but C4 looks up cosine stats using display strings like "Chem2TextQA Questions (188k)". This guarantees mean_cos falls back to 0, which matches the committed LaTeX table showing 0.000 cosines. Fix by using canonical keys for lookup (and keep display labels separate), e.g., store (display_name, stats, semantic_key) or a mapping from display name → semantic JSON key.

Copilot · 2026-04-25T02:33:08Z

+    ttr_str = f"{s['sampled_ttr']:.3f}$\\pm${s['sampled_ttr_std']:.3f}"
+    mean_cos = summary_c2.get(name, {}).get("mean_cosine", 0)
+    mean_cos_str = f"{mean_cos:.3f}"
+    mlen = f"{s['mean_length']:.1f}"
+    latex += f"{name} & {n} & {vocab} & {tri} & {ttr_str} & {mean_cos_str} & {mlen} \\\\\n"


summary_c2 is keyed by "Chem2TextQA Questions", "SMolInstruct Questions", "Chem2TextQA Answers", "SMolInstruct Answers", but C4 looks up cosine stats using display strings like "Chem2TextQA Questions (188k)". This guarantees mean_cos falls back to 0, which matches the committed LaTeX table showing 0.000 cosines. Fix by using canonical keys for lookup (and keep display labels separate), e.g., store (display_name, stats, semantic_key) or a mapping from display name → semantic JSON key.

Copilot · 2026-04-25T02:33:08Z

+OUT_DIR = Path("/data/asahu/projects/mutqa/chem2textqa_diversity")
+FIG_DIR = OUT_DIR / "figures"
+TAB_DIR = OUT_DIR / "tables"


FIG_DIR and TAB_DIR are never created in C4 before writing files. Unlike C1/C2/C3, this script can fail at runtime with FileNotFoundError when saving figures or the LaTeX table. Add FIG_DIR.mkdir(parents=True, exist_ok=True) and TAB_DIR.mkdir(parents=True, exist_ok=True) early in the script.

Copilot · 2026-04-25T02:33:09Z

+)
+
+plt.suptitle("", fontsize=1)  # Suppress default suptitle
+plt.savefig(FIG_DIR / "figure1_question_diversity.pdf", dpi=300, bbox_inches="tight")


FIG_DIR and TAB_DIR are never created in C4 before writing files. Unlike C1/C2/C3, this script can fail at runtime with FileNotFoundError when saving figures or the LaTeX table. Add FIG_DIR.mkdir(parents=True, exist_ok=True) and TAB_DIR.mkdir(parents=True, exist_ok=True) early in the script.

Copilot · 2026-04-25T02:33:09Z

+\end{table}
+"""
+
+with open(TAB_DIR / "diversity_analysis.tex", "w") as f:


FIG_DIR and TAB_DIR are never created in C4 before writing files. Unlike C1/C2/C3, this script can fail at runtime with FileNotFoundError when saving figures or the LaTeX table. Add FIG_DIR.mkdir(parents=True, exist_ok=True) and TAB_DIR.mkdir(parents=True, exist_ok=True) early in the script.

Copilot · 2026-04-25T02:33:10Z

+    for _ in range(n_pairs):
+        i, j = random.sample(range(n_texts), 2)


Computing 500k similarities with a pure-Python loop and random.sample(range(n_texts), 2) each iteration is a major runtime hotspot. Consider generating index arrays in bulk (e.g., with NumPy RNG), computing dot products vectorized, and then slicing down to n_pairs; this is substantially faster and more reproducible.

Copilot · 2026-04-25T02:33:10Z

+        cos_sim = float(np.dot(embeddings[i], embeddings[j]))
+        cosine_sims.append(cos_sim)


Computing 500k similarities with a pure-Python loop and random.sample(range(n_texts), 2) each iteration is a major runtime hotspot. Consider generating index arrays in bulk (e.g., with NumPy RNG), computing dot products vectorized, and then slicing down to n_pairs; this is substantially faster and more reproducible.

Copilot · 2026-04-25T02:33:10Z

+        if i >= next_checkpoint:
+            ttr = len(vocab) / len(vocab.union(set(all_tokens[i-checkpoint_interval:i+1])))
+            token_counts.append(i)
+            ttr_values.append(len(vocab) / (i + 1))


The local variable ttr is computed but never used, and the expression itself is confusing (it mixes unique types with a windowed token slice). Removing the unused computation reduces confusion and avoids unnecessary per-checkpoint overhead.

Copilot · 2026-04-25T02:33:11Z

+    for _ in range(n_bootstrap):
+        random.shuffle(texts)


sample_ttr() mutates its input list in-place by shuffling texts. Since corpus_stats() passes through the original lists used elsewhere in the script, this introduces hidden side effects and can make results harder to reproduce/debug. Shuffle a copy (e.g., texts_shuf = texts[:]) or sample indices instead.

Copilot · 2026-04-25T02:33:11Z

+CHEM2TEXT_PATH = "/data/luis/Chem2TextHackathon/full_premium_kimi/dataset_gold.jsonl"
+OUT_DIR = Path("/data/asahu/projects/mutqa/chem2textqa_diversity")


Hard-coded absolute paths make the pipeline non-portable and prevent running it from the repo checkout (and also mean the committed caches/tables in Task7_DiversityAnalysis/ aren’t what the scripts will read/write by default). Make paths configurable (CLI args via argparse, environment variables, or default to Path(__file__).resolve().parent) so reviewers/CI can run the analysis consistently.

avi-lab requested review from kushalviit and luistafoi April 25, 2026 02:05

Macaulay001 requested a review from Copilot April 25, 2026 02:19

Copilot started reviewing on behalf of Macaulay001 April 25, 2026 02:20 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Task 7: Chem2TextQA Diversity Analysis#9

Task 7: Chem2TextQA Diversity Analysis#9
avi-lab wants to merge 1 commit into
mainfrom
Task7_DiversityAnalysis

avi-lab commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Copilot AI Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		for _ in range(n_pairs):
		i, j = random.sample(range(n_texts), 2)

		cos_sim = float(np.dot(embeddings[i], embeddings[j]))
		cosine_sims.append(cos_sim)

		CHEM2TEXT_PATH = "/data/luis/Chem2TextHackathon/full_premium_kimi/dataset_gold.jsonl"
		OUT_DIR = Path("/data/asahu/projects/mutqa/chem2textqa_diversity")

Uh oh!

Conversation

avi-lab commented Apr 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants