Task 7: Chem2TextQA Diversity Analysis#9
Conversation
Comprehensive diversity analysis comparing Chem2TextQA against SMolInstruct baseline. Key Results: - Chem2TextQA: 46,992 question vocab, 328,159 answer vocab - SMolInstruct: 79 question vocab, 32 answer vocab - Diversity gap: 595× larger vocab, 2,480× more unique question stems - Topics: 2,169 fine-grained (vs. 14 templates in baseline) Deliverables: - 4 analysis scripts (C1-C4 pipeline) - 10 figures (3 final manuscript-ready PDFs + 7 intermediate) - LaTeX diversity table - JSON caches with all statistics Scripts: - C1: Lexical diversity, topic distribution, structural analysis - C2: Semantic embeddings and cosine similarity - C3: Type-Token Ratio decay curves - C4: Final composite figures + LaTeX table See ANALYSIS_SUMMARY.md and README.md for full details.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a full C1–C4 analysis pipeline to quantify and visualize lexical/semantic/topic diversity for Chem2TextQA vs. a SMolInstruct baseline, along with cached outputs and manuscript-ready artifacts.
Changes:
- Introduces 4 analysis scripts (C1–C4) to compute lexical stats, semantic cosine similarity, TTR decay curves, and final composite figures/table.
- Adds cached JSON outputs plus a LaTeX table for inclusion in a manuscript.
- Adds documentation summarizing results and how to run the pipeline.
Reviewed changes
Copilot reviewed 9 out of 29 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| Task7_DiversityAnalysis/C1_diversity_analysis.py | Computes lexical/topic/structural stats; writes caches, figures, and draft LaTeX table |
| Task7_DiversityAnalysis/C2_semantic_diversity.py | Computes embedding-based cosine similarity stats and histogram figures |
| Task7_DiversityAnalysis/C3_ttr_curve.py | Generates TTR decay curve figures for questions/answers |
| Task7_DiversityAnalysis/C4_combined_figures.py | Assembles final manuscript figures and writes the final LaTeX table |
| Task7_DiversityAnalysis/README.md | Provides run instructions and headline results |
| Task7_DiversityAnalysis/ANALYSIS_SUMMARY.md | Detailed narrative summary of findings and outputs |
| Task7_DiversityAnalysis/diversity_summary.json | Cached lexical/topic/structural statistics from C1 |
| Task7_DiversityAnalysis/semantic_diversity_stats.json | Cached semantic cosine similarity statistics from C2 |
| Task7_DiversityAnalysis/tables/diversity_analysis.tex | Publication-ready table for manuscript inclusion |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| stats_list = [ | ||
| ("Chem2TextQA Questions (188k)", q_chem_stats), | ||
| ("SMolInstruct Questions (all)", q_smol_stats), | ||
| ("Chem2TextQA Answers (188k)", a_chem_stats), | ||
| ("SMolInstruct Answers (all)", a_smol_stats), | ||
| ] |
There was a problem hiding this comment.
summary_c2 is keyed by "Chem2TextQA Questions", "SMolInstruct Questions", "Chem2TextQA Answers", "SMolInstruct Answers", but C4 looks up cosine stats using display strings like "Chem2TextQA Questions (188k)". This guarantees mean_cos falls back to 0, which matches the committed LaTeX table showing 0.000 cosines. Fix by using canonical keys for lookup (and keep display labels separate), e.g., store (display_name, stats, semantic_key) or a mapping from display name → semantic JSON key.
| ttr_str = f"{s['sampled_ttr']:.3f}$\\pm${s['sampled_ttr_std']:.3f}" | ||
| mean_cos = summary_c2.get(name, {}).get("mean_cosine", 0) | ||
| mean_cos_str = f"{mean_cos:.3f}" | ||
| mlen = f"{s['mean_length']:.1f}" | ||
| latex += f"{name} & {n} & {vocab} & {tri} & {ttr_str} & {mean_cos_str} & {mlen} \\\\\n" |
There was a problem hiding this comment.
summary_c2 is keyed by "Chem2TextQA Questions", "SMolInstruct Questions", "Chem2TextQA Answers", "SMolInstruct Answers", but C4 looks up cosine stats using display strings like "Chem2TextQA Questions (188k)". This guarantees mean_cos falls back to 0, which matches the committed LaTeX table showing 0.000 cosines. Fix by using canonical keys for lookup (and keep display labels separate), e.g., store (display_name, stats, semantic_key) or a mapping from display name → semantic JSON key.
| OUT_DIR = Path("/data/asahu/projects/mutqa/chem2textqa_diversity") | ||
| FIG_DIR = OUT_DIR / "figures" | ||
| TAB_DIR = OUT_DIR / "tables" |
There was a problem hiding this comment.
FIG_DIR and TAB_DIR are never created in C4 before writing files. Unlike C1/C2/C3, this script can fail at runtime with FileNotFoundError when saving figures or the LaTeX table. Add FIG_DIR.mkdir(parents=True, exist_ok=True) and TAB_DIR.mkdir(parents=True, exist_ok=True) early in the script.
| ) | ||
|
|
||
| plt.suptitle("", fontsize=1) # Suppress default suptitle | ||
| plt.savefig(FIG_DIR / "figure1_question_diversity.pdf", dpi=300, bbox_inches="tight") |
There was a problem hiding this comment.
FIG_DIR and TAB_DIR are never created in C4 before writing files. Unlike C1/C2/C3, this script can fail at runtime with FileNotFoundError when saving figures or the LaTeX table. Add FIG_DIR.mkdir(parents=True, exist_ok=True) and TAB_DIR.mkdir(parents=True, exist_ok=True) early in the script.
| \end{table} | ||
| """ | ||
|
|
||
| with open(TAB_DIR / "diversity_analysis.tex", "w") as f: |
There was a problem hiding this comment.
FIG_DIR and TAB_DIR are never created in C4 before writing files. Unlike C1/C2/C3, this script can fail at runtime with FileNotFoundError when saving figures or the LaTeX table. Add FIG_DIR.mkdir(parents=True, exist_ok=True) and TAB_DIR.mkdir(parents=True, exist_ok=True) early in the script.
| for _ in range(n_pairs): | ||
| i, j = random.sample(range(n_texts), 2) |
There was a problem hiding this comment.
Computing 500k similarities with a pure-Python loop and random.sample(range(n_texts), 2) each iteration is a major runtime hotspot. Consider generating index arrays in bulk (e.g., with NumPy RNG), computing dot products vectorized, and then slicing down to n_pairs; this is substantially faster and more reproducible.
| cos_sim = float(np.dot(embeddings[i], embeddings[j])) | ||
| cosine_sims.append(cos_sim) |
There was a problem hiding this comment.
Computing 500k similarities with a pure-Python loop and random.sample(range(n_texts), 2) each iteration is a major runtime hotspot. Consider generating index arrays in bulk (e.g., with NumPy RNG), computing dot products vectorized, and then slicing down to n_pairs; this is substantially faster and more reproducible.
| if i >= next_checkpoint: | ||
| ttr = len(vocab) / len(vocab.union(set(all_tokens[i-checkpoint_interval:i+1]))) | ||
| token_counts.append(i) | ||
| ttr_values.append(len(vocab) / (i + 1)) |
There was a problem hiding this comment.
The local variable ttr is computed but never used, and the expression itself is confusing (it mixes unique types with a windowed token slice). Removing the unused computation reduces confusion and avoids unnecessary per-checkpoint overhead.
| for _ in range(n_bootstrap): | ||
| random.shuffle(texts) |
There was a problem hiding this comment.
sample_ttr() mutates its input list in-place by shuffling texts. Since corpus_stats() passes through the original lists used elsewhere in the script, this introduces hidden side effects and can make results harder to reproduce/debug. Shuffle a copy (e.g., texts_shuf = texts[:]) or sample indices instead.
| CHEM2TEXT_PATH = "/data/luis/Chem2TextHackathon/full_premium_kimi/dataset_gold.jsonl" | ||
| OUT_DIR = Path("/data/asahu/projects/mutqa/chem2textqa_diversity") |
There was a problem hiding this comment.
Hard-coded absolute paths make the pipeline non-portable and prevent running it from the repo checkout (and also mean the committed caches/tables in Task7_DiversityAnalysis/ aren’t what the scripts will read/write by default). Make paths configurable (CLI args via argparse, environment variables, or default to Path(__file__).resolve().parent) so reviewers/CI can run the analysis consistently.
Comprehensive diversity analysis comparing Chem2TextQA against SMolInstruct baseline.
Key Results:
Deliverables:
Scripts:
See ANALYSIS_SUMMARY.md and README.md for full details.