Skip to content

Task 7: Chem2TextQA Diversity Analysis#9

Open
avi-lab wants to merge 1 commit into
mainfrom
Task7_DiversityAnalysis
Open

Task 7: Chem2TextQA Diversity Analysis#9
avi-lab wants to merge 1 commit into
mainfrom
Task7_DiversityAnalysis

Conversation

@avi-lab

@avi-lab avi-lab commented Apr 25, 2026

Copy link
Copy Markdown
Collaborator

Comprehensive diversity analysis comparing Chem2TextQA against SMolInstruct baseline.

Key Results:

  • Chem2TextQA: 46,992 question vocab, 328,159 answer vocab
  • SMolInstruct: 79 question vocab, 32 answer vocab
  • Diversity gap: 595× larger vocab, 2,480× more unique question stems
  • Topics: 2,169 fine-grained (vs. 14 templates in baseline)

Deliverables:

  • 4 analysis scripts (C1-C4 pipeline)
  • 10 figures (3 final manuscript-ready PDFs + 7 intermediate)
  • LaTeX diversity table
  • JSON caches with all statistics

Scripts:

  • C1: Lexical diversity, topic distribution, structural analysis
  • C2: Semantic embeddings and cosine similarity
  • C3: Type-Token Ratio decay curves
  • C4: Final composite figures + LaTeX table

See ANALYSIS_SUMMARY.md and README.md for full details.

Comprehensive diversity analysis comparing Chem2TextQA against SMolInstruct baseline.

Key Results:
- Chem2TextQA: 46,992 question vocab, 328,159 answer vocab
- SMolInstruct: 79 question vocab, 32 answer vocab
- Diversity gap: 595× larger vocab, 2,480× more unique question stems
- Topics: 2,169 fine-grained (vs. 14 templates in baseline)

Deliverables:
- 4 analysis scripts (C1-C4 pipeline)
- 10 figures (3 final manuscript-ready PDFs + 7 intermediate)
- LaTeX diversity table
- JSON caches with all statistics

Scripts:
- C1: Lexical diversity, topic distribution, structural analysis
- C2: Semantic embeddings and cosine similarity
- C3: Type-Token Ratio decay curves
- C4: Final composite figures + LaTeX table

See ANALYSIS_SUMMARY.md and README.md for full details.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a full C1–C4 analysis pipeline to quantify and visualize lexical/semantic/topic diversity for Chem2TextQA vs. a SMolInstruct baseline, along with cached outputs and manuscript-ready artifacts.

Changes:

  • Introduces 4 analysis scripts (C1–C4) to compute lexical stats, semantic cosine similarity, TTR decay curves, and final composite figures/table.
  • Adds cached JSON outputs plus a LaTeX table for inclusion in a manuscript.
  • Adds documentation summarizing results and how to run the pipeline.

Reviewed changes

Copilot reviewed 9 out of 29 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
Task7_DiversityAnalysis/C1_diversity_analysis.py Computes lexical/topic/structural stats; writes caches, figures, and draft LaTeX table
Task7_DiversityAnalysis/C2_semantic_diversity.py Computes embedding-based cosine similarity stats and histogram figures
Task7_DiversityAnalysis/C3_ttr_curve.py Generates TTR decay curve figures for questions/answers
Task7_DiversityAnalysis/C4_combined_figures.py Assembles final manuscript figures and writes the final LaTeX table
Task7_DiversityAnalysis/README.md Provides run instructions and headline results
Task7_DiversityAnalysis/ANALYSIS_SUMMARY.md Detailed narrative summary of findings and outputs
Task7_DiversityAnalysis/diversity_summary.json Cached lexical/topic/structural statistics from C1
Task7_DiversityAnalysis/semantic_diversity_stats.json Cached semantic cosine similarity statistics from C2
Task7_DiversityAnalysis/tables/diversity_analysis.tex Publication-ready table for manuscript inclusion

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +360 to +365
stats_list = [
("Chem2TextQA Questions (188k)", q_chem_stats),
("SMolInstruct Questions (all)", q_smol_stats),
("Chem2TextQA Answers (188k)", a_chem_stats),
("SMolInstruct Answers (all)", a_smol_stats),
]

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary_c2 is keyed by "Chem2TextQA Questions", "SMolInstruct Questions", "Chem2TextQA Answers", "SMolInstruct Answers", but C4 looks up cosine stats using display strings like "Chem2TextQA Questions (188k)". This guarantees mean_cos falls back to 0, which matches the committed LaTeX table showing 0.000 cosines. Fix by using canonical keys for lookup (and keep display labels separate), e.g., store (display_name, stats, semantic_key) or a mapping from display name → semantic JSON key.

Copilot uses AI. Check for mistakes.
Comment on lines +387 to +391
ttr_str = f"{s['sampled_ttr']:.3f}$\\pm${s['sampled_ttr_std']:.3f}"
mean_cos = summary_c2.get(name, {}).get("mean_cosine", 0)
mean_cos_str = f"{mean_cos:.3f}"
mlen = f"{s['mean_length']:.1f}"
latex += f"{name} & {n} & {vocab} & {tri} & {ttr_str} & {mean_cos_str} & {mlen} \\\\\n"

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

summary_c2 is keyed by "Chem2TextQA Questions", "SMolInstruct Questions", "Chem2TextQA Answers", "SMolInstruct Answers", but C4 looks up cosine stats using display strings like "Chem2TextQA Questions (188k)". This guarantees mean_cos falls back to 0, which matches the committed LaTeX table showing 0.000 cosines. Fix by using canonical keys for lookup (and keep display labels separate), e.g., store (display_name, stats, semantic_key) or a mapping from display name → semantic JSON key.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +25
OUT_DIR = Path("/data/asahu/projects/mutqa/chem2textqa_diversity")
FIG_DIR = OUT_DIR / "figures"
TAB_DIR = OUT_DIR / "tables"

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIG_DIR and TAB_DIR are never created in C4 before writing files. Unlike C1/C2/C3, this script can fail at runtime with FileNotFoundError when saving figures or the LaTeX table. Add FIG_DIR.mkdir(parents=True, exist_ok=True) and TAB_DIR.mkdir(parents=True, exist_ok=True) early in the script.

Copilot uses AI. Check for mistakes.
)

plt.suptitle("", fontsize=1) # Suppress default suptitle
plt.savefig(FIG_DIR / "figure1_question_diversity.pdf", dpi=300, bbox_inches="tight")

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIG_DIR and TAB_DIR are never created in C4 before writing files. Unlike C1/C2/C3, this script can fail at runtime with FileNotFoundError when saving figures or the LaTeX table. Add FIG_DIR.mkdir(parents=True, exist_ok=True) and TAB_DIR.mkdir(parents=True, exist_ok=True) early in the script.

Copilot uses AI. Check for mistakes.
\end{table}
"""

with open(TAB_DIR / "diversity_analysis.tex", "w") as f:

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FIG_DIR and TAB_DIR are never created in C4 before writing files. Unlike C1/C2/C3, this script can fail at runtime with FileNotFoundError when saving figures or the LaTeX table. Add FIG_DIR.mkdir(parents=True, exist_ok=True) and TAB_DIR.mkdir(parents=True, exist_ok=True) early in the script.

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +105
for _ in range(n_pairs):
i, j = random.sample(range(n_texts), 2)

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing 500k similarities with a pure-Python loop and random.sample(range(n_texts), 2) each iteration is a major runtime hotspot. Consider generating index arrays in bulk (e.g., with NumPy RNG), computing dot products vectorized, and then slicing down to n_pairs; this is substantially faster and more reproducible.

Copilot uses AI. Check for mistakes.
Comment on lines +109 to +110
cos_sim = float(np.dot(embeddings[i], embeddings[j]))
cosine_sims.append(cos_sim)

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing 500k similarities with a pure-Python loop and random.sample(range(n_texts), 2) each iteration is a major runtime hotspot. Consider generating index arrays in bulk (e.g., with NumPy RNG), computing dot products vectorized, and then slicing down to n_pairs; this is substantially faster and more reproducible.

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +93
if i >= next_checkpoint:
ttr = len(vocab) / len(vocab.union(set(all_tokens[i-checkpoint_interval:i+1])))
token_counts.append(i)
ttr_values.append(len(vocab) / (i + 1))

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The local variable ttr is computed but never used, and the expression itself is confusing (it mixes unique types with a windowed token slice). Removing the unused computation reduces confusion and avoids unnecessary per-checkpoint overhead.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +52
for _ in range(n_bootstrap):
random.shuffle(texts)

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_ttr() mutates its input list in-place by shuffling texts. Since corpus_stats() passes through the original lists used elsewhere in the script, this introduces hidden side effects and can make results harder to reproduce/debug. Shuffle a copy (e.g., texts_shuf = texts[:]) or sample indices instead.

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +24
CHEM2TEXT_PATH = "/data/luis/Chem2TextHackathon/full_premium_kimi/dataset_gold.jsonl"
OUT_DIR = Path("/data/asahu/projects/mutqa/chem2textqa_diversity")

Copilot AI Apr 25, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded absolute paths make the pipeline non-portable and prevent running it from the repo checkout (and also mean the committed caches/tables in Task7_DiversityAnalysis/ aren’t what the scripts will read/write by default). Make paths configurable (CLI args via argparse, environment variables, or default to Path(__file__).resolve().parent) so reviewers/CI can run the analysis consistently.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants