GitHub - YuvMilo/MechanisticAccountofSinks

Code for reproducing the experiments in the paper "A Mechanistic Account of Attention Sinks in GPT-2: One Circuit, Broader Implications for Mitigation".

Setup

conda create -n sinks python=3.11 -y
conda activate sinks
pip install -r requirements.txt

Reproducing the Figures and Tables

Figure 1 — Source-Agnostic Shift Histogram (truncated) + Appendix Full Histogram

python experiments_statistical.py --mode bias-term --output-dir results

Outputs:

results/bias_term_statistical/bq_k_aggregate_plot_truncated.png → Fig 1
results/bias_term_statistical/bq_k_aggregate_plot.png → Fig 6 (appendix, full histogram)

Figure 2 — EPE-Bias Projection Alignment

python experiments_single_input.py --mode epe-bias-proj --output-dir results

Output:

results/epe_bias_proj/epe_alignment.png → Fig 2

Figure 3 — EPE Captures the Net Positional Contribution

python experiments_statistical.py --mode epe-validation --output-dir results

Outputs:

results/epe_validation_statistical/epe_validation_plot.png → Fig 3
results/epe_validation_statistical/epe_validation_precentiles (numerical values for experiments)

Figure 4 — Coordinate-Level Alignment Histogram (truncated) + Appendix Full Histogram

python experiments_statistical.py --mode coord-alignment --output-dir results

Outputs:

results/coord_alignment_statistical/coord_alignment_histogram_truncated.png → Fig 4
results/coord_alignment_statistical/coord_alignment_histogram.png → Fig 8 (appendix, full histogram)

Figure 5 — Intervention Attention Maps

python intervention_analysis.py --mode sentence --output-dir results

Outputs:

results/sentence_analysis/layer_04_avg.png through layer_11_avg.png → Fig 5 (layers 4--11)

Table 1 — BOS Attention Statistics

python intervention_analysis.py --mode dataset --output-dir results

Outputs:

results/dataset_analysis/bos_attention_summary_mid_layers.txt → Table 1
results/dataset_analysis/bos_attention_summary_mid_layers.csv

Figure 7 (appendix) — Massive Activations in EPE_1

python experiments_single_input.py --mode massive-activations --output-dir results

Output:

results/massive_activations/massive_activations_in_ppe.png → Fig 7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Reproducing the Figures and Tables

Figure 1 — Source-Agnostic Shift Histogram (truncated) + Appendix Full Histogram

Figure 2 — EPE-Bias Projection Alignment

Figure 3 — EPE Captures the Net Positional Contribution

Figure 4 — Coordinate-Level Alignment Histogram (truncated) + Appendix Full Histogram

Figure 5 — Intervention Attention Maps

Table 1 — BOS Attention Statistics

Figure 7 (appendix) — Massive Activations in EPE_1

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
results		results
README.md		README.md
datasets_loader.py		datasets_loader.py
experiments_single_input.py		experiments_single_input.py
experiments_statistical.py		experiments_statistical.py
intervention_analysis.py		intervention_analysis.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Setup

Reproducing the Figures and Tables

Figure 1 — Source-Agnostic Shift Histogram (truncated) + Appendix Full Histogram

Figure 2 — EPE-Bias Projection Alignment

Figure 3 — EPE Captures the Net Positional Contribution

Figure 4 — Coordinate-Level Alignment Histogram (truncated) + Appendix Full Histogram

Figure 5 — Intervention Attention Maps

Table 1 — BOS Attention Statistics

Figure 7 (appendix) — Massive Activations in EPE_1

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages