Replication package for "ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering" (FSE 2026).
ToxiShield is a Chrome extension for real-time toxicity detection and detoxification in GitHub pull request reviews. It is built around three ML modules, evaluated end-to-end through a user study with 10 professional developers.
| Paper Section | Folder | What you can reproduce |
|---|---|---|
| §2 – Toxicity Filter | [toxicity-filter/](toxicity-filter/) |
Train BERT binary classifier; verify 98% accuracy / F1=0.97 |
| §3 – Communication Coach | [communication-coach/](communication-coach/) |
Verify Claude 3.5 Sonnet Macro F1=0.42, MCC=0.39 |
| §4 – The Reframer | [reframer/](reframer/) |
Fine-tune Llama 3.2 3B; evaluate J-Score=84% |
| §5 – Browser Extension | [browser-extension/](browser-extension/) |
Install and run the full system |
| §5.1 – User Study | [survey/](survey/) |
Inspect TAM survey responses |
| Validation | [manual-validation/](manual-validation/) |
Reproduce Cohen's κ for both annotation tasks |
ToxiShield operates as a two-stage pipeline triggered when a developer types a GitHub PR comment:
Developer types PR comment
│
▼
┌──────────────────────┐
│ Module 1 │ BERT-base-uncased (INT8 ONNX, runs in-browser)
│ Toxicity Filter │ Binary: toxic / non-toxic
└──────┬───────────────┘
│ if toxic
▼
┌──────────────────────┐
│ Module 2 │ Claude 3.5 Sonnet (LLM, zero-shot)
│ Communication │ 12-class subcategory + explanation
│ Coach │
└──────┬───────────────┘
│
▼
┌──────────────────────┐
│ Module 3 │ Llama 3.2 3B (LoRA fine-tuned, served via Ollama)
│ The Reframer │ Generates detoxified alternative + rationale
└──────────────────────┘
│
▼
Inline suggestion shown to developer (accept / discard / rate)
| Module | Model | Primary Metric | Result |
|---|---|---|---|
| Toxicity Filter | BERT-base-uncased (fine-tuned) | F1 (toxic class) | 0.97 |
| Communication Coach | Claude 3.5 Sonnet | Macro F1 / MCC | 0.42 / 0.39 |
| The Reframer | Llama 3.2 3B (LoRA) | J-Score | 84.00% |
What it does: Binary classification of PR comments (toxic vs non-toxic). Fine-tunes bert-base-uncased on a curated dataset of 38,761 labelled PR comments from 15M GitHub PRs. Best model exported to ONNX INT8 for in-browser inference.
Dataset: 10,120 toxic samples (stratified sampling across ToxiCR probability bins) + 28,641 non-toxic samples. Available at toxishield/38k-dataset-labelled.
To verify results (no training needed):
cd toxicity-filter
pip install -r requirements.txt
cat results/kfold-metrics/cross_validation_results.csvTo re-run training (GPU required, ~10 hours):
jupyter notebook notebooks/train-classifier.ipynb
# Navigate to Section 2 (10-fold CV block) — do not Run AllKey files:
| Path | Description |
|---|---|
data/38k-detection-dataset-full.csv |
Full labelled dataset (38,761 samples) |
notebooks/train-classifier.ipynb |
Training + 10-fold CV + ONNX INT8 export |
results/kfold-metrics/cross_validation_results.csv |
Per-fold TP/TN/FP/FN, accuracy, F1 |
results/kfold-misclassifications/misclassification_{1-10}.csv |
All 701 misclassified instances |
comparison/openai-detection-inference/ |
GPT-4o zero-shot baseline (Table 2) |
comparison/openai-detection-inference/compute_table2_metrics.py |
Script to compute Table 2 metrics from JSONL |
What it does: Classifies toxic PR comments into 12 subcategories using prompt-engineered LLMs. No fine-tuning — uses in-context learning with iteratively refined prompts across 14 evolutionary stages.
Prompt stages and paper mapping:
| Iterations | Paper Stage | Key change |
|---|---|---|
| iter_1 | Stage 1 (zero-shot baseline) | Class names only, no guidance |
| iter_2–iter_5 | Stages 2–5 | Added behavior-based definitions, sarcasm handling, lexical cues, rare-category examples |
| iter_6–iter_8 | Stage 2.1–2.3 | Sub-refinements of Stage 2 |
| iter_9–iter_11 | Stage 3.1–3.3 | Sub-refinements of Stage 3 |
| iter_12–iter_14 | Stage 4.1–4.3 | Sub-refinements of Stage 4 |
Best prompt for cross-model comparison: iterations/iter_4/prompt.py (Table 4 results)
To verify results (no API key needed):
cd communication-coach
pip install -r requirements.txt
python scripts/evaluate.py
# Prints: Macro F1=0.42, Macro MCC=0.39 — matches Table 4To re-run inference (API key required):
cp .env.example .env # add OPENAI_API_KEY or ANTHROPIC_API_KEY
# 1. Set model and iteration in scripts/config.py (default: gpt-4o, iter_4)
# 2. Run inference:
python scripts/openai-inference.py
# Results saved to iterations/iter_4/results.csv
# To evaluate results:
python scripts/evaluate.py
# Writes mcc_by_prompt.json and per_label_confusion_matrices.pdfNote: Table 4's best result (Claude 3.5 Sonnet) requires the Anthropic API client —
openai-inference.pyruns GPT models only. Stored Claude results are atiterations/iter_4/results-claude-3.5-sonnet.csvand can be evaluated directly without re-running inference.
Key files:
| Path | Description |
|---|---|
scripts/config.py |
Central config: model, iteration, dataset path, output path, labels |
scripts/openai-inference.py |
Runs inference for the configured model/iteration |
scripts/evaluate.py |
Computes Macro F1, MCC, exact match; writes confusion matrix PDF |
notebooks/evaluate.ipynb |
Notebook-based evaluation (set FILE_PATH to any results CSV) |
data/multiclass-dataset-full.csv |
1,200-sample labelled multiclass dataset |
iterations/iter_4/results-claude-3.5-sonnet.csv |
Claude 3.5 Sonnet results (Table 4, best) |
iterations/iter_4/results.csv |
GPT-4o results (Table 4) |
iterations/iter_4/mcc_by_prompt.json |
MCC scores across iterations |
What it does: Text style transfer — rewrites toxic PR comments into professional alternatives while preserving technical meaning. Uses teacher-student knowledge distillation: GPT-4o generates training pairs, Llama 3.2 3B learns from them via LoRA.
Parallel dataset: 10,117 (toxic, detoxified) pairs generated by gpt-4o-2024-05-13. Available at toxishield/20k-dataset-parallel.
Evaluation metrics:
| Metric | What it measures | Notebook |
|---|---|---|
| DETOX | % reduction in toxicity score (via ToxiCR) | metric-incivility-decrease.ipynb |
| FL | Fluency via CoLA acceptability classifier | metric-style-acc-flu-sim.ipynb |
| PRESERVE | Semantic similarity via sentence-transformers | metric-style-acc-flu-sim.ipynb |
| J-Score | Harmonic mean of DETOX × FL × PRESERVE | metric-style-acc-flu-sim.ipynb |
To evaluate stored results (no GPU needed):
cd reframer/detoxifier-fine-tuning
pip install -r ../requirements.txt
jupyter notebook notebooks/metric-style-acc-flu-sim.ipynb
# Change INPUT_FILE to evaluate different modelsTo re-run fine-tuning (GPU ≥16 GB VRAM required, ~6 hours):
jupyter notebook notebooks/fine-tune-huggingface.ipynb
# Base model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
# Uses num_train_epochs=10, LoRA rank=16, lr=2e-4Result files in detoxifier-fine-tuning/data/toxicr-outputs/:
| File | Table 6 model |
|---|---|
toxicr-teacher-gpt-4o-05-13-llama-3.2-3b-input-output-cleaned.xlsx |
Llama 3.2 3B (best, J=84%) |
toxicr-teacher-gpt-4o-05-13-llama-3.1-8b-input-output-cleaned.xlsx |
Llama 3.1 8B |
toxicr-phi-3.5-10k-test-teacher-gpt-4o-05-13-input-output.xlsx |
Phi 3.5 |
toxicr-gemma-2b-10k-test-teacher-gpt-4o-05-13-input-output.xlsx |
Gemma 2B |
toxicr-ft-gpt-4o-mini-toxishield-test-inference-10k-input-output.csv |
GPT-4o mini FT |
toxicr-ft-gpt-35-baseline-test-inference-10k-input-output.csv |
GPT-3.5 FT (baseline) |
Note: Qwen 2.5 Instruct 7B output file is not included in this package.
Sub-directory overview:
reframer/
├── parallel-dataset/ # Teacher model data generation scripts and raw outputs
├── openai-fine-tuning/ # GPT-4o-mini and GPT-3.5 fine-tuning data (comparison baselines)
├── detoxifier-fine-tuning/ # Llama 3.2 3B LoRA fine-tuning (primary model)
│ ├── notebooks/ # Training + all evaluation notebooks
│ └── data/toxicr-outputs/ # ToxiCR-scored input/output pairs for all 6 models
└── baseline-comparison/ # Prior-work baseline dataset and comparison results
What it does: Chrome extension (Manifest V3, React/Vite) that integrates all three modules into the developer workflow. Runs the BERT classifier locally via ONNX Runtime Web; calls the backend API for detoxification.
# 1. Start the backend
cd browser-extension/detoxifier-backend
cp .env.example .env # fill in OPENAI_API_KEY, DB credentials, OLLAMA_BASE_URL
npm install
npm start # starts on http://localhost:3000
# 2. Load the extension (pre-built artifact)
# Chrome → chrome://extensions → Developer mode → Load unpacked
# → select browser-extension/toxishield/Extension source layout:
| Path | Description |
|---|---|
browser-extension/src/ |
React side-panel UI, inference logic, TypeScript types |
browser-extension/public/content.js |
Injected content script: intercepts GitHub PR comment forms |
browser-extension/public/service_worker.js |
Background worker: routes messages between content script and panel |
browser-extension/public/static/vocab.json |
BERT WordPiece vocabulary for in-browser tokenisation |
browser-extension/toxishield/ |
Pre-built unpacked extension (load this in Chrome) |
browser-extension/detoxifier-backend/ |
Node.js/Express API: detoxification inference + usage logging |
ONNX model:
classifier_int8.onnxis not included (binary, ~80 MB). Generate it by running the ONNX export section oftoxicity-filter/notebooks/train-classifier.ipynb, then place the file atbrowser-extension/public/bert-base/classifier_int8.onnx.
See [browser-extension/README.md](browser-extension/README.md) for full setup and development build instructions.
10 professional software developers (US and Bangladesh) used ToxiShield on real GitHub repositories for two weeks. IRB-approved. Survey materials and anonymised responses are in [survey/](survey/).
| File | Description |
|---|---|
survey.xlsx |
Anonymised responses (10 participants, 9 Likert items + open-ended) |
ToxiShield_Developer_Survey_Guide.pdf |
Study protocol and task instructions given to participants |
ToxiShield _ Post-Study-Feedback-Form.pdf |
Post-study TAM questionnaire |
test-repository.txt |
GitHub repository used during the two-week study |
cd manual-validation
pip install -r requirements.txt
jupyter notebook notebooks/kappa.ipynb| Task | κ | Interpretation |
|---|---|---|
| Multiclass subcategory (100 samples) | 0.67 | Substantial |
| Detox quality — minimal change | 0.82 | Almost perfect |
| Detox quality — context preservation | 0.72 | Substantial |
| Detox quality — communication style | 0.77 | Substantial |
Python 3.10+ required. Each module has its own requirements.txt.
pip install -r toxicity-filter/requirements.txt # Module 1
pip install -r communication-coach/requirements.txt # Module 2
pip install -r reframer/requirements.txt # Module 3
pip install -r manual-validation/requirements.txt # ValidationBrowser extension and backend: Node.js 18+.
API keys (set in environment or in a module-level .env):
OPENAI_API_KEY=... # Module 2 (GPT runs), Module 3 (parallel dataset generation)
ANTHROPIC_API_KEY=... # Module 2 (Claude runs — best result in Table 4)
HF_TOKEN=... # Optional: only needed to push models to HuggingFace Hub
| Finding | How to verify (stored artifacts) | How to re-run |
|---|---|---|
| BERT F1=0.97 (Table 2) | toxicity-filter/results/kfold-metrics/cross_validation_results.csv |
~10 GPU-hours, A100 |
| Claude MCC=0.39 (Table 4) | communication-coach/iterations/iter_4/results-claude-3.5-sonnet.csv → run evaluate.py |
Anthropic API |
| Llama J=84% (Table 6) | reframer/detoxifier-fine-tuning/data/toxicr-outputs/ → run metric notebooks |
~6 GPU-hours, A100 |
All LLM inference used temperature=0 for determinism. Exact API model versions are recorded in the response_full column of each results CSV.
Authors: Md Awsaf Alam Anindya (awsafalam@gmail.com), Showvik Biswas (showvikdbz@gmail.com) Jaydeb Sarker jsarker@unomaha.edu and Amiangshu Bosu abosu@wayne.edu
This program is free software; you can redistribute it and/or modify it under the terms of the Apache-2.0 license as published by the Free Software Foundation.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
If you use our work, please cite our paper:
FSE 2026 (Research Track): "ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering"
@article{anindya2026toxishield,
title={ToxiShield: Enhancing Developer Collaboration through Real-Time Toxicity Filtering},
author={Anindya, Md Awsaf Alam and Biswas, Showvik and Iqbal, Anindya and Sarker, Jaydeb and Bosu, Amiangshu},
journal={Proceedings of the ACM on Software Engineering},
volume={},
number={FSE},
pages={TBD},
year={2026},
publisher={ACM New York, NY, USA}
}
FSE 2026 (Poster Track): "Real-Time Toxicity Filtering for Open-Source Code Reviews"
@inproceedings{poster2026toxishield,
title={Real-Time Toxicity Filtering for Open-Source Code Reviews},
author={Anindya, Md Awsaf Alam and Biswas, Showvik and Iqbal, Anindya and Sarker, Jaydeb and Bosu, Amiangshu},
booktitle={Proceedings of the 34th ACM International Conference on the Foundations of Software Engineering},
pages={},
year={2026},
location = {Montreal, Canada},
series = {FSE Companion '26},
publisher={ACM New York, NY, USA}
}