Skip to content

WSU-SEAL/ToxiShield

Repository files navigation

ToxiShield — Replication Package

Replication package for "ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering" (FSE 2026).

ToxiShield is a Chrome extension for real-time toxicity detection and detoxification in GitHub pull request reviews. It is built around three ML modules, evaluated end-to-end through a user study with 10 professional developers.


Quick Navigation

Paper Section Folder What you can reproduce
§2 – Toxicity Filter [toxicity-filter/](toxicity-filter/) Train BERT binary classifier; verify 98% accuracy / F1=0.97
§3 – Communication Coach [communication-coach/](communication-coach/) Verify Claude 3.5 Sonnet Macro F1=0.42, MCC=0.39
§4 – The Reframer [reframer/](reframer/) Fine-tune Llama 3.2 3B; evaluate J-Score=84%
§5 – Browser Extension [browser-extension/](browser-extension/) Install and run the full system
§5.1 – User Study [survey/](survey/) Inspect TAM survey responses
Validation [manual-validation/](manual-validation/) Reproduce Cohen's κ for both annotation tasks

Architecture Overview

ToxiShield operates as a two-stage pipeline triggered when a developer types a GitHub PR comment:

Developer types PR comment
        │
        ▼
┌──────────────────────┐
│   Module 1           │  BERT-base-uncased (INT8 ONNX, runs in-browser)
│   Toxicity Filter    │  Binary: toxic / non-toxic
└──────┬───────────────┘
       │ if toxic
       ▼
┌──────────────────────┐
│   Module 2           │  Claude 3.5 Sonnet (LLM, zero-shot)
│   Communication      │  12-class subcategory + explanation
│   Coach              │
└──────┬───────────────┘
       │
       ▼
┌──────────────────────┐
│   Module 3           │  Llama 3.2 3B (LoRA fine-tuned, served via Ollama)
│   The Reframer       │  Generates detoxified alternative + rationale
└──────────────────────┘
       │
       ▼
Inline suggestion shown to developer (accept / discard / rate)

Key Results at a Glance

Module Model Primary Metric Result
Toxicity Filter BERT-base-uncased (fine-tuned) F1 (toxic class) 0.97
Communication Coach Claude 3.5 Sonnet Macro F1 / MCC 0.42 / 0.39
The Reframer Llama 3.2 3B (LoRA) J-Score 84.00%

Module 1: Toxicity Filter (§2)

What it does: Binary classification of PR comments (toxic vs non-toxic). Fine-tunes bert-base-uncased on a curated dataset of 38,761 labelled PR comments from 15M GitHub PRs. Best model exported to ONNX INT8 for in-browser inference.

Dataset: 10,120 toxic samples (stratified sampling across ToxiCR probability bins) + 28,641 non-toxic samples. Available at toxishield/38k-dataset-labelled.

To verify results (no training needed):

cd toxicity-filter
pip install -r requirements.txt
cat results/kfold-metrics/cross_validation_results.csv

To re-run training (GPU required, ~10 hours):

jupyter notebook notebooks/train-classifier.ipynb
# Navigate to Section 2 (10-fold CV block) — do not Run All

Key files:

Path Description
data/38k-detection-dataset-full.csv Full labelled dataset (38,761 samples)
notebooks/train-classifier.ipynb Training + 10-fold CV + ONNX INT8 export
results/kfold-metrics/cross_validation_results.csv Per-fold TP/TN/FP/FN, accuracy, F1
results/kfold-misclassifications/misclassification_{1-10}.csv All 701 misclassified instances
comparison/openai-detection-inference/ GPT-4o zero-shot baseline (Table 2)
comparison/openai-detection-inference/compute_table2_metrics.py Script to compute Table 2 metrics from JSONL

Module 2: Communication Coach (§3)

What it does: Classifies toxic PR comments into 12 subcategories using prompt-engineered LLMs. No fine-tuning — uses in-context learning with iteratively refined prompts across 14 evolutionary stages.

Prompt stages and paper mapping:

Iterations Paper Stage Key change
iter_1 Stage 1 (zero-shot baseline) Class names only, no guidance
iter_2–iter_5 Stages 2–5 Added behavior-based definitions, sarcasm handling, lexical cues, rare-category examples
iter_6–iter_8 Stage 2.1–2.3 Sub-refinements of Stage 2
iter_9–iter_11 Stage 3.1–3.3 Sub-refinements of Stage 3
iter_12–iter_14 Stage 4.1–4.3 Sub-refinements of Stage 4

Best prompt for cross-model comparison: iterations/iter_4/prompt.py (Table 4 results)

To verify results (no API key needed):

cd communication-coach
pip install -r requirements.txt
python scripts/evaluate.py
# Prints: Macro F1=0.42, Macro MCC=0.39 — matches Table 4

To re-run inference (API key required):

cp .env.example .env    # add OPENAI_API_KEY or ANTHROPIC_API_KEY

# 1. Set model and iteration in scripts/config.py (default: gpt-4o, iter_4)
# 2. Run inference:
python scripts/openai-inference.py
# Results saved to iterations/iter_4/results.csv

# To evaluate results:
python scripts/evaluate.py
# Writes mcc_by_prompt.json and per_label_confusion_matrices.pdf

Note: Table 4's best result (Claude 3.5 Sonnet) requires the Anthropic API client — openai-inference.py runs GPT models only. Stored Claude results are at iterations/iter_4/results-claude-3.5-sonnet.csv and can be evaluated directly without re-running inference.

Key files:

Path Description
scripts/config.py Central config: model, iteration, dataset path, output path, labels
scripts/openai-inference.py Runs inference for the configured model/iteration
scripts/evaluate.py Computes Macro F1, MCC, exact match; writes confusion matrix PDF
notebooks/evaluate.ipynb Notebook-based evaluation (set FILE_PATH to any results CSV)
data/multiclass-dataset-full.csv 1,200-sample labelled multiclass dataset
iterations/iter_4/results-claude-3.5-sonnet.csv Claude 3.5 Sonnet results (Table 4, best)
iterations/iter_4/results.csv GPT-4o results (Table 4)
iterations/iter_4/mcc_by_prompt.json MCC scores across iterations

Module 3: The Reframer (§4)

What it does: Text style transfer — rewrites toxic PR comments into professional alternatives while preserving technical meaning. Uses teacher-student knowledge distillation: GPT-4o generates training pairs, Llama 3.2 3B learns from them via LoRA.

Parallel dataset: 10,117 (toxic, detoxified) pairs generated by gpt-4o-2024-05-13. Available at toxishield/20k-dataset-parallel.

Evaluation metrics:

Metric What it measures Notebook
DETOX % reduction in toxicity score (via ToxiCR) metric-incivility-decrease.ipynb
FL Fluency via CoLA acceptability classifier metric-style-acc-flu-sim.ipynb
PRESERVE Semantic similarity via sentence-transformers metric-style-acc-flu-sim.ipynb
J-Score Harmonic mean of DETOX × FL × PRESERVE metric-style-acc-flu-sim.ipynb

To evaluate stored results (no GPU needed):

cd reframer/detoxifier-fine-tuning
pip install -r ../requirements.txt
jupyter notebook notebooks/metric-style-acc-flu-sim.ipynb
# Change INPUT_FILE to evaluate different models

To re-run fine-tuning (GPU ≥16 GB VRAM required, ~6 hours):

jupyter notebook notebooks/fine-tune-huggingface.ipynb
# Base model: unsloth/Llama-3.2-3B-Instruct-bnb-4bit
# Uses num_train_epochs=10, LoRA rank=16, lr=2e-4

Result files in detoxifier-fine-tuning/data/toxicr-outputs/:

File Table 6 model
toxicr-teacher-gpt-4o-05-13-llama-3.2-3b-input-output-cleaned.xlsx Llama 3.2 3B (best, J=84%)
toxicr-teacher-gpt-4o-05-13-llama-3.1-8b-input-output-cleaned.xlsx Llama 3.1 8B
toxicr-phi-3.5-10k-test-teacher-gpt-4o-05-13-input-output.xlsx Phi 3.5
toxicr-gemma-2b-10k-test-teacher-gpt-4o-05-13-input-output.xlsx Gemma 2B
toxicr-ft-gpt-4o-mini-toxishield-test-inference-10k-input-output.csv GPT-4o mini FT
toxicr-ft-gpt-35-baseline-test-inference-10k-input-output.csv GPT-3.5 FT (baseline)

Note: Qwen 2.5 Instruct 7B output file is not included in this package.

Sub-directory overview:

reframer/
├── parallel-dataset/         # Teacher model data generation scripts and raw outputs
├── openai-fine-tuning/       # GPT-4o-mini and GPT-3.5 fine-tuning data (comparison baselines)
├── detoxifier-fine-tuning/   # Llama 3.2 3B LoRA fine-tuning (primary model)
│   ├── notebooks/            # Training + all evaluation notebooks
│   └── data/toxicr-outputs/  # ToxiCR-scored input/output pairs for all 6 models
└── baseline-comparison/      # Prior-work baseline dataset and comparison results

Module 4: Browser Extension + Backend (§5)

What it does: Chrome extension (Manifest V3, React/Vite) that integrates all three modules into the developer workflow. Runs the BERT classifier locally via ONNX Runtime Web; calls the backend API for detoxification.

# 1. Start the backend
cd browser-extension/detoxifier-backend
cp .env.example .env   # fill in OPENAI_API_KEY, DB credentials, OLLAMA_BASE_URL
npm install
npm start              # starts on http://localhost:3000

# 2. Load the extension (pre-built artifact)
# Chrome → chrome://extensions → Developer mode → Load unpacked
# → select browser-extension/toxishield/

Extension source layout:

Path Description
browser-extension/src/ React side-panel UI, inference logic, TypeScript types
browser-extension/public/content.js Injected content script: intercepts GitHub PR comment forms
browser-extension/public/service_worker.js Background worker: routes messages between content script and panel
browser-extension/public/static/vocab.json BERT WordPiece vocabulary for in-browser tokenisation
browser-extension/toxishield/ Pre-built unpacked extension (load this in Chrome)
browser-extension/detoxifier-backend/ Node.js/Express API: detoxification inference + usage logging

ONNX model: classifier_int8.onnx is not included (binary, ~80 MB). Generate it by running the ONNX export section of toxicity-filter/notebooks/train-classifier.ipynb, then place the file at browser-extension/public/bert-base/classifier_int8.onnx.

See [browser-extension/README.md](browser-extension/README.md) for full setup and development build instructions.


User Study (§5.1)

10 professional software developers (US and Bangladesh) used ToxiShield on real GitHub repositories for two weeks. IRB-approved. Survey materials and anonymised responses are in [survey/](survey/).

File Description
survey.xlsx Anonymised responses (10 participants, 9 Likert items + open-ended)
ToxiShield_Developer_Survey_Guide.pdf Study protocol and task instructions given to participants
ToxiShield _ Post-Study-Feedback-Form.pdf Post-study TAM questionnaire
test-repository.txt GitHub repository used during the two-week study

Inter-Annotator Agreement (§3.3, §4.4)

cd manual-validation
pip install -r requirements.txt
jupyter notebook notebooks/kappa.ipynb
Task κ Interpretation
Multiclass subcategory (100 samples) 0.67 Substantial
Detox quality — minimal change 0.82 Almost perfect
Detox quality — context preservation 0.72 Substantial
Detox quality — communication style 0.77 Substantial

Environment Setup

Python 3.10+ required. Each module has its own requirements.txt.

pip install -r toxicity-filter/requirements.txt      # Module 1
pip install -r communication-coach/requirements.txt  # Module 2
pip install -r reframer/requirements.txt             # Module 3
pip install -r manual-validation/requirements.txt    # Validation

Browser extension and backend: Node.js 18+.

API keys (set in environment or in a module-level .env):

OPENAI_API_KEY=...       # Module 2 (GPT runs), Module 3 (parallel dataset generation)
ANTHROPIC_API_KEY=...    # Module 2 (Claude runs — best result in Table 4)
HF_TOKEN=...             # Optional: only needed to push models to HuggingFace Hub

Reproducibility Summary

Finding How to verify (stored artifacts) How to re-run
BERT F1=0.97 (Table 2) toxicity-filter/results/kfold-metrics/cross_validation_results.csv ~10 GPU-hours, A100
Claude MCC=0.39 (Table 4) communication-coach/iterations/iter_4/results-claude-3.5-sonnet.csv → run evaluate.py Anthropic API
Llama J=84% (Table 6) reframer/detoxifier-fine-tuning/data/toxicr-outputs/ → run metric notebooks ~6 GPU-hours, A100

All LLM inference used temperature=0 for determinism. Exact API model versions are recorded in the response_full column of each results CSV.

Copyright Information

Authors: Md Awsaf Alam Anindya (awsafalam@gmail.com), Showvik Biswas (showvikdbz@gmail.com) Jaydeb Sarker jsarker@unomaha.edu and Amiangshu Bosu abosu@wayne.edu

This program is free software; you can redistribute it and/or modify it under the terms of the Apache-2.0 license as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

Citation for our papers

If you use our work, please cite our paper:

FSE 2026 (Research Track): "ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering"

@article{anindya2026toxishield,
  title={ToxiShield: Enhancing Developer Collaboration through Real-Time Toxicity Filtering},
  author={Anindya, Md Awsaf Alam and Biswas, Showvik and Iqbal, Anindya and Sarker, Jaydeb and Bosu, Amiangshu},
  journal={Proceedings of the ACM on Software Engineering},
  volume={},
  number={FSE},
  pages={TBD},
  year={2026},
  publisher={ACM New York, NY, USA}
}

FSE 2026 (Poster Track): "Real-Time Toxicity Filtering for Open-Source Code Reviews"

@inproceedings{poster2026toxishield,
  title={Real-Time Toxicity Filtering for Open-Source Code Reviews},
  author={Anindya, Md Awsaf Alam and Biswas, Showvik and Iqbal, Anindya and Sarker, Jaydeb and Bosu, Amiangshu},
    booktitle={Proceedings of the 34th ACM International Conference on the Foundations of Software Engineering},
  pages={},
  year={2026},
  location = {Montreal, Canada},
 series = {FSE Companion '26},
  publisher={ACM New York, NY, USA}
}

About

Real time detoxification for software developers communication

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors