Reproduction Package for "Black-Box Adversarial Attacks on LLM-Based Code Completion" [ICML 2025]
-
Updated
Jun 16, 2025 - Python
Reproduction Package for "Black-Box Adversarial Attacks on LLM-Based Code Completion" [ICML 2025]
FGSM (Fast Gradient Sign Method) is an adversarial attack technique that adds small, calculated perturbations to input data to fool CNNs. Proposed by Ian Goodfellow in 2014, it generates adversarial examples to mislead the model's predictions.
Auditing a content moderation model using DistilBERT on the Jigsaw dataset. Covers bias analysis, adversarial attacks (character evasion and label poisoning), mitigation techniques, and a guardrail pipeline to improve fairness, robustness, and real-world reliability.
An empirical investigation into the robustness-efficiency tradeoff of PEFT methods against jailbreak attacks
Add a description, image, and links to the adverserial-attack topic page so that developers can more easily learn about it.
To associate your repository with the adverserial-attack topic, visit your repo's landing page and select "manage topics."