Course Project for Introduction to Natural Language Processing (INLP), IIIT Hyderabad
- Jayant Gupta
- Gopal Kataria
- Manas Agrawal
- Mohammad Akmal Ali
This project investigates selective knowledge removal in large language models through mechanistic interpretability techniques. The goal is to remove knowledge associated with the Harry Potter domain from a pretrained llama-2-7b-chat model while preserving the model’s general linguistic and reasoning capabilities.
Traditional approaches to model editing rely on gradient-based fine-tuning or parameter modification, which may introduce unintended side effects or degrade general performance. In contrast, this work employs Sparse Autoencoders (SAEs) trained on internal transformer activations to identify interpretable features corresponding to specific knowledge domains. By selectively ablating these features during inference, it becomes possible to remove targeted knowledge in a controlled and interpretable manner.
The approach focuses on identifying high-level features in the residual stream of the transformer that are strongly associated with Harry Potter concepts. These features are then suppressed at inference time using forward hooks, allowing the model to generate responses without relying on the removed knowledge.
Pretrained model is available at hugging face repo. It's automatically download from demo.py.
git clone https://github.com/Jay-G14/INLP_PROJECT.git
cd INLP_PROJECT
# Install requirements
pip -r requirements.txt
# run demo
# takes time to download LLama 7B (around 20 mins on LAN,
# needs sufficient VRAM and RAM)
python3 demo.py | Action | Shortcut |
|---|---|
| Send prompt | Enter or Ctrl+S |
| Toggle HP ablation on/off | Ctrl+A or click ** Ablation** button |
| Quit | Ctrl+Q |
- Ask "Who is Harry Potter's best friend?" with ablation OFF -> normal answer.
- Ask the same with ablation ON -> model avoids HP-specific answers.
- Ask a general question (history, science) with ablation ON -> general capability is preserved.
Two types of corpora are used during analysis.
Target Corpus
The Harry Potter book series is processed and tokenized to produce sequences containing dense domain-specific knowledge.
Neutral Corpus
A combination of WikiText-2 and TinyStories is used as a baseline dataset. The inclusion of general fiction prevents the sparse autoencoder from incorrectly identifying common fantasy terminology such as “wand” or “spell” as uniquely Harry Potter related.
Sparse autoencoders are trained on the residual stream activations of llama-2-7b-chat at layer 12. The autoencoder learns a sparse representation of activations using a Top-K activation constraint.
The goal of this representation is to decompose the residual stream into interpretable features that correspond to meaningful semantic patterns in the model’s internal computations.
Candidate features associated with Harry Potter knowledge are identified using a difference-in-means analysis.
For each SAE feature:
- Activation statistics are computed on the Harry Potter corpus.
- Activation statistics are computed on the neutral corpus.
- A specificity ratio is calculated.
Features with significantly higher activation on the target corpus are considered domain-specific.
Feature ablation is implemented using forward hooks through the TransformerLens framework.
Selected SAE features are suppressed during the forward pass using a negative scaling factor. This intervention prevents the model from utilizing those features when generating text.
Evaluation focuses on two objectives:
- Measuring how effectively the model forgets Harry Potter knowledge.
- Ensuring that the model’s general capabilities remain intact.
The following metrics are used.
Knowledge Recall
Completion accuracy on prompts referencing Harry Potter entities and events.
Log-Probability Analysis
Log probabilities assigned to tokens from different semantic domains.
General Language Modeling
Perplexity on WikiText-2 to measure overall language modeling performance.
Qualitative Assessment
Generated completions are manually inspected and optionally classified by an external language model.
Detailled results can be found in project report link
- steps to reproduce our results :
# Llama full local training run
python main.py train \
--layer 15 --epochs 5 --batch_size 128 --expansion_factor 4 --k 8 \
--sae_device cpu --model_device cuda
# Llama feature discovery
python main.py features \
--layer 15 --num_features 100 --sort_by score
# Llama ablation evaluation
python main.py eval \
--layer 15 --num_features 100 --ablation_scale -3.0
# Push artifacts to Hugging Face
uv run python scripts/push_latest_llama_pt_to_hf.py