Knowledge Unlearning in Language Models using Sparse Autoencoders

Course Project for Introduction to Natural Language Processing (INLP), IIIT Hyderabad

Team

Jayant Gupta
Gopal Kataria
Manas Agrawal
Mohammad Akmal Ali

Project Overview

This project investigates selective knowledge removal in large language models through mechanistic interpretability techniques. The goal is to remove knowledge associated with the Harry Potter domain from a pretrained llama-2-7b-chat model while preserving the model’s general linguistic and reasoning capabilities.

Traditional approaches to model editing rely on gradient-based fine-tuning or parameter modification, which may introduce unintended side effects or degrade general performance. In contrast, this work employs Sparse Autoencoders (SAEs) trained on internal transformer activations to identify interpretable features corresponding to specific knowledge domains. By selectively ablating these features during inference, it becomes possible to remove targeted knowledge in a controlled and interpretable manner.

The approach focuses on identifying high-level features in the residual stream of the transformer that are strongly associated with Harry Potter concepts. These features are then suppressed at inference time using forward hooks, allowing the model to generate responses without relying on the removed knowledge.

Training Instructions

Running the Project

Pretrained model is available at hugging face repo. It's automatically download from demo.py.

Run Demo

git clone https://github.com/Jay-G14/INLP_PROJECT.git
cd INLP_PROJECT

# Install requirements
pip  -r requirements.txt  

# run demo
# takes time to download LLama 7B (around 20 mins on LAN,
#  needs sufficient VRAM and RAM)
python3 demo.py

Controls

Action	Shortcut
Send prompt	`Enter` or `Ctrl+S`
Toggle HP ablation on/off	`Ctrl+A` or click Ablation button
Quit	`Ctrl+Q`

What to Try

Ask "Who is Harry Potter's best friend?" with ablation OFF -> normal answer.
Ask the same with ablation ON -> model avoids HP-specific answers.
Ask a general question (history, science) with ablation ON -> general capability is preserved.

Methodology

Data Preparation

Two types of corpora are used during analysis.

Target Corpus

The Harry Potter book series is processed and tokenized to produce sequences containing dense domain-specific knowledge.

Neutral Corpus

A combination of WikiText-2 and TinyStories is used as a baseline dataset. The inclusion of general fiction prevents the sparse autoencoder from incorrectly identifying common fantasy terminology such as “wand” or “spell” as uniquely Harry Potter related.

Sparse Autoencoder Training

Sparse autoencoders are trained on the residual stream activations of llama-2-7b-chat at layer 12. The autoencoder learns a sparse representation of activations using a Top-K activation constraint.

The goal of this representation is to decompose the residual stream into interpretable features that correspond to meaningful semantic patterns in the model’s internal computations.

Feature Identification

Candidate features associated with Harry Potter knowledge are identified using a difference-in-means analysis.

For each SAE feature:

Activation statistics are computed on the Harry Potter corpus.
Activation statistics are computed on the neutral corpus.
A specificity ratio is calculated.

Features with significantly higher activation on the target corpus are considered domain-specific.

Intervention

Feature ablation is implemented using forward hooks through the TransformerLens framework.

Selected SAE features are suppressed during the forward pass using a negative scaling factor. This intervention prevents the model from utilizing those features when generating text.

Evaluation

Evaluation focuses on two objectives:

Measuring how effectively the model forgets Harry Potter knowledge.
Ensuring that the model’s general capabilities remain intact.

The following metrics are used.

Knowledge Recall

Completion accuracy on prompts referencing Harry Potter entities and events.

Log-Probability Analysis

Log probabilities assigned to tokens from different semantic domains.

General Language Modeling

Perplexity on WikiText-2 to measure overall language modeling performance.

Qualitative Assessment

Generated completions are manually inspected and optionally classified by an external language model.

Detailled results can be found in project report link

Training instructions detailed

steps to reproduce our results :

# Llama full local training run
python main.py train \
  --layer 15 --epochs 5 --batch_size 128 --expansion_factor 4 --k 8 \
  --sae_device cpu --model_device cuda

# Llama feature discovery
python main.py features \
  --layer 15 --num_features 100 --sort_by score

# Llama ablation evaluation
python main.py eval \
  --layer 15 --num_features 100 --ablation_scale -3.0

# Push artifacts to Hugging Face
uv run python scripts/push_latest_llama_pt_to_hf.py

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
eval_prompts		eval_prompts
results		results
scripts		scripts
src		src
.gitignore		.gitignore
Harry_Potter_all_books_preprocessed.txt		Harry_Potter_all_books_preprocessed.txt
Terminal Touchers-Mid.pdf		Terminal Touchers-Mid.pdf
batch.sh		batch.sh
demo.py		demo.py
inspiration.pdf		inspiration.pdf
main.py		main.py
prompts.txt		prompts.txt
pyproject.toml		pyproject.toml
readme.md		readme.md
reproducing_mid_eval_results.md		reproducing_mid_eval_results.md
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knowledge Unlearning in Language Models using Sparse Autoencoders

Team

Project Overview

Training Instructions

Running the Project

Run Demo

Controls

What to Try

Methodology

Data Preparation

Sparse Autoencoder Training

Feature Identification

Intervention

Evaluation

Training instructions detailed

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knowledge Unlearning in Language Models using Sparse Autoencoders

Team

Project Overview

Training Instructions

Running the Project

Run Demo

Controls

What to Try

Methodology

Data Preparation

Sparse Autoencoder Training

Feature Identification

Intervention

Evaluation

Training instructions detailed

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages