This is the repository for the paper "Tracing Persona Vectors Through LLM Pretraining". It contains the code and data to reproduce the experiments and results presented in the paper.
We study when and how persona vectors — linear directions in activation space that encode high-level behavioral traits such as evil, sycophantic, impolite, and humorous — emerge during LLM pretraining. Across OLMo-3-7B and Apertus-8B checkpoints we show that persona vectors form remarkably early (within the first ~0.22% of OLMo-3 pretraining tokens), remain effective for steering the fully trained base and post-trained (Instruct/SFT/DPO/RLVR) models, and keep refining throughout training. The repository provides the full pipeline used in the paper: persona-vector extraction from base-model activations, checkpoint sweeps for emergence and transfer analyses, geometry and facet analyses, and the discourse-type / control-vector ablations.
This repository builds upon the code from the original persona vector paper, which can be found in this GitHub repository. Files adopted from the original repo are indicated in their initial comment.
The code requires Python 3.11. External libraries include google-genai, matplotlib, numpy, openai, pandas, seaborn, transformers, and vllm libraries. All dependencies and version numbers are listed in the requirements.txt file.
We used a single NVIDIA A100 GPU for most experiments. If the compute cluster supports fractional GPU allocation, 24GB VRAM should be sufficient for the 7B and 8B models.
pretraining_persona/
├── requirements.txt # Python dependencies
├── .env.example # Template for API keys
├── analysis/ # Analysis and plotting scripts
├── data/ # All data (input and generations)
│ ├── trait_data_extract/ # Prompt pairs used to extract persona vectors
│ ├── trait_data_eval/ # Prompt sets used to evaluate steering
│ ├── upstream_trait_data_eval/ # Original eval data from the upstream repo
│ ├── persona_vectors/ # Extracted persona vectors (.pt), keyed by model and trait
│ └── model_responses/ # Model responses and LLM-as-judge verdicts (CSV)
├── pipeline/ # End-to-end orchestration scripts
├── results/ # Aggregated summary CSVs, keyed by model
└── source/ # Core Python library, partially adapted from the original repo
├── config.py # Shared configuration (paths, model names, hyperparameters)
├── generate_vec.py # Extract persona vectors from model activations
├── activation_steer.py # Apply persona vectors via activation addition
├── eval_persona.py # Run steered generation and collect responses
├── judge.py # GPT-based judge
├── deepseek_judge.py # DeepSeek-based trait expression judge
├── model_utils.py # Model and tokenizer loading helpers
├── prompts.py # Prompt templates shared across scripts
└── utils.py # General utility functions
An OpenAI API key is necessary for using GPT4.1 as a judge like in our report. A Gemini API key is necessary for creating the character description extract and eval dataset. A DeepSeek API key is necessary for using the DeepSeek-based trait expression judge. The keys should be specified in the .env file. We recommend to use a Docker container.
- Create a virtual environment.
python3.11 -m venv pretraining-persona
source pretraining-persona/bin/activate- Install dependencies:
pip install -r requirements.txt- Configure environment and fill in your API keys in the .env file
cp .env.example .env- All scripts should be run from the main directory
- Output directories are created automatically
- Most scripts skip runs if output files already exist (check individual scripts) unless specified overwrite flags
- Ablation scripts automatically aggregate results (see bottom of script files)
To reproduce the baseline evaluation results, please run the following script:
./pipeline/baseline_evaluation.shThese arguments can be specified via environment variables:
GPU(default: 0) - GPU device IDMODEL(default: "allenai/OLMo-3-1025-7B") - Model name or pathN_PER_QUESTION(default: 5) - Number of responses per questionMAX_TOKENS(default: 64) - Maximum number of tokensJUDGE_MODEL(default: "gpt-4.1-mini-2025-04-14") - Judge model for evaluationOUTPUT_BASE_DIR(default: "ablation_results") - Base directory for outputREPETITION_PENALTY(default: 1.1) - Repetition penalty
Output
- model responses and their judgments in
data/model_responses/{MODEL}/baseline - aggregated statistics in
results/{MODEL}baseline_summary.csv
Run the following script calls to generate responses for the instructed prompts using the base models. The outputs will be saved in data/model_responses/extract.
TRAIT=evil_character_neutral_q ./pipeline/instructed_continuations.sh
These arguments can be specified via environment variables:
GPU(default: 0) - GPU device IDMODEL(default: "allenai/OLMo-3-1025-7B") - Model name or pathTRAIT(default: "evil") - Trait to evaluateREVISION(default: "") - Model checkpointJUDGE_MODEL(default: "gpt-4.1-mini-2025-04-14") - Judge model for evaluationN_PER_QUESTION(default: 10) - Number of responses per questionMAX_TOKENS(default: 64) - Maximum number of tokensREPETITION_PENALTY(default: 1.1) - Repetition penaltySKIP_JUDGE(default: False) - Whether to skip judging the responses for debugging
Generated files:
data/model_responses/extract/{MODEL}/{REVISION}/{TRAIT}_pos_instruct.csvdata/model_responses/extract/{MODEL}/{REVISION}/{TRAIT}_neg_instruct.csv
To run the combined persona vector extraction below, execute python pipeline/merge_persona_vectors.py.
Params:
--model: Model name (e.g., "allenai/OLMo-3-1025-7B")--revision: Model revision (default: "")--methods: Comma-separated list of methods to merge (default: all neutral question methods)
Generated file:
data/trait_data_eval/{MODEL}/evil_combined.json
Execute the following script calls to extract persona vectors from a base model. The outputs will be saved in data/persona_vectors.
TRAIT=evil_character_neutral_q ./pipeline/generate_vec.sh
These arguments can be specified via environment variables:
GPU(default: 0) - GPU device IDMODEL(default: "allenai/OLMo-3-1025-7B") - Model name or pathTRAIT(default: "evil") - Trait to evaluate, options see aboveTHRESHOLD(default: 50) - Threshold value for evilness
Generated Files:
data/persona_vectors/{MODEL}/{TRAIT}_prompt_avg_diff.pt: Average prompt activations differencedata/persona_vectors/{MODEL}/{TRAIT}_response_avg_diff.pt: Average response activations differencedata/persona_vectors/{MODEL}/{TRAIT}_prompt_last_diff.pt: Last prompt token activations difference
Each vector has shape: [layers × hidden_dim]
pipeline/checkpoint_grids.sh defines the canonical checkpoint lists used across all sweep experiments. It is sourced automatically by scripts that loop over checkpoints — do not run it directly.
| Variable | Model | Checkpoints |
|---|---|---|
OLMO3_UNIVERSAL_TRANSFER_CHECKPOINT_GRID |
OLMo-3-1025-7B | 16 checkpoints: stage1-step3000 … main |
APERTUS_UNIVERSAL_TRANSFER_CHECKPOINT_GRID |
Apertus-8B-2509 | 15 checkpoints: step50000-tokens210B … main |
It also sets OLMO3_SHARED_NORM_CALIBRATION_TEXT_FILES pointing to the shared prompt file used for hidden-state norm calibration.
The core experiment of the paper (Figures 1 and 7). For each pretraining checkpoint in the model's grid, persona vectors are extracted and used to steer the same checkpoint. The reported metric is the steered-minus-baseline trait-expression delta (∆τ). Per checkpoint the script runs four steps in sequence: generate instructed continuations → extract vectors → baseline eval → steered eval.
Steering hyperparameters are fixed per model–trait pair (Table 7). Apertus evil uses c=0.2 for checkpoints up to 4200B tokens and c=0.5 afterwards.
# OLMo-3-7B (default)
./pipeline/checkpoint_sweep.sh
# Apertus-8B-2509
MODEL=Apertus/Apertus-8B-2509 MODEL_NAME_FOR_PATHS=Apertus-8B-2509 \
./pipeline/checkpoint_sweep.shEnvironment Variables (optional):
GPU(default: 0)MODEL(default:"allenai/OLMo-3-1025-7B")MODEL_NAME_FOR_PATHS— filesystem-safe model name (default: basename ofMODEL)TRAITS— comma-separated trait list (default: all four for OLMo-3, three for Apertus)CHECKPOINT_GRID— override the checkpoint list (default: model-appropriate grid fromcheckpoint_grids.sh)RUN_TAG(default:"same_checkpoint_v1") — output subdirectory labelN_PER_QUESTION_EXTRACT(default: 10) /N_PER_QUESTION_EVAL(default: 10)MAX_TOKENS(default: 64) /REPETITION_PENALTY(default: 1.1)JUDGE_MODEL(default:"gpt-4.1-mini-2025-04-14")THRESHOLD(default: 50) — minimum score for a generation to be included in vector extractionSKIP_JUDGE(default:False) — skip judging during extraction (useful for debugging)OVERWRITE(default:False) — re-run steps even if output files already exist
Steering hyperparameters (Table 7, hard-coded per trait):
| Model | Trait | Layer | Coef |
|---|---|---|---|
| OLMo-3-7B | evil, sycophantic, impolite | 16 | 0.5 |
| OLMo-3-7B | humorous | 20 | 0.3 |
| Apertus-8B | evil | 16 | 0.2 (≤4200B tokens) / 0.5 (>4200B) |
| Apertus-8B | sycophantic | 16 | 0.2 |
| Apertus-8B | impolite | 20 | 0.15 |
Output layout (one directory per checkpoint):
data/model_responses/extract/{MODEL}/{revision}/{trait}_{pos|neg}_instruct.csv
data/persona_vectors/{MODEL}/{revision}/{trait}_response_avg_diff.pt
data/model_responses/eval/{MODEL}/{revision}/{RUN_TAG}/baseline_{trait}.csv
data/model_responses/eval/{MODEL}/{revision}/{RUN_TAG}/steered_{trait}_layer{L}_coef{C}.csv
results/{MODEL}/checkpoint_grid/{RUN_TAG}/
The emergence figure (Figure 1) and its companion LaTeX table are produced by analysis/make_emergence_plot.py and analysis/make_emergence_table.py. See analysis/README.md → Emergence Plots.
Applies persona vectors already extracted from a base model's pretraining checkpoints (produced by checkpoint_sweep.sh) to a fixed post-trained target (instruct/SFT/DPO/RLVR). This tests whether vectors learned during pretraining generalise across the fine-tuning boundary. The baseline evaluation runs once per trait on the instruct model; the steered evaluation iterates over all extract checkpoints.
Prerequisites: persona vectors at every checkpoint must already exist in data/persona_vectors/{EXTRACT_MODEL}/{revision}/{trait}_response_avg_diff.pt. Run checkpoint_sweep.sh first if they are missing.
# OLMo-3 base vectors → Olmo-3-7B-Instruct (default)
./pipeline/transfer_sweep.sh
# OLMo-3 base vectors → Olmo-3-7B-Instruct-SFT
EVAL_MODEL=allenai/Olmo-3-7B-Instruct-SFT \
RUN_TAG=evil_transfer_to_sft_v1 \
./pipeline/transfer_sweep.sh
# Apertus base vectors → Apertus-8B-Instruct-2509
EXTRACT_MODEL=Apertus/Apertus-8B-2509 \
EXTRACT_MODEL_NAME_FOR_PATHS=Apertus-8B-2509 \
EVAL_MODEL=swiss-ai/Apertus-8B-Instruct-2509 \
EVAL_MODEL_NAME_FOR_PATHS=Apertus-8B-Instruct-2509 \
./pipeline/transfer_sweep.shEnvironment Variables (optional):
GPU(default: 0)EXTRACT_MODEL(default:"allenai/OLMo-3-1025-7B") — base model that produced the vectorsEXTRACT_MODEL_NAME_FOR_PATHS— filesystem-safe name forEXTRACT_MODEL(default: basename ofEXTRACT_MODEL)EVAL_MODEL(default:"allenai/Olmo-3-7B-Instruct") — post-trained model to steerEVAL_MODEL_NAME_FOR_PATHS— filesystem-safe name forEVAL_MODEL(default: basename ofEVAL_MODEL)TRAITS— comma-separated list (default: all four for OLMo-3, three for Apertus)CHECKPOINT_GRID— override checkpoint list (default: model-appropriate grid fromcheckpoint_grids.sh)RUN_TAG(default:"transfer_v1") — output subdirectory labelVECTOR_NORM(default: empty) — if set, rescales every vector to this L2 norm before steering, matching the instruct model's hidden-state scale; the value is embedded in output filenames astargetnorm{VECTOR_NORM//./p}N_PER_QUESTION(default: 10) /MAX_TOKENS(default: 64) /REPETITION_PENALTY(default: 1.1)JUDGE_MODEL(default:"gpt-4.1-mini-2025-04-14")OVERWRITE(default:False) — re-run steps even if output files already exist
Steering hyperparameters used in the paper (per-target transfer, Table 7):
| Base model | Trait | Layer | Coef (→ Instruct) |
|---|---|---|---|
| OLMo-3-7B | evil | 16 | 0.55 |
| OLMo-3-7B | sycophantic | 16 | 0.50 |
| OLMo-3-7B | impolite | 20 | 0.75 |
| OLMo-3-7B | humorous | 20 | 0.30 |
| Apertus-8B | evil | 16 | 0.3 |
| Apertus-8B | sycophantic | 16 | 0.5 |
| Apertus-8B | impolite | 20 | 0.7 |
Output layout:
data/model_responses/eval/{EVAL_MODEL}/instruct_transfer/{RUN_TAG}/
baselines/
baseline_{trait}.csv # shared baseline (one per trait)
steered/
{extract_revision}/
steering_results_{trait}_layer{L}_coef{C}.csv
results/{EVAL_MODEL}/instruct_transfer/{RUN_TAG}/
The transfer figures (per-target Δ trait score curves) and companion LaTeX tables are produced by analysis/make_transfer_plot.py and analysis/make_transfer_table.py. See analysis/README.md → Transfer Plots.
Geometry plots from Section 5.1 (Figures 4, 5, 10, 11) — cosine similarity across checkpoints, MDS embeddings, per-vector L2 norms — are produced by analysis/plot_persona_vectors.py. The two-panel cosine trajectory figure is produced by analysis/make_cosine_trajectory_plot.py. See analysis/README.md → Persona Vector Geometry Analysis.
Sub-facet annotation of steered/instructed responses across pretraining checkpoints:
- Baumeister's Roots of Evil —
analysis/baumeister_gpt_annotation.py. See analysis/README.md → Baumeister's Roots of Evil. - ELEPHANT Sycophancy Facets —
analysis/sycophancy_gpt_annotation.py. See analysis/README.md → ELEPHANT Sycophancy Facets. - Joint Facet Plot —
analysis/facet_joint_plot.py. See analysis/README.md → Joint Facet Plot.
This approach requires the original trait artifacts in data/trait_data_eval and data/trait_data_extract. Run python pipeline/extract_data_generation/character_desc_transform.py to generate character descriptions from the original JSON files. This file requires you to set the GEMINI_API_KEY environment variable. The output will be saved in data/eval_persona_{eval|extract}/evil_character_neutral_q.json.
Arguments:
--trait: Trait name (default: "evil_character")--output-file-name: Output file name (default: "evil_character_neutral_q.json")--model: LLM model for generation (default: "gemini-2.5-pro")--temperature: Temperature for generation (default: 0.0)
Narration prompts are generated in the pipeline/extract_data_generation/ directory using a template-based approach. The process works as follows:
-
Template files (e.g.,
evil_template.txt,ethical_template.txt) contain instructions and a{QUESTION}placeholder that defines the prompt structure and constraints. -
Question files (e.g.,
evil_questions.txt,ethical_questions.txt) contain specific storytelling prompts, one per line, that describe scenarios to generate stories about. -
Generate prompts: Run
python pipeline/extract_data_generation/generate_prompts.pyto replace the{QUESTION}placeholder in the template with each question from the questions file. This creates a complete prompt for each question, saved tooutputs/generated_prompts.jsonandoutputs/generated_prompts.txt. -
Generate stories: Use
python pipeline/extract_data_generation/infer_vllm.py --prompts-file outputs/generated_prompts.jsonto generate story responses using vLLM. The script loads the prompts, runs inference with the specified model, and saves responses to a JSON file.
For detailed usage instructions, see pipeline/extract_data_generation/README.md.
Dialogue-based extraction and evaluation data are specified directly as JSON files in: data/trait_data_{extract/eval}/evil_dialogue_{neutral/evil}_q.json Unlike the character descriptions and storytelling prompts, these dialogue prompts are manually curated (with the help of an LLM assistant) and do not require a separate generation script.
The dialogue JSON files fully specify both the 2-shot instructions and the neutral or evil dialogue questions; no additional prompt-generation step is needed beyond providing the trait name and options to source.eval_persona.
Tests different vector types (few_shot, person, dialogue) with fixed layer and coefficient across all evaluation sets.
Usage:
./pipeline/vector_type_evaluation.shEnvironment Variables (optional):
MODEL(default: "allenai/OLMo-3-1025-7B")STEERING_TYPE(default: "response")LAYER(default: 16) - Fixed layer for all runsCOEF(default: 2.0) - Fixed coefficient for all runsN_PER_QUESTION(default: 10)MAX_TOKENS(default: 64)JUDGE_MODEL(default: "gpt-4.1-mini-2025-04-14")OUTPUT_BASE_DIR(default: "data/model_responses/eval")
Vector Types Tested:
stories-data/persona_vectors/OLMo-3-1025-7B/evil_stories_neutral_q_response_avg_diff.ptcharacter-data/persona_vectors/OLMo-3-1025-7B/evil_character_neutral_q_response_avg_diff.ptdialogue-data/persona_vectors/OLMo-3-1025-7B/evil_dialogue_neutral_q_response_avg_diff.pt
Eval Sets: evil_stories_neutral_q, evil_character_neutral_q, evil_dialogue_neutral_q
Total Runs: 3 vector types × 3 eval sets = 9 runs
Output:
- Results saved to:
data/model_responses/eval/{model_name}_vector_types/{vec_type}_{eval_set}.csv - Summary:
results/{model_name}/vector_type_summary.csv
Example:
LAYER=16 COEF=2.0 ./pipeline/vector_type_evaluation.shRuns negative controls to verify that the observed steering effects are persona-specific and not an artefact of any non-zero perturbation. Two null vectors are built per checkpoint and compared against the real persona vector at identical layer, coefficient, and normalization:
- Random-direction control: a Gaussian unit vector sampled independently of the extraction data.
- Label-shuffled control: the difference-of-means vector recomputed after randomly flipping each sample's pos/neg label with probability 0.5.
The script sources pipeline/checkpoint_grids.sh for the shared checkpoint lists and calls source.build_control_vectors to produce the null vectors, then evaluates them with source.eval_persona.
./pipeline/control_vectors.shEnvironment Variables (optional):
GPU(default: 0)MODEL(default:"allenai/Olmo-3-1025-7B")TRAIT(default:"evil_character_neutral_q")CONTROL_EXTRACT_REVISIONS(default: 5 representative checkpoints) - comma-separated checkpoint IDsCONTROL_MODES(default:"random,shuffled")SEEDS(default:"0,1,2") - random seeds for reproducibilityLAYER(default: 16) /COEF(default: 0.5) - must match the real persona-vector runTHRESHOLD(default: 50) - extraction quality filter (shuffled mode only)N_PER_QUESTION_EVALUATE(default: 3) /MAX_QUESTIONS_EVALUATE(default: 20)RUN_TAG(default:"evil_controls_layer16_coef0p5_v1") - output subdirectory labelOVERWRITE(default:False) /RUN_BASELINES(default:True)
Output:
- Control vectors:
data/persona_vectors/{MODEL}-controls/{revision}/{TRAIT}__{mode}__seed{N}_response_avg_diff.pt - Steered responses:
data/model_responses/eval/{MODEL}/{revision}/controls/{RUN_TAG}/steered/{mode}/seed{N}/ - Baseline responses:
data/model_responses/eval/{MODEL}/{revision}/controls/{RUN_TAG}/baselines/ - Summary directory:
results/{MODEL}/controls/{RUN_TAG}/
If you find this work useful, please cite our paper:
@misc{moskvoretskii2026tracingpersonavectorsllm,
title={Tracing Persona Vectors Through LLM Pretraining},
author={Viktor Moskvoretskii and Dominik Glandorf and Jorge Medina Moreira and Tanja Käser and Robert West},
year={2026},
eprint={2605.13329},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.13329},
}