sandbagging

Here are 3 public repositories matching this topic...

roldanjorge / sdf-belief-dissociation

Synthetic-document fine-tuning on Qwen2.5-7B: a controlled study of whether SDF installs sandbagging, finding a layered recognition/generation/behavior dissociation.

model-organisms ai-safety ai-alignment sandbagging evaluation-awareness synthetic-document-finetuning

Updated May 25, 2026
Python

gelisam / sandbagging-detection-via-static-analysis

Star

Proving that the neural network is honest about its lack of capabilities

ai-safety sandbagging ai-safety-research

Updated Feb 7, 2026
TypeScript

Habib-AAhsan / gptoss-redteam-pack

Star

Reproducible red-team findings for openai/gpt-oss-20b: five minimal harnesses with checks, zips & manifest (v0.9.3).

evaluation kaggle reproducibility red-team ai-safety llm chain-of-thought gpt-oss-20b sandbagging

Updated Aug 26, 2025
HTML

Improve this page

Add a description, image, and links to the sandbagging topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the sandbagging topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly