Skip to content

mint-philosophy/llm_coherence

Repository files navigation

Incoherent Values? Probing LLM Preferences Through Parametric Variation

Code and data for testing whether LLM forced-choice preferences remain ordered across controlled seven-tier outcome ladders.

This repository provides the validated inputs, model-run wrappers, and analysis code for measuring monotonic preference coherence and predictive utility across LLMs.

Complete experiment artifacts are hosted on Hugging Face. Git tracks the reproducible code, canonical inputs, and lightweight public summaries.

Experiment Data

All datasets created during the experiment—including canonical inputs under data/ and model-run payloads under outputs/—are available on Hugging Face:

🤗 Dataset: https://huggingface.co/datasets/MINTLABJHUANU/LLMCoherence_Var_100

Clone or download that dataset repo to populate data/ and outputs/ locally without rerunning API calls.

Repository Structure

Path Purpose
data/ Canonical inputs and intermediate instrument data. Numbered subfolders (01_06_) follow the experiment order.
outputs/ Model-run payloads, per-model analysis, and checkpoints. Ignored by Git.
results/ Paper figures, tables, and small tracked summaries.
api_keys/ Local provider API keys (api_key_<provider>.txt). Ignored by Git.
scripts/ Numbered command wrappers for rerunning the pipeline.
src/llm_coherence/ Importable Python package used by the wrappers.

Use scripts/ to run the pipeline. Use src/llm_coherence/ to edit or audit the implementation. Wrapper files are intentionally small and delegate to main() functions in src/.

Tracked Inputs

The canonical validated ladder set is:

data/05_ladder_validation/phase6b_variations_pruned_final.json

The canonical forced-choice inputs are:

data/06_forced_choice_inputs/phase6b_variations_pruned/

The main count progression is:

Stage Count
Source outcomes 510
Screened candidate outcomes 181
Generated ladder candidates 146
Final validated ladders 100

Installation

Required dependencies

  • Python >=3.11,<3.13 (use 3.11 or 3.12) — required for all local analysis, replication from the Hub dataset, and API-based model runs.

Optional — only if you re-run glm-45-base-logprobs from scratch (self-hosted vLLM on GPU; not routed through OpenRouter). Skip these if you download existing outputs from MINTLABJHUANU/LLMCoherence_Var_100 or only run other models via API:

  • DockerDocker Desktop and a Docker Hub account (docker login) to build and push Dockerfile.hf_jobs.
  • Hugging Face - set api_keys/hf_token (or hf auth login) to submit GLM jobs with --submit-hf-job

Create an isolated environment and install the package:

bash scripts/00_repository/00_create_environment.sh
source .venv/bin/activate

The environment script installs the dependencies declared in pyproject.toml, including the NumPy pin used by the analysis stack. If you already have a clean Python 3.11 or 3.12 environment, the manual equivalent is:

python -m pip install -e .

HF Jobs submission helpers require the optional Hub dependency:

python -m pip install -e ".[hf-jobs]"

Validate tracked inputs and lightweight indexes:

PYTHONPATH=src python scripts/00_repository/validate_artifacts.py

Refresh browsable indexes after adding local model-run payloads under outputs/:

PYTHONPATH=src python scripts/00_repository/validate_artifacts.py --write-indexes

API Keys

Model-run steps require provider access. Set environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY) or create local files under api_keys/:

api_keys/api_key_openai.txt
api_keys/api_key_anthropic.txt
api_keys/api_key_openrouter.txt

Keys are loaded through src/llm_coherence/runtime/api_keys.py and are not included in this repository.

Quick Smoke Test

For ordinary replication, run a bounded smoke test before launching full model runs. The example below starts from the tracked validated ladders, creates a small forced-choice slice, runs both model experiments (step 10a: within-ladder tier-pair preferences; step 10b: ladder-vs-comparison forced choice), and runs both analysis stages.

PYTHONPATH=src python scripts/03_forced_choice_inputs/09_generate_forced_choice_inputs.py \
  --variations data/05_ladder_validation/phase6b_variations_pruned_final.json \
  --comparison-sample data/06_forced_choice_inputs/comparison_sample.json \
  --max-variations 2 \
  --max-comparison-samples 10 \
  --output-dir data/06_forced_choice_inputs/phase6b_variations_pruned_smoke_tiny10
PYTHONPATH=src python scripts/04_model_runs/10a_run_within_ladder_experiment.py \
  --model ministral-3b-2512-openrouter \
  --smoke
PYTHONPATH=src python scripts/04_model_runs/10b_run_7tier_experiment.py \
  --model ministral-3b-2512-openrouter \
  --trials 1 \
  --data-dir data/06_forced_choice_inputs/phase6b_variations_pruned_smoke_tiny10 \
  --max-variation-sets 2 \
  --max-concurrent 1 \
  --infrastructure openrouter \
  --smoke \
  --resume
PYTHONPATH=src python scripts/05_analysis/11_analyze_7tier_coherence.py \
  --model ministral-3b-2512-openrouter \
  --data-dir data/06_forced_choice_inputs/phase6b_variations_pruned_smoke_tiny10 \
  --results-dir outputs/ministral-3b-2512-openrouter/smoke_ministral-3b-2512-openrouter/ladder_vs_comparison_statements
# Optional on a tiny smoke slice (may produce no rows if too few comparison pairs).
PYTHONPATH=src python scripts/05_analysis/12_predictive_utility.py \
  --model ministral-3b-2512-openrouter \
  --results-dir outputs/ministral-3b-2512-openrouter/smoke_ministral-3b-2512-openrouter/ladder_vs_comparison_statements \
  --out-dir outputs/ministral-3b-2512-openrouter/smoke_ministral-3b-2512-openrouter/ladder_vs_comparison_statements/pred_utility_test \
  --n-perm 20
PYTHONPATH=src python scripts/06_reporting/13_make_fig_table.py \
  --model ministral-3b-2512-openrouter \
  --results-dir outputs

For a full rerun, remove or increase the smoke bounds (--max-variation-sets, --max-variations, and --max-comparison-samples), omit --smoke, and set the desired trial count.

Pipeline

Run scripts from the repository root with PYTHONPATH=src python <script>.

Step Command wrapper Implementation
1 scripts/01_instrument_design/01_create_filtered_dataset.py src/llm_coherence/generation/create_filtered_dataset.py
2 scripts/01_instrument_design/02_screen_outcomes.py src/llm_coherence/generation/filter_statements.py
3 scripts/01_instrument_design/03_generate_7tier_ladders.py src/llm_coherence/generation/generate_7tier_variations.py
4 scripts/02_ladder_validation/04_within_ladder_pruning.py src/llm_coherence/validation/within_ladder_pruning.py
5 scripts/02_ladder_validation/05_property_ladder_pruning.py src/llm_coherence/validation/property_ladder_pruning.py
7 scripts/02_ladder_validation/07_ranking_ladder_pruning.py src/llm_coherence/validation/ranking_ladder_pruning.py
8 scripts/02_ladder_validation/08_build_final_pruned_variations.py src/llm_coherence/validation/build_final_pruned_variations.py
9 scripts/03_forced_choice_inputs/09_generate_forced_choice_inputs.py src/llm_coherence/generation/generate_7tier_comparisons.py
10a scripts/04_model_runs/10a_run_within_ladder_experiment.py src/llm_coherence/experiments/within_ladder/run_within_ladder_experiment.py
10b scripts/04_model_runs/10b_run_7tier_experiment.py src/llm_coherence/experiments/ladder_statement_pair/run_7tier_experiment.py
11 scripts/05_analysis/11_analyze_7tier_coherence.py src/llm_coherence/analysis/analyze_7tier_coherence.py
12 scripts/05_analysis/12_predictive_utility.py src/llm_coherence/analysis/predictive_utility.py
13 scripts/06_reporting/13_make_fig_table.py src/llm_coherence/reporting/make_fig_table.py

The early instrument-design and ladder-audit stages require API access and are not necessary for most replication workflows. Most users should start from the tracked validated ladders and forced-choice inputs.

GLM Base on HF Jobs

glm-45-base-logprobs is the self-hosted GLM base run. It is not routed through OpenRouter. Use the same Step 10a and Step 10b scripts as every other model; select GLM base with --model glm-45-base-logprobs and submit to HF Jobs with --submit-hf-job.

Build and push the HF Jobs image from the repository root:

IMAGE=your-dockerhub-user/llm-coherence-vllm:glm-base-YYYYMMDD
bash scripts/00_repository/01_build_hf_jobs_image.sh "$IMAGE"

Within-ladder GLM experiment (Instance 1 / Step 10a)

Before entering the H200 queue, exercise the same within-ladder vLLM scoring and upload path on one inexpensive L4 with the auxiliary Qwen 0.5B smoke model:

PYTHONPATH=src python scripts/04_model_runs/10a_run_within_ladder_experiment.py \
  --submit-hf-job \
  --model qwen25-05b-instruct-smoke \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor l4x1 \
  --timeout 1h \
  --max-variation-sets 1 \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --job-tag qwen-l4-scoring-smoke \
  --path-in-repo smoke/qwen25-05b-instruct/scoring-smoke/within_ladder

This proxy run validates the container, 42-request one-ladder input, exact constrained A/B logprob scoring, analysis, and Hub upload. It does not validate that GLM fits or loads on the selected hardware; GLM remains an H200x8 run.

Submit a one-ladder within-ladder smoke job first:

PYTHONPATH=src python scripts/04_model_runs/10a_run_within_ladder_experiment.py \
  --submit-hf-job \
  --model glm-45-base-logprobs \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor h200x8 \
  --timeout 1h \
  --max-variation-sets 1 \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --job-tag glm-smoke-YYYYMMDD \
  --path-in-repo smoke/glm-45-base-logprobs/glm-smoke-YYYYMMDD/within_ladder

For the full within-ladder run across all 100 ladders, omit --max-variation-sets and upload to the canonical output path:

PYTHONPATH=src python scripts/04_model_runs/10a_run_within_ladder_experiment.py \
  --submit-hf-job \
  --model glm-45-base-logprobs \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor h200x8 \
  --timeout 12h \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --job-tag glm-within-ladder-full-YYYYMMDD \
  --path-in-repo outputs/glm-45-base-logprobs/within_ladder

The within-ladder HF job runs Step 10a inside the container as --generate, --run-local, then --analyze. When --hub-dataset is supplied without an explicit --path-in-repo, outputs are uploaded to:

outputs/glm-45-base-logprobs/within_ladder/

100-ladder GLM experiment (Instance 2 / Step 10b)

Submit the 7-tier ladder-vs-comparison run through Step 10b with the same model flag:

PYTHONPATH=src python scripts/04_model_runs/10b_run_7tier_experiment.py \
  --submit-hf-job \
  --model qwen25-05b-instruct-smoke \
  --trials 1 \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor l4x1 \
  --timeout 1h \
  --max-variation-sets 1 \
  --smoke \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --path-in-repo smoke/qwen25-05b-instruct/scoring-smoke/ladder_vs_comparison_statements

After that proxy path succeeds, run the GLM smoke on H200x8:

PYTHONPATH=src python scripts/04_model_runs/10b_run_7tier_experiment.py \
  --submit-hf-job \
  --model glm-45-base-logprobs \
  --trials 1 \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor h200x8 \
  --timeout 1h \
  --max-variation-sets 1 \
  --smoke \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --path-in-repo smoke/glm-45-base-logprobs/glm-smoke-YYYYMMDD/ladder_vs_comparison_statements

After the GLM smoke succeeds, submit the complete 100-ladder Step 10b run with 10 trials per comparison:

PYTHONPATH=src python scripts/04_model_runs/10b_run_7tier_experiment.py \
  --submit-hf-job \
  --model glm-45-base-logprobs \
  --trials 10 \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor h200x8 \
  --timeout 12h \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --job-tag glm-7tier-full-YYYYMMDD \
  --path-in-repo outputs/glm-45-base-logprobs/ladder_vs_comparison_statements

This command runs the actual GLM ladder-versus-comparison experiment, not the within-ladder validation. It processes the full manifest because it does not set --max-variation-sets. Results are uploaded to:

outputs/glm-45-base-logprobs/ladder_vs_comparison_statements/

Do not use --model-volume for the normal GLM run. Large sharded checkpoints can load very slowly from FUSE-mounted model volumes. By default, the model is downloaded to the job's local /data cache. --model-volume remains available as an experimental override and enables vLLM's prefetch loading strategy.

Outputs and External Artifacts

Tracked GitHub contents are sufficient to inspect the instrument and rerun the pipeline. For exact reproduction without rerunning APIs, download the full artifact tree from the Hugging Face dataset (data/ and outputs/).

Expected local output layout:

Path Contents
data/05_ladder_validation/ Ladder validation: pruned ladder JSONs, audit reports, and judge run folders (within_ladder_validation_tier/, property/, ranking/).
outputs/<model_key>/within_ladder/ Instance 1 (step 10a): tier-pair preferences, cost logs, summary.json.
outputs/<model_key>/ladder_vs_comparison_statements/ Instance 2 (step 10b): per-ladder results.json, reasoning traces, cost logs.
outputs/<model_key>/ladder_vs_comparison_statements/coherence_test/ Step 11: phase6b_coherence_*.json, justification analysis, per-category summaries.
outputs/<model_key>/ladder_vs_comparison_statements/pred_utility_test/ Step 12: predictive-utility CSVs and summaries.
outputs/checkpoints/<model_key>/ Resumable checkpoints for step 10b.
results/figures/ Generated figures (step 13).
results/tables/ Generated tables (step 13).

Smoke runs for step 10b write under outputs/<model_key>/smoke_<model_key>/ladder_vs_comparison_statements/ instead of the full-run path above.

The tracked results/model_run_index.json snapshot inventories local payloads under outputs/<model_key>/. Refresh it with validate_artifacts.py --write-indexes after copying or generating model-run artifacts.

# Update the dataset README on Hugging Face
python scripts/00_repository/hf_upload/hf_dataset.py upload readme

# Upload outputs/ (resume incomplete models)
python scripts/00_repository/hf_upload/hf_dataset.py upload outputs --skip-existing

# Stage a local bundle with manifest (optional)
python scripts/00_repository/hf_upload/hf_dataset.py prepare /path/to/artifact_bundle

Public Summaries

Two small summary files are tracked for inspection:

results/phase6b_coherence_summary.json
results/model_run_index.json

These are not substitutes for the raw model-response artifact bundle on Hugging Face.

License

Released under the MIT License. See LICENSE.

Citation

If you use this repository or its experiment artifacts, please cite:

Ajayi, E., Chowdhury, A., & Lazar, S. (2026). Incoherent Values? Probing LLM Preferences Through Parametric Variation. arXiv:2606.21102. https://arxiv.org/abs/2606.21102

@misc{ajayi_chowdhury_lazar_2026_incoherent_values,
  author        = {Ajayi, Elena and Chowdhury, Angelica and Lazar, Seth},
  title         = {Incoherent Values? Probing LLM Preferences Through Parametric Variation},
  year          = {2026},
  eprint        = {2606.21102},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CY},
  url           = {https://arxiv.org/abs/2606.21102}
}

Releases

No releases published

Packages

 
 
 

Contributors