Incoherent Values? Probing LLM Preferences Through Parametric Variation

Code and data for testing whether LLM forced-choice preferences remain ordered across controlled seven-tier outcome ladders.

This repository provides the validated inputs, model-run wrappers, and analysis code for measuring monotonic preference coherence and predictive utility across LLMs.

Complete experiment artifacts are hosted on Hugging Face. Git tracks the reproducible code, canonical inputs, and lightweight public summaries.

Experiment Data

All datasets created during the experiment—including canonical inputs under data/ and model-run payloads under outputs/—are available on Hugging Face:

🤗 Dataset: https://huggingface.co/datasets/MINTLABJHUANU/LLMCoherence_Var_100

Clone or download that dataset repo to populate data/ and outputs/ locally without rerunning API calls.

Repository Structure

Path	Purpose
`data/`	Canonical inputs and intermediate instrument data. Numbered subfolders (`01_`–`06_`) follow the experiment order.
`outputs/`	Model-run payloads, per-model analysis, and checkpoints. Ignored by Git.
`results/`	Paper figures, tables, and small tracked summaries.
`api_keys/`	Local provider API keys (`api_key_<provider>.txt`). Ignored by Git.
`scripts/`	Numbered command wrappers for rerunning the pipeline.
`src/llm_coherence/`	Importable Python package used by the wrappers.

Use scripts/ to run the pipeline. Use src/llm_coherence/ to edit or audit the implementation. Wrapper files are intentionally small and delegate to main() functions in src/.

Tracked Inputs

The canonical validated ladder set is:

data/05_ladder_validation/phase6b_variations_pruned_final.json

The canonical forced-choice inputs are:

data/06_forced_choice_inputs/phase6b_variations_pruned/

The main count progression is:

Stage	Count
Source outcomes	510
Screened candidate outcomes	181
Generated ladder candidates	146
Final validated ladders	100

Installation

Required dependencies

Python >=3.11,<3.13 (use 3.11 or 3.12) — required for all local analysis, replication from the Hub dataset, and API-based model runs.

Optional — only if you re-run glm-45-base-logprobs from scratch (self-hosted vLLM on GPU; not routed through OpenRouter). Skip these if you download existing outputs from MINTLABJHUANU/LLMCoherence_Var_100 or only run other models via API:

Docker — Docker Desktop and a Docker Hub account (docker login) to build and push Dockerfile.hf_jobs.
Hugging Face - set api_keys/hf_token (or hf auth login) to submit GLM jobs with --submit-hf-job

Create an isolated environment and install the package:

bash scripts/00_repository/00_create_environment.sh
source .venv/bin/activate

The environment script installs the dependencies declared in pyproject.toml, including the NumPy pin used by the analysis stack. If you already have a clean Python 3.11 or 3.12 environment, the manual equivalent is:

python -m pip install -e .

HF Jobs submission helpers require the optional Hub dependency:

python -m pip install -e ".[hf-jobs]"

Validate tracked inputs and lightweight indexes:

PYTHONPATH=src python scripts/00_repository/validate_artifacts.py

Refresh browsable indexes after adding local model-run payloads under outputs/:

PYTHONPATH=src python scripts/00_repository/validate_artifacts.py --write-indexes

API Keys

Model-run steps require provider access. Set environment variables (OPENAI_API_KEY, ANTHROPIC_API_KEY, OPENROUTER_API_KEY) or create local files under api_keys/:

api_keys/api_key_openai.txt
api_keys/api_key_anthropic.txt
api_keys/api_key_openrouter.txt

Keys are loaded through src/llm_coherence/runtime/api_keys.py and are not included in this repository.

Quick Smoke Test

For ordinary replication, run a bounded smoke test before launching full model runs. The example below starts from the tracked validated ladders, creates a small forced-choice slice, runs both model experiments (step 10a: within-ladder tier-pair preferences; step 10b: ladder-vs-comparison forced choice), and runs both analysis stages.

PYTHONPATH=src python scripts/03_forced_choice_inputs/09_generate_forced_choice_inputs.py \
  --variations data/05_ladder_validation/phase6b_variations_pruned_final.json \
  --comparison-sample data/06_forced_choice_inputs/comparison_sample.json \
  --max-variations 2 \
  --max-comparison-samples 10 \
  --output-dir data/06_forced_choice_inputs/phase6b_variations_pruned_smoke_tiny10

PYTHONPATH=src python scripts/04_model_runs/10a_run_within_ladder_experiment.py \
  --model ministral-3b-2512-openrouter \
  --smoke

PYTHONPATH=src python scripts/04_model_runs/10b_run_7tier_experiment.py \
  --model ministral-3b-2512-openrouter \
  --trials 1 \
  --data-dir data/06_forced_choice_inputs/phase6b_variations_pruned_smoke_tiny10 \
  --max-variation-sets 2 \
  --max-concurrent 1 \
  --infrastructure openrouter \
  --smoke \
  --resume

PYTHONPATH=src python scripts/05_analysis/11_analyze_7tier_coherence.py \
  --model ministral-3b-2512-openrouter \
  --data-dir data/06_forced_choice_inputs/phase6b_variations_pruned_smoke_tiny10 \
  --results-dir outputs/ministral-3b-2512-openrouter/smoke_ministral-3b-2512-openrouter/ladder_vs_comparison_statements

# Optional on a tiny smoke slice (may produce no rows if too few comparison pairs).
PYTHONPATH=src python scripts/05_analysis/12_predictive_utility.py \
  --model ministral-3b-2512-openrouter \
  --results-dir outputs/ministral-3b-2512-openrouter/smoke_ministral-3b-2512-openrouter/ladder_vs_comparison_statements \
  --out-dir outputs/ministral-3b-2512-openrouter/smoke_ministral-3b-2512-openrouter/ladder_vs_comparison_statements/pred_utility_test \
  --n-perm 20

PYTHONPATH=src python scripts/06_reporting/13_make_fig_table.py \
  --model ministral-3b-2512-openrouter \
  --results-dir outputs

For a full rerun, remove or increase the smoke bounds (--max-variation-sets, --max-variations, and --max-comparison-samples), omit --smoke, and set the desired trial count.

Pipeline

Run scripts from the repository root with PYTHONPATH=src python <script>.

Step	Command wrapper	Implementation
1	`scripts/01_instrument_design/01_create_filtered_dataset.py`	`src/llm_coherence/generation/create_filtered_dataset.py`
2	`scripts/01_instrument_design/02_screen_outcomes.py`	`src/llm_coherence/generation/filter_statements.py`
3	`scripts/01_instrument_design/03_generate_7tier_ladders.py`	`src/llm_coherence/generation/generate_7tier_variations.py`
4	`scripts/02_ladder_validation/04_within_ladder_pruning.py`	`src/llm_coherence/validation/within_ladder_pruning.py`
5	`scripts/02_ladder_validation/05_property_ladder_pruning.py`	`src/llm_coherence/validation/property_ladder_pruning.py`
7	`scripts/02_ladder_validation/07_ranking_ladder_pruning.py`	`src/llm_coherence/validation/ranking_ladder_pruning.py`
8	`scripts/02_ladder_validation/08_build_final_pruned_variations.py`	`src/llm_coherence/validation/build_final_pruned_variations.py`
9	`scripts/03_forced_choice_inputs/09_generate_forced_choice_inputs.py`	`src/llm_coherence/generation/generate_7tier_comparisons.py`
10a	`scripts/04_model_runs/10a_run_within_ladder_experiment.py`	`src/llm_coherence/experiments/within_ladder/run_within_ladder_experiment.py`
10b	`scripts/04_model_runs/10b_run_7tier_experiment.py`	`src/llm_coherence/experiments/ladder_statement_pair/run_7tier_experiment.py`
11	`scripts/05_analysis/11_analyze_7tier_coherence.py`	`src/llm_coherence/analysis/analyze_7tier_coherence.py`
12	`scripts/05_analysis/12_predictive_utility.py`	`src/llm_coherence/analysis/predictive_utility.py`
13	`scripts/06_reporting/13_make_fig_table.py`	`src/llm_coherence/reporting/make_fig_table.py`

The early instrument-design and ladder-audit stages require API access and are not necessary for most replication workflows. Most users should start from the tracked validated ladders and forced-choice inputs.

GLM Base on HF Jobs

glm-45-base-logprobs is the self-hosted GLM base run. It is not routed through OpenRouter. Use the same Step 10a and Step 10b scripts as every other model; select GLM base with --model glm-45-base-logprobs and submit to HF Jobs with --submit-hf-job.

Build and push the HF Jobs image from the repository root:

IMAGE=your-dockerhub-user/llm-coherence-vllm:glm-base-YYYYMMDD
bash scripts/00_repository/01_build_hf_jobs_image.sh "$IMAGE"

Within-ladder GLM experiment (Instance 1 / Step 10a)

Before entering the H200 queue, exercise the same within-ladder vLLM scoring and upload path on one inexpensive L4 with the auxiliary Qwen 0.5B smoke model:

PYTHONPATH=src python scripts/04_model_runs/10a_run_within_ladder_experiment.py \
  --submit-hf-job \
  --model qwen25-05b-instruct-smoke \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor l4x1 \
  --timeout 1h \
  --max-variation-sets 1 \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --job-tag qwen-l4-scoring-smoke \
  --path-in-repo smoke/qwen25-05b-instruct/scoring-smoke/within_ladder

This proxy run validates the container, 42-request one-ladder input, exact constrained A/B logprob scoring, analysis, and Hub upload. It does not validate that GLM fits or loads on the selected hardware; GLM remains an H200x8 run.

Submit a one-ladder within-ladder smoke job first:

PYTHONPATH=src python scripts/04_model_runs/10a_run_within_ladder_experiment.py \
  --submit-hf-job \
  --model glm-45-base-logprobs \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor h200x8 \
  --timeout 1h \
  --max-variation-sets 1 \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --job-tag glm-smoke-YYYYMMDD \
  --path-in-repo smoke/glm-45-base-logprobs/glm-smoke-YYYYMMDD/within_ladder

For the full within-ladder run across all 100 ladders, omit --max-variation-sets and upload to the canonical output path:

PYTHONPATH=src python scripts/04_model_runs/10a_run_within_ladder_experiment.py \
  --submit-hf-job \
  --model glm-45-base-logprobs \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor h200x8 \
  --timeout 12h \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --job-tag glm-within-ladder-full-YYYYMMDD \
  --path-in-repo outputs/glm-45-base-logprobs/within_ladder

The within-ladder HF job runs Step 10a inside the container as --generate, --run-local, then --analyze. When --hub-dataset is supplied without an explicit --path-in-repo, outputs are uploaded to:

outputs/glm-45-base-logprobs/within_ladder/

100-ladder GLM experiment (Instance 2 / Step 10b)

Submit the 7-tier ladder-vs-comparison run through Step 10b with the same model flag:

PYTHONPATH=src python scripts/04_model_runs/10b_run_7tier_experiment.py \
  --submit-hf-job \
  --model qwen25-05b-instruct-smoke \
  --trials 1 \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor l4x1 \
  --timeout 1h \
  --max-variation-sets 1 \
  --smoke \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --path-in-repo smoke/qwen25-05b-instruct/scoring-smoke/ladder_vs_comparison_statements

After that proxy path succeeds, run the GLM smoke on H200x8:

PYTHONPATH=src python scripts/04_model_runs/10b_run_7tier_experiment.py \
  --submit-hf-job \
  --model glm-45-base-logprobs \
  --trials 1 \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor h200x8 \
  --timeout 1h \
  --max-variation-sets 1 \
  --smoke \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --path-in-repo smoke/glm-45-base-logprobs/glm-smoke-YYYYMMDD/ladder_vs_comparison_statements

After the GLM smoke succeeds, submit the complete 100-ladder Step 10b run with 10 trials per comparison:

PYTHONPATH=src python scripts/04_model_runs/10b_run_7tier_experiment.py \
  --submit-hf-job \
  --model glm-45-base-logprobs \
  --trials 10 \
  --image "$IMAGE" \
  --namespace MINTLABJHUANU \
  --flavor h200x8 \
  --timeout 12h \
  --hub-dataset MINTLABJHUANU/LLMCoherence_Var_100 \
  --job-tag glm-7tier-full-YYYYMMDD \
  --path-in-repo outputs/glm-45-base-logprobs/ladder_vs_comparison_statements

This command runs the actual GLM ladder-versus-comparison experiment, not the within-ladder validation. It processes the full manifest because it does not set --max-variation-sets. Results are uploaded to:

outputs/glm-45-base-logprobs/ladder_vs_comparison_statements/

Do not use --model-volume for the normal GLM run. Large sharded checkpoints can load very slowly from FUSE-mounted model volumes. By default, the model is downloaded to the job's local /data cache. --model-volume remains available as an experimental override and enables vLLM's prefetch loading strategy.

Outputs and External Artifacts

Tracked GitHub contents are sufficient to inspect the instrument and rerun the pipeline. For exact reproduction without rerunning APIs, download the full artifact tree from the Hugging Face dataset (data/ and outputs/).

Expected local output layout:

Path	Contents
`data/05_ladder_validation/`	Ladder validation: pruned ladder JSONs, audit reports, and judge run folders (`within_ladder_validation_tier/`, `property/`, `ranking/`).
`outputs/<model_key>/within_ladder/`	Instance 1 (step 10a): tier-pair preferences, cost logs, `summary.json`.
`outputs/<model_key>/ladder_vs_comparison_statements/`	Instance 2 (step 10b): per-ladder `results.json`, reasoning traces, cost logs.
`outputs/<model_key>/ladder_vs_comparison_statements/coherence_test/`	Step 11: `phase6b_coherence_*.json`, justification analysis, per-category summaries.
`outputs/<model_key>/ladder_vs_comparison_statements/pred_utility_test/`	Step 12: predictive-utility CSVs and summaries.
`outputs/checkpoints/<model_key>/`	Resumable checkpoints for step 10b.
`results/figures/`	Generated figures (step 13).
`results/tables/`	Generated tables (step 13).

Smoke runs for step 10b write under outputs/<model_key>/smoke_<model_key>/ladder_vs_comparison_statements/ instead of the full-run path above.

The tracked results/model_run_index.json snapshot inventories local payloads under outputs/<model_key>/. Refresh it with validate_artifacts.py --write-indexes after copying or generating model-run artifacts.

# Update the dataset README on Hugging Face
python scripts/00_repository/hf_upload/hf_dataset.py upload readme

# Upload outputs/ (resume incomplete models)
python scripts/00_repository/hf_upload/hf_dataset.py upload outputs --skip-existing

# Stage a local bundle with manifest (optional)
python scripts/00_repository/hf_upload/hf_dataset.py prepare /path/to/artifact_bundle

Public Summaries

Two small summary files are tracked for inspection:

results/phase6b_coherence_summary.json
results/model_run_index.json

These are not substitutes for the raw model-response artifact bundle on Hugging Face.

License

Released under the MIT License. See LICENSE.

Citation

If you use this repository or its experiment artifacts, please cite:

Ajayi, E., Chowdhury, A., & Lazar, S. (2026). Incoherent Values? Probing LLM Preferences Through Parametric Variation. arXiv:2606.21102. https://arxiv.org/abs/2606.21102

@misc{ajayi_chowdhury_lazar_2026_incoherent_values,
  author        = {Ajayi, Elena and Chowdhury, Angelica and Lazar, Seth},
  title         = {Incoherent Values? Probing LLM Preferences Through Parametric Variation},
  year          = {2026},
  eprint        = {2606.21102},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CY},
  url           = {https://arxiv.org/abs/2606.21102}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Incoherent Values? Probing LLM Preferences Through Parametric Variation

Experiment Data

Repository Structure

Tracked Inputs

Installation

Required dependencies

API Keys

Quick Smoke Test

Pipeline

GLM Base on HF Jobs

Within-ladder GLM experiment (Instance 1 / Step 10a)

100-ladder GLM experiment (Instance 2 / Step 10b)

Outputs and External Artifacts

Public Summaries

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
api_keys		api_keys
data		data
outputs		outputs
results		results
scripts		scripts
src/llm_coherence		src/llm_coherence
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile.hf_jobs		Dockerfile.hf_jobs
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Incoherent Values? Probing LLM Preferences Through Parametric Variation

Experiment Data

Repository Structure

Tracked Inputs

Installation

Required dependencies

API Keys

Quick Smoke Test

Pipeline

GLM Base on HF Jobs

Within-ladder GLM experiment (Instance 1 / Step 10a)

100-ladder GLM experiment (Instance 2 / Step 10b)

Outputs and External Artifacts

Public Summaries

License

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages