Official codebase for ORMind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research (ACL 2025 Industry Track).
The default pipeline (--mode paper) is a faithful implementation of the
paper's Algorithm 1 with the prompt templates published verbatim in
Appendix C:
| Paper component (Section 3.3) | Code |
|---|---|
| Semantic Encoder | agent_team/semantic_encoder.py |
| Formalization Thinking | agent_team/formalization_thinking.py |
| Executive Compiler | agent_team/executive_compiler.py |
| Metacognitive Supervisor (forward/backward) | agent_team/supervisor.py |
| System 2 Reasoner (counterfactual + syntax analysis) | agent_team/reasoner.py |
| Memory pool P (Section 3.2) | utils/comment_pool.py |
| Algorithm 1 control flow | agent_team/ormind_pipeline.py |
Control flow per problem: encode -> formalize -> compile -> supervisor
formats the final program -> run on problem inputs -> if execution fails,
the System 2 Reasoner diagnoses the error and the Supervisor revises
(--max_repair_rounds, default 1 as in Algorithm 1) -> on clean
execution, the System 2 Reasoner generates a counterfactual checker from
the problem description; if it reports a discrepancy, the Supervisor
revises once more.
Two additional modes exist:
--mode standard: single-prompt baseline (the "w/o All modules" row of Table 2).--mode extended: post-publication research extensions (adaptive search, dual-view formalization, online preference verifier, experience memory). Not used for any number reported in the paper; see "Extended mode" below.
The harness is built so that benchmark numbers are auditable:
- The solving pipeline receives the problem text, the interface code
example, and raw test inputs only. Reference outputs never reach
any pipeline, prompt, or stored artifact (
solve_probleminmain.pytakestest_inputswith the labels already stripped). - Repair is triggered exclusively by execution errors and by the counterfactual check against the problem description — never by comparing to reference outputs.
- Grading happens exactly once per problem, after the pipeline has
finished (
test_generated_codeinutils/test_generated_code.py). tests/test_offline_pipeline.pycontains a regression test (scenario_no_label_leakage) asserting that the ground-truth optimum does not appear in any prompt.
Metric operationalization (numeric tolerance: rel_tol=1e-3,
abs_tol=0.2 on the objective value):
| Paper metric | Definition in this harness |
|---|---|
| SR | graded ACCEPT: objective matches the reference optimum |
| MFFR | graded MODEL_FAILURE: program ran but the model was invalid (solver status not Optimal, or no objective produced) |
| IEFR | graded COMPILE_ERROR or RUNTIME_ERROR |
| (residual) | graded WRONG_ANSWER: feasible model, wrong optimum |
git clone https://github.com/XiaoAI1989/ORMind.git
cd ORMind
pip install -r requirements.txtCreate env.local in the project root (loaded automatically):
# Paper configuration (Section 5.2): GPT-3.5-turbo, temperature 0.
# These are also the built-in defaults — with no env.local at all, the
# runner targets gpt-3.5-turbo on the OpenAI API; only the key is needed.
OPENAI_API_KEY=your_key_here
OPENAI_BASE_URL=https://api.openai.com/v1
ORMIND_MODEL=gpt-3.5-turbo
# Or any OpenAI-compatible endpoint, e.g. OpenRouter:
# OPENROUTER_API_KEY=your_key_here
# OPENROUTER_BASE_URL=https://openrouter.ai/api/v1
# ORMIND_MODEL=deepseek/deepseek-chat-v3.1
# Strictly opt-in: retry a failed call once on a second model. OFF unless
# set. When it fires, the runner prints a warning and the per-problem
# logs record completions per model ("Models Used"), so a mixed-model run
# is always visible. Do not enable when reproducing paper numbers.
# ORMIND_FALLBACK_MODEL=gpt-4o-miniTable 1 (main results). ORMind rows:
python run_exp.py # LPWP (the NL4Opt LP word problems)
python run_exp_ComplexOR.py # ComplexORBaseline rows are not produced by this repository: OptiMUS numbers are cited from its original paper, and the prompting baselines (CoT/ReAct/Reflexion/CoE, ...) come from their respective public implementations.
Table 2 (ablations). Every row maps to a flag:
python run_exp.py # ORMind (Full)
python run_exp.py --with_conductor # w/ Conductor
python run_exp.py --with_terminology_interpreter # w/ Terminology Interpreter
python run_exp.py --with_code_reviewer # w/ Code Reviewer
python run_exp.py --without_semantic_encoder # w/o Semantic Encoder
python run_exp.py --without_formalization # w/o Formalization Thinking
python run_exp.py --without_counterfactual # w/o Counterfactual Analysis
python run_exp.py --without_syntax_analysis # w/o Syntax Error Analysis
python run_exp.py --mode standard # w/o All modules(Use run_exp_ComplexOR.py with the same flags for the ComplexOR
column.)
Table 3 (model robustness). python run_exp.py --model gpt-4.
Figure 3 (temperature analysis). python run_exp.py --temperature 0.5.
Table 4 (prompt-length statistics). Each *_test_log.txt records
Prompt Tokens: N for the full problem run; aggregate with:
python data_process/count_token.py --folder <run_dir> # mean and stdSR/MFFR/IEFR can be recomputed from any run directory with
python data_process/correct_rate.py --folder <run_dir>.
Useful options:
python run_exp.py --problem prob_12 # single problem
python run_exp_ComplexOR.py --problem steel3
python run_exp.py --max_repair_rounds 2 # allow a second error repairGrading options that deviate from the published protocol (all defaults = published behaviour; see "Known protocol caveats"):
python run_exp_ComplexOR.py --accept_infeasible # status match on "Infeasible" counts as ACCEPT
python run_exp.py --rel_tol 1e-4 # stricter objective toleranceLPWP: the NL4Opt competition LP word problems. This release contains 288 problems, while the paper reports 289 test samples; the discrepancy is one problem.ComplexOR: 37 industrial optimization problems. 11 of them carry"Infeasible"as the reference output (see "Known protocol caveats").
Both follow the data format of Appendix F. Per dataset convention
(shared with Chain-of-Experts), each problem ships a code_example.py
that fixes the function interface the generated program must implement;
for ComplexOR the scaffold also pre-declares the decision variables, and
the generated code fills in the TODO region (objective + constraints).
All systems compared under this protocol receive the same scaffold.
Dataset notes:
- The
input.jsonfiles inside some ComplexOR problem folders are legacy artifacts of the upstream data format; nothing in this harness reads them (data.jsoncarries the graded samples,input_targets.jsonthe problem statement). prob_135declares one of its arguments as a string (constraint3: "twice") in its own docstring — intentional per the upstream data, not a typing error.
--mode extended runs agent_team/reflective_orchestrator.py, which
adds components developed after the paper was published: an adaptive
search controller, dual-view formalization with value-level consistency
scoring, an online preference verifier over candidate programs, an
experience memory, and consensus counterfactual verification — the
Section 3.3.4 mechanism deepened: two checkers audit the solution
through different verification lenses and return quantified,
machine-comparable violation reports. When both checkers produce valid
reports, only violations confirmed by both (matched on canonical
constraint expressions, ranked by violation magnitude) trigger a
revision, which suppresses checker-hallucinated repairs; if exactly one
checker yields a valid report, its findings are used as-is rather than
dropping verification entirely (utils/adaptive_search.py,
utils/online_preference.py, utils/experience_distiller.py, prompts
in agent_team/extended_experts.py).
Its learning signals (preference updates, memory records) are derived
from execution status and counterfactual checks only — like the paper
pipeline, it never sees reference outputs. Numbers produced in this mode
are not comparable to the paper and should be reported separately. The
flow diagram is in docs/ormind_algorithm_flow.excalidraw.
python run_exp.py --mode extended --num_candidates 4python tests/test_offline_pipeline.pyOffline regression suite (no API key needed; the LLM transport is
stubbed). Covers the Algorithm 1 happy path, syntax repair,
counterfactual revision, a causality check on the counterfactual loop
(the same wrong program is revised under a flagging checker and left
untouched under a clean one), consensus counterfactual verification,
every ablation flag, the standard baseline, extended mode, grader
classification (SR/MFFR/IEFR) including the --accept_infeasible and
tolerance flags, empty-input boundaries, cross-problem memory isolation,
fallback-model visibility, the paper-default runtime configuration, and
the no-label-leakage invariant.
tools/replay_driver.py runs the pipeline with a record/replay
transport: each LLM call is written to disk and satisfied from a
response file, so a full run can be driven by any completion source and
audited call by call afterwards.
docs/replay_sessions/ contains four complete recorded runs (every
prompt, response, generated checker, trace, and blind grading log),
including a reproduction of the Appendix A counterfactual-repair case
study and the Appendix B syntax-repair failure mode on a live model. See
docs/replay_sessions/README.md for what each session demonstrates.
- The experiments in the paper were run on LangChain 0.2.7 (Appendix E). The pipeline here runs the same workflow and prompts on a direct OpenAI-compatible client, which removes the heavyweight dependency and lets any OpenAI-compatible endpoint serve as the backbone.
gpt-3.5-turbo(the paper's default backbone) is the built-in default, but provider-side model snapshots drift over time; expect variance against the published numbers. Per-problem logs record the completions served by each model ("Models Used").- The LPWP release contains 288 of the 289 problems used in the paper.
- Algorithm 1 specifies a single error-triggered revision;
--max_repair_roundsgeneralizes this (default 1 = Algorithm 1). - When the Conductor ablation's LLM reply names no remaining expert, the release falls back to the first expert in the fixed Algorithm 1 order (deterministic) instead of a random choice.
- The counterfactual checker pairs the candidate solution with the data
of the input that produced it. If the LLM-written checker itself
crashes, the problem is recorded as
checker_failedin its trace and the run-level summary, and no revision is attempted for it.
@inproceedings{wang-etal-2025-ormind,
title = "{ORM}ind: A Cognitive-Inspired End-to-End Reasoning Framework for Operations Research",
author = "Wang, Zhiyuan and
Chen, Bokui and
Huang, Yinya and
Cao, Qingxing and
He, Ming and
Fan, Jianping and
Liang, Xiaodan",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-industry.10/",
doi = "10.18653/v1/2025.acl-industry.10",
pages = "104--131"
}