This repository contains the implementation of AdverMCTS, a novel dual-tree MCTS framework that combats pseudo-correctness in LLM-based code generation through adversarial test case generation and dynamic evaluation.
Large Language Models (LLMs) often generate code that passes public test cases but fails on edge cases—a phenomenon known as pseudo-correctness. AdverMCTS addresses this by:
- Solver Tree: Generates and refines code solutions using MCTS-based exploration
- Attacker Tree: Generates discriminative test cases to identify code weaknesses
- Dynamic Test Memory: Maintains a pool of challenging tests discovered during search
- Adversarial Feedback: Propagates failure information to guide future code generation
- Python 3.8+
- PyTorch 2.0+
- transformers
- vLLM (for local model inference)
- Ray (for distributed execution)
AdverMCTS supports 2 benchmark datasets:
Download the APPS dataset and place it in your data directory:
# Ensure your data path is configured in utils/utils.py
# The expected structure is:
# /path/to/data/APPS/test/{problem_id}/
# ├── question.txt
# ├── solutions.json
# ├── input_output.json
# └── starter_code.py (optional)Download the TACO dataset and Preprocess the TACO dataset:
cd data_preprocessing
python process_taco.py --local_data_path /path/to/TACO/Before running experiments, configure the following in utils/utils.py:
- Data Path: Update
get_raw_data_path()to return your data directory path - API Keys: If using API-based models, configure your OpenAI API key in
run.py
Basic usage with default settings:
python run.py \
--dataset apps \
--model adver_mcts \
--arch Qwen3-4B-Instruct-2507 \
--test_gen_model_name Qwen3-4B-Instruct-2507 \
--APPDifficulty competition \
--attacker_rollout 2 \
--rollout 16| Argument | Description | Default |
|---|---|---|
--dataset |
Dataset to use (apps, codeforces, taco) |
apps |
--model |
Model to run (see below) | adver_mcts |
--arch |
LLM architecture | Qwen3-4B-Instruct-2507 |
--APPDifficulty |
Problem difficulty level | competition |
--rollout |
Number of Solver rollouts | 16 |
--attacker_rollout |
Number of Attacker rollouts | 2 |
--width |
Maximum children per node | 3 |
--ray_parallel |
Enable distributed execution | False |
--output_judge_mode |
Output judgment mode (voting, oracle) |
oracle |
--adversarial_feedback_mode |
Feedback mode (none, concrete, abstract) |
none |
Main Method:
adver_mcts- AdverMCTS (our method)
Baselines:
bs- Beam Searchmcts_code- MCTS for code generationmcts_thought- MCTS for thought generationverbal_mcts- Verbal MCTSlats- Language Agent Tree Searchtot- Tree of Thoughtsmcts_token- MCTS at token level (PG-TD)
Ablation Studies:
cut_solver_attacker- Attacker without solver code accessbest_of_n_attacker- Best-of-N test selectionrandom_select_attacker- Random test selection
For parallel processing across multiple problems:
python run.py \
--dataset apps \
--model adver_mcts \
--ray_parallel True \
--ray_num_workers 100After experiments complete, analyze results using:
python ExpPro.pyConfigure the dataset and experiment IDs in ExpPro.py:
datasets = ['apps', 'taco'] # Datasets to analyze
playlist = {
'apps': [0], # Experiment IDs
'taco': [0]
}
compare_flag = False # Set True to compare two experimentsAdverMCTS/
├── run.py # Main entry point
├── Processor.py # Problem processing and execution
├── ExpPro.py # Results analysis
├── models/
│ ├── base_model.py # Base MCTS model class
│ ├── adver_mcts.py # AdverMCTS implementation (main)
│ ├── bs.py # Beam Search baseline
│ ├── lats.py # LATS baseline
│ ├── tot.py # Tree of Thoughts baseline
│ ├── mcts_code.py # MCTS code generation
│ ├── mcts_thought.py # MCTS thought generation
│ ├── verbal_mcts.py # Verbal MCTS
│ └── analysis_models/ # Ablation study models
├── data_handlers/
│ ├── apps_handler.py # APPS dataset handler
│ ├── codeforces_handler.py# Codeforces dataset handler
│ └── taco_handler.py # TACO dataset handler
├── data_preprocessing/
│ ├── process_codeforces.py# Codeforces preprocessing
│ └── process_taco.py # TACO preprocessing
├── executors/
│ ├── APPSExecutor.py # Code execution engine
│ └── code_wrapper.py # Sandboxed execution wrapper
├── llm_service/
│ ├── llm_generator.py # LLM generation interface
│ └── service_creation.py # Model loading utilities
└── utils/
├── utils.py # Utility functions
└── prompts.py # LLM prompt templates
The core AdverMCTS algorithm operates as follows:
- Initialization: Create Solver Tree (for code) and Attacker Tree (for tests)
- Solver Expansion: Generate code candidates via MCTS with verbal feedback
- Attacker Expansion: Generate testing strategies and concrete test cases
- Divergence Detection: Execute code pool against generated tests
- Oracle Judgment: Determine correct outputs via voting or LLM reasoning
- Backpropagation: Update both trees with rewards and verbal feedback
- Global Filtering Hub (Dynamic Memory): Store discriminative tests for future evaluation