AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

This repository contains the implementation of AdverMCTS, a novel dual-tree MCTS framework that combats pseudo-correctness in LLM-based code generation through adversarial test case generation and dynamic evaluation.

Overview

Large Language Models (LLMs) often generate code that passes public test cases but fails on edge cases—a phenomenon known as pseudo-correctness. AdverMCTS addresses this by:

Solver Tree: Generates and refines code solutions using MCTS-based exploration
Attacker Tree: Generates discriminative test cases to identify code weaknesses
Dynamic Test Memory: Maintains a pool of challenging tests discovered during search
Adversarial Feedback: Propagates failure information to guide future code generation

Requirements

Python 3.8+
PyTorch 2.0+
transformers
vLLM (for local model inference)
Ray (for distributed execution)

Data Preparation

AdverMCTS supports 2 benchmark datasets:

1. APPS Dataset

Download the APPS dataset and place it in your data directory:

# Ensure your data path is configured in utils/utils.py
# The expected structure is:
# /path/to/data/APPS/test/{problem_id}/
#   ├── question.txt
#   ├── solutions.json
#   ├── input_output.json
#   └── starter_code.py (optional)

2. TACO Dataset

Download the TACO dataset and Preprocess the TACO dataset:

cd data_preprocessing
python process_taco.py --local_data_path /path/to/TACO/

Configuration

Before running experiments, configure the following in utils/utils.py:

Data Path: Update get_raw_data_path() to return your data directory path
API Keys: If using API-based models, configure your OpenAI API key in run.py

Usage

Running AdverMCTS

Basic usage with default settings:

python run.py \
    --dataset apps \
    --model adver_mcts \
    --arch Qwen3-4B-Instruct-2507 \
    --test_gen_model_name Qwen3-4B-Instruct-2507 \
    --APPDifficulty competition \
    --attacker_rollout 2 \
    --rollout 16

Key Arguments

Argument	Description	Default
`--dataset`	Dataset to use (`apps`, `codeforces`, `taco`)	`apps`
`--model`	Model to run (see below)	`adver_mcts`
`--arch`	LLM architecture	`Qwen3-4B-Instruct-2507`
`--APPDifficulty`	Problem difficulty level	`competition`
`--rollout`	Number of Solver rollouts	`16`
`--attacker_rollout`	Number of Attacker rollouts	`2`
`--width`	Maximum children per node	`3`
`--ray_parallel`	Enable distributed execution	`False`
`--output_judge_mode`	Output judgment mode (`voting`, `oracle`)	`oracle`
`--adversarial_feedback_mode`	Feedback mode (`none`, `concrete`, `abstract`)	`none`

Available Models

Main Method:

adver_mcts - AdverMCTS (our method)

Baselines:

bs - Beam Search
mcts_code - MCTS for code generation
mcts_thought - MCTS for thought generation
verbal_mcts - Verbal MCTS
lats - Language Agent Tree Search
tot - Tree of Thoughts
mcts_token - MCTS at token level (PG-TD)

Ablation Studies:

cut_solver_attacker - Attacker without solver code access
best_of_n_attacker - Best-of-N test selection
random_select_attacker - Random test selection

Distributed Execution with Ray

For parallel processing across multiple problems:

python run.py \
    --dataset apps \
    --model adver_mcts \
    --ray_parallel True \
    --ray_num_workers 100

Results Analysis

After experiments complete, analyze results using:

python ExpPro.py

Configure the dataset and experiment IDs in ExpPro.py:

datasets = ['apps', 'taco']  # Datasets to analyze
playlist = {
    'apps': [0],  # Experiment IDs
    'taco': [0]
}
compare_flag = False  # Set True to compare two experiments

Project Structure

AdverMCTS/
├── run.py                   # Main entry point
├── Processor.py             # Problem processing and execution
├── ExpPro.py                # Results analysis
├── models/
│   ├── base_model.py        # Base MCTS model class
│   ├── adver_mcts.py        # AdverMCTS implementation (main)
│   ├── bs.py                # Beam Search baseline
│   ├── lats.py              # LATS baseline
│   ├── tot.py               # Tree of Thoughts baseline
│   ├── mcts_code.py         # MCTS code generation
│   ├── mcts_thought.py      # MCTS thought generation
│   ├── verbal_mcts.py       # Verbal MCTS
│   └── analysis_models/     # Ablation study models
├── data_handlers/
│   ├── apps_handler.py      # APPS dataset handler
│   ├── codeforces_handler.py# Codeforces dataset handler
│   └── taco_handler.py      # TACO dataset handler
├── data_preprocessing/
│   ├── process_codeforces.py# Codeforces preprocessing
│   └── process_taco.py      # TACO preprocessing
├── executors/
│   ├── APPSExecutor.py      # Code execution engine
│   └── code_wrapper.py      # Sandboxed execution wrapper
├── llm_service/
│   ├── llm_generator.py     # LLM generation interface
│   └── service_creation.py  # Model loading utilities
└── utils/
    ├── utils.py             # Utility functions
    └── prompts.py           # LLM prompt templates

AdverMCTS Algorithm

The core AdverMCTS algorithm operates as follows:

Initialization: Create Solver Tree (for code) and Attacker Tree (for tests)
Solver Expansion: Generate code candidates via MCTS with verbal feedback
Attacker Expansion: Generate testing strategies and concrete test cases
Divergence Detection: Execute code pool against generated tests
Oracle Judgment: Determine correct outputs via voting or LLM reasoning
Backpropagation: Update both trees with rewards and verbal feedback
Global Filtering Hub (Dynamic Memory): Store discriminative tests for future evaluation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

Overview

Requirements

Data Preparation

1. APPS Dataset

2. TACO Dataset

Configuration

Usage

Running AdverMCTS

Key Arguments

Available Models

Distributed Execution with Ray

Results Analysis

Project Structure

AdverMCTS Algorithm

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_handlers		data_handlers
data_preprocessing		data_preprocessing
executors		executors
llm_service		llm_service
models		models
utils		utils
.gitignore		.gitignore
ExpPro.py		ExpPro.py
Processor.py		Processor.py
README.md		README.md
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

Overview

Requirements

Data Preparation

1. APPS Dataset

2. TACO Dataset

Configuration

Usage

Running AdverMCTS

Key Arguments

Available Models

Distributed Execution with Ray

Results Analysis

Project Structure

AdverMCTS Algorithm

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages