Skip to content

SIMONLQY/AdverMCTS

Repository files navigation

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

This repository contains the implementation of AdverMCTS, a novel dual-tree MCTS framework that combats pseudo-correctness in LLM-based code generation through adversarial test case generation and dynamic evaluation.

Overview

Large Language Models (LLMs) often generate code that passes public test cases but fails on edge cases—a phenomenon known as pseudo-correctness. AdverMCTS addresses this by:

  1. Solver Tree: Generates and refines code solutions using MCTS-based exploration
  2. Attacker Tree: Generates discriminative test cases to identify code weaknesses
  3. Dynamic Test Memory: Maintains a pool of challenging tests discovered during search
  4. Adversarial Feedback: Propagates failure information to guide future code generation

Requirements

  • Python 3.8+
  • PyTorch 2.0+
  • transformers
  • vLLM (for local model inference)
  • Ray (for distributed execution)

Data Preparation

AdverMCTS supports 2 benchmark datasets:

1. APPS Dataset

Download the APPS dataset and place it in your data directory:

# Ensure your data path is configured in utils/utils.py
# The expected structure is:
# /path/to/data/APPS/test/{problem_id}/
#   ├── question.txt
#   ├── solutions.json
#   ├── input_output.json
#   └── starter_code.py (optional)

2. TACO Dataset

Download the TACO dataset and Preprocess the TACO dataset:

cd data_preprocessing
python process_taco.py --local_data_path /path/to/TACO/

Configuration

Before running experiments, configure the following in utils/utils.py:

  1. Data Path: Update get_raw_data_path() to return your data directory path
  2. API Keys: If using API-based models, configure your OpenAI API key in run.py

Usage

Running AdverMCTS

Basic usage with default settings:

python run.py \
    --dataset apps \
    --model adver_mcts \
    --arch Qwen3-4B-Instruct-2507 \
    --test_gen_model_name Qwen3-4B-Instruct-2507 \
    --APPDifficulty competition \
    --attacker_rollout 2 \
    --rollout 16

Key Arguments

Argument Description Default
--dataset Dataset to use (apps, codeforces, taco) apps
--model Model to run (see below) adver_mcts
--arch LLM architecture Qwen3-4B-Instruct-2507
--APPDifficulty Problem difficulty level competition
--rollout Number of Solver rollouts 16
--attacker_rollout Number of Attacker rollouts 2
--width Maximum children per node 3
--ray_parallel Enable distributed execution False
--output_judge_mode Output judgment mode (voting, oracle) oracle
--adversarial_feedback_mode Feedback mode (none, concrete, abstract) none

Available Models

Main Method:

  • adver_mcts - AdverMCTS (our method)

Baselines:

  • bs - Beam Search
  • mcts_code - MCTS for code generation
  • mcts_thought - MCTS for thought generation
  • verbal_mcts - Verbal MCTS
  • lats - Language Agent Tree Search
  • tot - Tree of Thoughts
  • mcts_token - MCTS at token level (PG-TD)

Ablation Studies:

  • cut_solver_attacker - Attacker without solver code access
  • best_of_n_attacker - Best-of-N test selection
  • random_select_attacker - Random test selection

Distributed Execution with Ray

For parallel processing across multiple problems:

python run.py \
    --dataset apps \
    --model adver_mcts \
    --ray_parallel True \
    --ray_num_workers 100

Results Analysis

After experiments complete, analyze results using:

python ExpPro.py

Configure the dataset and experiment IDs in ExpPro.py:

datasets = ['apps', 'taco']  # Datasets to analyze
playlist = {
    'apps': [0],  # Experiment IDs
    'taco': [0]
}
compare_flag = False  # Set True to compare two experiments

Project Structure

AdverMCTS/
├── run.py                   # Main entry point
├── Processor.py             # Problem processing and execution
├── ExpPro.py                # Results analysis
├── models/
│   ├── base_model.py        # Base MCTS model class
│   ├── adver_mcts.py        # AdverMCTS implementation (main)
│   ├── bs.py                # Beam Search baseline
│   ├── lats.py              # LATS baseline
│   ├── tot.py               # Tree of Thoughts baseline
│   ├── mcts_code.py         # MCTS code generation
│   ├── mcts_thought.py      # MCTS thought generation
│   ├── verbal_mcts.py       # Verbal MCTS
│   └── analysis_models/     # Ablation study models
├── data_handlers/
│   ├── apps_handler.py      # APPS dataset handler
│   ├── codeforces_handler.py# Codeforces dataset handler
│   └── taco_handler.py      # TACO dataset handler
├── data_preprocessing/
│   ├── process_codeforces.py# Codeforces preprocessing
│   └── process_taco.py      # TACO preprocessing
├── executors/
│   ├── APPSExecutor.py      # Code execution engine
│   └── code_wrapper.py      # Sandboxed execution wrapper
├── llm_service/
│   ├── llm_generator.py     # LLM generation interface
│   └── service_creation.py  # Model loading utilities
└── utils/
    ├── utils.py             # Utility functions
    └── prompts.py           # LLM prompt templates

AdverMCTS Algorithm

The core AdverMCTS algorithm operates as follows:

  1. Initialization: Create Solver Tree (for code) and Attacker Tree (for tests)
  2. Solver Expansion: Generate code candidates via MCTS with verbal feedback
  3. Attacker Expansion: Generate testing strategies and concrete test cases
  4. Divergence Detection: Execute code pool against generated tests
  5. Oracle Judgment: Determine correct outputs via voting or LLM reasoning
  6. Backpropagation: Update both trees with rewards and verbal feedback
  7. Global Filtering Hub (Dynamic Memory): Store discriminative tests for future evaluation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages