📖 LongNovel: A Multi-Scale Benchmark for Hallucination Detection in Long-Context Novel Summarization

Overview

While modern LLMs feature expanded context windows, hallucination remains a critical bottleneck—especially in long-context summarization. LongNovel is a multi-scale, bilingual (Chinese and English) benchmark specifically designed to study and detect hallucinations in long narrative texts.

Why Novels?

Unlike short news articles or academic papers, long novels feature dense, intrinsic narrative structures, intricate event descriptions, and complex dialogues. This makes them the ideal testing ground for tracking how hallucinations evolve as context length scales.

Key Features & Contributions

Bilingual & Multi-Scale Dataset: Built using 29 Chinese novels (ranging from 16k to 100k tokens) alongside chapter-level English data from the BookSum dataset.
Granular Taxonomy: Defines 8 distinct hallucination types to categorize model errors precisely.

Updates

2026-06 We release LongNovel benchmark.

Dataset

Load Data

You can download and load the LongNovel data through this link :

from datasets import load_dataset

# Load the dataset
dataset = load_dataset('SII-BDML/LongNovel', split='test')

Data Format

The data format in LongNovel is structured as follows:

{
    "id": "Unique identifier for each piece of data",
    "language": "The language of the data, which can be 'zh' for Chinese or 'en' for English",
    "article": "The original text corresponding to the summary to be detected",
    "summary": "The summary text that needs to be checked for hallucinations",
    "instruction": "The prompt or instruction used during the hallucination detection process",
    "input": "The complete string fed into the model, which contains both the article and the summary",
    "output": "The hallucination detection result"
}

Evaluation

The script automatically configures rope_scaling based on the model path and context length. By default, it uses Tensor Parallelism (TP) of 8, making it suitable for multi-GPU setups

Run Model Inference

Run a standard 64k evaluation:

python run_inference.py --model Qwen3-32B --long_context 64k

Run with Chain of Thought reasoning on a 32k validation set:

python run_inference.py --model Qwen3-32B --long_context 32k --mode CoT --data_split validation

Args

--model <name_or_path>

The identifier for the model. It defaults to Qwen3-32B. --long_context <size> Sets the context length and determines which dataset file to load. Options: 16k, 32k, 64k, 100k.

--mode <type>

Defines the prompting strategy.

"" (Default): Standard Hallucination Detection inference.

CoT: Enables Chain of Thought processing for better reasoning.

PromptB: Reorders the input, placing the summary before the article text.

--data_split <split>

Specifies which data subset to evaluate.

test: (Default) Uses test sets.

validation: Uses validation sets (only supported for 16k/32k).

Results

We introduce LongNovel, a multilingual long-context dataset for hallucination detection in novels, based on human-annotated summaries. It comprises four subsets ranging from 16k to 100k tokens. Our extensive experiments on LongNovel reveal that current large language models still lack sufficient capability in long-context hallucination detection tasks. We hope that LongNovel will provide useful insights for future research in this field.

Recall performance of various LLMs across different hallucination types.
Evt: Event, Ent: Entity, Rel: Relation, Num: Numerical, Tmp: Temporal, Cau: Causal, Log: Logical Inversion, Fab: Fabrication.

License

This project is licensed under the Apache License 2.0.

Acknowledgement

We would like to express our gratitude to the annotators from iQIYI for their high-quality manual labeling and correction. We also thank iQIYI for providing the GPU resources that supported this work. Furthermore, we thank the creators and maintainers of the BookSum dataset. Our work utilizes the processed version from the ubaada/booksum-complete-cleaned repository. We deeply appreciate both the original BookSum authors and the community contributors for making these valuable resources available.

Citation

If you find our benchmark and code useful, please consider citing our work:

The ArXiv link will be added soon.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
assets		assets
eval		eval
prompt_en		prompt_en
prompt_zh		prompt_zh
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📖 LongNovel: A Multi-Scale Benchmark for Hallucination Detection in Long-Context Novel Summarization

Overview

Why Novels?

Key Features & Contributions

Updates

Contents

Dataset

Load Data

Data Format

Evaluation

Run Model Inference

Args

Results

License

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

📖 LongNovel: A Multi-Scale Benchmark for Hallucination Detection in Long-Context Novel Summarization

Overview

Why Novels?

Key Features & Contributions

Updates

Contents

Dataset

Load Data

Data Format

Evaluation

Run Model Inference

Args

Results

License

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages