📖 LongNovel: A Multi-Scale Benchmark for Hallucination Detection in Long-Context Novel Summarization
While modern LLMs feature expanded context windows, hallucination remains a critical bottleneck—especially in long-context summarization. LongNovel is a multi-scale, bilingual (Chinese and English) benchmark specifically designed to study and detect hallucinations in long narrative texts.
Unlike short news articles or academic papers, long novels feature dense, intrinsic narrative structures, intricate event descriptions, and complex dialogues. This makes them the ideal testing ground for tracking how hallucinations evolve as context length scales.
- Bilingual & Multi-Scale Dataset: Built using 29 Chinese novels (ranging from 16k to 100k tokens) alongside chapter-level English data from the BookSum dataset.
- Granular Taxonomy: Defines 8 distinct hallucination types to categorize model errors precisely.
2026-06We release LongNovel benchmark.
You can download and load the LongNovel data through this link :
from datasets import load_dataset
# Load the dataset
dataset = load_dataset('SII-BDML/LongNovel', split='test')The data format in LongNovel is structured as follows:
{
"id": "Unique identifier for each piece of data",
"language": "The language of the data, which can be 'zh' for Chinese or 'en' for English",
"article": "The original text corresponding to the summary to be detected",
"summary": "The summary text that needs to be checked for hallucinations",
"instruction": "The prompt or instruction used during the hallucination detection process",
"input": "The complete string fed into the model, which contains both the article and the summary",
"output": "The hallucination detection result"
}The script automatically configures rope_scaling based on the model path and context length. By default, it uses Tensor Parallelism (TP) of 8, making it suitable for multi-GPU setups
Run a standard 64k evaluation:
python run_inference.py --model Qwen3-32B --long_context 64k
Run with Chain of Thought reasoning on a 32k validation set:
python run_inference.py --model Qwen3-32B --long_context 32k --mode CoT --data_split validation
--model <name_or_path>
The identifier for the model. It defaults to
Qwen3-32B.--long_context <size>Sets the context length and determines which dataset file to load. Options:16k,32k,64k,100k.
--mode <type>
Defines the prompting strategy.
""(Default): Standard Hallucination Detection inference.CoT: Enables Chain of Thought processing for better reasoning.PromptB: Reorders the input, placing the summary before the article text.
--data_split <split>
Specifies which data subset to evaluate.
test: (Default) Uses test sets.validation: Uses validation sets (only supported for 16k/32k).
We introduce LongNovel, a multilingual long-context dataset for hallucination detection in novels, based on human-annotated summaries. It comprises four subsets ranging from 16k to 100k tokens. Our extensive experiments on LongNovel reveal that current large language models still lack sufficient capability in long-context hallucination detection tasks. We hope that LongNovel will provide useful insights for future research in this field.
Recall performance of various LLMs across different hallucination types.
Evt: Event, Ent: Entity, Rel: Relation, Num: Numerical,
Tmp: Temporal, Cau: Causal, Log: Logical Inversion, Fab: Fabrication.
This project is licensed under the Apache License 2.0.
We would like to express our gratitude to the annotators from iQIYI for their high-quality manual labeling and correction. We also thank iQIYI for providing the GPU resources that supported this work. Furthermore, we thank the creators and maintainers of the BookSum dataset. Our work utilizes the processed version from the ubaada/booksum-complete-cleaned repository. We deeply appreciate both the original BookSum authors and the community contributors for making these valuable resources available.
If you find our benchmark and code useful, please consider citing our work:
The ArXiv link will be added soon.
