- [2026.06.10]: 🔥OmniDocLayout-1M dataset is available on HuggingFace. Click here to download it.
- [2026.04.09]: OmniDocLayout has been selected as a "Highlight" paper at CVPR 2026! 🎉🎉🎉
- [2026.02.21]: OmniDocLayout has been accepted by CVPR 2026! 🎉🎉🎉
- [2025.11.24]: We have released our paper on arXiv. Check out the paper here.
Document AI has advanced rapidly and attracted increasing attention in both academia and industry. However, while most existing efforts focus on document layout analysis (DLA), its generative counterpart, document layout generation, remains relatively underexplored.
Compared with traditional graphic layout design or room layout planning, document layout generation is more challenging because each page usually contains a larger number of elements and exhibits more diverse structural patterns. Existing document layout generation datasets are often dominated by simple academic paper layouts, while modern and complex document types such as newspapers, magazines, textbooks, exam papers, and slides remain underrepresented.
To address these limitations, we introduce OmniDocLayout, a new framework for diverse document layout generation. The paper mainly contains two parts:
- OmniDocLayout-1M: the first million-scale dataset for diverse document layout generation, covering six common document types and approximately 48M annotated layout elements.
- OmniDocLayout-LLM: a lightweight 0.5B LLM trained with a two-stage Coarse-to-Fine learning paradigm, which first learns general layout principles from OmniDocLayout-1M and then adapts to fine-grained complex document domains.
- We introduce OmniDocLayout-1M, the first million-scale document layout dataset for diverse document layout generation, covering six common document types: textbook, newspaper, magazine, exam, academic paper, and slide.
- We propose OmniDocLayout-LLM, a lightweight 0.5B model trained with a Coarse-to-Fine learning paradigm, enabling effective transfer from coarse document layout principles to fine-grained complex domains.
- Extensive experiments on M6Doc demonstrate that OmniDocLayout-LLM achieves strong performance across multiple document types and layout generation tasks.
OmniDocLayout-1M is designed to support large-scale training for document layout generation. It covers six common document types from real-world scenarios:
| Type | File | Volume |
|---|---|---|
| Textbook | textbook.json |
200,000 |
| Newspaper | newspaper.json |
207,679 |
| Magazine | magazine.json |
195,008 |
| Exam paper | exam.json |
90,360 |
| Academic paper | academic.json |
200,000 |
| Slide | slide.json |
100,000 |
| Total | - | 993,047 |
The dataset is collected from 36 public and copyright-clean sources, including academic databases, publishers, and document-sharing platforms. It covers diverse domains such as academia, education, news, economics, and more.
- Large Scale: approximately 1M document pages and about 48M annotated layout elements.
- Diverse Types: 6 challenging and complicated document types from 36 public and copyright-clean sources.
- Reading Order: annotations follow a natural reading order, which is important for autoregressive layout generation.
- Quality Assessment: blind human evaluation shows that more than 92% of sampled annotations have similar perceived quality to manual annotations.
The core idea of OmniDocLayout-LLM is a two-stage Coarse-to-Fine learning paradigm.
In the first stage, the model learns universal document layout principles from OmniDocLayout-1M with coarse-grained labels. This stage helps the model acquire transferable spatial priors, such as:
- Alignment
- Non-overlapping arrangement
- Reading order
- Spatial grouping
- ...
In the second stage, the model is adapted to a specific complex document domain with fine-grained labels. For example, a coarse category such as title can be mapped to fine-grained domain-specific categories such as:
- Title
- Headline
- First-level title
- Second-level title
- ...
This design allows the model to benefit from large-scale coarse-grained layout knowledge while requiring only limited fine-grained annotations for adaptation.
Our model supports five layout generation tasks:
| Task | Description |
|---|---|
| U-Cond | Unconditional layout generation without external constraints. |
| C→S+P | Given element categories, predict sizes and positions. |
| C+S→P | Given element categories and sizes, predict positions. |
| Completion | Complete the remaining layout given a subset of existing elements. |
| Refinement | Recover a clean layout from perturbed geometric attributes. |
We compare several visual examples of various methods under U-Cond Task as follows. For general-purpose LLMs, we adopt the strongest 5-shot setting.
We thank the developers of the following projects and tools:
- MinerU for document parsing and layout annotation.
- DocLayout-YOLO for dense document layout detection.
- SWIFT for the model training and inference.
If you find this project useful for your research, please consider giving us a star and citing our paper:
@inproceedings{kang2026omnidoclayout,
title={OmniDocLayout: Towards Diverse Document Layout Generation via Coarse-to-Fine LLM Learning},
author={Kang, Hengrui and Gu, Zhuangcheng and Zhao, Zhiyuan and Wen, Zichen and Wang, Bin and Li, Weijia and He, Conghui},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={3208--3218},
year={2026}
}


