Generative Data Transformation: From Mixed to Unified Data

Official implementation of "Generative Data Transformation: From Mixed to Unified Data".

Taesar is a data-centric framework for target-aligned sequential regeneration. It addresses the domain gap in multi-domain sequential recommendation by regenerating target-aligned sequences with adaptive contrastive decoding, allowing standard sequential models to benefit from cross-domain context without relying on increasingly complex model-centric fusion architectures.

1. Paper

Jiaqing Zhang, Mingjia Yin, Hao Wang, Yuxin Tian, Yuyang Ye, Yawen Li, Wei Guo, Yong Liu, and Enhong Chen. Generative Data Transformation: From Mixed to Unified Data. In Proceedings of the ACM Web Conference 2026 (WWW '26), April 13-17, 2026, Dubai, United Arab Emirates.

Paper / PDF / Project Page / Citation

Taesar is a data-centric framework for target-aligned sequential regeneration. It transforms mixed-domain user behavior into target-aligned training sequences through tri-model pretraining and adaptive contrastive decoding, helping standard sequential recommenders use cross-domain data without complex fusion architectures.

2. Highlights

Proposes Taesar, a target-aligned sequential regeneration framework for multi-domain recommendation.
Uses tri-model pretraining and domain-specific adaptation to separate mixed-domain, source-domain, and target-domain knowledge.
Applies adaptive contrastive decoding to transform mixed sequences into target-aligned training data.
Generalizes across multiple sequential recommendation backbones and improves both data-centric and model-centric baselines.

3. Method At A Glance

Taesar first trains mixed, source, and target decoder views, then uses adaptive contrastive decoding to decide which source-domain items should be deleted or replaced with target-aligned items. The regenerated sequence data can then be used by standard sequential recommendation models.

4. Repository Structure

.
|-- config/                         # Hydra configuration files
|-- data/                           # Dataset and sequential data loaders
|-- dataset/                        # Raw data folder and preprocessing notebooks
|-- model/                          # Sequential model implementation
|-- cdsr_baseline/                  # Baseline implementations for comparison
|-- pretrain.py                     # Stage-I pretraining entry point
|-- decoding.py                     # Stage-II adaptive contrastive decoding
|-- finetune.py                     # Fine-tuning/evaluation entry point
|-- trainer.py                      # Training utilities
|-- run.sh                          # End-to-end example script
|-- environment.yml                 # Conda environment
|-- docs/assets/                    # README figures cropped from the paper
`-- README.md

5. Installation

conda env create --name Taesar --file environment.yml
conda activate Taesar

The environment includes PyTorch, Hydra, RecBole-related dependencies, W&B, and plotting utilities. Use a CUDA-enabled machine for full experiments.

6. Data

Download the raw Amazon datasets:

cd dataset/raw/
gdown 'https://drive.google.com/uc?id=1Y7bvGSeWZ7TjGx5qA-4n59a457IpvQLO'
gdown 'https://drive.google.com/uc?id=1ogT75lYJ4fd0vNyhP1a7Kq8SC1fYBa6Y'
gdown 'https://drive.google.com/uc?id=1VJ2qx8mHi2nhyEVkoEv-3YQ5ZraG7cNs'
gdown 'https://drive.google.com/uc?id=1JdqI7sosDmqU13ZXhz1Z0rRiIOm-5hT5'
unzip Amazon_Books.zip
unzip Amazon_Electronics.zip
unzip Amazon_Sports_and_Outdoors.zip
unzip Amazon_Tools_and_Home_Improvement.zip

Process datasets with the notebooks under dataset/, especially dataset/to_taesar.ipynb for Taesar-format data.

7. Quick Start

Run the provided end-to-end script:

bash run.sh

The script executes pretraining, adaptive contrastive decoding for each target domain, and fine-tuning with new, sim, and full training modes.

8. Reproducing Paper Results

The same workflow can be run stage by stage:

python pretrain.py -m stage=run gpu_id=0 seed=2025
python decoding.py -m stage=dec gpu_id=0 seed=2025 target_dom=dom1 train_batch_size=32
python finetune.py -m stage=tun gpu_id=0 seed=2025 train_type=new target_dom=dom1
python finetune.py -m stage=tun gpu_id=0 seed=2025 train_type=sim target_dom=dom1
python finetune.py -m stage=tun gpu_id=0 seed=2025 train_type=full target_dom=dom1

Repeat decoding and fine-tuning for dom1, dom2, dom3, and dom4.

9. Configuration Notes

Main settings are in config/overall.yaml:

stage: run, dec, or tun
dataset: default BEST
target_dom: dom1, dom2, dom3, or dom4
train_type: new, sim, or full
valid_metric: default NDCG@10
topk: [5, 10, 20, 50, 100]

Model-specific settings live under config/model/.

10. Experimental Highlights

Taesar improves multiple sequential backbones across the BEST domains, showing that target-aligned regenerated data can complement both standard single-domain models and multi-domain baselines. The paper's versatility study reports that regenerated data remains useful across models with different architectures, rather than only helping the decoder used to produce it.

Conclusion: data regeneration is presented as a model-agnostic improvement path for cross-domain sequential recommendation.

10.1 Component Ablation

The ablation study removes domain-specific adaptation, source-domain experts, global contrastive scoring, and local contrastive scoring. Each removal reduces performance compared with full Taesar, indicating that both the domain-shared adaptation stage and the two-level contrastive decoding stage are needed.

Conclusion: the reported gains are not from a single decoding trick; the full data-transformation pipeline matters.

The regenerated Books-domain data has a flatter item-frequency curve, reduced head dominance in the Top 10% popularity segment, increased Mid 40% and Tail 50% coverage, and longer denser user sequences.

Conclusion: the empirical analysis ties Taesar's accuracy gains to better target-domain diversity and richer sequence context.

11. Notes For Maintainers

Keep run.sh synchronized with the stage names and defaults in config/overall.yaml.
If you rename the environment file, update the installation command in this README at the same time.
Store README-ready paper figures under docs/assets/; keep raw experiment outputs in their run directories.

12. Citation

@inproceedings{zhang2026generative,
  title = {Generative Data Transformation: From Mixed to Unified Data},
  author = {Zhang, Jiaqing and Yin, Mingjia and Wang, Hao and Tian, Yuxin and Ye, Yuyang and Li, Yawen and Guo, Wei and Liu, Yong and Chen, Enhong},
  booktitle = {Proceedings of the ACM Web Conference 2026},
  series = {WWW '26},
  year = {2026},
  doi = {10.1145/3774904.3792124}
}

13. Contact

For paper questions, please contact:

First author: Jiaqing Zhang (jiaqing.zhang@mail.ustc.edu.cn)
Corresponding authors: Hao Wang (wanghao3@ustc.edu.cn) and Enhong Chen (cheneh@ustc.edu.cn)

For repository issues, please open a GitHub issue in this repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generative Data Transformation: From Mixed to Unified Data

1. Paper

2. Highlights

3. Method At A Glance

4. Repository Structure

5. Installation

6. Data

7. Quick Start

8. Reproducing Paper Results

9. Configuration Notes

10. Experimental Highlights

10.1 Component Ablation

11. Notes For Maintainers

12. Citation

13. Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cdsr_baseline		cdsr_baseline
config		config
data		data
dataset		dataset
docs		docs
model		model
.gitignore		.gitignore
README.md		README.md
decoding.py		decoding.py
environment.yml		environment.yml
finetune.py		finetune.py
pretrain.py		pretrain.py
run.sh		run.sh
trainer.py		trainer.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

Generative Data Transformation: From Mixed to Unified Data

1. Paper

2. Highlights

3. Method At A Glance

4. Repository Structure

5. Installation

6. Data

7. Quick Start

8. Reproducing Paper Results

9. Configuration Notes

10. Experimental Highlights

10.1 Component Ablation

11. Notes For Maintainers

12. Citation

13. Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages