Official implementation of "Generative Data Transformation: From Mixed to Unified Data".
Taesar is a data-centric framework for target-aligned sequential regeneration. It addresses the domain gap in multi-domain sequential recommendation by regenerating target-aligned sequences with adaptive contrastive decoding, allowing standard sequential models to benefit from cross-domain context without relying on increasingly complex model-centric fusion architectures.
Jiaqing Zhang, Mingjia Yin, Hao Wang, Yuxin Tian, Yuyang Ye, Yawen Li, Wei Guo, Yong Liu, and Enhong Chen. Generative Data Transformation: From Mixed to Unified Data. In Proceedings of the ACM Web Conference 2026 (WWW '26), April 13-17, 2026, Dubai, United Arab Emirates.
Paper / PDF / Project Page / Citation
Taesar is a data-centric framework for target-aligned sequential regeneration. It transforms mixed-domain user behavior into target-aligned training sequences through tri-model pretraining and adaptive contrastive decoding, helping standard sequential recommenders use cross-domain data without complex fusion architectures.
- Proposes Taesar, a target-aligned sequential regeneration framework for multi-domain recommendation.
- Uses tri-model pretraining and domain-specific adaptation to separate mixed-domain, source-domain, and target-domain knowledge.
- Applies adaptive contrastive decoding to transform mixed sequences into target-aligned training data.
- Generalizes across multiple sequential recommendation backbones and improves both data-centric and model-centric baselines.
Taesar first trains mixed, source, and target decoder views, then uses adaptive contrastive decoding to decide which source-domain items should be deleted or replaced with target-aligned items. The regenerated sequence data can then be used by standard sequential recommendation models.
.
|-- config/ # Hydra configuration files
|-- data/ # Dataset and sequential data loaders
|-- dataset/ # Raw data folder and preprocessing notebooks
|-- model/ # Sequential model implementation
|-- cdsr_baseline/ # Baseline implementations for comparison
|-- pretrain.py # Stage-I pretraining entry point
|-- decoding.py # Stage-II adaptive contrastive decoding
|-- finetune.py # Fine-tuning/evaluation entry point
|-- trainer.py # Training utilities
|-- run.sh # End-to-end example script
|-- environment.yml # Conda environment
|-- docs/assets/ # README figures cropped from the paper
`-- README.md
conda env create --name Taesar --file environment.yml
conda activate TaesarThe environment includes PyTorch, Hydra, RecBole-related dependencies, W&B, and plotting utilities. Use a CUDA-enabled machine for full experiments.
Download the raw Amazon datasets:
cd dataset/raw/
gdown 'https://drive.google.com/uc?id=1Y7bvGSeWZ7TjGx5qA-4n59a457IpvQLO'
gdown 'https://drive.google.com/uc?id=1ogT75lYJ4fd0vNyhP1a7Kq8SC1fYBa6Y'
gdown 'https://drive.google.com/uc?id=1VJ2qx8mHi2nhyEVkoEv-3YQ5ZraG7cNs'
gdown 'https://drive.google.com/uc?id=1JdqI7sosDmqU13ZXhz1Z0rRiIOm-5hT5'
unzip Amazon_Books.zip
unzip Amazon_Electronics.zip
unzip Amazon_Sports_and_Outdoors.zip
unzip Amazon_Tools_and_Home_Improvement.zipProcess datasets with the notebooks under dataset/, especially dataset/to_taesar.ipynb for Taesar-format data.
Run the provided end-to-end script:
bash run.shThe script executes pretraining, adaptive contrastive decoding for each target domain, and fine-tuning with new, sim, and full training modes.
The same workflow can be run stage by stage:
python pretrain.py -m stage=run gpu_id=0 seed=2025
python decoding.py -m stage=dec gpu_id=0 seed=2025 target_dom=dom1 train_batch_size=32
python finetune.py -m stage=tun gpu_id=0 seed=2025 train_type=new target_dom=dom1
python finetune.py -m stage=tun gpu_id=0 seed=2025 train_type=sim target_dom=dom1
python finetune.py -m stage=tun gpu_id=0 seed=2025 train_type=full target_dom=dom1Repeat decoding and fine-tuning for dom1, dom2, dom3, and dom4.
Main settings are in config/overall.yaml:
stage:run,dec, ortundataset: defaultBESTtarget_dom:dom1,dom2,dom3, ordom4train_type:new,sim, orfullvalid_metric: defaultNDCG@10topk:[5, 10, 20, 50, 100]
Model-specific settings live under config/model/.
Taesar improves multiple sequential backbones across the BEST domains, showing that target-aligned regenerated data can complement both standard single-domain models and multi-domain baselines. The paper's versatility study reports that regenerated data remains useful across models with different architectures, rather than only helping the decoder used to produce it.
Conclusion: data regeneration is presented as a model-agnostic improvement path for cross-domain sequential recommendation.
The ablation study removes domain-specific adaptation, source-domain experts, global contrastive scoring, and local contrastive scoring. Each removal reduces performance compared with full Taesar, indicating that both the domain-shared adaptation stage and the two-level contrastive decoding stage are needed.
Conclusion: the reported gains are not from a single decoding trick; the full data-transformation pipeline matters.
The regenerated Books-domain data has a flatter item-frequency curve, reduced head dominance in the Top 10% popularity segment, increased Mid 40% and Tail 50% coverage, and longer denser user sequences.
Conclusion: the empirical analysis ties Taesar's accuracy gains to better target-domain diversity and richer sequence context.
- Keep
run.shsynchronized with the stage names and defaults inconfig/overall.yaml. - If you rename the environment file, update the installation command in this README at the same time.
- Store README-ready paper figures under
docs/assets/; keep raw experiment outputs in their run directories.
@inproceedings{zhang2026generative,
title = {Generative Data Transformation: From Mixed to Unified Data},
author = {Zhang, Jiaqing and Yin, Mingjia and Wang, Hao and Tian, Yuxin and Ye, Yuyang and Li, Yawen and Guo, Wei and Liu, Yong and Chen, Enhong},
booktitle = {Proceedings of the ACM Web Conference 2026},
series = {WWW '26},
year = {2026},
doi = {10.1145/3774904.3792124}
}For paper questions, please contact:
- First author: Jiaqing Zhang (
jiaqing.zhang@mail.ustc.edu.cn) - Corresponding authors: Hao Wang (
wanghao3@ustc.edu.cn) and Enhong Chen (cheneh@ustc.edu.cn)
For repository issues, please open a GitHub issue in this repository.


