ERNIEKit Data Format Specification

ERNIEKit currently supports reading local datasets and downloading specified Hugging Face datasets in two formats: erniekit and alpaca.

Local Datasets

CLI: Modify the following fields in the YAML config file:
- Set train_dataset_path/eval_dataset_path to the absolute or relative path of your local dataset file
- Set train_dataset_type/eval_dataset_type to the dataset format (erniekit/alpaca)
- Set train_dataset_prob/eval_dataset_prob for multi-source dataset mixing probabilities

# single-source
train_dataset_type: "erniekit"
train_dataset_path: "./examples/data/sft-train.jsonl"
train_dataset_prob: "1.0"

# multi-source
train_dataset_type: "erniekit,erniekit"
train_dataset_path: "./examples/data/sft-train1.jsonl,./examples/data/sft-train2.jsonl"
train_dataset_prob: "0.8,0.2"

WebUI:
- Under Set Custom Dataset, input the local file path in Dataset Path
- Select the corresponding format (erniekit/alpaca) in Optional Data Type

Hugging Face Datasets

CLI: Modify the following fields in the YAML config file:
- Set train_dataset_path/eval_dataset_path to the Hugging Face repo ID
- Set train_dataset_type/eval_dataset_type to alpaca
- Set train_dataset_prob/eval_dataset_prob for multi-source dataset mixing probabilities

# single-source
train_dataset_type: "alpaca"
train_dataset_path: "BelleGroup/train_2M_CN"
train_dataset_prob: "1.0"

# multi-source
train_dataset_type: "alpaca,alpaca"
train_dataset_path: "llamafactory/alpaca_gpt4_zh,BelleGroup/train_2M_CN"
train_dataset_prob: "0.8,0.2"

WebUI:
- Under Set Built-in Dataset, select the dataset name in Dataset Selection
- The system will automatically configure the path and type, then download and read from Hugging Face

Supported Hugging Face datasets are defined in ernie.dataset.hf.data_info.json:

Supported Hugging Face Datasets

Dataset Name	Type	Format	File	File Format
llamafactory/alpaca_en	sft	alpaca	alpaca_data_en_52k.json	json
llamafactory/alpaca_zh	sft	alpaca	alpaca_data_zh_51k.json	json
llamafactory/alpaca_gpt4_en	sft	alpaca	alpaca_gpt4_data_en.json	json
llamafactory/alpaca_gpt4_zh	sft	alpaca	alpaca_gpt4_data_zh.json	json
BelleGroup/train_2M_CN	sft	alpaca	train_2M_CN.json	jsonl
BelleGroup/train_1M_CN	sft	alpaca	Belle_open_source_1M.json	jsonl
BelleGroup/train_0.5M_CN	sft	alpaca	Belle_open_source_0.5M.json	jsonl
BelleGroup/generated_chat_0.4M	sft	alpaca	generated_chat_0.4M.json	jsonl
BelleGroup/school_math_0.25M	sft	alpaca	school_math_0.25M.json	jsonl
sahil2801/CodeAlpaca-20k	sft	alpaca	code_alpaca_20k.json	json
TIGER-Lab/MathInstruct	sft	alpaca	MathInstruct.json	json
YeungNLP/firefly-train-1.1M	sft	alpaca	firefly-train-1.1M.jsonl	jsonl
suolyer/webqa	sft	alpaca	train.json	jsonl
zxbsmk/webnovel_cn	sft	alpaca	novel_cn_token512_50k.json	json
AstraMindAI/SFT-Nectar	sft	alpaca	sft_data_structured.json	json
hfl/stem_zh_instruction	sft	alpaca	bio_50282.json	jsonl
llamafactory/OpenO1-SFT	sft	alpaca	OpenO1-SFT-Pro.jsonl	jsonl
Congliu/Chinese-DeepSeek-R1-Distill-data-110k-SFT	sft	alpaca	distill_r1_110k_sft.jsonl	jsonl
mayflowergmbh/oasst_de	sft	alpaca	oasst_de.json	json
mayflowergmbh/dolly-15k_de	sft	alpaca	dolly_de.json	json
mayflowergmbh/alpaca-gpt4_de	sft	alpaca	alpaca_gpt4_data_de.json	json
mayflowergmbh/openschnabeltier_de	sft	alpaca	openschnabeltier.json	json
mayflowergmbh/evol-instruct_de	sft	alpaca	evol_instruct_de.json	json
mayflowergmbh/dolphin_de	sft	alpaca	dolphin.json	json
mayflowergmbh/booksum_de	sft	alpaca	booksum.json	json
mayflowergmbh/airoboros-3.0_de	sft	alpaca	airoboros_3.json	json
mayflowergmbh/ultra-chat_de	sft	alpaca	ultra_chat_german.json	json
Intel/orca_dpo_pairs	dpo	alpaca	orca_rlhf.jsonl	jsonl

erniekit Data Format

SFT Dataset

We provide demo data for quick testing. You can either use these samples or train with your own data.

Required fields for SFT:

system(optional): System configuration
src: User conversation content
tgt: System response content
label(optional): Training flag (1=include in training, 0=exclude)

Notes:

src and tgt are List objects supporting multi-turn conversations
Each training sample is in JSON format, with multiple samples separated by newlines

{
    "system": "你是一个生活小助理",
    "src": [
        "我们如何在日常生活中减少用水？",
        "还有别的建议吗？"
    ],
    "tgt": [
        "1. 使用节水装置，如节水淋浴喷头和水龙头。",
        "2. 使用水箱或水桶收集家庭废水，例如洗碗和洗浴。 \n3. 在社区中提高节水意识。 \n4. 检查水管和灌溉系统的漏水情况，并及时修复它们。 \n5. 洗澡时间缩短，使用低流量淋浴头节约用水。 \n6. 收集雨水，用于园艺或其他非饮用目的。 \n7. 刷牙或擦手时关掉水龙头。 \n8. 减少浇水草坪的时间。 \n9. 尽可能多地重复使用灰水（来自洗衣机、浴室水槽和淋浴的水）。 \n10. 只购买能源效率高的洗碗机和洗衣机。"
    ],
    "label": [0, 1],
}

DPO Dataset

We provide demo data for quick testing. You can either use these samples or train with your own data.

Required fields for DPO:

system(optional): System configuration
src: User conversation content (first item=question1, second=question2, etc.)
tgt: System response content (one fewer item than src)
response: Contains chosen/rejected responses (must contain odd number of strings)
sort: Differentiates chosen/rejected (lower value=rejected, higher=chosen)
Each training sample is in JSON format, with multiple samples separated by newlines

{
    "system": "你是一个生活小助理",
    "src": [
        "你好。",
        "哪一个富含蛋白质，床还是墙？"
    ],
    "tgt": ["你好呀，我是你的生活小助理。"],
    "response": [
        [
            "床和墙都不是蛋白质的来源，因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。"
        ],
        [
            "对不起，我无法回答那个问题。请提供更具体的信息，让我知道你需要什么帮助。"
        ]
    ],
    "sort": [
        1,
        0
    ]
}

alpaca Format

SFT Dataset

Supports json and jsonl file formats:

json: Each line contains one JSON object:

{"instruction":"instructionA", "input":"inputA", "output":"outputA"}
{"instruction":"instructionB", "input":"inputB", "output":"outputB"}
{"instruction":"instructionC", "input":"inputC", "output":"outputC"}

jsonl: All data in a single JSON array:

[
    {"instruction":"instructionA", "input":"inputA", "output":"outputA"},
    {"instruction":"instructionB", "input":"inputB", "output":"outputB"},
    {"instruction":"instructionC", "input":"inputC", "output":"outputC"}
]

Field Mapping Between alpaca and erniekit

alpaca	erniekit	Mapping
instruction input	src	src[-1] = instruction + input
output	tgt	tgt[-1] = output
history	src tgt	history = zip(src[:-1], tgt[:-1])
system	system	system=system

DPO Dataset

(Coming soon)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERNIEKit Data Format Specification

Local Datasets

Hugging Face Datasets

Supported Hugging Face Datasets

erniekit Data Format

SFT Dataset

DPO Dataset

alpaca Format

SFT Dataset

DPO Dataset

FilesExpand file tree

datasets.md

Latest commit

History

datasets.md

File metadata and controls

ERNIEKit Data Format Specification

Local Datasets

Hugging Face Datasets

Supported Hugging Face Datasets

erniekit Data Format

SFT Dataset

DPO Dataset

alpaca Format

SFT Dataset

DPO Dataset