ERNIEKit currently supports reading local datasets and downloading specified Hugging Face datasets in two formats: erniekit and alpaca.
- CLI: Modify the following fields in the YAML config file:
- Set
train_dataset_path/eval_dataset_pathto the absolute or relative path of your local dataset file - Set
train_dataset_type/eval_dataset_typeto the dataset format (erniekit/alpaca) - Set
train_dataset_prob/eval_dataset_probfor multi-source dataset mixing probabilities
- Set
# single-source
train_dataset_type: "erniekit"
train_dataset_path: "./examples/data/sft-train.jsonl"
train_dataset_prob: "1.0"
# multi-source
train_dataset_type: "erniekit,erniekit"
train_dataset_path: "./examples/data/sft-train1.jsonl,./examples/data/sft-train2.jsonl"
train_dataset_prob: "0.8,0.2"- WebUI:
- Under
Set Custom Dataset, input the local file path inDataset Path - Select the corresponding format (erniekit/alpaca) in
Optional Data Type
- Under
- CLI: Modify the following fields in the YAML config file:
- Set
train_dataset_path/eval_dataset_pathto the Hugging Face repo ID - Set
train_dataset_type/eval_dataset_typeto alpaca - Set
train_dataset_prob/eval_dataset_probfor multi-source dataset mixing probabilities
- Set
# single-source
train_dataset_type: "alpaca"
train_dataset_path: "BelleGroup/train_2M_CN"
train_dataset_prob: "1.0"
# multi-source
train_dataset_type: "alpaca,alpaca"
train_dataset_path: "llamafactory/alpaca_gpt4_zh,BelleGroup/train_2M_CN"
train_dataset_prob: "0.8,0.2"- WebUI:
- Under
Set Built-in Dataset, select the dataset name inDataset Selection - The system will automatically configure the path and type, then download and read from Hugging Face
- Under
Supported Hugging Face datasets are defined in ernie.dataset.hf.data_info.json:
| Dataset Name | Type | Format | File | File Format |
|---|---|---|---|---|
| llamafactory/alpaca_en | sft | alpaca | alpaca_data_en_52k.json | json |
| llamafactory/alpaca_zh | sft | alpaca | alpaca_data_zh_51k.json | json |
| llamafactory/alpaca_gpt4_en | sft | alpaca | alpaca_gpt4_data_en.json | json |
| llamafactory/alpaca_gpt4_zh | sft | alpaca | alpaca_gpt4_data_zh.json | json |
| BelleGroup/train_2M_CN | sft | alpaca | train_2M_CN.json | jsonl |
| BelleGroup/train_1M_CN | sft | alpaca | Belle_open_source_1M.json | jsonl |
| BelleGroup/train_0.5M_CN | sft | alpaca | Belle_open_source_0.5M.json | jsonl |
| BelleGroup/generated_chat_0.4M | sft | alpaca | generated_chat_0.4M.json | jsonl |
| BelleGroup/school_math_0.25M | sft | alpaca | school_math_0.25M.json | jsonl |
| sahil2801/CodeAlpaca-20k | sft | alpaca | code_alpaca_20k.json | json |
| TIGER-Lab/MathInstruct | sft | alpaca | MathInstruct.json | json |
| YeungNLP/firefly-train-1.1M | sft | alpaca | firefly-train-1.1M.jsonl | jsonl |
| suolyer/webqa | sft | alpaca | train.json | jsonl |
| zxbsmk/webnovel_cn | sft | alpaca | novel_cn_token512_50k.json | json |
| AstraMindAI/SFT-Nectar | sft | alpaca | sft_data_structured.json | json |
| hfl/stem_zh_instruction | sft | alpaca | bio_50282.json | jsonl |
| llamafactory/OpenO1-SFT | sft | alpaca | OpenO1-SFT-Pro.jsonl | jsonl |
| Congliu/Chinese-DeepSeek-R1-Distill-data-110k-SFT | sft | alpaca | distill_r1_110k_sft.jsonl | jsonl |
| mayflowergmbh/oasst_de | sft | alpaca | oasst_de.json | json |
| mayflowergmbh/dolly-15k_de | sft | alpaca | dolly_de.json | json |
| mayflowergmbh/alpaca-gpt4_de | sft | alpaca | alpaca_gpt4_data_de.json | json |
| mayflowergmbh/openschnabeltier_de | sft | alpaca | openschnabeltier.json | json |
| mayflowergmbh/evol-instruct_de | sft | alpaca | evol_instruct_de.json | json |
| mayflowergmbh/dolphin_de | sft | alpaca | dolphin.json | json |
| mayflowergmbh/booksum_de | sft | alpaca | booksum.json | json |
| mayflowergmbh/airoboros-3.0_de | sft | alpaca | airoboros_3.json | json |
| mayflowergmbh/ultra-chat_de | sft | alpaca | ultra_chat_german.json | json |
| Intel/orca_dpo_pairs | dpo | alpaca | orca_rlhf.jsonl | jsonl |
We provide demo data for quick testing. You can either use these samples or train with your own data.
Required fields for SFT:
system(optional): System configurationsrc: User conversation contenttgt: System response contentlabel(optional): Training flag (1=include in training, 0=exclude)
Notes:
srcandtgtare List objects supporting multi-turn conversations- Each training sample is in JSON format, with multiple samples separated by newlines
{
"system": "你是一个生活小助理",
"src": [
"我们如何在日常生活中减少用水?",
"还有别的建议吗?"
],
"tgt": [
"1. 使用节水装置,如节水淋浴喷头和水龙头。",
"2. 使用水箱或水桶收集家庭废水,例如洗碗和洗浴。 \n3. 在社区中提高节水意识。 \n4. 检查水管和灌溉系统的漏水情况,并及时修复它们。 \n5. 洗澡时间缩短,使用低流量淋浴头节约用水。 \n6. 收集雨水,用于园艺或其他非饮用目的。 \n7. 刷牙或擦手时关掉水龙头。 \n8. 减少浇水草坪的时间。 \n9. 尽可能多地重复使用灰水(来自洗衣机、浴室水槽和淋浴的水)。 \n10. 只购买能源效率高的洗碗机和洗衣机。"
],
"label": [0, 1],
}We provide demo data for quick testing. You can either use these samples or train with your own data.
Required fields for DPO:
system(optional): System configurationsrc: User conversation content (first item=question1, second=question2, etc.)tgt: System response content (one fewer item than src)response: Contains chosen/rejected responses (must contain odd number of strings)sort: Differentiates chosen/rejected (lower value=rejected, higher=chosen)- Each training sample is in JSON format, with multiple samples separated by newlines
{
"system": "你是一个生活小助理",
"src": [
"你好。",
"哪一个富含蛋白质,床还是墙?"
],
"tgt": ["你好呀,我是你的生活小助理。"],
"response": [
[
"床和墙都不是蛋白质的来源,因为它们都是无生命的物体。蛋白质通常存在于肉类、奶制品、豆类和坚果等食物中。"
],
[
"对不起,我无法回答那个问题。请提供更具体的信息,让我知道你需要什么帮助。"
]
],
"sort": [
1,
0
]
}Supports json and jsonl file formats:
- json: Each line contains one JSON object:
{"instruction":"instructionA", "input":"inputA", "output":"outputA"}
{"instruction":"instructionB", "input":"inputB", "output":"outputB"}
{"instruction":"instructionC", "input":"inputC", "output":"outputC"}- jsonl: All data in a single JSON array:
[
{"instruction":"instructionA", "input":"inputA", "output":"outputA"},
{"instruction":"instructionB", "input":"inputB", "output":"outputB"},
{"instruction":"instructionC", "input":"inputC", "output":"outputC"}
]Field Mapping Between alpaca and erniekit
| alpaca | erniekit | Mapping |
|---|---|---|
| instruction input |
src | src[-1] = instruction + input |
| output | tgt | tgt[-1] = output |
| history | src tgt |
history = zip(src[:-1], tgt[:-1]) |
| system | system | system=system |
(Coming soon)