ODA-DataScorer - OpenDataArena Data Scoring Toolkit

Introduction

ODA-DataScorer is a toolkit for multi-dimensional score assessments of post-training datasets for OpenDataArena, offering a series of automated, multi-faceted scoring and processing methods via model-based, LLM-as-judge, and heuristic approaches.

Wiki Documentation

More details about the data scoring can be found in OpenDataArena-Tool Data Scorer Documentation.

Core Modules

ODA-DataScorer integrates various advanced data processing and scoring technologies, primarily including the following three core modules:

📊 Model-based Scorer: leveraging internal model signals to assess data. This framework integrates 40 model-based scorers, covering multiple dimensions including quality, complexity, gradient analysis, and more:
- Quality: SkyworkLlamaScorer, SkyworkQwenScorer, AtheneScorer, RMDeBERTaScorer, Gpt2HarmlessScorer, Gpt2HelpfulScorer, InfOrmScorer, DeitaQScorer, DebertaScorer, FinewebEduScorer, TextbookScorer, QuRateScorer, CleanlinessScorer, ProfessionalismScorer, ReadabilityScorer, ReasoningScorer, UniEvalD2tScorer, UniEvalDialogScorer, UniEvalFactScorer, UniEvalSumScorer
- Complexity: DeitaCScorer, IFDScorer, ThinkingProbScorer, PPLScorer, NormLossScorer, UPDScorer, ComplexityScorer
- Others: GraNdScorer, NuclearNormScorer, EffectiveRankScorer, Task2VecScorer, MIWVScorer, SelectitTokenScorer, SelectitSentenceScorer, SelectitModelScorer, HESScorer, EmbedSVDEntropyScorer, AskLlmScorer, FailRateScorer, InstagScorer
⚖️ LLM-as-a-Judge Scorer: leveraging powerful LLMs as "judges" to simulate human judgment in scoring the data.
In this framework, commonly used dimensions include Q, A, and QA:
- Q: Evaluates the "Question/Instruction" itself.
- A: Evaluates the "Answer/Generated Content" itself.
- QA: Evaluates the overall quality of the "Question-Answer Pair" (such as the relevance of the answer to the question).
Currently built-in metrics include:
- Difficulty (Q): The difficulty of the question
- Relevance (QA): The relevance of the answer to the question
- Clarity (Q & QA): Clarity of expression
- Coherence (Q & QA): Content coherence
- Completeness (Q & QA): Information completeness
- Complexity (Q & QA): Level of complexity
- Correctness (Q & QA): Content correctness
- Meaningfulness (Q & QA): Meaningfulness/Value
🧠 Heuristic Scorer: using heuristic methods to score the data. This framework integrates 25 heuristic scorers, covering multiple dimensions including diversity, statistical features, content detection, and more:
- Diversity: VendiScorer, KNNScorer, ApsScorer, ApjsScorer, RadiusScorer, ClusterInertiaScorer, PartitionEntropyScorer, NovelSumScorer, FacilityLocationScorer, UniqueNgramScorer, UniqueNtokenScorer, MtldScorer, VocdDScorer, TokenEntropyScorer, GramEntropyScorer, HddScorer
- Statistical Features: TokenLengthScorer, StrLengthScorer, LogicalWordCountScorer, CompressRatioScorer, TreeInstructScorer, LogDetDistanceScorer
- Content Detection: ThinkOrNotScorer, PureThinkScorer, TsPythonScorer

Installation

conda create -n oda python=3.10 -y
conda activate oda
git clone https://github.com/OpenDataArena/ODA-DataScorer.git
cd ODA-DataScorer
pip install -r requirements.txt
pip install flash_attn==2.7.4.post1 --no-build-isolation
# if you want to calculate fail rate, run the following command, which will install the lighteval package
cd model_based/scorers/fail_rate
pip install -e .[dev]

How to Use

To begin, ensure your input data adheres to the expected format.

Data Format

Your original input data should primarily consist of two keys: instruction and output, and each line must be a valid JSON object. This means your file should be in JSONL format.

Example: (You can also refer to data_process/example_input.jsonl)

{"instruction": "What is the capital of France?", "output": "Paris"}
{"instruction": "Explain the concept of quantum entanglement.", "output": "Quantum entanglement is a phenomenon where two or more particles become linked in such a way that they share the same fate, regardless of the distance between them. Measuring the state of one entangled particle instantaneously influences the state of the other(s)."}
{"instruction": "List three benefits of regular exercise.", "output": "Regular exercise improves cardiovascular health, boosts mood and reduces stress, and strengthens muscles and bones."}

Important Note:

If your original data contains an input key (common in formats like Alpaca), you must concatenate the input value with the instruction value, using a \n as a separator.
Some scorers may require additional fields or special format requirements. Please be sure to consult the corresponding scorer's Wiki or README for specific descriptions of required fields/formats.

Running Data Scoring Scripts

ODA-DataScorer adopts a modular structure, with each core module serving as an independent subdirectory. For detailed instructions on running specific scorers, please refer to the README.md file within the corresponding subdirectory.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data_process		data_process
heuristic		heuristic
llm_as_judge		llm_as_judge
model_based		model_based
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_zh-CN.md		README_zh-CN.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ODA-DataScorer - OpenDataArena Data Scoring Toolkit

Introduction

Wiki Documentation

Core Modules

Installation

How to Use

Data Format

Running Data Scoring Scripts

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ODA-DataScorer - OpenDataArena Data Scoring Toolkit

Introduction

Wiki Documentation

Core Modules

Installation

How to Use

Data Format

Running Data Scoring Scripts

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages