Skip to content

[Feature]: Automates llm_script performance verification by building a different LLM call-based evaluation system #211

Description

@wnghdcjfe

Current Situation/Problem and Proposal

Currently, llm_script generates responses by invoking various prompts according to its own logic. However, because there’s no evaluation framework to systematically compare and analyze model-to-model performance differences, it’s hard to pinpoint clear areas for improvement.

🎯 Objectives

Leverage external LLMs (GPT-4, Claude, Llama-2, etc.) as evaluators

Automate the process of having those evaluators review and score the responses produced by our LLM

Collect and visualize quality metrics (accuracy, consistency, fluency, etc.) to drive a concrete improvement roadmap

🔍 Proposal

API module extension

Add a separate call class for each evaluator LLM under script/llm

Common interface

Define evaluate(prompt, response) → score

Evaluation script

Implement a script that iterates through prompts, collects scores, and outputs results

Metrics design

Quantitative: consistency, factuality, response latency

Reporting: periodically publish evaluation reports on the itdoc.kr blog.

Metadata

Metadata

Assignees

Labels

enhancementWork focused on refining or upgrading existing features

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions