AI Eval inconsistency

## Summary
## GitHub Issue: AI Evaluation Inconsistencies

**Problem Statement:** Users experience inconsistencies in AI evaluation results, including out-of-order questions and missing scores.

### Expected
The AI evaluation feature should provide consistently ordered results with complete scoring for all questions.

### Actual Behavior
1. **Duplication 1:** Questions were out of order; all were scored.
2. **Duplication 2:** Scored only 5 out of 10 questions; questions were out of order.
3. **Duplication 3:** Encountered an error message ("unable to download results as scores not assigned") despite seeing an aggregate score on the dashboard; results eventually downloaded but questions were scored multiple times and were out of order.
4. **Duplication 4:** Scored Q5 only 3 times instead of 4; all answers received scores and were out of order.
5. **Duplication 5:** Error message reappeared for downloading results even though aggregate score was visible.

### Error Messages
- "Unable to download results as scores not assigned"

### Reproduction Steps
1. Run an AI evaluation using a Golden QnA dataset.
2. Set a duplication factor for multiple runs.
3. Observe ordering of questions and scoring consistency in results.
4. Attempt to download results after evaluation.

**Additional Information Needed:** Evaluation Name, Golden QA dataset name, duplication factor, AI Assistant version, and approximate run time for each duplication run.

## Source
- **Discord thread:** https://discord.com/channels/717975833226248303/1513423979484942377
- **Reported by:** avanti0872_62582
- **Title:** AI Eval inconsistency

## Original message
Hi <@1132952719586492478> we were testing the AI eval feature and faced some challenges. We ran a sample golden QnA of 5 questions and the following happened - 
1. In Duplication 1 – In results the Qs came out of order, but all were scored 
2. In Duplication 2 – In results, it scored only 5 out of 10 Qs, one question was not scored at all, and they were out of order. 
3. In Duplication 3 – We got an error message saying unable to download results as scores not assigned, but aggregate score was visible on dashboard (it also ran the eval very quickly). Actually tried downloading this 3 days later and its come. But its scored each question 5-6 times instead of only 3, and results are out of order.
4. In Duplication 4 – It scored Q5 only 3 times instead of 4, but gave score to all answers, and out of order
5. In Duplication 5 – We got an error message saying unable to download results as scores not assigned, but aggregate score was visible on dashboard (it also ran the eval very quickly). This one we are still not able to download results.

<details>
<summary>Full thread transcript</summary>

**avanti0872_62582:** Hi <@1132952719586492478> we were testing the AI eval feature and faced some challenges. We ran a sample golden QnA of 5 questions and the following happened - 
1. In Duplication 1 – In results the Qs came out of order, but all were scored 
2. In Duplication 2 – In results, it scored only 5 out of 10 Qs, one question was not scored at all, and they were out of order. 
3. In Duplication 3 – We got an error message saying unable to download results as scores not assigned, but aggregate score was visible on dashboard (it also ran the eval very quickly). Actually tried downloading this 3 days later and its come. But its scored each question 5-6 times instead of only 3, and results are out of order.
4. In Duplication 4 – It scored Q5 only 3 times instead of 4, but gave score to all answers, and out of order
5. In Duplication 5 – We got an error message saying unable to download results as scores not assigned, but aggregate score was visible on dashboard (it also ran the eval very quickly). This one we are still not able to download results.

**Glific AI doc bot (bot):** Hey! Thanks for testing the AI Evals and sharing such clear notes on the 5 duplication runs.

A couple of quick checks: please confirm your Golden QA CSV follows the question, answer template and that the duplication factor was set as intended—duplication simply repeats the full set during the run . In the results CSV, question_id maps to the original golden question order, so even if rows display out of order you can sort by question_id to review coherently . The “unable to download results as scores not assigned” despite seeing an aggregate likely needs a deeper look on our side.

Could you share for each run: the Evaluation Name, Golden QA dataset name, duplication factor, AI Assistant + version, and the approximate run time? For reference, here are the docs: [AI Evaluations in Glific](https://glific.github.io/docs/docs/Integrations/AI%20Evaluations%20in%20Glific) and [About Golden QnA](https://glific.github.io/docs/docs/FAQ/About%20Golden%20QnA)  .

I do not have sufficient info to answer this question. <@&1383715800632004648>, can you please chime in here?

**Glific AI doc bot (bot):** Was this helpful in resolving your query? Or do you need further support?

**avanti0872_62582:** https://setu.glific.com/ai-evaluations

**fawas003:** <@1227120295400443936> checking this and will get back to you by the first half of today

**fawas003:** <@1118428955859505224> 
it seems there are some issues with the ordering of the questions, and scores were missing for some of the questions. Also, downloading the results was not successful even though the aggregate score was visible.the result sheet is downloadable now, though.
could you please take a look into this?

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI Eval inconsistency #101

Summary

GitHub Issue: AI Evaluation Inconsistencies

Expected

Actual Behavior

Error Messages

Reproduction Steps

Source

Original message

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

AI Eval inconsistency #101

Description

Summary

GitHub Issue: AI Evaluation Inconsistencies

Expected

Actual Behavior

Error Messages

Reproduction Steps

Source

Original message

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions