Skip to content

AI Eval inconsistency #101

@akanshaaa19

Description

@akanshaaa19

Summary

GitHub Issue: AI Evaluation Inconsistencies

Problem Statement: Users experience inconsistencies in AI evaluation results, including out-of-order questions and missing scores.

Expected

The AI evaluation feature should provide consistently ordered results with complete scoring for all questions.

Actual Behavior

  1. Duplication 1: Questions were out of order; all were scored.
  2. Duplication 2: Scored only 5 out of 10 questions; questions were out of order.
  3. Duplication 3: Encountered an error message ("unable to download results as scores not assigned") despite seeing an aggregate score on the dashboard; results eventually downloaded but questions were scored multiple times and were out of order.
  4. Duplication 4: Scored Q5 only 3 times instead of 4; all answers received scores and were out of order.
  5. Duplication 5: Error message reappeared for downloading results even though aggregate score was visible.

Error Messages

  • "Unable to download results as scores not assigned"

Reproduction Steps

  1. Run an AI evaluation using a Golden QnA dataset.
  2. Set a duplication factor for multiple runs.
  3. Observe ordering of questions and scoring consistency in results.
  4. Attempt to download results after evaluation.

Additional Information Needed: Evaluation Name, Golden QA dataset name, duplication factor, AI Assistant version, and approximate run time for each duplication run.

Source

Original message

Hi <@1132952719586492478> we were testing the AI eval feature and faced some challenges. We ran a sample golden QnA of 5 questions and the following happened -

  1. In Duplication 1 – In results the Qs came out of order, but all were scored
  2. In Duplication 2 – In results, it scored only 5 out of 10 Qs, one question was not scored at all, and they were out of order.
  3. In Duplication 3 – We got an error message saying unable to download results as scores not assigned, but aggregate score was visible on dashboard (it also ran the eval very quickly). Actually tried downloading this 3 days later and its come. But its scored each question 5-6 times instead of only 3, and results are out of order.
  4. In Duplication 4 – It scored Q5 only 3 times instead of 4, but gave score to all answers, and out of order
  5. In Duplication 5 – We got an error message saying unable to download results as scores not assigned, but aggregate score was visible on dashboard (it also ran the eval very quickly). This one we are still not able to download results.
Full thread transcript

avanti0872_62582: Hi <@1132952719586492478> we were testing the AI eval feature and faced some challenges. We ran a sample golden QnA of 5 questions and the following happened -

  1. In Duplication 1 – In results the Qs came out of order, but all were scored
  2. In Duplication 2 – In results, it scored only 5 out of 10 Qs, one question was not scored at all, and they were out of order.
  3. In Duplication 3 – We got an error message saying unable to download results as scores not assigned, but aggregate score was visible on dashboard (it also ran the eval very quickly). Actually tried downloading this 3 days later and its come. But its scored each question 5-6 times instead of only 3, and results are out of order.
  4. In Duplication 4 – It scored Q5 only 3 times instead of 4, but gave score to all answers, and out of order
  5. In Duplication 5 – We got an error message saying unable to download results as scores not assigned, but aggregate score was visible on dashboard (it also ran the eval very quickly). This one we are still not able to download results.

Glific AI doc bot (bot): Hey! Thanks for testing the AI Evals and sharing such clear notes on the 5 duplication runs.

A couple of quick checks: please confirm your Golden QA CSV follows the question, answer template and that the duplication factor was set as intended—duplication simply repeats the full set during the run . In the results CSV, question_id maps to the original golden question order, so even if rows display out of order you can sort by question_id to review coherently . The “unable to download results as scores not assigned” despite seeing an aggregate likely needs a deeper look on our side.

Could you share for each run: the Evaluation Name, Golden QA dataset name, duplication factor, AI Assistant + version, and the approximate run time? For reference, here are the docs: AI Evaluations in Glific and About Golden QnA .

I do not have sufficient info to answer this question. <@&1383715800632004648>, can you please chime in here?

Glific AI doc bot (bot): Was this helpful in resolving your query? Or do you need further support?

avanti0872_62582: https://setu.glific.com/ai-evaluations

fawas003: <@1227120295400443936> checking this and will get back to you by the first half of today

fawas003: <@1118428955859505224>
it seems there are some issues with the ordering of the questions, and scores were missing for some of the questions. Also, downloading the results was not successful even though the aggregate score was visible.the result sheet is downloadable now, though.
could you please take a look into this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    To Do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions