Skip to content

Fix BLEU zero division errors for empty inputs#764

Open
zanvari wants to merge 1 commit into
huggingface:mainfrom
zanvari:fix-bleu-zero-division-empty-input
Open

Fix BLEU zero division errors for empty inputs#764
zanvari wants to merge 1 commit into
huggingface:mainfrom
zanvari:fix-bleu-zero-division-empty-input

Conversation

@zanvari

@zanvari zanvari commented Jun 15, 2026

Copy link
Copy Markdown

Summary

This PR fixes the issue reported in #601, where the BLEU metric raises a ZeroDivisionError when either the reference or prediction tokenizes to an empty sequence.

The issue can occur when evaluating inputs such as:

  • Non-printable characters that result in an empty tokenized reference (e.g. chr(12) / form-feed).
  • Empty prediction strings.

In these cases, BLEU computation may reach the brevity penalty calculation with either reference_length == 0 or translation_length == 0, leading to a division-by-zero error instead of returning a valid metric output.

Fixes #601.

Reproduction

The following examples previously raised ZeroDivisionError:

from evaluate import load

bleu = load("bleu")

bleu.compute(
    predictions=["hello"],
    references=[[chr(12)]]
)

bleu.compute(
    predictions=[""],
    references=[["test"]]
)

Errors were raised during the computation of the BLEU brevity penalty.

Changes

  • Added handling for cases where:

    • reference_length == 0
    • translation_length == 0
  • Return a valid BLEU result with a score of 0.0 instead of raising an exception.

While implementing this fix, I also vendored nmt_bleu.py into the BLEU metric directory and replaced the external import with a local import. This removes the external GitHub dependency and incidentally resolves the offline-loading issue discussed in #565.

Rationale

When either the reference length or translation length is zero, BLEU is not meaningfully defined. Returning a BLEU score of 0.0 is preferable to raising an exception because:

  • Evaluation can continue without crashing.
  • The result clearly indicates no overlap between prediction and reference.
  • The behavior is consistent with the expectation that degenerate inputs should yield the lowest possible score rather than terminate evaluation.

Validation

I verified the fix using the examples above.

Empty-tokenized reference

bleu.compute(
    predictions=["hello"],
    references=[[chr(12)]]
)

Result:

{
    "bleu": 0.0,
    ...
}

Empty prediction

bleu.compute(
    predictions=[""],
    references=[["test"]]
)

Result:

{
    "bleu": 0.0,
    ...
}

Normal BLEU computation

bleu.compute(
    predictions=["hello there general kenobi"],
    references=[["hello there general kenobi"]]
)

Result:

{
    "bleu": 1.0,
    ...
}

Testing

If maintainers prefer, I would be happy to split the offline-loading change related to #565 into a separate PR and keep this PR focused solely on the ZeroDivisionError fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluation of form feed symbol with BLEU results in error

1 participant