Skip to content

Potential inconsistency in label distribution of FB Hateful Memes test_seen split #2

@MeiZhiyuan88666

Description

@MeiZhiyuan88666

Hi,

I encountered a potential inconsistency in the label distribution of the Facebook Hateful Memes dataset used in this project, and I would like to seek clarification.

According to the official description of the Hateful Memes Challenge dataset, the test set is expected to contain 1,000 samples with a balanced distribution (i.e., 500 hateful and 500 non-hateful memes).

However, after inspecting the file and counting the labels, I obtained the following statistics:

  • Hateful memes (label = 1): 490
  • Non-hateful memes (label = 0): 510

This deviates from the expected 1:1 distribution described in the official dataset documentation.

This raises a few questions:

  1. Is this dataset directly downloaded from the official Facebook Hateful Memes Challenge release?
  2. If so, is this imbalance expected (e.g., due to version differences or preprocessing)?
  3. Or has any filtering, cleaning, or modification been applied to the original dataset?

Since evaluation metrics (especially accuracy) can be sensitive to label distribution, this discrepancy might affect reproducibility and fairness of comparisons.

Could you please clarify the source of the dataset and whether any preprocessing steps were applied?

Thanks in advance for your help!

Best regards

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions