Skip to content

Deduplication Configuration #376

Description

@yongkangzhao

I would like to know how to configure the minhash deduplication pipeline so that when duplicate datapoints are detected, instead of keep 1 sample, I would like to drop ALL samples that are classified as duplicates.

Is there a config, or somewhere I can modify to achieve this?

here's some context:

let's say we have two datasets, dataset A and B, and there's a need to detect duplicates between A and B, and only remove duplicates from A.
I'm thinking if we can remove all detected duplicates, then I will have a processed A and I can keep using the original B to achieve a similar effect.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions