Skip to content

Question: could url_filterblacklistsv0_3_0.tar.gz be hosted externally? #477

Description

@juancorro77

Hi! First of all, thanks for this library — really impressive work.

While reading through the code I noticed that src/datatrove/assets/url_filterblacklistsv0_3_0.tar.gz (~17 MB) lives in the source tree. I might be missing context about why it's bundled this way, but I was wondering if it could eventually be hosted on HuggingFace Hub and downloaded on first use — since url_filter.py already uses huggingface_hub.cached_assets_path for the extracted files, it feels like a natural fit.

No rush at all, just a thought. If it's intentional or there's a reason to keep it local I completely understand!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions