Hi! First of all, thanks for this library — really impressive work.
While reading through the code I noticed that src/datatrove/assets/url_filterblacklistsv0_3_0.tar.gz (~17 MB) lives in the source tree. I might be missing context about why it's bundled this way, but I was wondering if it could eventually be hosted on HuggingFace Hub and downloaded on first use — since url_filter.py already uses huggingface_hub.cached_assets_path for the extracted files, it feels like a natural fit.
No rush at all, just a thought. If it's intentional or there's a reason to keep it local I completely understand!
Hi! First of all, thanks for this library — really impressive work.
While reading through the code I noticed that
src/datatrove/assets/url_filterblacklistsv0_3_0.tar.gz(~17 MB) lives in the source tree. I might be missing context about why it's bundled this way, but I was wondering if it could eventually be hosted on HuggingFace Hub and downloaded on first use — sinceurl_filter.pyalready useshuggingface_hub.cached_assets_pathfor the extracted files, it feels like a natural fit.No rush at all, just a thought. If it's intentional or there's a reason to keep it local I completely understand!