Skip to content

Memory limitations for processing the DwC-A file #2

@LevanBokeria

Description

@LevanBokeria

During the Step 2 of downloading data, the script 02-fetch_gbif_moth_data.py uses a Darwin Core Archive file to download the appropriate images for a list of species.

I have downloaded the DwC-A file associated with order = Lepidopera , which is 30GB zipped, and containes large files like the occurrence.txt which is around 110GB by itself. The link to downloading the file is here.

However, given the large size of the file the computer runs out of memory when trying to read in the occurrence.txt for subsequence processing. The process is automatically killed. The full error output of the terminal is below (note that some of the text output like "reading the occurrence.txt file..." was added in by me inside the script).

I am able to download smaller DwC-A files, associated with just one family name = Erebidae, and I think this won't run into the same memory issues. However, since we'd like to train the model on a very large list of species, we would need to download images for all of them.

One solution would be to download images in chunks, separately for different "family" names, using separate DwC-A files for each family.

However, I was wondering if you might have a solution to downloading all the images in one go. Did you ever run into same memory limiatations when downloading images for your species, which I believe were in thousands as well? And if so, how did you work around it? Any advice would be greatly appreciated!


Terminal output after attempting the download:

`(gbif-species-trainer-AMI-fork) lbokeria@610-MJ6THLXQ7R data_download % python 02-fetch_gbif_moth_data.py \

--write_directory /Users/lbokeria/Documents/projects/gbif-species-trainer-AMI-fork/data_download/output_data/gbif_data_uksi_macro_moths_small_try/ \

--dwca_file /Users/lbokeria/Downloads/0001402-230530130749713.zip \

--species_checklist /Users/lbokeria/Documents/projects/gbif-species-trainer-AMI-fork/uksi-macro-moths-small-try-keys.csv \

--max_images_per_species 2 \

--resume_session True

INFO:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:NumExpr defaulting to 8 threads.
reading the multimedia.txt file...
/Users/lbokeria/miniforge3/envs/gbif-species-trainer-AMI-fork/lib/python3.9/site-packages/dwca/read.py:203: DtypeWarning: Columns (4,5,6,7,8,9,10,12) have mixed types. Specify dtype option on import or set low_memory=False.
df = read_csv(self.absolute_temporary_path(relative_path), **kwargs)
finished
reading the occurrence.txt file...
zsh: killed python 02-fetch_gbif_moth_data.py --write_directory --dwca_file 2 True`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions