Memory limitations for processing the DwC-A file

During the Step 2 of downloading data, the script `02-fetch_gbif_moth_data.py` uses a Darwin Core Archive file to download the appropriate images for a list of species. 

I have downloaded the DwC-A file associated with order = Lepidopera , which is 30GB zipped, and containes large files like the occurrence.txt which is around 110GB by itself. The link to downloading the file is [here](https://www.gbif.org/occurrence/search?taxon_key=797).

However, given the large size of the file the computer runs out of memory when trying to read in the occurrence.txt for subsequence processing. The process is automatically killed. The full error output of the terminal is below (note that some of the text output like "reading the occurrence.txt file..." was added in by me inside the script). 

I am able to download smaller DwC-A files, associated with just one family name = Erebidae, and I think this won't run into the same memory issues. However, since we'd like to train the model on a very large list of species, we would need to download images for all of them.

One solution would be to download images in chunks, separately for different "family" names, using separate DwC-A files for each family. 

However, I was wondering if you might have a solution to downloading all the images in one go. Did you ever run into same memory limiatations when downloading images for your species, which I believe were in thousands as well? And if so, how did you work around it? Any advice would be greatly appreciated! 

----------
Terminal output after attempting the download: 

`(gbif-species-trainer-AMI-fork) lbokeria@610-MJ6THLXQ7R data_download % python 02-fetch_gbif_moth_data.py \

--write_directory /Users/lbokeria/Documents/projects/gbif-species-trainer-AMI-fork/data_download/output_data/gbif_data_uksi_macro_moths_small_try/  \

--dwca_file /Users/lbokeria/Downloads/0001402-230530130749713.zip \

--species_checklist /Users/lbokeria/Documents/projects/gbif-species-trainer-AMI-fork/uksi-macro-moths-small-try-keys.csv \

--max_images_per_species 2 \

--resume_session True

INFO:Note: NumExpr detected 10 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:NumExpr defaulting to 8 threads.
reading the multimedia.txt file...
/Users/lbokeria/miniforge3/envs/gbif-species-trainer-AMI-fork/lib/python3.9/site-packages/dwca/read.py:203: DtypeWarning: Columns (4,5,6,7,8,9,10,12) have mixed types. Specify dtype option on import or set low_memory=False.
  df = read_csv(self.absolute_temporary_path(relative_path), **kwargs)
finished
reading the occurrence.txt file...
zsh: killed     python 02-fetch_gbif_moth_data.py --write_directory  --dwca_file     2  True`




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory limitations for processing the DwC-A file #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Memory limitations for processing the DwC-A file #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions