Skip to content

Optimize demo data loading by reducing in-memory consumption #2895

@frances-h

Description

@frances-h

Problem Description

Downloading a dataset from SDV's demo datasets can be done using download_demo functionality

from sdv.datasets.demo import download_demo

data, metadata = download_demo('multi_table', 'fake_hotels')

Under the hood, the function will create three in-memory representation of the fake_hotels:

  • data_io which is the data.zip directly downloaded in s3
  • in_memory_directory which is a dictionary of the data read in bytes
  • data which is a dictionary of the data after loading it in pandas

This is inefficient and causes out-of-memory issues when the dataset is large.

Expected behavior

To optimize the code, we can update it to do the following:

  • maintain only one dictionary -- the pandas dictionary
    • open the csv file and load it directly into pandas without the need for a separate in_memory_directory variable.
    • this should be done file by file such that we don't open multiple files at once.
  • delete the data.zip from memory after finishing.

In the end, there should be only one variable in-memory which is data

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions