Problem Description
Downloading a dataset from SDV's demo datasets can be done using download_demo functionality
from sdv.datasets.demo import download_demo
data, metadata = download_demo('multi_table', 'fake_hotels')
Under the hood, the function will create three in-memory representation of the fake_hotels:
data_io which is the data.zip directly downloaded in s3
in_memory_directory which is a dictionary of the data read in bytes
data which is a dictionary of the data after loading it in pandas
This is inefficient and causes out-of-memory issues when the dataset is large.
Expected behavior
To optimize the code, we can update it to do the following:
- maintain only one dictionary -- the pandas dictionary
- open the csv file and load it directly into pandas without the need for a separate
in_memory_directory variable.
- this should be done file by file such that we don't open multiple files at once.
- delete the
data.zip from memory after finishing.
In the end, there should be only one variable in-memory which is data
Problem Description
Downloading a dataset from SDV's demo datasets can be done using
download_demofunctionalityUnder the hood, the function will create three in-memory representation of the
fake_hotels:data_iowhich is the data.zip directly downloaded in s3in_memory_directorywhich is a dictionary of the data read in bytesdatawhich is a dictionary of the data after loading it in pandasThis is inefficient and causes out-of-memory issues when the dataset is large.
Expected behavior
To optimize the code, we can update it to do the following:
in_memory_directoryvariable.data.zipfrom memory after finishing.In the end, there should be only one variable in-memory which is
data