As a User I want to know patterns (and best practices) for structuring (large) datasets as data packages so that I can use best practice and common approach
Example questions: suppose I have a 5GB time series dataset of rainfall observations across 30 years at a daily level and across 10k geographic locations (grouped by locality, then state, then country)
- How do you partition across data packages? Is this one data package or many (e.g. one for each year)
- How do you partition across resources? Does all data go in one big file or do you partition by common values for key fields (e.g. by each year)
See also the support for chunking/partitioning resources already in data packages frictionlessdata/datapackage#228
Research & Reading
This idea of partitioning shares much in common with partitioning in databases (or, more accurately, database tables).
Essentially we are asking for partitioning criteria people should use to partition their dataset into resources (or even their resources).
See https://en.wikipedia.org/wiki/Partition_(database)
As a User I want to know patterns (and best practices) for structuring (large) datasets as data packages so that I can use best practice and common approach
Example questions: suppose I have a 5GB time series dataset of rainfall observations across 30 years at a daily level and across 10k geographic locations (grouped by locality, then state, then country)
See also the support for chunking/partitioning resources already in data packages frictionlessdata/datapackage#228
Research & Reading
This idea of partitioning shares much in common with partitioning in databases (or, more accurately, database tables).
Essentially we are asking for partitioning criteria people should use to partition their dataset into resources (or even their resources).
See https://en.wikipedia.org/wiki/Partition_(database)