Skip to content

Pattern for organizing (large) datasets into data packages #546

Description

@rufuspollock

As a User I want to know patterns (and best practices) for structuring (large) datasets as data packages so that I can use best practice and common approach

Example questions: suppose I have a 5GB time series dataset of rainfall observations across 30 years at a daily level and across 10k geographic locations (grouped by locality, then state, then country)

  • How do you partition across data packages? Is this one data package or many (e.g. one for each year)
  • How do you partition across resources? Does all data go in one big file or do you partition by common values for key fields (e.g. by each year)

See also the support for chunking/partitioning resources already in data packages frictionlessdata/datapackage#228

Research & Reading

This idea of partitioning shares much in common with partitioning in databases (or, more accurately, database tables).

Essentially we are asking for partitioning criteria people should use to partition their dataset into resources (or even their resources).

See https://en.wikipedia.org/wiki/Partition_(database)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions