Introduce Distributed Parquet Reader by yigitozkavci · Pull Request #26 · utdemir/distributed-dataset

yigitozkavci · 2019-12-11T22:34:30Z

This is a WIP PR. Currently https://github.com/yigitozkavci/parquet-hs should be pulled locally because parquet-hs hasn't been uploaded to Hackage yet.

Running the example using Nix:

# While inside distributed-dataset directory
$ git clone https://github.com/yigitozkavci/parquet-hs ../parquet-hs

# Start the nix shell
$ nix-shell

# This example reads parquet data from https://yigitozkavci-dd-test-bucket.s3.amazonaws.com/test.parquet. Data is being streamed remotely.
$ cabal new-run example-parquet

...<metadata here>
[Info] Stages: SInit @ParquetValue 1
[Info] Running: SInit @ParquetValue 1
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 0),("nested",ParquetNull),("some_str",ParquetString "zero")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 1),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetNull)]))),("some_str",ParquetString "one")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 2),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 16)])))]))),("some_str",ParquetString "two")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 3),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 16)])))]))),("some_str",ParquetString "three")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 4),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 4)])))]))),("some_str",ParquetString "four")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 5),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 4)])))]))),("some_str",ParquetString "five")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 6),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 16)])))]))),("some_str",ParquetString "six")]))

utdemir · 2019-12-12T09:20:43Z

This looks great! I'll try to find a dataset online which uses Parquet, so we can have a use case and a good example.

I guess most parquet files in the wild will be compressed. How hard do you think it'd be to implement at least one compression algorithm? We can start with either snappy or gzip initially. I'm happy to give you a hand on parquet-hs library if you don't want to work on it. By the way, I'm also happy to merge this without compression support.

There's some formatting changes we might want to do here (main codebase uses Ormolu, it'd be good to use it here), and we can download the parquet-hs from github via Nix instead of depending on a local clone. But after that I don't see a reason keep it in a WIP PR, whole project is a WIP right now anyway :).

yigitozkavci · 2019-12-12T09:30:03Z

I'll try to find a dataset online which uses Parquet, so we can have a use case and a good example.

That would be really nice. I'm currently generating parquet files for testing using this script and I believe this might be fairly limited in terms of variety.

I guess most parquet files in the wild will be compressed. How hard do you think it'd be to implement at least one compression algorithm?

This is one of the reasons of not uploading parquet-hs to Hackage yet, really. It shouldn't be too hard, this is my next priority when I get to work on parquet-hs.

There's some formatting changes we might want to do here (main codebase uses Ormolu, it'd be good to use it here).

Sure! I will update the PR using ormolu. I was using brittany, I think, while formatting this.

utdemir · 2019-12-12T22:17:36Z

I found Amazon Customer Reviews Dataset, which is provided as partitioned, snappy compressed parquet files on S3. It's around 50GB in total (compressed).

I created #27 to gather public datasets we can use.

yigitozkavci added 2 commits December 11, 2019 22:20

Implement Parquet reader using parquet-hs

26bcc82

Remove pinch dependency and improve error message in Parquet example

5e993de

utdemir mentioned this pull request Dec 12, 2019

Parquet Support #17

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Distributed Parquet Reader#26

Introduce Distributed Parquet Reader#26
yigitozkavci wants to merge 2 commits into
utdemir:masterfrom
yigitozkavci:parquet-integration

yigitozkavci commented Dec 11, 2019

Uh oh!

utdemir commented Dec 12, 2019

Uh oh!

yigitozkavci commented Dec 12, 2019

Uh oh!

utdemir commented Dec 12, 2019 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yigitozkavci commented Dec 11, 2019

Uh oh!

utdemir commented Dec 12, 2019

Uh oh!

yigitozkavci commented Dec 12, 2019

Uh oh!

utdemir commented Dec 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

utdemir commented Dec 12, 2019 •

edited

Loading