Skip to content

Introduce Distributed Parquet Reader#26

Open
yigitozkavci wants to merge 2 commits into
utdemir:masterfrom
yigitozkavci:parquet-integration
Open

Introduce Distributed Parquet Reader#26
yigitozkavci wants to merge 2 commits into
utdemir:masterfrom
yigitozkavci:parquet-integration

Conversation

@yigitozkavci

Copy link
Copy Markdown
Collaborator

This is a WIP PR. Currently https://github.com/yigitozkavci/parquet-hs should be pulled locally because parquet-hs hasn't been uploaded to Hackage yet.

Running the example using Nix:

# While inside distributed-dataset directory
$ git clone https://github.com/yigitozkavci/parquet-hs ../parquet-hs

# Start the nix shell
$ nix-shell

# This example reads parquet data from https://yigitozkavci-dd-test-bucket.s3.amazonaws.com/test.parquet. Data is being streamed remotely.
$ cabal new-run example-parquet

...<metadata here>
[Info] Stages: SInit @ParquetValue 1
[Info] Running: SInit @ParquetValue 1
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 0),("nested",ParquetNull),("some_str",ParquetString "zero")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 1),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetNull)]))),("some_str",ParquetString "one")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 2),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 16)])))]))),("some_str",ParquetString "two")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 3),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 16)])))]))),("some_str",ParquetString "three")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 4),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 4)])))]))),("some_str",ParquetString "four")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 5),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 4)])))]))),("some_str",ParquetString "five")]))
ParquetObject (MkParquetObject (fromList [("some_num",ParquetInt 6),("nested",ParquetObject (MkParquetObject (fromList [("another_levelll",ParquetObject (MkParquetObject (fromList [("j",ParquetInt 16)])))]))),("some_str",ParquetString "six")]))

@utdemir

utdemir commented Dec 12, 2019

Copy link
Copy Markdown
Owner

This looks great! I'll try to find a dataset online which uses Parquet, so we can have a use case and a good example.

I guess most parquet files in the wild will be compressed. How hard do you think it'd be to implement at least one compression algorithm? We can start with either snappy or gzip initially. I'm happy to give you a hand on parquet-hs library if you don't want to work on it. By the way, I'm also happy to merge this without compression support.

There's some formatting changes we might want to do here (main codebase uses Ormolu, it'd be good to use it here), and we can download the parquet-hs from github via Nix instead of depending on a local clone. But after that I don't see a reason keep it in a WIP PR, whole project is a WIP right now anyway :).

@yigitozkavci

Copy link
Copy Markdown
Collaborator Author

I'll try to find a dataset online which uses Parquet, so we can have a use case and a good example.

That would be really nice. I'm currently generating parquet files for testing using this script and I believe this might be fairly limited in terms of variety.

I guess most parquet files in the wild will be compressed. How hard do you think it'd be to implement at least one compression algorithm?

This is one of the reasons of not uploading parquet-hs to Hackage yet, really. It shouldn't be too hard, this is my next priority when I get to work on parquet-hs.

There's some formatting changes we might want to do here (main codebase uses Ormolu, it'd be good to use it here).

Sure! I will update the PR using ormolu. I was using brittany, I think, while formatting this.

@utdemir

utdemir commented Dec 12, 2019

Copy link
Copy Markdown
Owner

I found Amazon Customer Reviews Dataset, which is provided as partitioned, snappy compressed parquet files on S3. It's around 50GB in total (compressed).

I created #27 to gather public datasets we can use.

@utdemir utdemir mentioned this pull request Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants