Introduce Distributed Parquet Reader#26
Conversation
|
This looks great! I'll try to find a dataset online which uses Parquet, so we can have a use case and a good example. I guess most parquet files in the wild will be compressed. How hard do you think it'd be to implement at least one compression algorithm? We can start with either snappy or gzip initially. I'm happy to give you a hand on There's some formatting changes we might want to do here (main codebase uses Ormolu, it'd be good to use it here), and we can download the parquet-hs from github via Nix instead of depending on a local clone. But after that I don't see a reason keep it in a WIP PR, whole project is a WIP right now anyway :). |
That would be really nice. I'm currently generating parquet files for testing using this script and I believe this might be fairly limited in terms of variety.
This is one of the reasons of not uploading parquet-hs to Hackage yet, really. It shouldn't be too hard, this is my next priority when I get to work on parquet-hs.
Sure! I will update the PR using ormolu. I was using brittany, I think, while formatting this. |
|
I found Amazon Customer Reviews Dataset, which is provided as partitioned, snappy compressed parquet files on S3. It's around 50GB in total (compressed). I created #27 to gather public datasets we can use. |
This is a WIP PR. Currently https://github.com/yigitozkavci/parquet-hs should be pulled locally because parquet-hs hasn't been uploaded to Hackage yet.
Running the example using Nix: