Posts | Sylvain Lesage (@severo@mastodon.social)

Sylvain Lesage

@severo@mastodon.social

Dataviz freelance developer. Part-time Hugging Face.

huggingface

|https://huggingface.co/severo

openstreetmap

|https://www.openstreetmap.org/user/Sylvain%20Lesage

https://rednegra.net

https://severo.github.io/

October 28, 2022

621 Posts 383 Following 366 Followers

Posts Posts & Replies Media

TIL* that you can embed an index of the Parquet row group pages in the file metadata. It gives a much finer granularity when fetching parts of the Parquet file, allowing for smaller requests and faster rendering on the frontend.

It adds some weight to the metadata, which (I guess) is why it's not enabled by default in PyArrow. Note also that the PyArrow reader itself cannot make sense of this index :) I'm not sure about the current support in other clients such as DuckDB or hyparquet.