A focused study on the speed comparison of reading parquet files using PyArrow vs. reading identical CSV files with Pandas

Image source: Pixabay

Why Parquet in lieu of CSV?

Because you may want to read large data files 50X faster than what you can do with built-in functions of Pandas!

Comma-separated values (CSV) is a flat-file format used widely in data analytics. It is simple to work with and performs decently in small to medium data regimes. However, as you do data processing with bigger files (and also, perhaps, pay for the cloud-based storage of them), there are some excellent reasons to move towards file formats using the columnar data storage principle.

Apache Parquet is one of the most popular of these types. The article below discusses some of these advantages (as opposed to using the traditional row-based formats e.g. a flat CSV file).

Apache Parquet: How to be a hero with the open-source columnar data format

In short,

  • Being column-oriented, Parquet brings all the efficient storage characteristics (e.g., blocks, row group, column chunks) to the table
  • Apache Parquet is built from the ground using the Google shredding and assembly algorithm
  • Parquet files were designed with complex nested data structures in mind.
  • Apache Parquet is built to support very efficient compression and encoding schemes. A parquet file can be compressed using various clever methods such as — (a) dictionary encoding, (b) bit packing, (c) run-length encoding.
Image source: Author produced (own copyright)

Parquet files are an excellent choice in the situation when you have to store and read large data files from disk or cloud storage. For data analysis with Python, we all use Pandas widely. In this article, we will show that using Parquet files with Apache Arrow gives you an impressive speed advantage compared to using CSV files with Pandas while reading the content of large files. In particular, we will talk about the impact of,

  • file size
  • number of columns that are being read
  • the sparsity of the file (missing values)

PyArrow for reading

PyArrow is a Python binding (API) for the Apache Arrow framework. As per their website — “Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.”

These features make Apache arrow one of the fastest-growing frameworks for distributed in-memory data analytics. And to boot, it turns out to be an ideal in-memory transport layer for reading or writing data with Parquet files.

Using PyArrow with Parquet files can lead to an impressive speed advantage in terms of the reading speed of large data files, and, once read, the in-memory object can be transformed into a regular Pandas DataFrame easily.

To know more about the full features of PyArrow, please consult the Apache documentation.

How fast is PyArrow/Parquet combination?

The code for this article is here in my Github repo.

CSV and Parquet files of various sizes

First, we create various CSV files filled with randomly generated floating-point numbers. We also convert them into zipped (compressed) parquet files. All of the files have 100 columns but a varying number of rows to lend them different file sizes.

The directory may look like after this process.

Using PyArrow with Parquet files can lead to an impressive speed advantage in terms of the reading speed of large data files

Pandas CSV vs. Arrow Parquet reading speed

Now, we can write two small chunks of code to read these files using Pandas read_csv and PyArrow’s read_table functions. We also monitor the time it takes to read the file and compare them in the form of a ratio. The result is shown below,

Although there are some ups and downs in the trend, it is clear that PyArrow/Parquet combination shines for larger file sizes i.e. as the file size grows, it is more advantageous/ faster to store the data in a Parquet format and read with PyArrow.

The ratio grows to as large as > 10 for a 100 MB file i.e. the reading speed is 10X faster! For GB-sized files, the advantage can be even higher.

Reading a small number of columns is much faster with Arrow

Next, we show something even cooler. Often, we may not need to read all the columns from a columnar storage file. For example, we may apply some filter on the data and choose only selected data for the actual in-memory processing.

Now, with CSV files or regular SQL databases, this means we are choosing specific rows out of all the data. However, for the columnar database, this effectively means choosing specific columns.

Let us see if we have added advantage in terms of reading speed when we are reading only a small fraction of columns from the Parquet file. Here is the result from our analysis,

When we read a very small fraction of columns, say < 10 out of 100, the reading speed ratio becomes as large as > 50 i.e. we get 50X speedup compared to the regular Pandas CSV file reading. The speedup tapers off for large fractions of columns and settles down to a stable value.

we have added advantage in terms of reading speed when we are reading only a small fraction of columns from the Parquet file

PyArrow (Parquet) reading speed varies with sparsity in the file

Next, we look at the impact of sparsity on the reading speed of the Parquet file. In many situations, the data can have a lot of sparsity i.e. no values are recorded. This is common, for example, for sensor-driven data analysis, where various sensors record data at a different frequency and interval, and large portions of the data matrix are filled up with NaN values.

In our analysis, we artificially injected Numpy NaN values into a fixed size file, saved them in Parquet format, and read them using PyArrow. The result is shown below. Clearly, sparse files are read much faster by PyArrow than dense data files. This behavior can be utilized to our advantage depending on the type of data we may encounter.

Summary

In this article, we showed a focused analysis of the reading speed comparison between Pandas/CSV and Apache Arrow/Parquet combination. We showed how the Apache Arrow has a significant advantage in reading speed over Pandas CSV and how this varies with the size of the dataset. We also showed that reading a small fraction of columns is inherently faster for this type of column-oriented file format. Finally, we also showed the impact of sparsity on the reading speed by the Apache Arrow.

You can check the author’s GitHub repositories for code, ideas, and resources in machine learning and data science. If you are, like me, passionate about AI/machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.

Tirthajyoti Sarkar - Data Science and Solutions Engineering Manager - Adapdix Corporation | LinkedIn


How fast is reading Parquet file (with Arrow) vs. CSV with Pandas? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.