Efficiently Scaling Polars GPU Parquet Reader

When working with large datasets, the performance of your data processing tools becomes critical. Polars, an open-source library for data manipulation known for its speed and efficiency, offers a GPU-accelerated backend powered by cuDF that can significantly boost performance.

However, to fully leverage the power of the Polars GPU backend, it’s essential to optimize the data loading process and effectively manage the memory required by the workflow. As development of the GPU backend has progressed, several additional techniques have become available for maintaining high performance as dataset size increases when using GPU Parquet reader. The existing Polars GPU parquet reader (up to version 24.10) would not scale for higher dataset sizes.

This post explores how a chunked Parquet Reader, combined with Unified Virtual Memory (UVM), can outperform both nonchunked readers and CPU-based approaches.

Challenges with scale factors and nonchunked readers

As the scale factor (SF) increases, a nonchunked GPU Polars Reader (24.10) often struggles. Beyond SF200, performance degrades significantly. In some cases, such as Query 9, the nonchunked GPU reader fails—even before reaching SF50. This limitation arises due to memory constraints when loading large Parquet files into the GPU’s memory. The missing data in the nonchunked Parquet Reader plot highlights the out-of-memory (OOM) errors encountered at higher scale factors.

A plot showing successful query execution across all scale factors for 24.12 but showing lack of completion from SF 250 for version 24.10 of Parquet Reader for Query 13. — *Figure 1. Query 13 execution reliability, 24.10 to 24.12 Parquet Reader comparison*

Improving IO and peak memory with chunked Parquet reading

To overcome these memory limitations, a chunked Parquet Reader becomes essential. By reading the Parquet file in smaller chunks, the memory footprint is reduced, enabling Polars GPU to process larger datasets. Using a chunked Parquet Reader with a 16 GB pass-read-limit enables the execution of more scale factors compared to a nonchunked reader for any given query. For Query 9, chunked Parquet reading with 16 GB or 32 GB is necessary to execute and achieve better throughput.

Results from varying both the dataset size and chunk size. The missing dots from the unlimited and 32.0 GB chunk sizes are runs that ran out of memory. The 16.0 GB chunk size and below successfully ran for all dataset sizes. — *Figure 2. Throughput comparison by varying chunk sizes (`pass_read_limit`) across scale factors for Query 9*

Reading even larger datasets with UVM

While chunked reading improves memory management, the integration of UVM takes performance to the next level. UVM enables the GPU to access system memory directly, further alleviating memory constraints and improving data transfer efficiency.

To provide a comparison, non-UVM chunked encounters an OOM error before reaching SF100. Chunked plus UVM enables successful execution of queries on higher scale factors, but throughput is affected.

Figure 3 shows the clear advantage. Many more scale factors have successful execution with a chunked Parquet Reader with UVM enabled compared to a nonchunked Parquet Reader.

A plot showing Query 13 chunked plus UVM versus CPU versus non-UVM. The non-UVM throughput exceeds everything but stops at SF200. UVM plus chunked throughput continues to execute at a higher throughput than CPU until SF400. — *Figure 3. Throughput comparison with chunked plus UVM versus CPU versus non-UVM for Query 13 (higher is better)*

Stability and throughput

When selecting the optimal pass_read_limit, it’s crucial to consider the balance between stability and throughput. Figures 1-3 suggest that 16 GB or 32 GB pass_read_limit is the best combination of stability and throughput.

32 GB pass_read_limit: All queries succeeded except Query 9 and Query 19 that failed with OOM exceptions
16 GB pass_read_limit: All queries succeeded

Chunked-GPU versus CPU

When the observed throughput of each query is generally still higher than CPU Polars, this allows many queries to complete that do not complete without chunking. A 16 GB or possibly 32 GB pass_read_limit seems reasonable. A 16 GB or 32 GB pass_read_limit results in successful execution at higher scale factors vs nonchunked Parquet.

Conclusion

For Polars GPU, a chunked Parquet Reader with UVM is often better than Polars CPU and a nonchunked Parquet Reader, especially when dealing with large datasets and high scale factors. By optimizing the data loading process, you can unlock the full potential of Polars GPU and achieve significant performance gains. As part of latest cudf-polars (24.12 and above), chunked Parquet Reader and UVM are the default approach for reading a Parquet file. This has resulted in the improvements presented above across all queries and scale factors.

To get started, install cuDF Polars.