When working with large datasets, the performance of your data processing tools becomes critical. Polars, an open-source library for data manipulation known for its speed and efficiency, offers a GPU-accelerated backend powered by cuDF that can significantly boost performance.
However, to fully leverage the power of the Polars GPU backend, it’s essential to optimize the data loading process and effectively manage the memory required by the workflow. As development of the GPU backend has progressed, several additional techniques have become available for maintaining high performance as dataset size increases when using GPU Parquet reader. The existing Polars GPU parquet reader (up to version 24.10) would not scale for higher dataset sizes.
This post explores how a chunked Parquet Reader, combined with Unified Virtual Memory (UVM), can outperform both nonchunked readers and CPU-based approaches.
Challenges with scale factors and nonchunked readers
As the scale factor (SF) increases, a nonchunked GPU Polars Reader (24.10) often struggles. Beyond SF200, performance degrades significantly. In some cases, such as Query 9, the nonchunked GPU reader fails—even before reaching SF50. This limitation arises due to memory constraints when loading large Parquet files into the GPU’s memory. The missing data in the nonchunked Parquet Reader plot highlights the out-of-memory (OOM) errors encountered at higher scale factors.

Improving IO and peak memory with chunked Parquet reading
To overcome these memory limitations, a chunked Parquet Reader becomes essential. By reading the Parquet file in smaller chunks, the memory footprint is reduced, enabling Polars GPU to process larger datasets. Using a chunked Parquet Reader with a 16 GB pass-read-limit enables the execution of more scale factors compared to a nonchunked reader for any given query. For Query 9, chunked Parquet reading with 16 GB or 32 GB is necessary to execute and achieve better throughput.

pass_read_limit
) across scale factors for Query 9Reading even larger datasets with UVM
While chunked reading improves memory management, the integration of UVM takes performance to the next level. UVM enables the GPU to access system memory directly, further alleviating memory constraints and improving data transfer efficiency.
To provide a comparison, non-UVM chunked encounters an OOM error before reaching SF100. Chunked plus UVM enables successful execution of queries on higher scale factors, but throughput is affected.
Figure 3 shows the clear advantage. Many more scale factors have successful execution with a chunked Parquet Reader with UVM enabled compared to a nonchunked Parquet Reader.

Stability and throughput
When selecting the optimal pass_read_limit
, it’s crucial to consider the balance between stability and throughput. Figures 1-3 suggest that 16 GB or 32 GB pass_read_limi
t is the best combination of stability and throughput.
- 32 GB
pass_read_limit
: All queries succeeded except Query 9 and Query 19 that failed with OOM exceptions - 16 GB
pass_read_limit
: All queries succeeded
Chunked-GPU versus CPU
When the observed throughput of each query is generally still higher than CPU Polars, this allows many queries to complete that do not complete without chunking. A 16 GB or possibly 32 GB pass_read_limit
seems reasonable. A 16 GB or 32 GB pass_read_limit
results in successful execution at higher scale factors vs nonchunked Parquet.
Conclusion
For Polars GPU, a chunked Parquet Reader with UVM is often better than Polars CPU and a nonchunked Parquet Reader, especially when dealing with large datasets and high scale factors. By optimizing the data loading process, you can unlock the full potential of Polars GPU and achieve significant performance gains. As part of latest cudf-polars
(24.12 and above), chunked Parquet Reader and UVM are the default approach for reading a Parquet file. This has resulted in the improvements presented above across all queries and scale factors.
To get started, install cuDF Polars.