refactor: Optimize DataFrame Reconstruction & Update Docs for Linux ARM64 Release #795

pangjunrong · 2025-04-13T19:47:09Z

This change optimizes the reconstruction of Pandas DataFrames using the _from_sequence methods for ExtensionArrays, which run on bulk conversion routines using vectorized NumPy operations while avoiding the extra layers of validation in the general constructor.

It also uses the _from_mgr method for DataFrame construction using BlockManager, which assumes the BlockManager is already in a valid state and skips the validation and setup overhead that the regular pd.DataFrame(BlockManager) constructor performs.

This also resolves the warnings mentioned in #786.

Based on a simple generic benchmark between the typical class initialization and use of _from_sequence, the performance gain seems to be consistent & profound across BooleanArray, DatetimeArray & IntegerArray. However, I will need help for additional verification on this as I don't fully understand the full picture of how Pandas Internals operate at the lower level.

Separately, the installation guides are updated to inform users of connectorx==0.4.3's general availability for Linux ARM64 distributions running on glibc 2.35 & later and the use of connectorx=0.2.3 for older distributions not covered by our new build process.

…olve warnings

wangxiaoying · 2025-04-14T00:13:02Z

connectorx-python/connectorx/__init__.py

                    placement=binfo.cids[0],
                )
            )
        elif binfo.dt == 2:  # BooleanArray
+            bool_array = pd.core.arrays.BooleanArray._from_sequence(block_data[0])


From what I understand from the pandas source code (_from_sequence, coerce_to_array), it seems we will have an extra mask array constructed by this _from_sequence step, which will then be discarded and replaced by our mask array like in this example:

And also it seems to directly call the constructor of the BooleanArray anyway. I'm wondering why this _from_sequence approach is still faster than the old BooleanArray(data, mask) approach as it seems to only include the overhead of an additional mask construction. I'm I missing something here?

wangxiaoying · 2025-04-14T00:17:38Z

Hi @pangjunrong , thank you so much for the PR!

The documentation, DatetimeArray and DataFrame._from_mgr parts look great to me! I have a question about the IntegerArray and BooleanArray when I check on the pandas code as I left in the review.

pangjunrong added 3 commits April 13, 2025 19:46

refactor extensionarrays to use from_sequence, use df from_mgr to res…

8669518

…olve warnings

updated install readme for linux arm64 release

ba5dbac

update readme for linux arm64 release

a7f708b

pangjunrong added documentation Improvements or additions to documentation enhancement New feature or request labels Apr 13, 2025

pangjunrong requested a review from wangxiaoying April 13, 2025 19:47

pangjunrong self-assigned this Apr 13, 2025

wangxiaoying reviewed Apr 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Optimize DataFrame Reconstruction & Update Docs for Linux ARM64 Release #795

refactor: Optimize DataFrame Reconstruction & Update Docs for Linux ARM64 Release #795

pangjunrong commented Apr 13, 2025 •

edited

Loading

wangxiaoying Apr 14, 2025 •

edited

Loading

wangxiaoying commented Apr 14, 2025 •

edited

Loading

refactor: Optimize DataFrame Reconstruction & Update Docs for Linux ARM64 Release #795

Are you sure you want to change the base?

refactor: Optimize DataFrame Reconstruction & Update Docs for Linux ARM64 Release #795

Conversation

pangjunrong commented Apr 13, 2025 • edited Loading

wangxiaoying Apr 14, 2025 • edited Loading

Choose a reason for hiding this comment

wangxiaoying commented Apr 14, 2025 • edited Loading

pangjunrong commented Apr 13, 2025 •

edited

Loading

wangxiaoying Apr 14, 2025 •

edited

Loading

wangxiaoying commented Apr 14, 2025 •

edited

Loading