Skip to content

refactor: Optimize DataFrame Reconstruction & Update Docs for Linux ARM64 Release #795

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pangjunrong
Copy link
Collaborator

@pangjunrong pangjunrong commented Apr 13, 2025

This change optimizes the reconstruction of Pandas DataFrames using the _from_sequence methods for ExtensionArrays, which run on bulk conversion routines using vectorized NumPy operations while avoiding the extra layers of validation in the general constructor.

It also uses the _from_mgr method for DataFrame construction using BlockManager, which assumes the BlockManager is already in a valid state and skips the validation and setup overhead that the regular pd.DataFrame(BlockManager) constructor performs.

This also resolves the warnings mentioned in #786.

Based on a simple generic benchmark between the typical class initialization and use of _from_sequence, the performance gain seems to be consistent & profound across BooleanArray, DatetimeArray & IntegerArray. However, I will need help for additional verification on this as I don't fully understand the full picture of how Pandas Internals operate at the lower level.

BooleanArray
DatetimeArray
IntegerArray

Separately, the installation guides are updated to inform users of connectorx==0.4.3's general availability for Linux ARM64 distributions running on glibc 2.35 & later and the use of connectorx=0.2.3 for older distributions not covered by our new build process.

@pangjunrong pangjunrong added documentation Improvements or additions to documentation enhancement New feature or request labels Apr 13, 2025
@pangjunrong pangjunrong self-assigned this Apr 13, 2025
placement=binfo.cids[0],
)
)
elif binfo.dt == 2: # BooleanArray
bool_array = pd.core.arrays.BooleanArray._from_sequence(block_data[0])
Copy link
Contributor

@wangxiaoying wangxiaoying Apr 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand from the pandas source code (_from_sequence, coerce_to_array), it seems we will have an extra mask array constructed by this _from_sequence step, which will then be discarded and replaced by our mask array like in this example:
image

And also it seems to directly call the constructor of the BooleanArray anyway. I'm wondering why this _from_sequence approach is still faster than the old BooleanArray(data, mask) approach as it seems to only include the overhead of an additional mask construction. I'm I missing something here?

@wangxiaoying
Copy link
Contributor

wangxiaoying commented Apr 14, 2025

Hi @pangjunrong , thank you so much for the PR!

The documentation, DatetimeArray and DataFrame._from_mgr parts look great to me! I have a question about the IntegerArray and BooleanArray when I check on the pandas code as I left in the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants