Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache GroupByHash raw values where appropriate #25294

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

pettyjamesm
Copy link
Member

Description

Adds logical to dynamically decide whether to cache hash values into
the hash table itself based on whether the GroupByHash instance is
spillable, container types are present in the grouping keys, there are 2 or more variable width types present, or there are 3 or more types in total.

Caching hash values in GroupByHash is a classic compute / memory
trade-off. Caching the value costs more memory (8 extra bytes per
record) but avoids relatively more expensive hash recalculation when
re-hashing the table or sorting the table contents by raw hash value for
spilling.

Caching the hash value can also make inserting new entries
cheaper by avoiding the need for relatively more expensive
valueIdentical checks when the hash value can prove that the values are
not identical.

Additional context and related issues

Builds on top of:

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Mar 12, 2025
Special cases the scenario where a single FlatHashStrategy chunk is used
in code generation to avoid allocating a MutableVariableWidthOffset
container as part of the hashFlat logic.

Also adds a specialized method signature for valueIdentical for flat vs
variable width types to avoid the same MutableVariableWidthOffset
allocations.
Adds logical to dynamically decide whether to cache hash values into
the hash table itself based on whether the GroupByHash instance is
spillable, whether container types are present in the grouping keys, the
number of variable width types, and the number of types overall.

Caching hash values in GroupByHash is a classic compute / memory
trade-off. Caching the value costs more memory (8 extra bytes per
record) but avoids relatively more expensive hash recalculation when
re-hashing the table or sorting the table contents by raw hash value for
spilling. Caching the hash value can also make inserting new entries
cheaper by avoiding the need for relatively more expensive
valueIdentical checks when the hash value can prove that the values are
not identical.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

1 participant