Hash Join

Here is a page to track ideas and ongoing work for hash joins. See also the Parallel_Hash page for parallelism-specific ideas.

Some active discussions with patches:

Hash joins can, in extreme cases, use more memory that they are allowed to, if hash-based partitioning fails to divide the inner relation up enough to fit in the memory budget. Thread discussing strategies for respecting work_mem more strictly.

Hash joins can decide to use a huge number of partitions in order to fit into work_mem, but the partition book-keeping is unmetered so can be way more than work_mem. Thread discussiong that.

Hash Join and Hash nodes seem to be very tightly coupled. Perhaps we should just merge them into one?

Observation without a patches:

nodeHashjoin.c's switch case HJ_SCAN_BUCKET calls ExecScanHashBucket() which contains ExecQualAndReset(hash clauses), and then after that returns it immediately does ExecQual(joinqual). So we pay the cost of evaluator setup twice, instead of doing both together. The reason we have to do this that EXPLAIN needs to count the two kinds of filtering separately, and we don't have a mechanism to build a single expression which ANDs together two filters and but maintains two separate counters. This might apply to other join types too. (Observation made by Andres Freund in private discussion with Thomas Munro, who added this note here.)

tuples are copied while loading them into the hash table during conversion to minimal tuple; we really just want to copy it into place in the hash table having allocated the memory!

Memory is allocated in chunks that waste a ton of "invisible" memory on some OSes. Thread.

Andres: There's too many indirections for hashtable lookups. For a successful hashtable lookup we need the following pointer dereferences: 1) HashJoinState->hj_HashTable (and a bunch of related state), 2) HashJoinTable->unshared 3) HashJoinTable->unshared[bucket] (likely uncached), 4) HashJoinTuple->hashvalue (likely uncached)
- We should work on at least removing the indirection for ->hashvalue, by moving the hashvalue into HashJoinTable->unshared[bucket], rather than that being just a pointer. It might be worthwhile to also store the ->next pointer inline.
- It could also be worthwhile to just embed the HashJoinTableData into the HashJoinState. As far as I can tell there's really no need for that to be a separate pointer indirection.

Andres: The whole notion of HashJoinTuple is suspect - needing to perform the offsetting math in various places costs cycles itself, but also makes it harder to avoid unnecessary allocations in other places.

Andres: The bucket scanning API (ExecScanHashBucket()) leads to unnecessary unpredictable branches, in a path that already suffers heavily from stalls.
- In a query where every lookup is a match (very commmon), the branch that decides whether to continue lookups in a previous bucket, or perform a fresh lookup, reaches both paths at roughly 50% - the worst case for a branch predictor. Instead there should be separate function for continuing a bucket search, which should be called by either branching earlier in the caller (will be very predictable), or even better by having a separate hj_JoinState value for continuing a bucket search.
- Currently the same "if" determines whether there is a match in a fresh lookup (common), and whether there's further tuples in a bucket (uncommon). That leads to inefficient branch prediction. Should be split.
- It might be worthwhile to move the check whether to search the skew hashtable into either a) the ExecHashJoinImpl() state-machine, by having a separate state for searching buckets in the skew hashtable b) just moving the check into ExecHashJoinImpl(), which'd at least allow to reuse the result of previous checks.

Perhaps different hash join nodes could share a hash table, for the benefit of partition-wise joins. Thread.

Some work on the horizon:

Andres Freund is working on transforming execution plans into a "linear programs" of opcodes (and eventually probably also machine code via LLVM), like SQLite and System R. This means we'll need to figure out how to break our hash join algorithm into steps that can be expressed that way.

Even before we get that far, there may be micro-optimisation opportunities where we JIT some of the hashing work? Andres: Indeed, the evaluation of the hash keys is fairly expensive, especially with multiple columns. We should add an expression step for mixing hash values, and use that to build an expression that does not just evaluate the key, but immediately hashes it.

Some highly speculative work:

It's possible to make hash joins go faster by peeking ahead at the next tuple to be probed, and prefetching the right memory cache line. Experimental hack thread with links to academic papers. To do this well might require executor changes to that we can get a batch of tuples at the same time, and process them without escaping the current node.

Navigation menu