The hash join algorithm is a variation of the join algorithm .
The algorithm receives two tables and a join condition at the input. The result of his work is a table with the results of the connection.
The smaller of the two input tables is placed in a special data structure in memory: a hash table , which provides a very high search speed. Then, for each row from the larger table, a search is made for the values that match the join condition. The results are placed in the output table.
On pseudo code, the algorithm can be described like this:
[HashTable] = Build a HashTable ([SmallerTable], [Column Names of the SmallerTable by which the connection will be made]);
For each row [r] from [BigTable]
Print ([r], Search in the HeshTable ([HashTable], [Column Names of the BigTable by which the connection is made]));
Benefits
- A hash join is significantly faster than a nested loop join. With a relatively small size of a smaller table, this is the most efficient type of join.
Weaknesses
- A join condition can only be equality.
- A large memory requirement for constructing a hash table, which extremely limits the scalability of the algorithm when increasing the size of a smaller table.
- The hash table must be completely built before the first result is written to the resulting table, which makes this type of join unacceptable if you need to get the first row of the result as quickly as possible.
Real systems use more sophisticated hashing schemes than in the above example, mainly aimed at reducing the memory requirement for constructing a hash table. For example, the data of both tables is divided into parts, and a hash table is built for only one of these parts.