MINOR: [C++] Fix typos in key_map doc and comment #45104

zanmato1984 · 2024-12-24T02:31:05Z

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

zanmato1984 · 2024-12-24T02:31:26Z

cpp/src/arrow/acero/doc/key_map.md

@@ -65,7 +65,7 @@ Every slot can either be **empty** or store data related to a single inserted ke

 Status byte, as the name suggests, stores 8 bits. The highest bit indicates if the slot is empty (the highest bit is set) or corresponds to one of inserted keys (the highest bit is zero). The remaining 7 bits contain 7 bits of key hash that we call a **stamp**. The stamp is used to eliminate some false positives when searching for a matching key for a given input. Slot also stores **key id**, which is a non-negative integer smaller than the number of inserted keys, that is used as a reference to the actual inserted key. The last piece of information related to an inserted key is its **hash** value. We store hashes for all keys, so that they never need to be re-computed. That greatly simplifies some operations, like resizing of a hash table, that may not even need to look at the keys at all. For an empty slot, the status byte is 0x80, key id is zero and the hash is not used and can be set to any number.

-A single block contains 8 slots and can be viewed as a micro-stack of up to 8 inserted keys. When the first key is inserted into an empty block, it will occupy a slot with local id 0. The second inserted key will go into slot number 1 and so on. We use N highest bits of hash to get an index of a **start block**, when searching for a match or an empty slot to insert a previously not seen key when that is the case. If the start block contains any empty slots, then the search for either a match or place to insert a key will end at that block. We will call such a block an **open block**. A block that is not open is a full block. In the case of full block, the input key related search may continue in the next block module the number of blocks. If the key is not inserted into its start block, we will refer to it as an **overflow** entry, other entries being **non-overflow**. Overflow entries are slower to process, since they require visiting more than one block, so we want to keep their percentage low. This is done by choosing the right **load factor** (percentage of occupied slots in the hash table) at which the hash table gets resized and the number of blocks gets doubled. By tuning this value we can control the probability of encountering an overflow entry.
+A single block contains 8 slots and can be viewed as a micro-stack of up to 8 inserted keys. When the first key is inserted into an empty block, it will occupy a slot with local id 0. The second inserted key will go into slot number 1 and so on. We use N highest bits of hash to get an index of a **start block**, when searching for a match or an empty slot to insert a previously not seen key when that is the case. If the start block contains any empty slots, then the search for either a match or place to insert a key will end at that block. We will call such a block an **open block**. A block that is not open is a full block. In the case of full block, the input key related search may continue in the next block modulo the number of blocks. If the key is not inserted into its start block, we will refer to it as an **overflow** entry, other entries being **non-overflow**. Overflow entries are slower to process, since they require visiting more than one block, so we want to keep their percentage low. This is done by choosing the right **load factor** (percentage of occupied slots in the hash table) at which the hash table gets resized and the number of blocks gets doubled. By tuning this value we can control the probability of encountering an overflow entry.


module -> modulo

zanmato1984 · 2024-12-24T04:55:10Z

cc @kou @pitrou

kou · 2024-12-24T05:31:59Z

cpp/src/arrow/compute/key_map_internal.cc

@@ -281,7 +281,7 @@ void SwissTable::early_filter_imp(const int num_keys, const uint32_t* hashes,
 // When we reach this limit, we need to break processing of any further rows and resize.
 //
 uint64_t SwissTable::num_groups_for_resize() const {
-  // Resize small hash tables when 50% full (up to 12KB).
+  // Resize small hash tables when 50% full (up to 32KB).


Sorry, how to compute 32KB?

The log_blocks_small = 9 means that we consider 1 << 9 = 512 blocks as "small". A block is a swiss-table-specific concept explained in https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/doc/key_map.md , which is 64-byte. So the small hash table ends up with 512 * 64B = 32KB.

So the original 12KB is obviously a typo.

kou

+1

kou · 2024-12-24T06:11:27Z

cpp/src/arrow/compute/key_map_internal.cc

@@ -281,7 +281,7 @@ void SwissTable::early_filter_imp(const int num_keys, const uint32_t* hashes,
 // When we reach this limit, we need to break processing of any further rows and resize.
 //
 uint64_t SwissTable::num_groups_for_resize() const {
-  // Resize small hash tables when 50% full (up to 12KB).
+  // Resize small hash tables when 50% full (up to 32KB).


zanmato1984 · 2024-12-24T06:20:41Z

The CI failure is unrelated. I'm merging this.

Thank you @kou for looking!

### Rationale for this change In my last typo fix in #45104 I made a mistake: https://github.com/apache/arrow/pull/45104/files#diff-a1f24396d20cd510ae48838e684fbc5ba7b23475702ae849a2d13bf48e5251aaR284 I was too careless. The size of a block in swiss table is much more subtle than I thought. It turns out the original `12KB` was right. ### What changes are included in this PR? Removing my last wrong change and elaborate a bit about how the number `12KB` comes. ### Are these changes tested? ### Are there any user-facing changes? Authored-by: Rossi Sun <[email protected]> Signed-off-by: Rossi Sun <[email protected]>

Fix typos in key_map doc and comment

d905f48

zanmato1984 requested a review from westonpace as a code owner December 24, 2024 02:31

github-actions bot added Component: C++ Component: Documentation labels Dec 24, 2024

zanmato1984 commented Dec 24, 2024

View reviewed changes

github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 24, 2024

kou reviewed Dec 24, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 24, 2024

kou approved these changes Dec 24, 2024

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Dec 24, 2024

zanmato1984 merged commit e7471f8 into apache:main Dec 24, 2024
41 of 42 checks passed

zanmato1984 removed the awaiting merge Awaiting merge label Dec 24, 2024

zanmato1984 deleted the swiss-join-typo branch December 24, 2024 06:22

zanmato1984 mentioned this pull request Dec 26, 2024

MINOR: [C++] Fix the wrong comment in #45104 #45111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MINOR: [C++] Fix typos in key_map doc and comment #45104

MINOR: [C++] Fix typos in key_map doc and comment #45104

zanmato1984 commented Dec 24, 2024

zanmato1984 Dec 24, 2024

zanmato1984 commented Dec 24, 2024

kou Dec 24, 2024

zanmato1984 Dec 24, 2024

kou Dec 24, 2024

kou left a comment

kou Dec 24, 2024

zanmato1984 commented Dec 24, 2024

MINOR: [C++] Fix typos in key_map doc and comment #45104

MINOR: [C++] Fix typos in key_map doc and comment #45104

Conversation

zanmato1984 commented Dec 24, 2024

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

zanmato1984 Dec 24, 2024

Choose a reason for hiding this comment

zanmato1984 commented Dec 24, 2024

kou Dec 24, 2024

Choose a reason for hiding this comment

zanmato1984 Dec 24, 2024

Choose a reason for hiding this comment

kou Dec 24, 2024

Choose a reason for hiding this comment

kou left a comment

Choose a reason for hiding this comment

kou Dec 24, 2024

Choose a reason for hiding this comment

zanmato1984 commented Dec 24, 2024