-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MINOR: [C++] Fix typos in key_map doc and comment #45104
Conversation
@@ -65,7 +65,7 @@ Every slot can either be **empty** or store data related to a single inserted ke | |||
|
|||
Status byte, as the name suggests, stores 8 bits. The highest bit indicates if the slot is empty (the highest bit is set) or corresponds to one of inserted keys (the highest bit is zero). The remaining 7 bits contain 7 bits of key hash that we call a **stamp**. The stamp is used to eliminate some false positives when searching for a matching key for a given input. Slot also stores **key id**, which is a non-negative integer smaller than the number of inserted keys, that is used as a reference to the actual inserted key. The last piece of information related to an inserted key is its **hash** value. We store hashes for all keys, so that they never need to be re-computed. That greatly simplifies some operations, like resizing of a hash table, that may not even need to look at the keys at all. For an empty slot, the status byte is 0x80, key id is zero and the hash is not used and can be set to any number. | |||
|
|||
A single block contains 8 slots and can be viewed as a micro-stack of up to 8 inserted keys. When the first key is inserted into an empty block, it will occupy a slot with local id 0. The second inserted key will go into slot number 1 and so on. We use N highest bits of hash to get an index of a **start block**, when searching for a match or an empty slot to insert a previously not seen key when that is the case. If the start block contains any empty slots, then the search for either a match or place to insert a key will end at that block. We will call such a block an **open block**. A block that is not open is a full block. In the case of full block, the input key related search may continue in the next block module the number of blocks. If the key is not inserted into its start block, we will refer to it as an **overflow** entry, other entries being **non-overflow**. Overflow entries are slower to process, since they require visiting more than one block, so we want to keep their percentage low. This is done by choosing the right **load factor** (percentage of occupied slots in the hash table) at which the hash table gets resized and the number of blocks gets doubled. By tuning this value we can control the probability of encountering an overflow entry. | |||
A single block contains 8 slots and can be viewed as a micro-stack of up to 8 inserted keys. When the first key is inserted into an empty block, it will occupy a slot with local id 0. The second inserted key will go into slot number 1 and so on. We use N highest bits of hash to get an index of a **start block**, when searching for a match or an empty slot to insert a previously not seen key when that is the case. If the start block contains any empty slots, then the search for either a match or place to insert a key will end at that block. We will call such a block an **open block**. A block that is not open is a full block. In the case of full block, the input key related search may continue in the next block modulo the number of blocks. If the key is not inserted into its start block, we will refer to it as an **overflow** entry, other entries being **non-overflow**. Overflow entries are slower to process, since they require visiting more than one block, so we want to keep their percentage low. This is done by choosing the right **load factor** (percentage of occupied slots in the hash table) at which the hash table gets resized and the number of blocks gets doubled. By tuning this value we can control the probability of encountering an overflow entry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module -> modulo
@@ -281,7 +281,7 @@ void SwissTable::early_filter_imp(const int num_keys, const uint32_t* hashes, | |||
// When we reach this limit, we need to break processing of any further rows and resize. | |||
// | |||
uint64_t SwissTable::num_groups_for_resize() const { | |||
// Resize small hash tables when 50% full (up to 12KB). | |||
// Resize small hash tables when 50% full (up to 32KB). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, how to compute 32KB
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log_blocks_small = 9
means that we consider 1 << 9 = 512
blocks as "small". A block is a swiss-table-specific concept explained in https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/doc/key_map.md , which is 64-byte. So the small hash table ends up with 512 * 64B = 32KB
.
So the original 12KB
is obviously a typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@@ -281,7 +281,7 @@ void SwissTable::early_filter_imp(const int num_keys, const uint32_t* hashes, | |||
// When we reach this limit, we need to break processing of any further rows and resize. | |||
// | |||
uint64_t SwissTable::num_groups_for_resize() const { | |||
// Resize small hash tables when 50% full (up to 12KB). | |||
// Resize small hash tables when 50% full (up to 32KB). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The CI failure is unrelated. I'm merging this. Thank you @kou for looking! |
### Rationale for this change In my last typo fix in #45104 I made a mistake: https://github.com/apache/arrow/pull/45104/files#diff-a1f24396d20cd510ae48838e684fbc5ba7b23475702ae849a2d13bf48e5251aaR284 I was too careless. The size of a block in swiss table is much more subtle than I thought. It turns out the original `12KB` was right. ### What changes are included in this PR? Removing my last wrong change and elaborate a bit about how the number `12KB` comes. ### Are these changes tested? ### Are there any user-facing changes? Authored-by: Rossi Sun <[email protected]> Signed-off-by: Rossi Sun <[email protected]>
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?