Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR: [C++] Fix typos in key_map doc and comment #45104

Merged
merged 1 commit into from
Dec 24, 2024

Conversation

zanmato1984
Copy link
Contributor

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@@ -65,7 +65,7 @@ Every slot can either be **empty** or store data related to a single inserted ke

Status byte, as the name suggests, stores 8 bits. The highest bit indicates if the slot is empty (the highest bit is set) or corresponds to one of inserted keys (the highest bit is zero). The remaining 7 bits contain 7 bits of key hash that we call a **stamp**. The stamp is used to eliminate some false positives when searching for a matching key for a given input. Slot also stores **key id**, which is a non-negative integer smaller than the number of inserted keys, that is used as a reference to the actual inserted key. The last piece of information related to an inserted key is its **hash** value. We store hashes for all keys, so that they never need to be re-computed. That greatly simplifies some operations, like resizing of a hash table, that may not even need to look at the keys at all. For an empty slot, the status byte is 0x80, key id is zero and the hash is not used and can be set to any number.

A single block contains 8 slots and can be viewed as a micro-stack of up to 8 inserted keys. When the first key is inserted into an empty block, it will occupy a slot with local id 0. The second inserted key will go into slot number 1 and so on. We use N highest bits of hash to get an index of a **start block**, when searching for a match or an empty slot to insert a previously not seen key when that is the case. If the start block contains any empty slots, then the search for either a match or place to insert a key will end at that block. We will call such a block an **open block**. A block that is not open is a full block. In the case of full block, the input key related search may continue in the next block module the number of blocks. If the key is not inserted into its start block, we will refer to it as an **overflow** entry, other entries being **non-overflow**. Overflow entries are slower to process, since they require visiting more than one block, so we want to keep their percentage low. This is done by choosing the right **load factor** (percentage of occupied slots in the hash table) at which the hash table gets resized and the number of blocks gets doubled. By tuning this value we can control the probability of encountering an overflow entry.
A single block contains 8 slots and can be viewed as a micro-stack of up to 8 inserted keys. When the first key is inserted into an empty block, it will occupy a slot with local id 0. The second inserted key will go into slot number 1 and so on. We use N highest bits of hash to get an index of a **start block**, when searching for a match or an empty slot to insert a previously not seen key when that is the case. If the start block contains any empty slots, then the search for either a match or place to insert a key will end at that block. We will call such a block an **open block**. A block that is not open is a full block. In the case of full block, the input key related search may continue in the next block modulo the number of blocks. If the key is not inserted into its start block, we will refer to it as an **overflow** entry, other entries being **non-overflow**. Overflow entries are slower to process, since they require visiting more than one block, so we want to keep their percentage low. This is done by choosing the right **load factor** (percentage of occupied slots in the hash table) at which the hash table gets resized and the number of blocks gets doubled. By tuning this value we can control the probability of encountering an overflow entry.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module -> modulo

@github-actions github-actions bot added awaiting review Awaiting review awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Dec 24, 2024
@zanmato1984
Copy link
Contributor Author

cc @kou @pitrou

@@ -281,7 +281,7 @@ void SwissTable::early_filter_imp(const int num_keys, const uint32_t* hashes,
// When we reach this limit, we need to break processing of any further rows and resize.
//
uint64_t SwissTable::num_groups_for_resize() const {
// Resize small hash tables when 50% full (up to 12KB).
// Resize small hash tables when 50% full (up to 32KB).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, how to compute 32KB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log_blocks_small = 9 means that we consider 1 << 9 = 512 blocks as "small". A block is a swiss-table-specific concept explained in https://github.com/apache/arrow/blob/main/cpp/src/arrow/acero/doc/key_map.md , which is 64-byte. So the small hash table ends up with 512 * 64B = 32KB.

So the original 12KB is obviously a typo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Dec 24, 2024
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -281,7 +281,7 @@ void SwissTable::early_filter_imp(const int num_keys, const uint32_t* hashes,
// When we reach this limit, we need to break processing of any further rows and resize.
//
uint64_t SwissTable::num_groups_for_resize() const {
// Resize small hash tables when 50% full (up to 12KB).
// Resize small hash tables when 50% full (up to 32KB).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Dec 24, 2024
@zanmato1984
Copy link
Contributor Author

The CI failure is unrelated. I'm merging this.

Thank you @kou for looking!

@zanmato1984 zanmato1984 merged commit e7471f8 into apache:main Dec 24, 2024
41 of 42 checks passed
@zanmato1984 zanmato1984 removed the awaiting merge Awaiting merge label Dec 24, 2024
@zanmato1984 zanmato1984 deleted the swiss-join-typo branch December 24, 2024 06:22
zanmato1984 added a commit that referenced this pull request Dec 26, 2024
### Rationale for this change

In my last typo fix in #45104 I made a mistake: https://github.com/apache/arrow/pull/45104/files#diff-a1f24396d20cd510ae48838e684fbc5ba7b23475702ae849a2d13bf48e5251aaR284 I was too careless. The size of a block in swiss table is much more subtle than I thought. It turns out the original `12KB` was right.

### What changes are included in this PR?

Removing my last wrong change and elaborate a bit about how the number `12KB` comes.

### Are these changes tested?

### Are there any user-facing changes?

Authored-by: Rossi Sun <[email protected]>
Signed-off-by: Rossi Sun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants