Fix cached hash clash in `Fixpoint` #31140

ggevay · 2025-01-22T13:53:05Z

Fixes https://github.com/MaterializeInc/database-issues/issues/8906

TransformCtx gives Transforms the hash of the plan that is the input to the transform. The hash that TransformCtx gives out is the one stored in last_hash (the "cached hash"), which has to be updated at various moments. We forgot about one such moment: Fixpoint sometimes jumps back to the beginning as part of its cycle detection mechanism (which, somewhat confusingly, also involves hashing, but is not currently connected to the last_hash mechanism). Therefore, in this case, Transform::transform's soft_assert_eq_no_log! detects the wrong hash and reports a "cached hash clash". (And if we were to go through this soft_assert_eq_no_log! in prod, then the wrong hash would lead to wrong values for the transform_hits metric. In the future, we'd like to use these hashes for more mission-critical things than just metrics, so we'd like them to be correct.)

The first commit just fixes the bug, by updating last_hash to the hash of the plan that was at the beginning of the fixpoint loop.

The second commit factors out the updating of last_hash into a function, because it now happens in 3 places.

The third commit downgrades a soft_panic_or_log! to a tracing::error! to stop https://github.com/MaterializeInc/database-issues/issues/8197 from flaking nightly, because we have accumulated enough examples to debug this when we get to this issue. The tracing:error! would still show up in Sentry if this happens in prod, which would affect the prioritization of the issue. (So far, it hasn't happened in production at all in the 7 months since I added this soft_panic_or_log.)

The fourth commit adds a regression test. (This would fail with the above soft_panic_or_log!, so this is one more reason I wanted to downgrade this now to an error log.)

cc @mgree

Motivation

This PR fixes a recognized bug: https://github.com/MaterializeInc/database-issues/issues/8906

Tips for reviewer

I suggest reviewing commit by commit.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

I'd like to stop MaterializeInc#8197 from flaking Nightly, because we have accumulated enough examples to debug this when we get to this issue. But at the same time, I would still be interested in if this happens in production, because that would affect the prioritization of the issue. (So far, it hasn't happened in production at all in the 7 months since I added this soft_panic_or_log.)

ggevay added 4 commits January 22, 2025 14:19

optimizer: update last_hash when fixpoint jumps back to the beginning

52812e0

optimizer: factor out the updating of last_hash into a fn

65a4ec6

Add test for MaterializeInc#8906

646220a

ggevay requested a review from a team as a code owner January 22, 2025 13:53

ggevay added A-optimization Area: query optimization and transformation A-CLUSTER Topics related to the CLUSTER layer labels Jan 22, 2025

ggevay mentioned this pull request Jan 22, 2025

Change HIR's and LIR's EXPLAIN CTE order #31132

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cached hash clash in `Fixpoint` #31140

Fix cached hash clash in `Fixpoint` #31140

ggevay commented Jan 22, 2025 •

edited

Loading

Fix cached hash clash in Fixpoint #31140

Are you sure you want to change the base?

Fix cached hash clash in Fixpoint #31140

Conversation

ggevay commented Jan 22, 2025 • edited Loading

Motivation

Tips for reviewer

Checklist

Fix cached hash clash in `Fixpoint` #31140

Fix cached hash clash in `Fixpoint` #31140

ggevay commented Jan 22, 2025 •

edited

Loading