Fix cached hash clash in Fixpoint
#31140
Open
+229
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes https://github.com/MaterializeInc/database-issues/issues/8906
TransformCtx
givesTransform
s the hash of the plan that is the input to the transform. The hash thatTransformCtx
gives out is the one stored inlast_hash
(the "cached hash"), which has to be updated at various moments. We forgot about one such moment:Fixpoint
sometimes jumps back to the beginning as part of its cycle detection mechanism (which, somewhat confusingly, also involves hashing, but is not currently connected to thelast_hash
mechanism). Therefore, in this case,Transform::transform
'ssoft_assert_eq_no_log!
detects the wrong hash and reports a "cached hash clash". (And if we were to go through thissoft_assert_eq_no_log!
in prod, then the wrong hash would lead to wrong values for thetransform_hits
metric. In the future, we'd like to use these hashes for more mission-critical things than just metrics, so we'd like them to be correct.)The first commit just fixes the bug, by updating
last_hash
to the hash of the plan that was at the beginning of the fixpoint loop.The second commit factors out the updating of
last_hash
into a function, because it now happens in 3 places.The third commit downgrades a
soft_panic_or_log!
to atracing::error!
to stop https://github.com/MaterializeInc/database-issues/issues/8197 from flaking nightly, because we have accumulated enough examples to debug this when we get to this issue. Thetracing:error!
would still show up in Sentry if this happens in prod, which would affect the prioritization of the issue. (So far, it hasn't happened in production at all in the 7 months since I added this soft_panic_or_log.)The fourth commit adds a regression test. (This would fail with the above
soft_panic_or_log!
, so this is one more reason I wanted to downgrade this now to an error log.)cc @mgree
Motivation
Tips for reviewer
I suggest reviewing commit by commit.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.