Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: ci3 post-merge fixes #10825

Merged
merged 178 commits into from
Dec 19, 2024
Merged

chore: ci3 post-merge fixes #10825

merged 178 commits into from
Dec 19, 2024

Conversation

ludamad
Copy link
Collaborator

@ludamad ludamad commented Dec 17, 2024

No description provided.

ludamad and others added 30 commits December 17, 2024 20:20
The test `can simulate public txs while building a block` would, in some
cases, build blocks with varying tx sizes. An example run built blocks
with 4, 12, and 12 txs. This totals 28 txs, so the remaining 2 txs never
get mined and the test timeout.

This fixes it by forcing the sequencer to build with just 4 txs
consistently, and making the total number of txs a multiple of it.
We now require the *average* of the increase in proven chain across a
namespace to be 0 for an *hour* to trigger a slack alert.
Resolves #10033
Resolves #10374

This PR does the following:
- Witgen handles out-of-gas errors for all opcodes 
- all halts (return/revert/exceptional) work as follows:
- charge gas for the problematic instruction as always, adding a row to
the gas trace
    - pop the parent/caller's latest gas from the stack
- call a helper function on the gas trace to mutate that most recent gas
row, returning to the parent's latest gas minus any consumed gas (all
gas consumed on exceptional halt)
- `GasTraceEntry` includes a field `is_halt_or_first_row_in_nested_call`
which lets us break gas rules on a halt or when starting a nested call
because in both cases gas will jump.
- `constrain_gas` returns a bool `out_of_gas` so that opcode
implementations can handle out of gas
- `write_to_memory` now has an option to skip the "jump back to correct
pc" which was problematic when halting because the `jump` wouldn't
result in a next row with the right pc

Explanation on how gas works for calls:
- Parent snapshots its gas right before a nested call in
`ctx.*_gas_left`
- Nested call is given a `ctx.start_*_gas_left` and the gas trace is
forced to that same value
- throughout the nested call, the gas trace operates normally, charging
per instruction
- when any halt is encountered, the instruction that halted must have
its gas charged normally, but then we call a helper function on the gas
trace to mutate the most recent row, flagging it to eventually become a
sort of "fake" row that skips some constraints
- the mutation of the halting row resets the gas to the parents last gas
before the call (minus however much gas was consumed by the nested
call... if exceptional halt, that is _all_ allocated gas)


Follow-up work
- properly constrain gas for nested calls, returns, reverts and
exceptional halts
- if `jump` exceptionally halts (i.e. out of gas), it should be okay
that the next row doesn't have the target pc
- Handle the edge case when an error is encountered on
return/revert/call, but after the stack has already been modified
Attempts two fixes at e2e epochs test. First, it increases the L1 block
time, to account for general CI slowness. Second, it adds more retries
to the L1 gas utils getTx, since the e2e epochs test works using the tx
delayer, which artificially introduces a delay between a tx being sent
and it being available in anvil, so it triggered a timeout in the utils.

**Update**: Increasing the retries caused the error to change, we were
getting a timeout in teardown. This was caused because the sequencer got
stuck in waiting for the tx to be mined for way too long (more than 5
minutes, the afterAll hook timeout), and the `node.stop()` waits for the
current loops to end before returning.

But what's interesting is _why_ the sequencer got stuck in waiting for
its tx to be mined. The tx was being delayed by the tx-delayer, which
intercepts txs, manually computes their tx hash to return it to the
caller immediately, and holds on the tx to submit it to anvil at a later
point in time. What came up is that the tx hash we were manually
computing over txs with blobs did not match the tx hash returned by
anvil. This has been logged as #10824. However, we can sidestep this
issue by just choosing a reasonable value for max attempts so teardown
doesn't timeout.

---------

Co-authored-by: ludamad <[email protected]>
Please read [contributing guidelines](CONTRIBUTING.md) and remove this
line.
@ludamad ludamad marked this pull request as draft December 19, 2024 02:55
@ludamad ludamad marked this pull request as ready for review December 19, 2024 03:10
@ludamad ludamad merged commit 0832045 into cl/ci3 Dec 19, 2024
34 of 35 checks passed
@ludamad ludamad deleted the cl/ci3-fixes branch December 19, 2024 03:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bench-all CI: Enables this CI job. e2e-all CI: Enables this CI job. network-all Run this CI job.
Projects
None yet
Development

Successfully merging this pull request may close these issues.