Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(workflow): Fix multiple context leaks in reuseV8Context executor #1519

Closed

Conversation

mjameswh
Copy link
Contributor

@mjameswh mjameswh commented Sep 12, 2024

What was changed

  • Fix multiple context leaks and bugs in the reuseV8Context executor:

@mjameswh mjameswh requested a review from a team as a code owner September 12, 2024 20:52
@mjameswh mjameswh force-pushed the fix-reusev8context-reassign-bug branch from 44f477c to 04810fb Compare September 16, 2024 21:22
@mjameswh mjameswh changed the title [WIP] fix(workflow): Fix multiple context leaks in reuseV8Context executor fix(workflow): Fix multiple context leaks in reuseV8Context executor Sep 16, 2024
@antlai-temporal
Copy link
Contributor

I'm a bit confused about what is the ultimate goal and why this is needed. If we want to eliminate all side channels this is not going to work, e.g., you could have timing side-channels, and the v8 VM is not designed to eliminate side-channels. If the goal is that other untrusted code is running in the same reused v8 context and attacking other users by modifying the context, I'm not sure we are protecting all the primordial classes (a String, Number,...) either, and in any case, that's a very hard problem...

Is this just to help prevent some mistakes, nothing to do with security?

@mjameswh
Copy link
Contributor Author

Is this just to help prevent some mistakes, nothing to do with security?

No, that's really not a security thing (that's simply impossible AFAIK with Node's vm). And we don't have a goal of preventing all possible escapes out of the VM.

The reuseV8Context is a performance optimization (roughly 66% reduction on RAM usage for cached workflows, and up to 50% on CPU usage). The trick is that instead of creating a new vm context for each workflow (which is slow and allocates a large number of objects), we use a single vm context, swapping global variables in and out before and after executing each activation.

However, there's been some reports of users doing uncommon but still legitimate and reasonable things in their workflow code, that would result in context from one workflow execution leaking into another workflow execution. Depending on use cases, that may result in non-deterministic behaviors, memory leaks, or some other oddities.

For example, one user recently reported that they were mocking functions in the console object inside their workflow code to redirect messages to the workflow log. They were doing that because their workflow was calling some outside library, not meant specifically for Temporal Workflows, which would print things its own logging through console.log. Agreed, that's not the best way of doing this, but still something I would have expected to be supported by our sandbox. Their code looked something like this:

async function myWorkflow() {
  const workflowId = workflowInfo().workflowId;
  globalThis.console = {
    ...console,
    log: (message) => {
      log(`${workflowId}: message`)
    }
  }
  // ...

The problem here is that previously, when leaving a workflow's execution context, the Reusable VM's would not touch any global variable that existed right after start. In this case, that means that it would leave the console object in place, even though it was no longer the original, pristine console object! That would result in them seeing log messages with an incorrect workflowId.

Once you know it, it's quite easy to fix their code to avoid this issue (they did), but it's also an opportunity for us to make our engine more reliable.

@mjameswh
Copy link
Contributor Author

There's also one test, "Shared global state is frozen", that existed before this PR that was actually not testing what we thought it was testing. That is, the test was asserting attempting to add a custom property to some global object from Node's default vm context would be rejected.

However, it was doing so by adding something to the setTimetout function, but it turns out that's insufficient as the globalThis.setTimeout is overridden by the SDK itself, resulting in it being present in the globals object exposed outside of the vm, where global variables that would not been modified inside the vm would not appear in the globals object (seems to be some kind of internal optimization done by Node/V8).

This PR fixes that bug; the test is now asserting the proper behavior against the Array global function instead.

packages/common/src/type-helpers.ts Outdated Show resolved Hide resolved
packages/common/src/type-helpers.ts Outdated Show resolved Hide resolved
packages/test/src/test-isolation.ts Outdated Show resolved Hide resolved
Comment on lines 107 to 131
(setTimeout as any).a = 1;
(Array as any).a = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That test was presumably testing that v8's built-in functions were not mutable, but the SDK redefines setTimeout, so that test was actually useless. Turns out in fact that built-in functions were unexpectedly mutable.

Comment on lines 160 to 161
env.client.workflow.execute(sharedGlobalReassignment, { taskQueue, workflowId: randomUUID() }),
env.client.workflow.execute(sharedGlobalReassignment, { taskQueue, workflowId: randomUUID() }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be racy in CI and may not actually provide the test coverage you want. There's nothing guaranteeing that the workflows will run concurrently.

*/
readonly contextKeysToPreserve: Set<string>;
private _context?: vm.Context;
private pristine?: Map<string, PropertyDescriptor>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring.

const workflowModule: WorkflowModule = new Proxy(
{},
{
get(_: any, fn: string) {
return (...args: any[]) => {
Object.assign(context, bag);
for (const [pname, pdesc] of bag.entries()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would have liked to prevent constructing the entries array for every time we call into the VM.

// Looks like Node/V8 is not properly syncing deletion of keys on the outter context
// object to the inner globalThis object. Hence, we delete them from inside the context.
context.__TEMPORAL_ARGS__ = keysToCleanup;
vm.runInContext(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will add a bit of overhead for every time we enter the context, I would try to find an alternative way to do this.

@mjameswh mjameswh force-pushed the fix-reusev8context-reassign-bug branch from 04810fb to 0bd389b Compare November 22, 2024 19:26
@mjameswh
Copy link
Contributor Author

The code changed significantly since the first round of reviews.

I'm closing this PR and opening a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants