Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: Ensure executeRequest waits for standard output to complete when using comlink #144

Closed
wants to merge 3 commits into from

Conversation

higoo-higoo
Copy link

@higoo-higoo higoo-higoo commented Nov 26, 2024

Description of Issue

When using comlink, executeRequest function sometimes finishes before the standard output is fully completed.
This behavior appears to be caused by the lack of synchronization between callback handling and the simple function logic in comlink.

issue

How to Reproduce the Issue

  1. npm run quickstart, jlpm build , jlpm docs:lite with "exposeAppInBrowser": true and jlpm serve
  2. Access JupyterLite.
  3. Execute the following code in your browser's console:
window.jupyterapp.serviceManager.kernels._kernelConnections.values().next().value.iopubMessage.connect((e, t) => {
    const c = t.content;
    
    if (c?.execution_state === 'busy') {
        console.log('+++++++execution start+++++++');
    }

    console.log('!!', t.content);
    
    if (c?.execution_state === 'idle') {
        console.log('-------execution finish-------');
    }
});
  1. Execute any code in JupyterLite.
  • Expected behavior: The log -------execution finish------- should only appear after all standard output is processed.
  • Actual behavior: The log sometimes appears prematurely, before the standard output is fully handled.

Changes Made

I identified the issue as being caused by an inconsistency in the execution order between callback handling and the simple function logic when using comlink.

To resolve this:

I implemented a mechanism where the worker sends a message through a callback at the end of the process .
The kernel now waits for this message before concluding the executeRequest function.
This ensures proper synchronization and prevents premature termination of executeRequest.

Copy link
Contributor

lite-badge 👈 Try it on ReadTheDocs

@jtpio jtpio added the bug Something isn't working label Dec 5, 2024
@@ -198,6 +198,9 @@ export class PyodideKernel extends BaseKernel implements IKernel {
);
break;
}
case 'execute_return': {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if such message should be named something else, or prefixed with an underscore or similar? Otherwise it looks like it may be an official message part of the Jupyter protocol, while it isn't.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, jpk: would be solid. It's not really private, and a downstream may well want to react to it. We'd pay a minimal tax to do the .split but adding more fields at this time seems even worse and more breaking.

Most of these should be hanging off a const or namespace somewhere (but not enum: those can break in round-tripped JSON settings).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, should we rename the message to jpk:execute_return after all??

@martenrichter
Copy link
Contributor

Do you think this also solves the ipywidget problem, which is also related to the commit introducing coimlink:
See here, the last comment for my analysis:
#143
Or do you think it is a separate issue in comlink communication?

@jtpio
Copy link
Member

jtpio commented Dec 12, 2024

Do you think this also solves the ipywidget problem, which is also related to the commit introducing coimlink:

Not sure yet, but probably worth trying with this link (preview for the PR): https://jupyterlite-pyodide-kernel--144.org.readthedocs.build/en/144/_static/lab/index.html

@martenrichter
Copy link
Contributor

Nope, it still persists. So it must be another race condition introduced by comlink (or something else in the comlink commit).

@@ -244,6 +247,8 @@ export class PyodideKernel extends BaseKernel implements IKernel {
await this.ready;
const result = await this._remoteKernel.execute(content, this.parent);
result.execution_count = this.executionCount;
await this._executed.promise;
this._executed = new PromiseDelegate<void>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a stupid question. There is always one PromiseDelegate. The execute request starts before interacting with the promise_delegate. If two execution requests are started almost at the same time, what identifies which promise to resolve once the message arrives?

@martenrichter
Copy link
Contributor

Another question: Does it make sense to add this message mechanism? I think the diagnosis is that Comlink sometimes does not accurately handle calls to async functions in the worker. But this means that potentially, all async functions are affected by the same issue. So, it may be better to find the root cause of why Comlink is failing. It is not that simple and uses UUIDS for every message, but I am also not sure if it can handle async.

@martenrichter
Copy link
Contributor

The real problem is that comlink changes the ordering of the messages from "this._sendWorkerMessage" and the return of the worker's async methods. So, the _sendWorkerMessages arrive after the return from the async method that sets the parent header with the msg_id.

@martenrichter
Copy link
Contributor

Ok, I have found the underlying problem.
This line:

remote.registerCallback(proxy(this._processWorkerMessage.bind(this)));

ends up in comlink:
https://github.com/GoogleChromeLabs/comlink/blob/fd4b52666b1ec62784f8b45cb1108c7e40bc481d/src/comlink.ts#L213
creating a separate MessageChannel for the callback used in _sendWorkerMessage.

This explains all the troubles, as the call to the worker is done via the normal worker's message ports, but the _sendWorkerMessage uses a separate Message Channel. The UAs do not guarantee that two separate channels are executed in the same order, but ordering is crucial for the Jupyter protocol.
I see the following options:
1.) Proceed forward in the spirit of this PR, and add to every call to the worker a completion message, but have for every call a separate Promise may be using a Map using some kind of unique id, maybe the message id.
2.) Find a way to use comlink so that _sendWorkerMessage does not use a separate channel from the other mechanism.
3.) Go back to using postMessage for comlink, maybe with an added id jupyterMessage so that comlink knows this is not a message it should process.

@jtpio
Copy link
Member

jtpio commented Dec 16, 2024

Many thanks @martenrichter for the thorough investigation!

2.) Find a way to use comlink so that _sendWorkerMessage does not use a separate channel from the other mechanism.

Right, if there is a way to do that with comlink and stick to the comlink API if possible, it would be great. Otherwise switching back to raw postMessage would likely be fine since it's more important to fix the actual user issue 👍

@martenrichter
Copy link
Contributor

Many thanks @martenrichter for the thorough investigation!

2.) Find a way to use comlink so that _sendWorkerMessage does not use a separate channel from the other mechanism.

Right, if there is a way to do that with comlink and stick to the comlink API if possible, it would be great. Otherwise switching back to raw postMessage would likely be fine since it's more important to fix the actual user issue 👍

I have posted a PR for 3.). My follow-up investigation showed that using native comlink is only possible if I submit an upstream patch, which will be complex. I do not know how welcome this will be from the comlink team side.

@jtpio jtpio closed this in #148 Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants