python-sdk(feat): Support async function #1017 #1046

Default2882 · 2024-11-20T06:47:58Z

Context

Currently indexify does not support async functions, this is important because users would like to run functions which are I/O bound. Addresses #1017.

What

This change refactors the executor agent so that the downloader, function_worker and the task_reporter run in a ThreadPoolExecutor.

Testing

Added a basic test, will be adding more tests.

Contribution Checklist

If the python-sdk was changed, please run make fmt in python-sdk/.
If the server was changed, please run make fmt in server/.
Make sure all PR Checks are passing.

Default2882 · 2024-11-21T13:05:52Z

python-sdk/indexify/functions_sdk/indexify_functions.py

@@ -272,6 +281,30 @@ def run_fn(
        )
        return output, None

+    async def run_fn_async(


This and other functions can use some refactoring to avoid code duplication.

Default2882 · 2024-11-21T13:07:56Z

python-sdk/indexify/executor/agent.py

-                                f"Failed to get runnable tasks: {async_task.exception()}",
-                                style="red",
+                task_name = TaskEnum.from_value(async_task.get_name())
+                match task_name:


This switch case can possibly use a strategy pattern

Yep, this is a large switch it's be better to just a call a method per each case here.

python-sdk/indexify/executor/function_worker/function_worker.py

Default2882 · 2024-11-21T13:11:06Z

python-sdk/indexify/executor/task_reporter/task_reporter.py

+        base_url: str,
+        executor_id: str,
+        task_store: TaskStore,
+        config_path: Optional[str] = None,
    ):
        self._base_url = base_url
        self._executor_id = executor_id
        self._client = get_httpx_client(config_path)


Make task reported use IndexifyClient because using the basic http client does not set the API key.

Default2882 · 2024-11-21T13:11:20Z

python-sdk/tests/test_graph_behaviours.py

@@ -225,7 +225,23 @@ def remote_or_local_pipeline(pipeline, remote=True):
    return pipeline


+@indexify_function()


Add more tests.

eabatalov

wrt commit 11f50f5
There are a few comments to address but overall looks good. Thank you.
I'll check other commits.

python-sdk/indexify/executor/downloader.py

python-sdk/indexify/executor/agent.py

eabatalov · 2024-11-22T14:27:30Z

python-sdk/indexify/executor/agent.py

-                                f"Failed to get runnable tasks: {async_task.exception()}",
-                                style="red",
+                task_name = TaskEnum.from_value(async_task.get_name())
+                match task_name:


Yep, this is a large switch it's be better to just a call a method per each case here.

eabatalov

Thank you for looking into this.

I started to review per commit but I see that there's no clear separation of commits in the PR. I'll then review full changes soon and let you know the feedback.

I also think that all the commits in this PR need to be squashed into a single commit because the commit titles are currently misleading and they'll just make the commit log and history confusing.

eabatalov · 2024-11-22T14:38:35Z

python-sdk/indexify/executor/task_reporter/task_reporter_utils.py

+    )
+
+
+def _log(task_outcome):


Nit. Why do we have underscores here? These functions are supposed to be visible outside of task_reporter_utils.py file. This is why I'm asking.

These functions are supposed to be visible outside of task_reporter_utils.py file

No, they shouldn't be. The idea is to use a custom Console theme for different parts of the executor, although I can see that there are some misses related to this in other areas of the code.

eabatalov · 2024-11-22T15:01:45Z

python-sdk/indexify/executor/function_worker/function_worker.py


 class FunctionWorker:
    def __init__(
-        self, workers: int = 1, indexify_client: IndexifyClient = None
+        self,
+        workers: int = get_optimal_process_count(),


Here we add extra responsibility on deciding how many worker processes to run. Do we have to add this responsibility here given that it's already configurable by the caller?

eabatalov

Overall question about this PR.

Do we need to refactor whole Executor code base to be async to support async functions?

As it's stated in the issue #1017:

The function worker needs to detect if the user function is multiprocess and create an event loop, and call the function with asyncio.run or something of that nature. This might require our sdk to have async interfaces as well. That needs to be explored.

How this PR helps us to get there?

fyi we plan to implement running function code in separate processes soon. In this case it's even less important if Executor runs fully in async event loop or not.

python-sdk/indexify/executor/function_worker/function_worker.py

eabatalov · 2024-11-22T15:36:20Z

python-sdk/indexify/executor/agent.py

@@ -63,6 +64,9 @@ def __init__(
        name_alias: Optional[str] = None,
        image_version: Optional[int] = None,
    ):
+        event_loop = asyncio.get_event_loop()
+        self._thread_pool = ThreadPoolExecutor(max_workers=num_workers)


Currently ProcessPoolExecutor is used. Why do we switch to ThreadPoolExecutor here? What are the implications of this switch? E.g. how do we make sure that there are no deadlocks between the threads now and in the future?

Even though we were importing and initialising ProcessPoolExecutor, it wasn't being used anywhere. As per the existing implementation of agent, downloader, worker_function, and task_reported, all of them were executing on a single event loop.

As per - https://docs.python.org/3/library/concurrent.futures.html
ProcessPoolExecutor and ThreadPoolExecutor, behave the same way except the fact that one uses sub processes and one uses threads.

E.g. how do we make sure that there are no deadlocks between the threads now and in the future?

Each thread will be assigned a single unit of work, I don't see how we can deadlock here.

Yep, this is right. Indeed ProcessPoolExecutor is not used. It seems unnecessary to explicitly create a ThreadPoolExecutor. See https://stackoverflow.com/questions/60204054/default-executor-asyncio. A ThreadPool is required for Python event loop anyway cause some operations like file reads/writes are not supported in asyncio yet.

python-sdk/indexify/executor/agent.py

Default2882 · 2024-11-23T07:29:47Z

Thank you for looking into this.

I started to review per commit but I see that there's no clear separation of commits in the PR. I'll then review full changes soon and let you know the feedback.

Oh yea, this was never supposed to be reviewed commit by commit.

I also think that all the commits in this PR need to be squashed into a single commit because the commit titles are currently misleading and they'll just make the commit log and history confusing.

Will probably do that and force push to have a better commit history.

Default2882 · 2024-11-23T07:34:43Z

Overall question about this PR.

Do we need to refactor whole Executor code base to be async to support async functions?

As it's stated in the issue #1017:

The function worker needs to detect if the user function is multiprocess and create an event loop, and call the function with asyncio.run or something of that nature. This might require our sdk to have async interfaces as well. That needs to be explored.

How this PR helps us to get there?

fyi we plan to implement running function code in separate processes soon. In this case it's even less important if Executor runs fully in async event loop or not.

I am aware of this and was trying to re-use the event loop to make the executor completely async (also took the liberty of re-factoring the code a bit), and give the consumers of python-sdk the option to run async indexify functions by using ThreadPoolExecutor with an event loop. However if we want to completely move away from event loop and use processes/threads, then the executor needs to be completely re-written due to lack of proper abstractions.

…orker

eabatalov · 2024-11-28T11:23:36Z

Overall question about this PR.
Do we need to refactor whole Executor code base to be async to support async functions?
As it's stated in the issue #1017:

The function worker needs to detect if the user function is multiprocess and create an event loop, and call the function with asyncio.run or something of that nature. This might require our sdk to have async interfaces as well. That needs to be explored.

How this PR helps us to get there?
fyi we plan to implement running function code in separate processes soon. In this case it's even less important if Executor runs fully in async event loop or not.

I am aware of this and was trying to re-use the event loop to make the executor completely async (also took the liberty of re-factoring the code a bit), and give the consumers of python-sdk the option to run async indexify functions by using ThreadPoolExecutor with an event loop. However if we want to completely move away from event loop and use processes/threads, then the executor needs to be completely re-written due to lack of proper abstractions.

Yep, we'll discuss this internally and I'll come back to you in a few days with an answer about the plan for the code base.

eabatalov · 2024-11-29T11:53:09Z

I am aware of this and was trying to re-use the event loop to make the executor completely async (also took the liberty of re-factoring the code a bit), and give the consumers of python-sdk the option to run async indexify functions by using ThreadPoolExecutor with an event loop. However if we want to completely move away from event loop and use processes/threads, then the executor needs to be completely re-written due to lack of proper abstractions.

Yep, we'll discuss this internally and I'll come back to you in a few days with an answer about the plan for the code base.

We're going to implement function execution in separate processes in OSS version next week. It's going to be a sizable patch. This allows functions to free all GPU resources they consume once finished and provides some other benefits.

Default2882 · 2024-11-29T12:19:09Z

We're going to implement function execution in separate processes in OSS version next week. It's going to be a sizable patch. This allows functions to free all GPU resources they consume once finished and provides some other benefits.

Got it, should I cancel this PR? Because it won't be relevant anymore.

diptanu · 2024-12-01T19:02:31Z

@Default2882 Lets close this since Eugene is re-writing the executor, we can revisit this feature after the new executor has landed.

Default2882 changed the title ~~python-sdk(feat): Support async function #1017~~ [WIP] python-sdk(feat): Support async function #1017 Nov 20, 2024

Default2882 force-pushed the suppot_async_function branch from 8bc3537 to 7534402 Compare November 21, 2024 05:21

Default2882 changed the title ~~[WIP] python-sdk(feat): Support async function #1017~~ python-sdk(feat): Support async function #1017 Nov 21, 2024

Default2882 marked this pull request as ready for review November 21, 2024 12:41

Default2882 commented Nov 21, 2024

View reviewed changes

eabatalov self-assigned this Nov 22, 2024

Default2882 mentioned this pull request Nov 22, 2024

bug: Use Indexify Client with downloader.py and task reported.py in Extractor Agent #1062

Closed

eabatalov reviewed Nov 22, 2024

View reviewed changes

Default2882 added 18 commits November 27, 2024 16:15

[WIP] refactored function worker into a separate folder

fecd1bb

Removed unused imports from cli.py

a524073

[WIP] using multiprocessing module to create processes for function w…

00833ad

…orker

Added server/indexify_server_state in .gitignore

57a3bec

Removing unused imports

8b54182

Ran make fmt

c66c59c

bad commit

fcd1156

[WIP] Moved console from cli.py

85be72c

[WIP] Added ThreadPool to event loop

3e2a913

[WIP] Added models for run and download task

297c85b

[WIP] Using event loop in downloader.py

e93d5a2

[WIP] added event loop in function worker

bcd0e5b

[WIP] Refactored task_reporter.py

45e5789

Added async function support in indexify_function decorator

64b3c86

Working async functions

a3c2dfb

linting

4e5bfbf

Remoiving unused objects

59c1767

Added Indexify client in downloader.py

82b139a

Default2882 added 7 commits November 27, 2024 16:17

Deleting unused objects

2e63d3d

Remove console.py

ed2a71b

Refactored task reported

18fc645

linting

130a9c2

Refactoring agent.py

a87bd52

Removing unused objects and fixing log_exception method

6243033

Added IndexifyClient in task_reporter.py

55154e8

Default2882 force-pushed the suppot_async_function branch from 11f50f5 to 55154e8 Compare November 27, 2024 16:21

diptanu closed this Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python-sdk(feat): Support async function #1017 #1046

python-sdk(feat): Support async function #1017 #1046

Default2882 commented Nov 20, 2024 •

edited by adithyaakrishna

Loading

Default2882 Nov 21, 2024

Default2882 Nov 21, 2024

eabatalov Nov 22, 2024

Default2882 Nov 21, 2024

Default2882 Nov 21, 2024

eabatalov left a comment •

edited

Loading

eabatalov Nov 22, 2024

eabatalov left a comment •

edited

Loading

eabatalov Nov 22, 2024

Default2882 Nov 22, 2024

eabatalov Nov 22, 2024

eabatalov left a comment

eabatalov Nov 22, 2024

Default2882 Nov 22, 2024 •

edited

Loading

eabatalov Nov 28, 2024

Default2882 commented Nov 23, 2024

Default2882 commented Nov 23, 2024

eabatalov commented Nov 28, 2024

eabatalov commented Nov 29, 2024

Default2882 commented Nov 29, 2024

diptanu commented Dec 1, 2024

		@@ -225,7 +225,23 @@ def remote_or_local_pipeline(pipeline, remote=True):
		return pipeline


		@indexify_function()

python-sdk(feat): Support async function #1017 #1046

python-sdk(feat): Support async function #1017 #1046

Conversation

Default2882 commented Nov 20, 2024 • edited by adithyaakrishna Loading

Context

What

Testing

Contribution Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eabatalov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eabatalov left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eabatalov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Default2882 Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Default2882 commented Nov 23, 2024

Default2882 commented Nov 23, 2024

eabatalov commented Nov 28, 2024

eabatalov commented Nov 29, 2024

Default2882 commented Nov 29, 2024

diptanu commented Dec 1, 2024

Default2882 commented Nov 20, 2024 •

edited by adithyaakrishna

Loading

eabatalov left a comment •

edited

Loading

eabatalov left a comment •

edited

Loading

Default2882 Nov 22, 2024 •

edited

Loading