-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fix][Core] Execute user requested actor exit in C++ side #49918
[Fix][Core] Execute user requested actor exit in C++ side #49918
Conversation
8eac3e4
to
ad9caea
Compare
@MortalHappiness let me know when this is ready for review |
At first glance, I think we need to use a |
@edoakes What do you think would be a better approach? Implementing it purely on the Python side using If implemented on the C++ side, I will export a function to Python and call this function when |
If it's possible to do it in C++ without significantly more code change I'd prefer to do it there. That is more future proof and allows us to reuse it across different language frontends in the future. |
81246c5
to
545ca38
Compare
545ca38
to
9f4b19c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MortalHappiness great start!
Let's please unify as many of the actor exit codepaths as possible to share the same logic. For example, can we also unify the single threaded and multi threaded actor exit logic to use the same flag?
Should I do this in this PR or in follow-up PRs? |
I'm ok to do it in follow-up PRs, but could you summarize for me all of the existing exit paths before this PR and what you see as the end goal state? |
Hi @edoakes I've tried to summarize exit codepaths in PR description. Let me know if I missed something. |
Looks good now -- my only concern is that we don't have any behavior change for the sync worker case. Let's make sure that still raises immediately and fails the current task w/ an It seems like the "raising immediately" behavior might not be covered by a current test case (if my understanding is correct). If so, please add a test case for that. For example: def method(self):
ray.exit_actor()
do_something() # This shouldn't be called and this method should raise `RayActorError` when `ray.get` is called. |
Hi @edoakes I've addressed all comments. Ready for another review. Thanks. |
python/ray/actor.py
Outdated
if not worker.core_worker.current_actor_is_asyncio(): | ||
raise_sys_exit_with_custom_error_message("exit_actor() is called.") | ||
worker.core_worker.set_current_actor_should_exit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little confused at why this is conditional because we should try to make the behavior as uniform as possible:
- For non-async actors, don't we still want to set the field in the core worker? Otherwise, if users accidentally or intentionally catch the
SystemExit
exception that is raised, the actor won't actually be exited. - For async actors, don't we still want to raise the
SystemExit
exception immediately?
raise AssertionError( | ||
"This line should not be reached because non-async actor" | ||
"should exit immediately after exit_actor is called." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this assertion would actually catch a regression because this line might execute and then the exit
method would still return a RayActorError
. For any condition like this, you should verify that if you intentionally break the behavior (in this case, make it so that sync actors don't raise immediately), the test fails. I don't think it will in this case.
Instead, what you will need to do is add an "out of bounds" check -- so for example catch the SystemExit
exception and then send a signal on an actor or write a tempfile like the other test cases. Then after the actor exits, assert that the exception was raised and therefore the out of band behavior happened.
assert ( | ||
# Exited when task execution returns | ||
"exit_actor()" in str(exc_info.value) | ||
# Exited during periodical check in worker | ||
or "User requested to exit the actor" in str(exc_info.value) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things for this test:
- We should add the same check that we have for the sync case that
exit_actor()
immediately raises an exception). - I might be misguided, but I think we can have two test cases here instead of just checking for both errors. The first one can call
exit_actor
on the task directly executing the actor method and that should always return "exit_actor() was called...". The second one can callexit_actor
in a background asyncio task and then assert that `"User requested to exit the actor" is eventually raised.
ray.get(a.__ray_ready__.remote()) | ||
with pytest.raises(ray.exceptions.RayActorError) as exc_info: | ||
ray.get(a.start_exit_task.remote()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm this test doesn't look correct. the method starts a background task so there is no guarantee that runs before the method finishes successfully. I think what you need to do is start the background task and have it block on a SignalActor
before it actually calls exit_actor()
, check that start_exit_task
completes successfully, then unblock the background task so it calls exit_actor()
, then use wait_for_condition
on ray.get(a.noop_method.remote())
to check that eventually subsequent methods will return the actor error
@edoakes I've addressed all of your comments, but the actual implementation differs slightly. Let me describe the changes here:
|
@MortalHappiness why can't we set the
Am I missing something? |
@edoakes After some debug printing and some experiment, I found that CPython source code: https://github.com/python/cpython/blob/e53d105872fafa77507ea33b7ecf0faddd4c3b60/Lib/asyncio/tasks.py#L991-L1011 The following is a toy example. import time
import asyncio
import threading
async def f():
print("Inside coroutine f()")
# raise ValueError
# raise BaseException
raise SystemExit
def start_loop(loop):
asyncio.set_event_loop(loop)
loop.run_forever()
def main():
loop = asyncio.new_event_loop()
thread = threading.Thread(target=start_loop, args=(loop,))
thread.start()
future = asyncio.run_coroutine_threadsafe(f(), loop)
try:
time.sleep(2)
print("loop alive?", loop.is_running())
future.result()
except BaseException as e:
print("Caught exception:", repr(e))
print("about to stop loop")
loop.call_soon_threadsafe(loop.stop)
thread.join()
if __name__ == "__main__":
main() Explanation:
And in Ray, we use Line 4587 in b845581
This is why we cannot raise |
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Co-authored-by: Edward Oakes <[email protected]> Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
This reverts commit 516bf49. Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
5945cfc
to
04cf1ee
Compare
@MortalHappiness I see, thanks for the explanation. Is it possible to add a top-level exception handler for |
@edoakes Not easy to do that because the future object is returned by It's easier to raise a custom exception in this case. My idea is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes LGTM, just a few nits
Your latest suggestion sounds better than what is in the code, but given that we already have the existing behavior with AsyncioActorExit
and it would be a very minor API breakage to change, better to leave it as is.
src/ray/core_worker/context.h
Outdated
void SetCurrentActorShouldExit() ABSL_LOCKS_EXCLUDED(mutex_); | ||
|
||
bool GetCurrentActorShouldExit() const ABSL_LOCKS_EXCLUDED(mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add header comments for these indicating what they're used for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Verify the basic case. | ||
""" | ||
|
||
def test_exit_actor_normal_actor_raise_immediately(shutdown_only, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a test case for calling ray.exit_actor()
within the constructor if there isn't one already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 3 tests: test_exit_actor_normal_actor_in_constructor_should_exit
, test_exit_actor_async_actor_in_constructor_should_exit
, and test_exit_actor_async_actor_nested_task_in_constructor_should_exit
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
Closes: ray-project#49451 Signed-off-by: Chi-Sheng Liu <[email protected]>
@edoakes I found that you enabled auto-merged. But I found that I accidentally changed 1 line that should not be changed so I pushed it again and the auto-merge is disabled. Could you enable it again? Thanks. |
Why are these changes needed?
When an asyncio task creates another asyncio task, raising
AsyncioActorExit
cannot make the caller exit because they are not the same task. Therefore, this PR makesexit_actor
to request actor exit in core worker context, which will be checked regularly by core worker.This PR also removes
AsyncioActorExit
exception because it is no longer needed.Related issue number
Closes: #49451
Follow-up: Unify actor exit codepaths
Core worker exit codepaths:
Either ForceExit or Exit:
ray.cancel
python function):ray/src/ray/core_worker/core_worker.cc
Line 4348 in 9008eb6
ray/src/ray/core_worker/core_worker.cc
Line 4510 in 9008eb6
ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc
Line 1730 in 67f9490
ExitRequest
RPC call:ray/src/ray/core_worker/core_worker.cc
Line 4737 in 9008eb6
task_execution_callback
application-language callback:ray/src/ray/core_worker/core_worker.cc
Line 3351 in 9008eb6
ray/src/ray/core_worker/core_worker.cc
Lines 3424 to 3443 in 9008eb6
check_signals
application-language callback:ray/src/ray/core_worker/core_worker.cc
Lines 3159 to 3170 in 9008eb6
Actor exit codepaths (in GCS):
Either DestroyActor or KillActor
ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc
Line 377 in 67f9490
ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc
Line 1017 in 67f9490
ray.kill
(CoreWorker::KillActor
->ActorInfoAccessor::AsyncKillActor
->KillActorViaGcs
RPC call ->GcsActorManager::HandleKillActorViaGcs
):ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc
Lines 694 to 698 in 67f9490
ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc
Line 1217 in 67f9490
ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc
Line 1294 in 67f9490
ray/src/ray/gcs/gcs_server/gcs_actor_manager.cc
Line 1517 in 67f9490
Examples:
exit_actor
is called (this PR): core worker exits based on application-language callback.ray.kill
is called:DestroyActor
in GCS will be called, and then it'll notify core worker to exit.Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.