-
-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tasks stop looping following a connection reset #9074
Comments
This issue seems to have a few layers involved, very few of which involve the task extension. The task extension already retries on This brings me to your traceback, which points to a connection reset while the library is internally chunking and delaying the ready event. I'm not exactly sure how this would cause your task to die, I suspect instead it causes I'm unsure what the proper course of action for this is at the moment, since a networking issue in the middle of this should cause the library to loop back and redo the flow but the fact Do you have any other logs? |
Here's the full logs from application start to
|
Do you receive the |
I'm not sure, I don't override I could add an |
There isn't a default log for the ready event being triggered unfortunately so you'd have to add your own. Right now my working theory is that |
I have more logs now. Here's a full run from start up to the point where I noticed tasks were not running:
When I noticed the issue, I restarted the bot manually and got this:
This time we see the new log I added: Edit: I'm just now looking at the uptime for the 2nd log, it looks like it was about 7ish minutes. So maybe it would have eventually become ready. 🤔 I thought the alert I get for tasks not running wasn't that tight. I'll adjust it so I don't react to a false alarm next time. |
Ok, better logs this time. Here we see the bot start loading cogs around 18:07:52, and then the client still isn't ready by 20:34:00, around 2.5 hours later.
I have a
So we can see that the bot has never got the async def on_ready(self) -> None:
logger.info("client ready") Predictably, in the past 2.5 hours my tasks have not executed, which is what alerted me to the issue in the first place. What I usually do to fix this is manually kill the bot and restart it. Then 🤞🏻 and hope that it doesn't happen again when it starts back up. |
As a workaround, I'm going to try something like this: WAIT_UNTIL_READY_TIMEOUT = 900.0 # 15 minutes
async def wait_until_ready(bot: Client) -> None:
while True:
try:
if bot.is_ready():
logger.info("wait_until_ready: already ready")
break
ready = bot.wait_for("ready", timeout=WAIT_UNTIL_READY_TIMEOUT)
resumed = bot.wait_for("resumed", timeout=WAIT_UNTIL_READY_TIMEOUT)
await asyncio.wait([ready, resumed], return_when=asyncio.FIRST_COMPLETED)
ready.close()
resumed.close()
logger.info("wait_until_ready: ready or resumed")
break
except TimeoutError:
logger.warning("wait_until_ready: timeout waiting for ready or resumed")
break
except BaseException as e: # Catch EVERYTHING so tasks don't die
logger.exception("error: exception in task before loop: %s", e) |
Is there a reason why these logs do not have anything from |
Not sure, I don't see any logs from |
Your workaround doesn't actually do anything except change the event from |
Out of curiosity, does |
I'll add a hook for it and log it 👍🏻 Edit: I also changed the workaround code to this awful hack 😅 ready_future = bot.loop.create_future()
resumed_future = bot.loop.create_future()
ready_listeners = bot._listeners.get("ready", [])
resumed_listeners = bot._listeners.get("resumed", [])
ready_listeners.append((ready_future, lambda: True))
resumed_listeners.append((resumed_future, lambda: True))
bot._listeners["ready"] = ready_listeners
bot._listeners["resumed"] = resumed_listeners
ready_coro = asyncio.wait_for(ready_future, timeout=WAIT_UNTIL_READY_TIMEOUT)
resumed_coro = asyncio.wait_for(resumed_future, timeout=WAIT_UNTIL_READY_TIMEOUT)
await asyncio.wait([ready_coro, resumed_coro], return_when=asyncio.FIRST_COMPLETED)
if not ready_future.done():
ready_future.cancel()
if not resumed_future.done():
resumed_future.cancel() I just needed a reference to the futures so that I could Edit 2: Actually I can just do this instead, and it's much less hacky 😅 _, unfinished = await asyncio.wait(
[
bot.wait_for("ready", timeout=WAIT_UNTIL_READY_TIMEOUT),
bot.wait_for("resumed", timeout=WAIT_UNTIL_READY_TIMEOUT),
],
return_when=asyncio.FIRST_COMPLETED,
)
for task in unfinished:
task.cancel() |
I've always seen the "ratelimited" logs at bot startup time, even years ago when I was on just 1 shard. I assumed it was expected normal operations for any bot on bot startup. 🤔 |
It's not too weird if you're in a lot of guilds and doing the slow chunking (i.e. no presence intent). |
Since implementing the workaround I have yet to see any tasks get into the weird non-starting state. For reference, the full workaround that I've been using is: import asyncio
import logging
from discord import Client
from discord.ext import commands, tasks
logger = logging.getLogger(__name__)
WAIT_UNTIL_READY_TIMEOUT = 900.0 # 15 minutes
async def wait_until_ready(bot: Client) -> None:
while True:
try:
if bot.is_ready():
logger.info("wait_until_ready: already ready")
break
_, unfinished = await asyncio.wait(
[
bot.wait_for("ready", timeout=WAIT_UNTIL_READY_TIMEOUT),
bot.wait_for("resumed", timeout=WAIT_UNTIL_READY_TIMEOUT),
],
return_when=asyncio.FIRST_COMPLETED,
)
for task in unfinished:
task.cancel()
logger.info("wait_until_ready: ready or resumed")
break
except TimeoutError:
logger.warning("wait_until_ready: timeout waiting for ready or resumed")
break
except BaseException as e: # Catch EVERYTHING so tasks don't die
logger.exception("error: exception in task before loop: %s", e)
class TasksCog(commands.Cog):
def __init__(self, bot: Client) -> None:
self.bot = bot
self.some_task.start()
@tasks.loop(minutes=10)
async def some_task(self) -> None:
try:
... # do stuff
except BaseException as e: # Catch EVERYTHING so tasks don't die
logger.exception("error: exception in task cog: %s", e)
@some_task.before_loop
async def before_some_task(self) -> None:
await wait_until_ready(self.bot)
async def setup(bot: commands.AutoShardedBot) -> None:
await bot.add_cog(TasksCog(bot)) |
Again, your workaround just changed it so it waits until RESUME is called. Catching In order for this issue to be solved I need to actually know the underlying issue and unfortunately I am unsure what you're triggering. There are a lot of timeouts in the READY handling precisely so it doesn't hang and at this point I'm feeling like I'm chasing a non-existent lead and the issue might be elsewhere entirely. I still have no idea whether the shard ready events are fired. |
Generally true, and it's not something I would ever put into a library. But in my application its fine because I see the logs immediately, and go investigate whatever the issue is asap. I'd rather my application fail loudly and continue running than not 🤷🏻♀️
Right, sorry. I added a log for that. I do see shard ready logs. I added logs for them via: async def on_ready(self) -> None:
logger.info("client ready")
async def on_shard_ready(self, shard_id: int) -> None:
logger.info(f"shard {shard_id} ready") in my I see logs for |
Technically it might not even wait for that. The timeout is at 15 minutes if neither "ready" nor "resume" happen. But if "ready" does, by chance, happen before I get a "resume", then the code will stop waiting exactly then and not continue to wait for a "resume". In practice I haven't seen it cause an issue so far 🤞🏻 — I understand that the actual solution at the library layer will of course need to be more robust/general. |
I've also experienced this bug on bot startup, and now debugging it... |
Summary
After I see the connection reset stack trace, I've noticed that while everything else is working fine, tasks simply do not ever loop again.
Reproduction Steps
Sadly I have been struggling to reproduce this but I see it in the wild every now and then.
Minimal Reproducible Code
I have some tasks in my codebase that are set up like so:
Expected Results
Connection resets happen, sometimes the network is flakey. When this happens the bot should attempt to reconnect and carry on.
Actual Results
This seems to work just fine most of the time. Every now and then though I get this stack trace. When I do it means that my tasks have stopped looping. The bot continues to work just fine in all other respects, but the tasks completely stop happening. In my example code the
my_task()
code should execute every 5ish minutes. But when I see this stack trace it means thatmy_task()
will never execute again. It just stops looping tasks completely.Maybe relevant, just before this stack trace is a bunch of logs like:
This exception also seems to happen after my bot is restarted. If it doesn't happen just after my bot is started, then it doesn't ever seem to happen. Whenever I see it, it's always just after the bot has been restarted.
Intents
3276543
System Information
Checklist
Additional Context
Here's the actual tasks cog code from my bot: https://github.com/lexicalunit/spellbot/blob/main/src/spellbot/cogs/tasks_cog.py
The text was updated successfully, but these errors were encountered: