Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do better retries when Mastodon fails (and maybe Twitter too?) #366

Open
sentry-io bot opened this issue Aug 23, 2023 · 5 comments
Open

Do better retries when Mastodon fails (and maybe Twitter too?) #366

sentry-io bot opened this issue Aug 23, 2023 · 5 comments

Comments

@sentry-io
Copy link

sentry-io bot commented Aug 23, 2023

Sentry Issue: BIGCASES2-28

JobTimeoutException: Task exceeded maximum timeout value (180 seconds)
(7 additional frame(s) were not displayed)
...
  File "bc/subscription/tasks.py", line 379, in make_post_for_webhook_event
    api_post_id = api.add_status(message, image, files)
  File "bc/channel/utils/connectors/masto.py", line 94, in add_status
    media_id = self.upload_media(
  File "bc/channel/utils/connectors/masto.py", line 57, in upload_media
    media_dict = self.api.media_post(
@mlissner
Copy link
Member

38 errors so far since this came up three days ago. I wonder:

  • Does this break Twitter too?
  • Which masto instance/account is affected?
  • Any ideas to fix it?

@ERosendo
Copy link
Contributor

ERosendo commented Aug 23, 2023

Does this break Twitter too?

No, The bot schedules one independent task for each channel linked to a case. If one of the tasks fails, it won't affect the other channels.

Which masto instance/account is affected?

I checked some events in sentry and it seems the only channel affected is [email protected] ( the API is rate-limiting the bot).

Any ideas to fix it?

I think we're getting this exception because the wrapper we're using implements the sleep function and a while loop to handle status code 429(you can find the implementation of the API request method here) so it seems that the sleep function adds a delay that causes the job to reach the default timeout for queues. We could tweak one of the arguments of the Mastodon class to throw an exception when the bot gets rate-limited so we could use retries instead of the sleep and while loop to handle the status code.

@mlissner
Copy link
Member

OK, great. So we killed the bankr cases bot on Twitter, perhaps it needs to die on Masto too. I'll go bug the [email protected] folks a second time...

@mlissner mlissner moved this to In Discussion / Later in @erosendo's backlog Sep 12, 2023
@mlissner mlissner changed the title Unable to post to Mastodon due to Timeout? Do better retries when Mastodon fails (and maybe Twitter too?) Sep 12, 2023
@mlissner mlissner moved this from In Discussion / Later to Bots Backlog in @erosendo's backlog Sep 12, 2023
@TheCleric
Copy link
Contributor

So I have a few ideas on this one and would like some input before I just start applying my own assumptions. Looking at the code @ERosendo referenced, the Mastodon API setting to allow us to do our own rate limit handling isn't great for our purposes. This is the exception that you get when you tell it you want to handle rate limit errors yourself:

raise MastodonRatelimitError('Hit rate limit.')

In the background it gathers stuff from the response headers telling it how long it's rate limited and when it should try again. But then that's the very helpful message it gives us. Thanks Mastodon.py.

So this leaves us in a position where we can certainly detect the FACT that we've been rate limited, but would have no idea how long.

So here's a few options:

Option 1

  • Have a special handler for adding mastodon messages to the queue that can detect rate limit errors
  • When it detects a rate limit error, use the queue's enqueue_at or enqueue_in function to resubmit the status in X time (How long? We don't know, so this would be a guess.)
  • When we enqueue it again, do we apply the rate limit protection in case we retried to early? If so, we would need our own retry counter in the function to decrement it (not ideal)

Option 2

  • Space out our retries on either just Mastodon add_status calls, or on all add_status calls. We currently tell rq the number of and time interval of retries (by default I think it retries again in 20 seconds), but instead of just telling it a single interval, we can tell it a series of intervals. For example we could tell it to retry three times with intervals of 20, 60, and 300. So its first retry would be after 20 seconds, its second would be 60 seconds after that, and its last 300 seconds after that one.
  • The upside is this would be a relatively small change to the code (comparatively) and could do a lot of the same things as Option 1 via rq's own builtin functionality.
  • The downside is: these Sentry errors would go away, but we'd start seeing a bunch of MastodonRatelimitErrors in their place.

Option 3

  • A combination of the first 2 options
  • We'll queue the initial message with rate limit protection as in Option 1, but subsequent retries would be queued without it and use the staggered retry intervals of Option 2
  • This would essentially give us 1 try to do it without throwing a sentry error, but any retries would log a sentry error

All of the options share the same weakness: guessing at what the rate limit actually is. This leaves us in a position where if we guess too low then we'll just error out until all of our retries are gone, but if we guess too high, it would severely delay the sending of Mastodon messages.

Technically there is an option to replace the Mastodon.py library with another one that supports better rate limit handling (or our own), but that's an even bigger unknown that I'd have to research. As well we could try to convince the maintainers of Mastodon.py (with an issue and PR) to provide us with the data on the rate limit in the exception, but I'm not sure what their appetite for that would be.

@mlissner
Copy link
Member

mlissner commented May 7, 2024

Hm, my gut is that the simplest thing is the right answer, or at least the best place to begin to answer this. It's not a problem we get all the time, so maybe we can get away with the really simple thing and call it good enough.

I'm also, if I'm being honest, not super concerned if we miss a mastodon post due to this, because there just aren't many people there, and it's more or less fallen in popularity. I don't want to contribute to that, but also I don't want to bend over backwards if nobody is there.

My other thought is that it probably wouldn't be hard to tweak the mastodon exception to have a useful attribute, so maybe if just retrying using rq isn't enough, that could be the next step (it'd be nice to contribute to the masto community).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Bots Backlog
Development

No branches or pull requests

3 participants