Post Mortem: 2024-11-13 - Worker downtime due to failed recovery #4463
birkjernstrom
started this conversation in
Core Development
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
Our worker did not process enqueued tasks for 8h 15m due to failed recovery from Redis maintenance. Preventing important asynchronous tasks from being handled on time, e.g order confirmation, benefit grants, webhooks. All enqueued tasks were stored and able to be replayed successfully for the system to catch-up.
2024-11-12 23.10 UTC
2024-11-13 06.30 UTC
2024-11-13 07.25 UTC
Status
Resolved. All systems operational. No data loss.
What happened?
Our cloud provider (Render) performed maintenance on our Redis - used for storing our enqueued MQ tasks. Unfortunately, despite the maintenance being completed quickly (<1min) our worker ended up in a zombie state from it: Still having active processes (no automated recovery or alerts/escalation), but having inactive connections and therefore not getting new events to process.
All events were stored & enqueued for processing. Otherwise, we throw
HTTP 500
to external parties (Stripe & GitHub) for automatic retries and ability to replay (Stripe) and would have received alerts for API downtime. So all checkouts & payments was still online, operational and stored for processing on our end. But order creation, confirmation email, benefit grant & webhooks performed by enqueued events were not during this incident.Given the zombie state, no alarms were triggered so unfortunately it was manually discovered. A Polar user reported not receiving webhooks on Discord. @birkjernstrom saw it at ~
06.30 UTC
(07.30 local time) and started investigating immediately. It was fully resolved & all systems restored & caught-up by07.25 UTC
.Why?
We have automated recovery & escalations in case the worker processes terminates. However, in this scenario they did not despite our connection pool being stale without receiving incoming enqueued tasks it listens for. Causing a zombie state of active processes, but them not receiving all the events.
What did we do to resolve it?
06.30
: Reviewed the health of our server processes - all OK on surface level (root problem)06.30 - 06.35
: Testing to confirm the reported issue in our sandbox environment. Confirmed.06.35-06.40
: Acknowledge incident on Discord & inform team.06.40
: Checked the logs for the worker. Seeing tasks (from the workers own scheduler of cron-like events) executing. At quick glance it looked operational.06.40-07.10
: Believed it was an issue with our API due to the above (incorrectly). Confirmed we received webhooks from Stripe (HTTP 200
, i.e stored) and started debugging & reviewing changes made to it. Nothing.07.10-07.24
: Connected to Redis directly (cli) toMONITOR
for incoming operations while testing API. Confirmed they were stored correctly. Confirmation it's isolated to processing (worker).07.24
: Restarted worker manually (seconds) and seeing the logs fill up from all the enqueued events now being processed.07.25
: Manual review of affected orders (4) to confirm all events, webhooks etc have been caught-upI (Birk) then shared the resolution & short post mortem with our community on Discord. Proceeding with reaching out 1:1 with the developers using Polar & impacted by this to apologize, confirm they've received webhooks & get permission to reach out to the impacted customers (4) to apologize, explain & offer support.
What should we learn & improve from it?
invoice.paid
, that triggers unless the task has been successfully handled within X minutes Ops: Health endpoint for delayed - critical - enqueued tasks #4465In terms of more softer values:
MONITOR
on Redis directly to confirm which side the issue is on: Enqueue (API) or processing (worker)Beta Was this translation helpful? Give feedback.
All reactions