Post Mortem: 2024-11-13 - Worker downtime due to failed recovery #4463

birkjernstrom · 2024-11-13T15:38:44Z

birkjernstrom
Nov 13, 2024
Maintainer

Summary

Our worker did not process enqueued tasks for 8h 15m due to failed recovery from Redis maintenance. Preventing important asynchronous tasks from being handled on time, e.g order confirmation, benefit grants, webhooks. All enqueued tasks were stored and able to be replayed successfully for the system to catch-up.

Severity: High
Data loss: None
Started: 2024-11-12 23.10 UTC
Detected: 2024-11-13 06.30 UTC
Resolved: 2024-11-13 07.25 UTC

Status

Resolved. All systems operational. No data loss.

What happened?

Our cloud provider (Render) performed maintenance on our Redis - used for storing our enqueued MQ tasks. Unfortunately, despite the maintenance being completed quickly (<1min) our worker ended up in a zombie state from it: Still having active processes (no automated recovery or alerts/escalation), but having inactive connections and therefore not getting new events to process.

All events were stored & enqueued for processing. Otherwise, we throw HTTP 500 to external parties (Stripe & GitHub) for automatic retries and ability to replay (Stripe) and would have received alerts for API downtime. So all checkouts & payments was still online, operational and stored for processing on our end. But order creation, confirmation email, benefit grant & webhooks performed by enqueued events were not during this incident.

Given the zombie state, no alarms were triggered so unfortunately it was manually discovered. A Polar user reported not receiving webhooks on Discord. @birkjernstrom saw it at ~06.30 UTC (07.30 local time) and started investigating immediately. It was fully resolved & all systems restored & caught-up by 07.25 UTC.

Why?

We have automated recovery & escalations in case the worker processes terminates. However, in this scenario they did not despite our connection pool being stale without receiving incoming enqueued tasks it listens for. Causing a zombie state of active processes, but them not receiving all the events.

What did we do to resolve it?

06.30: Reviewed the health of our server processes - all OK on surface level (root problem)
06.30 - 06.35: Testing to confirm the reported issue in our sandbox environment. Confirmed.
06.35-06.40: Acknowledge incident on Discord & inform team.
06.40: Checked the logs for the worker. Seeing tasks (from the workers own scheduler of cron-like events) executing. At quick glance it looked operational.
06.40-07.10: Believed it was an issue with our API due to the above (incorrectly). Confirmed we received webhooks from Stripe (HTTP 200, i.e stored) and started debugging & reviewing changes made to it. Nothing.
07.10-07.24: Connected to Redis directly (cli) to MONITOR for incoming operations while testing API. Confirmed they were stored correctly. Confirmation it's isolated to processing (worker).
07.24: Restarted worker manually (seconds) and seeing the logs fill up from all the enqueued events now being processed.
07.25: Manual review of affected orders (4) to confirm all events, webhooks etc have been caught-up

I (Birk) then shared the resolution & short post mortem with our community on Discord. Proceeding with reaching out 1:1 with the developers using Polar & impacted by this to apologize, confirm they've received webhooks & get permission to reach out to the impacted customers (4) to apologize, explain & offer support.

What should we learn & improve from it?

Terminate worker process in case Redis connection pool is stale to trigger self-recovery server: improve resiliency of worker in case of Redis loss #4453
Setup public status page so customers can see vs. wonder if something is wrong https://status.polar.sh/
Worker health in status page Ops: Add worker health to status page #4464
Dedicated status & monitoring for critical enqueued tasks, e.g invoice.paid, that triggers unless the task has been successfully handled within X minutes Ops: Health endpoint for delayed - critical - enqueued tasks #4465

In terms of more softer values:

Just restart the worker. It's safe & fast. Even with our added resilience & monitoring above, other issues could arise and it's a good measure in case of task issues to ensure the process is "fresh" and should work.
Run MONITOR on Redis directly to confirm which side the issue is on: Enqueue (API) or processing (worker)
The above would have saved ~30min.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polar

Post Mortem: 2024-11-13 - Worker downtime due to failed recovery #4463

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Polar

Post Mortem: 2024-11-13 - Worker downtime due to failed recovery #4463

birkjernstrom Nov 13, 2024 Maintainer

Summary

Status

What happened?

Why?

What did we do to resolve it?

What should we learn & improve from it?

Replies: 0 comments

birkjernstrom
Nov 13, 2024
Maintainer