EDU-3861: HAN prototyping

temporalio · Jan 31, 2025 · 77dc527 · 77dc527
1 parent 58fdaf7
commit 77dc527
Show file tree

Hide file tree

Showing 8 changed files with 1,050 additions and 0 deletions.
diff --git a/docs/production-deployment/cloud/high-availability/enable.mdx b/docs/production-deployment/cloud/high-availability/enable.mdx
diff --git a/docs/production-deployment/cloud/high-availability/faq.mdx b/docs/production-deployment/cloud/high-availability/faq.mdx
@@ -0,0 +1,219 @@
+---
+id: faq
+title: Frequently Asked Questions
+sidebar_label: Frequently Asked Questions
+slug: /cloud/high-availability/faq
+description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted.
+tags:
+  - Temporal Cloud
+  - Production
+  - High availability
+keywords:
+  - availability
+  - explanation
+  - failover
+  - high-availability
+  - multi-region
+  - multi-region namespace
+  - namespaces
+  - temporal-cloud
+  - term
+---
+import { RelatedReadContainer, RelatedReadItem } from '@site/src/components/related-read/RelatedRead';
+
+:::tip Support, stability, and dependency info
+
+High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud.
+
+:::
+
+<div style={{backgroundColor: '#ffff00',padding: '0px 15px',borderRadius: '5px',border: '1px solid #cccccc',display: 'inline-block'}}>**Repurposed material. No audits, updates, intros, re-org.  Converted to markdown automatically from Google Doc rich text, so there are many errors**</div>
+
+Failovers
+
+**Q: What is a failover******
+
+A failover shifts Workflow Execution processing from an active Temporal Namespace to a standby Temporal Namespace during outages or other incidents. Standby Namespaces use replication to duplicate data and prevent data loss during failover.
+
+**Q: What failover modes does Temporal use internally?******
+
+Users cannot configure failover modes. The following descriptions explain Temporal Cloud’s internal failover system:
+
+  * ******[Graceful failover**](https://docs.temporal.io/cloud/multi-region#graceful-failover): Replication tasks are fully processed and drained before transferring control to the standby region. Temporal Cloud pauses traffic to the active Namespace before the failover, minimizing the rewind of progress and avoiding data conflicts. The Namespace experiences a short period of unavailability, defaulting to 10 seconds at most. Under most circumstances, the actual time the Namespace is unavailable is much, much shorter than that.
+
+During this period, existing Workflows stop progress. Temporal Cloud returns a "Service unavailable error". State transitions will not happen and tasks are not dispatched. User requests like start/signal Workflow will be rejected while operations are paused during handover. This mode favors _consistency_ over availability.
+
+  * ******[Forced failover**](https://docs.temporal.io/cloud/multi-region#forced-failover): In this mode, a Namespace immediately activates in the standby region. Events not replicated due to replication lag will undergo conflict resolution upon reaching the new active region. This mode prioritizes _availability_ over consistency.
+  * ******[Hybrid failover**](https://docs.temporal.io/cloud/multi-region#hybrid-failover) (Default mode): While graceful failovers are consistent, they aren’t always practical in certain circumstances such as when cells experience outages and/or a critical database is unavailable. Temporal Cloud’s hybrid failover mode limits an initial Graceful failover attempt to 10 seconds or less. If the graceful approach doesn’t resolve the issue, Temporal Cloud automatically switches to a forced failover. This strategy balances consistency and availability requirements. This strategy balances _consistency_ and _availability_ requirements.
+
+**Q: What is the difference between a handover and a failover?******
+
+They are essentially the same thing. It is the process of transferring control from the active to the standby region during outages or other incidents.
+
+
+
+
+
+
+
+
+
+
+**Q: What situation triggers a graceful failover vs a forced one? Who or what triggers it? What are the differences in results?******
+
+Users can initiate a failover, but they can’t control or configure the failover mode. The three failover modes are internal operations on the Temporal side. They are explained for user education.
+
+**Q: Under what circumstances would I want to initiate a failover?******
+
+Normally, we don't expect users to failover when there’s a problem related to Temporal Cloud. Please contact Temporal support if you feel you have a pressing need. You might consider initiating a failover under two circumstances:
+
+  * The “break glass”  (fire alarm) scenario where Temporal hasn’t responded to an outage or is unaware of an outage in the customer environment.
+  * To test the failover functionality and ensure it works properly.
+
+You can still choose to initiate failover if you have issues sourced from your side or your dependencies.
+
+**Q: Under what circumstances does Temporal initiate failovers?******
+
+Temporal Cloud initiates failovers when there are incidents or outages in the cloud provider. This includes failures of databases, storage, etc. We trigger failovers any time we observe increased latencies or an increase in service errors that causes us to violate the SLA that is in our control.
+
+**Q: Are there any other types of failover not listed above?******
+
+No. There are only three types.
+
+**Q: Can we control the hybrid failover timeout?******
+
+No. The timeout is not configurable outside of Temporal.
+
+**Q: Can a failover get stuck? What is the maximum amount of time it can take?******
+
+There is typically no way for failovers to get “stuck”. We follow the hybrid failover method where we try to do a smooth handoff. If that does not take place within 10 seconds, we initiate a “forced” failover.
+
+**Q: Is the 10 seconds maximum unavailability window configurable?******
+
+No, it is not configurable by the user. Extending the wait time is unlikely to increase the chances of graceful failovers during extreme incidents such as when a source region is down.
+
+**Q: How does the client detect that the failover has occurred?******
+
+We do not send real-time failover notifications. Users are notified via email and audit logs.
+
+**Q: Can the customer determine the resolved failover region?******
+
+Users can determine the failover region from the Namespace endpoint’s CNAME (\<namespace>.tmprl.cloud). Whenever Temporal Cloud triggers a failover from the Temporal side, we update the CNAME to point to the new active region. The CNAME points to a Temporal Cloud regional endpoint. For example, a Namespace active in aws us-east-1 points to aws-**us-east-1**.region.tmprl.cloud.
+
+Replication Lag and Latency
+
+**Q: What affects replication latency? What can cause the replication latency to increase?******
+
+Slowdowns in the standby cell, such as capacity issues or outages, can increase replication latency. Otherwise, it is typically a matter of seconds or even less and can be monitored through  [external metrics](https://docs.temporal.io/cloud/multi-region#metrics-operations).
+
+**Q: Can Workflows execute events a second time in the standby Cluster due to replication lag?******
+
+Yes. This is explained in the [conflict resolution](https://docs.temporal.io/cloud/multi-region#conflict-resolution) section of our documentation.
+
+**Q: Is it possible to see whether a failover was graceful, forced, or hybrid?******
+
+No, customers cannot normally view the method used. File a support ticket if there’s a specific need to review a process.
+
+**Q: Is replication lag emitted as a metric?******
+
+Yes, replication lag is a [metric](https://docs.temporal.io/cloud/multi-region#metrics-operations) that we expose.
+
+**Q: Can we see replication information by Workflow type or ID?******
+
+_[Answer in progress]___
+
+**Q: Is the data replicated in order?******
+
+For a single Workflow, events are replicated  in order. There's no ordering guarantee for replication of events between different Workflows.
+
+**Q: What happens if both regions become active simultaneously?******
+
+This only happens when there's a network partition or delays in the Namespace replication queue. Normally, when cells can talk to each other, only one region will ever become active.
+
+If both regions have become active and both have active Workers, Workflows will run independently based on their local History. Workers fetch tasks from their assigned region. With global Worker setups, Workers fetch tasks from the ‘true’ active region as known by Temporal Cloud. Eventually, when the network partition heals, History is merged via conflict resolution and one side wins.
+
+**Q: What if DNS is still updating during a network partition between Clusters? ******
+
+In this situation, the now passive Cluster can’t forward requests to the new active Cluster. However, DNS normally points to the correct active Cluster without forwarding. Workers configured to point to the standby Cluster can be reconfigured to point to the active Cluster.
+
+Conflict Resolution
+
+See [this Notion Page](https://www.notion.so/temporalio/Conflict-Resolution-Example-83e9dec0f8f246ee8584995ae2e408f4) for an example Conflict Resolution.
+
+**Q: How are conflicts resolved?******
+
+Each cell has a version number, which is used in Event History metadata. Failover operations increase that number. Events with the highest number win during conflict resolution.
+
+**Q: What happens to Workflows if conflicts can’t be resolved?******
+
+This can only happen if there is a bug in the conflict resolution.  If there is a bug in conflict resolution, those events are placed in a dead letter queue to unblock replication. Temporal will resolve the issue and reapply the events.
+
+Customer impact is limited to the affected Workflows. The rest of the system continues as normal.
+
+**Q: How is History affected if conflicts can’t be resolved?******
+
+Same as above.
+
+**Q: How do customers detect unresolved conflicts?******
+
+Unresolved conflicts are not made visible to customers. Temporal directs unresolvable conflicts (conflicts that require Temporal on-call intervention) into a dead letter queue and makes sure those conflicts are resolved and their events re-applied.
+
+**Q: How do customers manually resolve conflicts?******
+
+No manual resolution by customers is needed unless Temporal cannot handle a specific scenario.
+
+**Q: Are non-selected event histories deleted during automatic conflict resolution? ******
+
+No. They are hidden but not deleted. We do not expose access to non-selected events to customers.
+
+Data Loss
+
+**Q: Under what circumstances would a Workflow Execution be unrecoverable if it was started but not replicated before failover?******
+
+The normal time difference between the two operations is typically measured in single-digit seconds.  So this scenario can only happen if the Cluster is healthy enough to accept the Workflow start request and fails to replicate this event. This is very unlikely. If it did happen, the started Workflow is recovered after the Cluster is itself recovered. The only possibility of data loss would require that Temporal lose contact with the previously active cell after permanently completing an operation.
+
+Metrics and Observability
+
+**Q: What information can be pulled from MRN metrics?******
+
+This is [documented](https://docs.temporal.io/cloud/multi-region#metrics-operations).
+
+Always check metric replication lag before initiating a failover test or emergency failover. A forced failover when there is a large replication lag has a higher likelihood of rolling back Workflow progress.
+
+**Q: What warning signs Signal that a failover may be arriving?******
+
+You should always be prepared for failover. One could happen at any point in time.
+
+We notify customers when a failover occurs. There is no time lapse between discovering failover prerequisites and the failover itself.
+
+Other
+
+**Q: Can Signals be sent twice since multi-region doesn't provide at-most-once delivery?******
+
+During conflict resolution, a Signal could be applied twice.
+
+**Q: What happens if the active region is unavailable for an extended period and the standby region does not have the most recent Signal?**** ******
+
+Workers cannot process the Signal as it won’t be present in any available region.
+
+**Q: If the active region remains unavailable for an extended period, does the active role switch to the standby region? ******
+
+If Temporal Cloud initiated the failover, it will “fail back” to the original active region once the incident is fully resolved. Otherwise, the active role remains with the newly active (formerly standby) region.
+
+**Q: ****[Is there a way to determine the region an event ran in via the UI?******](https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1717092196761469)
+
+Not at the moment.
+
+**Q: ****[Can we show branching in the UI?******](https://temporaltechnologies.slack.com/archives/C04V0LSU5S6/p1717171982044329)
+
+Not at the moment.
+
+**Q: What should customers worry about in terms of Signals and events synchronization?******
+
+Signals are cherry-picked during conflict resolution if there is replication lag and conflict. Workflows can theoretically revert multiple steps.
+
+Customers should decide whether to add logic to handle this or manually fix affected Workflows if they believe the risk is low. Other known limitations have been [documented](https://docs.temporal.io/cloud/multi-region#architecture) around causality and so forth.
+
+**Q: How much time does it take to reconcile data after an incident is resolved? ******
+
+It depends on the distribution of Workflows. If evenly distributed, data can sync quickly. If concentrated in a single partition, it could take hours. Do not “fail back” your region (revert it to the original active region) until the data is fully reconciled and the other region has caught up.
diff --git a/docs/production-deployment/cloud/high-availability/guarantees.mdx b/docs/production-deployment/cloud/high-availability/guarantees.mdx
@@ -0,0 +1,61 @@
+---
+id: guarantees
+title: Guarantees and availability
+sidebar_label: Guarantees and availability
+slug: /cloud/high-availability/guarantees
+description: Temporal Cloud's High-Availability Namespaces offer automated failover, synchronized data replication, and high availability for workloads requiring disaster-tolerant deployment and 99.99% uptime. Use Global Namespace for self-hosted.
+tags:
+  - Temporal Cloud
+  - Production
+  - High availability
+keywords:
+  - availability
+  - explanation
+  - failover
+  - high-availability
+  - multi-region
+  - multi-region namespace
+  - namespaces
+  - temporal-cloud
+  - term
+---
+
+:::tip Support, stability, and dependency info
+
+High-availability Namespaces are in [Public Preview](/evaluate/development-production-features/release-stages#public-preview) for Temporal Cloud.
+
+:::
+
+<div style={{backgroundColor: '#ffff00',padding: '0px 15px',borderRadius: '5px',border: '1px solid #cccccc',display: 'inline-block'}}>**Maybe these could just be parts of the FAQ?**</div>
+
+
+## Multi-region Namespace SLA {#sla}
+
+**What guarantees does Temporal offer for multi-region Namespaces?**
+
+Multi-region Namespaces offer 99.99% availability, enforced by Temporal Cloud's [service error rates SLA](https://docs.temporal.io/cloud/sla).
+Our system is designed to limit data loss after recovery when the incident triggering the failover is resolved.
+
+Our recovery point objective ([RPO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Point_Objective)) is near-zero.
+There may be a short period of time during an incident or forced failover when some data is unavailable in the standby region.
+Some Workflow History data won't arrive until networks issue are fixed, enabling the History to finish replicating and the divergent History branches to reconcile.
+
+Temporal Cloud proactively responds to incidents by triggering failovers.
+Our recovery time objective ([RTO](https://en.wikipedia.org/wiki/Disaster_recovery#Recovery_Time_Objective)) is 20 minutes or less per incident.
+
+:::info
+
+During a disaster scenario in which the data on the hard drives in the active region cannot be recovered, the duration of data loss may be as high as the [replication lag](/cloud/multi-region#replication-lag) at the time of disaster.
+
+:::
+
+### Regional availability {#regional-availability}
+
+Multi-region Namespaces are available in all existing [Temporal Cloud regions](/cloud/service-availability#regions).
+
+:::tip
+
+Namespace pairing is currently limited to regions within the same continent.
+South America is excluded as only one region is available.
+
+:::