Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

terraform destroy never finishes #40

Open
mhite opened this issue Mar 7, 2023 · 4 comments
Open

terraform destroy never finishes #40

mhite opened this issue Mar 7, 2023 · 4 comments

Comments

@mhite
Copy link
Contributor

mhite commented Mar 7, 2023

Is there something about the Splunk Dataflow pipeline design that causes it to never be able to successfully drain?

I've gone through the full build + teardown (destroy) process at least a dozen times and have never seen it destroy successfully without intervention by manually canceling the dataflow job in the console.

google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h0m43s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h0m53s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m3s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m13s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m23s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m33s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m43s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m53s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m3s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m13s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m23s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m33s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m43s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m53s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h3m3s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h3m13s elapsed]
@ilakhtenkov
Copy link
Contributor

I also faced with that some time before.

The solution here is adding on_delete option.
By default it set up to drain, means Dataflow is trying to gracefully shutdown the job.
We can set it to cancel, which could cause potentially some lost of logs in the process of deployment. But will fix this issue.

I would definitely vote on it. What do you think @rarsan?

@rarsan
Copy link
Contributor

rarsan commented Mar 21, 2023

I was actually looking at this today but wasn't able to reproduce. I have encountered this before, but very sporadically.

Forcing to cancel vs drain is a reasonable option with proper warning about data loss potential. However, I suspect a clean teardown can be ensured by enforcing a particular order for resource deletion: e.g. delete log sink first, then dataflow job to ensure the sink stops and the dataflow job gets the chance to drain. Perhaps it's due to another prematurely deleted dependency like the GCS bucket causing Dataflow job teardown to hang?

@mhite can you share the order of resources being deleted in the case where it does hang? specifically log sink, pubsub topic, pubsub subscription, gcs bucket, and dataflow job.

@mhite
Copy link
Contributor Author

mhite commented Mar 24, 2023

@rarsan -

Does this help?

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: yes

google_pubsub_topic_iam_binding.input_sub_publisher: Destroying... [id=projects/<REDACTED>/topics/export-pipeline-input-topic/roles/pubsub.publisher]
google_secret_manager_secret_iam_member.dataflow_worker_secret_access[0]: Destroying... [id=projects/<REDACTED>/secrets/demo-hec-token/roles/secretmanager.secretAccessor/serviceAccount:export-pipeline-worker@<REDACTED>.iam.gserviceaccount.com]
google_pubsub_subscription_iam_binding.input_sub_subscriber: Destroying... [id=projects/<REDACTED>/subscriptions/export-pipeline-input-subscription/roles/pubsub.subscriber]
google_pubsub_topic_iam_binding.deadletter_topic_publisher: Destroying... [id=projects/<REDACTED>/topics/export-pipeline-deadletter-topic/roles/pubsub.publisher]
google_project_iam_binding.dataflow_worker_role[0]: Destroying... [id=<REDACTED>/roles/dataflow.worker]
google_pubsub_subscription.dataflow_deadletter_pubsub_sub: Destroying... [id=projects/<REDACTED>/subscriptions/export-pipeline-deadletter-subscription]
google_dns_policy.splunk_network_dns_policy[0]: Destroying... [id=projects/<REDACTED>/policies/dataflow-net-dns-policy]
google_pubsub_subscription_iam_binding.input_sub_viewer: Destroying... [id=projects/<REDACTED>/subscriptions/export-pipeline-input-subscription/roles/pubsub.viewer]
google_storage_bucket_iam_binding.dataflow_worker_bucket_access: Destroying... [id=b/<REDACTED>-export-pipeline-6563c6ff/roles/storage.objectAdmin]
google_dataflow_job.dataflow_job: Destroying... [id=2023-03-22_15_44_26-13366291121410881687]
google_dns_policy.splunk_network_dns_policy[0]: Destruction complete after 0s
google_monitoring_group.splunk-export-pipeline-group: Destroying... [id=projects/<REDACTED>/groups/358908245497544187]
google_monitoring_group.splunk-export-pipeline-group: Destruction complete after 1s
google_monitoring_dashboard.splunk-export-pipeline-dashboard: Destroying... [id=projects/<REDACTED>/dashboards/d532a668-79ee-4028-8f7b-374f6017ff91]
google_monitoring_dashboard.splunk-export-pipeline-dashboard: Destruction complete after 0s
google_service_account_iam_binding.terraform_caller_impersonate_dataflow_worker[0]: Destroying... [id=projects/<REDACTED>/serviceAccounts/export-pipeline-worker@<REDACTED>.iam.gserviceaccount.com/roles/iam.serviceAccountUser]
google_pubsub_subscription.dataflow_deadletter_pubsub_sub: Destruction complete after 1s
google_compute_firewall.connect_dataflow_workers[0]: Destroying... [id=projects/<REDACTED>/global/firewalls/dataflow-internal-ip-fwr]
google_secret_manager_secret_iam_member.dataflow_worker_secret_access[0]: Destruction complete after 4s
google_pubsub_topic_iam_binding.deadletter_topic_publisher: Destruction complete after 4s
google_pubsub_topic_iam_binding.input_sub_publisher: Destruction complete after 4s
google_compute_router_nat.dataflow_nat[0]: Destroying... [id=<REDACTED>/us-central1/dataflow-net-us-central1-router/dataflow-net-us-central1-router-nat]
google_logging_project_sink.project_log_sink: Destroying... [id=projects/<REDACTED>/sinks/export-pipeline-project-log-sink]
google_storage_bucket_iam_binding.dataflow_worker_bucket_access: Destruction complete after 5s
google_pubsub_subscription_iam_binding.input_sub_viewer: Destruction complete after 5s
google_service_account_iam_binding.terraform_caller_impersonate_dataflow_worker[0]: Destruction complete after 4s
google_logging_project_sink.project_log_sink: Destruction complete after 1s
google_project_iam_binding.dataflow_worker_role[0]: Destruction complete after 8s
google_pubsub_subscription_iam_binding.input_sub_subscriber: Destruction complete after 9s
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 10s elapsed]
google_compute_firewall.connect_dataflow_workers[0]: Still destroying... [id=projects/<REDACTED>/global/firewalls/dataflow-internal-ip-fwr, 10s elapsed]
google_compute_firewall.connect_dataflow_workers[0]: Destruction complete after 11s
google_compute_router_nat.dataflow_nat[0]: Still destroying... [id=<REDACTED>/us-central1/dataflow...er/dataflow-net-us-central1-router-nat, 10s elapsed]
google_compute_router_nat.dataflow_nat[0]: Destruction complete after 12s
google_compute_router.dataflow_to_splunk_router[0]: Destroying... [id=projects/<REDACTED>/regions/us-central1/routers/dataflow-net-us-central1-router]
google_compute_address.dataflow_nat_ip_address[0]: Destroying... [id=projects/<REDACTED>/regions/us-central1/addresses/dataflow-splunk-nat-ip-address]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 20s elapsed]
google_compute_address.dataflow_nat_ip_address[0]: Still destroying... [id=projects/<REDACTED>/regions/us-...dresses/dataflow-splunk-nat-ip-address, 10s elapsed]
google_compute_router.dataflow_to_splunk_router[0]: Still destroying... [id=projects/<REDACTED>/regions/us-...outers/dataflow-net-us-central1-router, 10s elapsed]
google_compute_router.dataflow_to_splunk_router[0]: Destruction complete after 10s
google_compute_address.dataflow_nat_ip_address[0]: Destruction complete after 11s
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 30s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 40s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 50s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 1m0s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 1m10s elapsed]
...continues forever...

... I go into the console and manually cancel.

google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h45m33s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h45m43s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h45m53s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m3s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m13s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m23s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m33s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m43s elapsed]
google_dataflow_job.dataflow_job: Destruction complete after 7h46m53s
random_id.dataflow_job_instance: Destroying... [id=_YQ]
random_id.dataflow_job_instance: Destruction complete after 0s
google_storage_bucket_object.dataflow_job_temp_object: Destroying... [id=<REDACTED>-export-pipeline-6563c6ff-tmp/]
google_compute_subnetwork.splunk_subnet[0]: Destroying... [id=projects/<REDACTED>/regions/us-central1/subnetworks/dataflow-net]
google_pubsub_subscription.dataflow_input_pubsub_subscription: Destroying... [id=projects/<REDACTED>/subscriptions/export-pipeline-input-subscription]
google_service_account.dataflow_worker_service_account[0]: Destroying... [id=projects/<REDACTED>/serviceAccounts/export-pipeline-worker@<REDACTED>.iam.gserviceaccount.com]
google_pubsub_topic.dataflow_deadletter_pubsub_topic: Destroying... [id=projects/<REDACTED>/topics/export-pipeline-deadletter-topic]
google_service_account.dataflow_worker_service_account[0]: Destruction complete after 0s
google_storage_bucket_object.dataflow_job_temp_object: Destruction complete after 0s
google_storage_bucket.dataflow_job_temp_bucket: Destroying... [id=<REDACTED>-export-pipeline-6563c6ff]
google_storage_bucket.dataflow_job_temp_bucket: Destruction complete after 1s
random_id.bucket_suffix: Destroying... [id=ZWPG_w]
random_id.bucket_suffix: Destruction complete after 0s
google_pubsub_subscription.dataflow_input_pubsub_subscription: Destruction complete after 1s
google_pubsub_topic.dataflow_input_pubsub_topic: Destroying... [id=projects/<REDACTED>/topics/export-pipeline-input-topic]
google_pubsub_topic.dataflow_deadletter_pubsub_topic: Destruction complete after 1s
google_pubsub_topic.dataflow_input_pubsub_topic: Destruction complete after 2s
google_compute_subnetwork.splunk_subnet[0]: Still destroying... [id=projects/<REDACTED>/regions/us-central1/subnetworks/dataflow-net, 10s elapsed]
google_compute_subnetwork.splunk_subnet[0]: Destruction complete after 11s
google_compute_network.splunk_export[0]: Destroying... [id=projects/<REDACTED>/global/networks/dataflow-net]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 10s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 20s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 30s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 40s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 50s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 1m0s elapsed]
google_compute_network.splunk_export[0]: Destruction complete after 1m2s

Destroy complete! Resources: 28 destroyed.

@rarsan
Copy link
Contributor

rarsan commented Sep 11, 2023

I neglected to share my findings from analyzing the output of your terraform destroy. So I couldn't trace this to a resource deletion out-of-order issue. The log sink is being deleted before the dataflow job as expected. And the GCS bucket & object are being deleted afterwards, so my hypothesis of GCS resource causing the Dataflow job teardown to hang is not correct.

I'm OK adding the on_delete option that defaults to cancel for quick prototyping and with proper warning that this should be modified to drain for production workloads. Should we expose that as top-level parameter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants