Timer for AWS credentials refresh gets detached #110

gwiesing · 2023-07-19T08:35:54Z

(check apply)

read the contribution guideline

Problem

After about 1 week of running smoothly after a new deployment to AWS Fargate we got - the first time ever - an error showing up that during the AWS credentials renewal the refresh timer gets detached. Other deployments have not shown this yet, even after a month or more running.
...

Steps to replicate

Error log

Jul 18, 2023 @ 15:13:11.981[os] Timer detached. title=:out_opensearch_expire_credentials error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/sign.rb:49:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/http_checksum.rb:19:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/endpoint_pattern.rb:30:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/request_compression.rb:94:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-sts/plugins/endpoints.rb:41:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/plugins/raise_response_errors.rb:16:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:111:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/jsonvalue_converter.rb:16:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/assume_role_credentials.rb:65:in `refresh' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/fluent-plugin-opensearch-1.1.3/lib/fluent/plugin/out_opensearch.rb:238:in `aws_credentials' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/fluent-plugin-opensearch-1.1.3/lib/fluent/plugin/out_opensearch.rb:353:in `block (2 levels) in configure' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/fluent-plugin-opensearch-1.1.3/lib/fluent/plugin/out_opensearch.rb:351:in `block in configure' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.1/lib/fluent/plugin_helper/thread.rb:78:in `block in thread_create' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/net_http/handler.rb:128:in `session' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/plugins/request_callback.rb:87:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/xml/error_handler.rb:10:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/transfer_encoding.rb:26:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/checksum_algorithm.rb:136:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/idempotency_token.rb:19:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/plugins/response_target.rb:24:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-sts/client.rb:857:in `assume_role' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/refreshing_credentials.rb:30:in `initialize' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/fluent-plugin-opensearch-1.1.3/lib/fluent/plugin/out_opensearch.rb:351:in `synchronize' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/cool.io-1.7.1/lib/cool.io/loop.rb:88:in `run_once' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.1/lib/fluent/plugin_helper/event_loop.rb:93:in `block in start' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/3.1.0/net/http.rb:1040:in `connect' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/3.1.0/net/http.rb:990:in `start' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/net_http/connection_pool.rb:100:in `session_for' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/net_http/handler.rb:76:in `transmit' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/net_http/handler.rb:50:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/plugins/content_length.rb:24:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/helpful_socket_errors.rb:12:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/retry_errors.rb:360:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/query/handler.rb:30:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/param_converter.rb:26:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/response_paging.rb:12:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/assume_role_credentials.rb:53:in `initialize' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/fluent-plugin-opensearch-1.1.3/lib/fluent/plugin/out_opensearch.rb:238:in `new' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/fluentd-1.16.1/lib/fluent/plugin_helper/timer.rb:80:in `on_timer' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/3.1.0/net/http.rb:995:in `do_start' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/net_http/connection_pool.rb:307:in `start_session' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/user_agent.rb:37:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/recursion_detection.rb:18:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/plugins/endpoint.rb:47:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/cool.io-1.7.1/lib/cool.io/loop.rb:88:in `run' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/3.1.0/net/http/response.rb:151:in `value' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/3.1.0/delegate.rb:87:in `method_missing' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/endpoint_discovery.rb:84:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/aws-sdk-core/plugins/param_validator.rb:26:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/plugins/request_callback.rb:71:in `call' error
Jul 18, 2023 @ 15:13:11.980/usr/lib/ruby/gems/3.1.0/gems/aws-sdk-core-3.177.0/lib/seahorse/client/request.rb:72:in `send_request' error
Jul 18, 2023 @ 15:13:11.979/usr/lib/ruby/3.1.0/net/http/response.rb:142:in `error!' error
Jul 18, 2023 @ 15:13:11.979[os] Unexpected error raised. Stopping the timer. title=:out_opensearch_expire_credentials error_class=Net::HTTPClientException error="400 \"Bad Request\"" error
Jul 18, 2023 @ 15:13:11.410[os] Recreate the AWS credentials

Endpoint configuration

      <endpoint>
        url "#{ENV['OS_URL']}"
        region "#{ENV['AWS_REGION']}"
        assume_role_arn "#{ENV['OS_ROLE_ARN']}"
        assume_role_session_name fluentd
        sts_credentials_region "#{ENV['AWS_REGION']}"
        refresh_credentials_interval "#{ENV['AWS_REFRESH_CREDENTIALS_INTERVAL']}"
      </endpoint>

The credentials refresh interval is set to 20m. The component is acting as a logging aggregator, forwarding logs from other Fluentd and Application components to OS.

Expected Behavior or What you need to ask

We assume the unexpected error is actually coming from the aws-sdk (or a STS hickup?) and since this is the first time ever to be observed it is hard to determine the cause. Due to the detachment of the timer there is no recovery possible from this error as after the token expiration all requests against OS are failing. There is no further credentials refresh happening.

If asked for an expected behavior this might be either a later retry attempt for the token refresh (better) or maybe a Fluentd exit so that the AWS Fargate task can be restarted (worse).
...

Using Fluentd and OpenSearch plugin versions

AWS Fargate linux
fluentd-1.16.1, ruby="3.1.4"
gem 'fluent-plugin-opensearch' version '1.1.3'

The text was updated successfully, but these errors were encountered:

gwiesing · 2023-09-11T15:12:22Z

For the purpose of testing file buffer fill up and flushes I have blocked (any) outgoing traffic and the error log appears again when the plugin tries to refresh the credentials.
Failing to renew the token is expected behavior with blocked traffic.

Unexpected error raised. Stopping the timer. title=:out_opensearch_expire_credentials error_class=Seahorse::Client::NetworkingError error="Failed to open TCP connection to xxxx (execution expired)"

When re-enabling the out traffic, the OS output plugin is not able to communicate with OpenSearch due to token expiration and not renewing the token.
Consider it as a bug or a request, the credentials refresh should not stop with a detached timer which leaves fluentd sitting there with full buffers not being able to flush them and the need of manual interference.
I understand that blocked traffic is or should never happen and also token renewal service should always (high availability) be responsive.

ntopee · 2024-08-26T10:09:08Z

Is there any update/workaround on this?

We are still experiencing the same issue in EKS. We have multiple regions that have been working perfectly for over a year, but in 2 of our regions, sometimes (once in a few days) the sts call times out resulting in a networking error, the timer stops, and the token expires.

cosmo0920 · 2024-10-02T08:48:55Z

This should be also fixed in #142 and v1.1.5.
If #142 doesn't fix this issue, please feel free to comment this issue.

ntopee mentioned this issue Aug 26, 2024

If fluent-plugin-opensearch faied to refresh @_aws_credentials, it won't refresh @_aws_credentials anymore #129

Closed

2 tasks

cosmo0920 closed this as completed Oct 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timer for AWS credentials refresh gets detached #110

Timer for AWS credentials refresh gets detached #110

gwiesing commented Jul 19, 2023 •

edited

Loading

gwiesing commented Sep 11, 2023

ntopee commented Aug 26, 2024

cosmo0920 commented Oct 2, 2024

Timer for AWS credentials refresh gets detached #110

Timer for AWS credentials refresh gets detached #110

Comments

gwiesing commented Jul 19, 2023 • edited Loading

Problem

Steps to replicate

Expected Behavior or What you need to ask

Using Fluentd and OpenSearch plugin versions

gwiesing commented Sep 11, 2023

ntopee commented Aug 26, 2024

cosmo0920 commented Oct 2, 2024

gwiesing commented Jul 19, 2023 •

edited

Loading