Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High WA% without workload #2270

Closed
fe-ax opened this issue Dec 7, 2023 · 8 comments
Closed

High WA% without workload #2270

fe-ax opened this issue Dec 7, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@fe-ax
Copy link

fe-ax commented Dec 7, 2023

Describe the bug
A high WA% (Waiting for I/O) time while nothing is happening on the DB. CPU usage is nearly 0%.

To Reproduce
Steps to reproduce the behavior:

  1. Run with command dragonfly --logtostderr --logtostderr --maxmemory=4gb --save_schedule=*:* --hz=5 --dbfilename dump.rdb --df_snapshot_format=false

Expected behavior
Lower WA% when no workload is present.

Screenshots
image

Environment (please complete the following information):

  • OS:
sh-4.2# cat /etc/os-release
      NAME="Amazon Linux"
      VERSION="2"
      ID="amzn"
      ID_LIKE="centos rhel fedora"
      VERSION_ID="2"
      PRETTY_NAME="Amazon Linux 2"
      CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
  • Kernel: Linux ip-10-117-39-51.eu-central-1.compute.internal 5.10.198-187.748.amzn2.x86_64 #1 SMP Tue Oct 24 19:49:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Containerized?: Kubernetes
  • Dragonfly Version:
    dragonfly v1.13.0-f39eac5bcaf7c8ffe5c433a0e8e15747391199d9
    build time: 2023-12-04 15:59:48

Reproducible Code Snippet
N/A

Additional context

  • We are using a EBS disk to write the dump file
  • EBS disk is a 1GB gp3 with 3000 iops available
  • Other workloads that use the persistent disk don't show this behaviour
  • AWS metrics show almost no sign of workload

image

@fe-ax fe-ax added the bug Something isn't working label Dec 7, 2023
@chakaz
Copy link
Collaborator

chakaz commented Dec 7, 2023

First, let me make sure I understand correctly this issue: you do not experience worse performance (like throughput / latency), but the process seems to be waiting for I/O more than other deployments. Is that correct?

Is WA% always high in this deployment, or is it just during writes to disk? (I see that you're saving RDB every 1 minute).

When you say "Other workloads that use the persistent disk don't show this behaviour" - what are the differences between this deployment and the others? Do they use different disks?

And finally, a few unrelated questions:

  • Why do you use --hz=5?
  • Similarly, why disable Dragonfly's snapshot format (via --df_snapshot_format=false)?
  • May I ask how do you use Dragonfly? With what load, for which purpose, etc?

Thanks!

@romange
Copy link
Collaborator

romange commented Dec 7, 2023

Duplicate of #2181
@fe-ax it's a kernel change on how it attributes CPU time in iouring API. Unfortunately, there is nothing much we can do about it but it does not affect anything. It's completely harmless, it's just that an idle CPU that is waiting for networking packet is attributed now as IOWAIT. iouring kernel folks decided at some point that it's better to attribute a cpu blocked on any I/O (even networking) as IOWAIT.

I am surprised that it appeared in kernel 5.10 but 5.10 is a lts kernel version, so maybe they backported this change to there. AFAIK, it first appeared in 6+ kernel versions.

I now googled again kernel discussions about this and learned that they decided to revert the decision because it has been confusing to many users.
See here: https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/commit/queue-6.4/io_uring-gate-iowait-schedule-on-having-pending-requests.patch?id=2b8c242ac869eae3d96b712fdb9940e9cd1e0d69

Also here mariadb/mysql folks complaining about this: https://bugzilla.kernel.org/show_bug.cgi?id=217699

@romange
Copy link
Collaborator

romange commented Dec 10, 2023

working as intended

@romange romange closed this as not planned Won't fix, can't repro, duplicate, stale Dec 10, 2023
@tvijverb
Copy link

First, let me make sure I understand correctly this issue: you do not experience worse performance (like throughput / latency), but the process seems to be waiting for I/O more than other deployments. Is that correct?

Is WA% always high in this deployment, or is it just during writes to disk? (I see that you're saving RDB every 1 minute).

When you say "Other workloads that use the persistent disk don't show this behaviour" - what are the differences between this deployment and the others? Do they use different disks?

And finally, a few unrelated questions:

* Why do you use `--hz=5`?

* Similarly, why disable Dragonfly's snapshot format (via `--df_snapshot_format=false`)?

* May I ask how do you use Dragonfly? With what load, for which purpose, etc?

Thanks!

  1. The Dragonflydb pod is running on AWS spot instances, so saving the db state every minute is quite helpful for our purposes.
  2. --hz=5 is used to reduce the cpu load, the default setting uses more than 10% CPU on our AWS instance.
  3. --df_snapshot_format=false was needed in previous dragonflydb versions to save the db state to a Redis compatible *.rdb file.
  4. The dragonfly instance is used as a simple job queue for Python (Celery).

@sherif-fanous
Copy link

sherif-fanous commented Dec 16, 2023

I ran across this issue the past few days on my home lab k8s cluster where I started getting nagging NodeCPUHighUsage alerts from Prometheus.

After hours of triage (Because using all other available Linux tools didn't show any high CPU usage) I was able to determine that the alert was reporting CPU being io iowait and narrowed it down to Dragonfly.

In my case, I'm running a super trivial workload on my home lab so temporarily forced Dragonfly to use epoll using --force_epoll. Let me be clear that this works in my case where as I stated the workload is trivial and at least I'm no longer getting the Prometheus alerts.

@fe-ax
Copy link
Author

fe-ax commented Jan 8, 2024

@romange We're using kernel-5.10.198-187.748.amzn2.

If this is the patch intended to resolve the issue. It doesn't fix the issue.

Here we can see the patch is already implemented in the running version on our host.

@crishoj
Copy link

crishoj commented Apr 22, 2024

Also observing ~100% IOWAIT on Linux 6.5.0.

@tvijverb
Copy link

@crishoj Parent issue on liburing mentions it will be fixed in kernel 6.10. No idea if the patch will be backported. axboe/liburing#943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants