High WA% without workload #2270

fe-ax · 2023-12-07T09:33:35Z

Describe the bug
A high WA% (Waiting for I/O) time while nothing is happening on the DB. CPU usage is nearly 0%.

To Reproduce
Steps to reproduce the behavior:

Run with command dragonfly --logtostderr --logtostderr --maxmemory=4gb --save_schedule=*:* --hz=5 --dbfilename dump.rdb --df_snapshot_format=false

Expected behavior
Lower WA% when no workload is present.

Screenshots

Environment (please complete the following information):

OS:

sh-4.2# cat /etc/os-release
      NAME="Amazon Linux"
      VERSION="2"
      ID="amzn"
      ID_LIKE="centos rhel fedora"
      VERSION_ID="2"
      PRETTY_NAME="Amazon Linux 2"
      CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"

Kernel: Linux ip-10-117-39-51.eu-central-1.compute.internal 5.10.198-187.748.amzn2.x86_64 #1 SMP Tue Oct 24 19:49:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Containerized?: Kubernetes
Dragonfly Version:

    dragonfly v1.13.0-f39eac5bcaf7c8ffe5c433a0e8e15747391199d9
    build time: 2023-12-04 15:59:48

Reproducible Code Snippet
N/A

Additional context

We are using a EBS disk to write the dump file
EBS disk is a 1GB gp3 with 3000 iops available
Other workloads that use the persistent disk don't show this behaviour
AWS metrics show almost no sign of workload

The text was updated successfully, but these errors were encountered:

chakaz · 2023-12-07T11:11:47Z

First, let me make sure I understand correctly this issue: you do not experience worse performance (like throughput / latency), but the process seems to be waiting for I/O more than other deployments. Is that correct?

Is WA% always high in this deployment, or is it just during writes to disk? (I see that you're saving RDB every 1 minute).

When you say "Other workloads that use the persistent disk don't show this behaviour" - what are the differences between this deployment and the others? Do they use different disks?

And finally, a few unrelated questions:

Why do you use --hz=5?
Similarly, why disable Dragonfly's snapshot format (via --df_snapshot_format=false)?
May I ask how do you use Dragonfly? With what load, for which purpose, etc?

Thanks!

romange · 2023-12-07T11:37:34Z

Duplicate of #2181
@fe-ax it's a kernel change on how it attributes CPU time in iouring API. Unfortunately, there is nothing much we can do about it but it does not affect anything. It's completely harmless, it's just that an idle CPU that is waiting for networking packet is attributed now as IOWAIT. iouring kernel folks decided at some point that it's better to attribute a cpu blocked on any I/O (even networking) as IOWAIT.

I am surprised that it appeared in kernel 5.10 but 5.10 is a lts kernel version, so maybe they backported this change to there. AFAIK, it first appeared in 6+ kernel versions.

I now googled again kernel discussions about this and learned that they decided to revert the decision because it has been confusing to many users.
See here: https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/commit/queue-6.4/io_uring-gate-iowait-schedule-on-having-pending-requests.patch?id=2b8c242ac869eae3d96b712fdb9940e9cd1e0d69

Also here mariadb/mysql folks complaining about this: https://bugzilla.kernel.org/show_bug.cgi?id=217699

romange · 2023-12-10T07:32:23Z

working as intended

tvijverb · 2023-12-12T09:48:06Z

First, let me make sure I understand correctly this issue: you do not experience worse performance (like throughput / latency), but the process seems to be waiting for I/O more than other deployments. Is that correct?

Is WA% always high in this deployment, or is it just during writes to disk? (I see that you're saving RDB every 1 minute).

When you say "Other workloads that use the persistent disk don't show this behaviour" - what are the differences between this deployment and the others? Do they use different disks?

And finally, a few unrelated questions:
* Why do you use `--hz=5`?

* Similarly, why disable Dragonfly's snapshot format (via `--df_snapshot_format=false`)?

* May I ask how do you use Dragonfly? With what load, for which purpose, etc?
Thanks!

The Dragonflydb pod is running on AWS spot instances, so saving the db state every minute is quite helpful for our purposes.
--hz=5 is used to reduce the cpu load, the default setting uses more than 10% CPU on our AWS instance.
--df_snapshot_format=false was needed in previous dragonflydb versions to save the db state to a Redis compatible *.rdb file.
The dragonfly instance is used as a simple job queue for Python (Celery).

sherif-fanous · 2023-12-16T19:40:00Z

I ran across this issue the past few days on my home lab k8s cluster where I started getting nagging NodeCPUHighUsage alerts from Prometheus.

After hours of triage (Because using all other available Linux tools didn't show any high CPU usage) I was able to determine that the alert was reporting CPU being io iowait and narrowed it down to Dragonfly.

In my case, I'm running a super trivial workload on my home lab so temporarily forced Dragonfly to use epoll using --force_epoll. Let me be clear that this works in my case where as I stated the workload is trivial and at least I'm no longer getting the Prometheus alerts.

fe-ax · 2024-01-08T10:53:17Z

@romange We're using kernel-5.10.198-187.748.amzn2.

If this is the patch intended to resolve the issue. It doesn't fix the issue.

Here we can see the patch is already implemented in the running version on our host.

crishoj · 2024-04-22T10:25:46Z

Also observing ~100% IOWAIT on Linux 6.5.0.

tvijverb · 2024-04-22T10:38:06Z

@crishoj Parent issue on liburing mentions it will be fixed in kernel 6.10. No idea if the patch will be backported. axboe/liburing#943

fe-ax added the bug Something isn't working label Dec 7, 2023

romange closed this as not planned Won't fix, can't repro, duplicate, stale Dec 10, 2023

chakaz mentioned this issue Dec 11, 2023

keeping 100% iowait CPU utilization on kernel 6.5 #2287

Closed

romange mentioned this issue Jan 19, 2024

CPU 100% iowait #2444

Closed

romange mentioned this issue Mar 14, 2024

High IO Wait CPU usage #2729

Closed

IgorOhrimenko mentioned this issue Jul 24, 2024

Software error on replication #3170

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High WA% without workload #2270

High WA% without workload #2270

fe-ax commented Dec 7, 2023

chakaz commented Dec 7, 2023

romange commented Dec 7, 2023

romange commented Dec 10, 2023

tvijverb commented Dec 12, 2023

sherif-fanous commented Dec 16, 2023 •

edited

Loading

fe-ax commented Jan 8, 2024 •

edited

Loading

crishoj commented Apr 22, 2024 •

edited

Loading

tvijverb commented Apr 22, 2024

High WA% without workload #2270

High WA% without workload #2270

Comments

fe-ax commented Dec 7, 2023

chakaz commented Dec 7, 2023

romange commented Dec 7, 2023

romange commented Dec 10, 2023

tvijverb commented Dec 12, 2023

sherif-fanous commented Dec 16, 2023 • edited Loading

fe-ax commented Jan 8, 2024 • edited Loading

crishoj commented Apr 22, 2024 • edited Loading

tvijverb commented Apr 22, 2024

sherif-fanous commented Dec 16, 2023 •

edited

Loading

fe-ax commented Jan 8, 2024 •

edited

Loading

crishoj commented Apr 22, 2024 •

edited

Loading