You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I am using both the Ingest API and OTLP to ingest documents into a few indices in Quickwit. I'm running the default helm chart with about 10 indexer pods (2 vCPU each, 8G RAM, local attached EBS volume per pod with 250G). Peak throughput can go from ~250 to ~300MB/s, up to 35k docs/sec. Very sporadically (twice in 2 weeks), I see one of the indexer pods rate limiting all ingestion and returning a 4xx to clients. Logs will show the following error continuously:
INFO quickwit_ingest::ingest_api_service: ingestion rejected due to disk limit
INFO quickwit_ingest::ingest_api_service: ingestion rejected due to disk limit
I see no ERRORS or WARNINGs before this state and to recover, I have to clean up the queue directory on the local disk and recycle the pod. Recycling the pod is not enough since (my guess), max_queue_disk_usage (which is set to 32G) is full. Just before the rate limiting kicked in, this indexer was doing 2.3k docs/s, ~25MB/s.
More info:
Looking into the local /quickwit/qwdata EBS volume, I can see that the queues directory has reached 33G (max_disk_usage is set to 32G):
I have no name!@quickwit-indexer-0:/quickwit/qwdata$ du -h queues/
33G queues/
There are about 257 wal files, a few created every second:
I have no name!@quickwit-indexer-0:/quickwit/qwdata/queues$ ls -al
total 33554524
drwxrwsr-x 2 1005 1005 16384 Nov 13 22:39 .
drwxrwsr-x 7 root 1005 4096 Oct 29 18:04 ..
-rw-rw-r-- 1 1005 1005 43 Jun 21 16:01 partition_id
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055576
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055577
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055578
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055579
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055580
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055581
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055582
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055583
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055584
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055585
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055586
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055587
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055588
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055589
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055590
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055591
[...]
The last wal file has the timestamp matching the first "ingestion rejected due to disk limit" log error.
Expected behavior
I don't know what is the expect behaviour after setting max_queue_disk_usage to 32G. I can see this as a protection to not allow the queue to grow unbounded, thus the rate limiting on client requests. There are 2 issues with this:
Once we get into this state, clients will always be rate limited but since the pod health check stays healthy, this pod will not be marked as unhealthy so it can not be removed from the ALB that sits in front of indexers.
What happens when the queue disk gets full? It seems the root cause might be that for some reason WAL files were not removed after successful ingestion (I assume this but it's just a guess since I didn't see any other errors before this happened and normally /quickwit/qwdata/queues on pods never grows >1G). If the rate limiting kicks in, I would expect it to last until queue disk space is available again, but this never happened.
So far, ingestion has been going on with zero issues for most of the time. I've tried to dig into wal metrics and opened the following bug #5547 .
Thanks for your help!
Any guidance on how max_queue_disk_usage should be set would be also greatly appreciated.
Thanks @fulmicoton, in the meantime, I've created a quick Prometheus exporter to give me detailed metrics about wal directory usage, I can see we rarely go above ~1.5G, from max of 32G (this is the last 24h):
I suspect an edge case may be happening that could affect the truncation of the WAL?
Describe the bug
I am using both the Ingest API and OTLP to ingest documents into a few indices in Quickwit. I'm running the default helm chart with about 10 indexer pods (2 vCPU each, 8G RAM, local attached EBS volume per pod with 250G). Peak throughput can go from ~250 to ~300MB/s, up to 35k docs/sec. Very sporadically (twice in 2 weeks), I see one of the indexer pods rate limiting all ingestion and returning a 4xx to clients. Logs will show the following error continuously:
I see no ERRORS or WARNINGs before this state and to recover, I have to clean up the queue directory on the local disk and recycle the pod. Recycling the pod is not enough since (my guess),
max_queue_disk_usage
(which is set to 32G) is full. Just before the rate limiting kicked in, this indexer was doing 2.3k docs/s, ~25MB/s.More info:
Looking into the local /quickwit/qwdata EBS volume, I can see that the queues directory has reached 33G (max_disk_usage is set to 32G):
There are about 257 wal files, a few created every second:
The last wal file has the timestamp matching the first
"ingestion rejected due to disk limit"
log error.Expected behavior
I don't know what is the expect behaviour after setting
max_queue_disk_usage
to 32G. I can see this as a protection to not allow the queue to grow unbounded, thus the rate limiting on client requests. There are 2 issues with this:/quickwit/qwdata/queues
on pods never grows >1G). If the rate limiting kicks in, I would expect it to last until queue disk space is available again, but this never happened.So far, ingestion has been going on with zero issues for most of the time. I've tried to dig into wal metrics and opened the following bug #5547 .
Thanks for your help!
Any guidance on how max_queue_disk_usage should be set would be also greatly appreciated.
Configuration:
Version: v0.8.2
node.yaml
The text was updated successfully, but these errors were encountered: