Ingestion stops after getting the error: "ingestion rejected due to disk limit" #5548

fredsig · 2024-11-14T18:08:03Z

Describe the bug
I am using both the Ingest API and OTLP to ingest documents into a few indices in Quickwit. I'm running the default helm chart with about 10 indexer pods (2 vCPU each, 8G RAM, local attached EBS volume per pod with 250G). Peak throughput can go from ~250 to ~300MB/s, up to 35k docs/sec. Very sporadically (twice in 2 weeks), I see one of the indexer pods rate limiting all ingestion and returning a 4xx to clients. Logs will show the following error continuously:

INFO quickwit_ingest::ingest_api_service: ingestion rejected due to disk limit
INFO quickwit_ingest::ingest_api_service: ingestion rejected due to disk limit

I see no ERRORS or WARNINGs before this state and to recover, I have to clean up the queue directory on the local disk and recycle the pod. Recycling the pod is not enough since (my guess), max_queue_disk_usage (which is set to 32G) is full. Just before the rate limiting kicked in, this indexer was doing 2.3k docs/s, ~25MB/s.

More info:

Looking into the local /quickwit/qwdata EBS volume, I can see that the queues directory has reached 33G (max_disk_usage is set to 32G):

I have no name!@quickwit-indexer-0:/quickwit/qwdata$ du -h queues/
33G     queues/

There are about 257 wal files, a few created every second:

I have no name!@quickwit-indexer-0:/quickwit/qwdata/queues$ ls -al
total 33554524
drwxrwsr-x 2 1005 1005     16384 Nov 13 22:39 .
drwxrwsr-x 7 root 1005      4096 Oct 29 18:04 ..
-rw-rw-r-- 1 1005 1005        43 Jun 21 16:01 partition_id
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055576
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055577
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055578
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055579
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055580
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055581
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055582
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055583
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055584
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055585
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055586
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055587
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055588
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055589
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055590
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055591
[...]

The last wal file has the timestamp matching the first "ingestion rejected due to disk limit" log error.

Expected behavior

I don't know what is the expect behaviour after setting max_queue_disk_usage to 32G. I can see this as a protection to not allow the queue to grow unbounded, thus the rate limiting on client requests. There are 2 issues with this:

Once we get into this state, clients will always be rate limited but since the pod health check stays healthy, this pod will not be marked as unhealthy so it can not be removed from the ALB that sits in front of indexers.
What happens when the queue disk gets full? It seems the root cause might be that for some reason WAL files were not removed after successful ingestion (I assume this but it's just a guess since I didn't see any other errors before this happened and normally /quickwit/qwdata/queues on pods never grows >1G). If the rate limiting kicks in, I would expect it to last until queue disk space is available again, but this never happened.

So far, ingestion has been going on with zero issues for most of the time. I've tried to dig into wal metrics and opened the following bug #5547 .

Thanks for your help!
Any guidance on how max_queue_disk_usage should be set would be also greatly appreciated.

Configuration:
Version: v0.8.2

node.yaml

data_dir: /quickwit/qwdata
default_index_root_uri: s3://prod-<redacted>-quickwit/indexes
gossip_listen_port: 7282
grpc:
  max_message_size: 80 MiB
indexer:
  enable_otlp_endpoint: true
ingest_api:
  max_queue_disk_usage: 32GiB
  max_queue_memory_usage: 4GiB
listen_address: 0.0.0.0
metastore:
  postgres:
    acquire_connection_timeout: 30s
    idle_connection_timeout: 1h
    max_connection_lifetime: 1d
    max_connections: 50
    min_connections: 10
storage:
  s3:
    region: us-east-1
version: 0.8

The text was updated successfully, but these errors were encountered:

fredsig · 2024-11-14T18:59:37Z

CPU/Mem usage for the pod during the rate limiting (between 22:40 and 23:30)

Disk usage (%) for the EBS Volume

Long term view for storage (all pods show similar patterns), no problems with available disk space

fredsig · 2024-11-18T09:16:48Z

Thanks @fulmicoton, in the meantime, I've created a quick Prometheus exporter to give me detailed metrics about wal directory usage, I can see we rarely go above ~1.5G, from max of 32G (this is the last 24h):

I suspect an edge case may be happening that could affect the truncation of the WAL?

fredsig added the bug Something isn't working label Nov 14, 2024

fredsig mentioned this issue Nov 14, 2024

Prometheus metrics for ingest wal usage are not working #5547

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingestion stops after getting the error: "ingestion rejected due to disk limit" #5548

Ingestion stops after getting the error: "ingestion rejected due to disk limit" #5548

fredsig commented Nov 14, 2024 •

edited

Loading

fredsig commented Nov 14, 2024

fredsig commented Nov 18, 2024 •

edited

Loading

Ingestion stops after getting the error: "ingestion rejected due to disk limit" #5548

Ingestion stops after getting the error: "ingestion rejected due to disk limit" #5548

Comments

fredsig commented Nov 14, 2024 • edited Loading

fredsig commented Nov 14, 2024

fredsig commented Nov 18, 2024 • edited Loading

fredsig commented Nov 14, 2024 •

edited

Loading

fredsig commented Nov 18, 2024 •

edited

Loading