Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingestion stops after getting the error: "ingestion rejected due to disk limit" #5548

Open
fredsig opened this issue Nov 14, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@fredsig
Copy link

fredsig commented Nov 14, 2024

Describe the bug
I am using both the Ingest API and OTLP to ingest documents into a few indices in Quickwit. I'm running the default helm chart with about 10 indexer pods (2 vCPU each, 8G RAM, local attached EBS volume per pod with 250G). Peak throughput can go from ~250 to ~300MB/s, up to 35k docs/sec. Very sporadically (twice in 2 weeks), I see one of the indexer pods rate limiting all ingestion and returning a 4xx to clients. Logs will show the following error continuously:

INFO quickwit_ingest::ingest_api_service: ingestion rejected due to disk limit
INFO quickwit_ingest::ingest_api_service: ingestion rejected due to disk limit

I see no ERRORS or WARNINGs before this state and to recover, I have to clean up the queue directory on the local disk and recycle the pod. Recycling the pod is not enough since (my guess), max_queue_disk_usage (which is set to 32G) is full. Just before the rate limiting kicked in, this indexer was doing 2.3k docs/s, ~25MB/s.

More info:
Screenshot 2024-11-14 at 17 53 04
Screenshot 2024-11-14 at 17 54 04

Looking into the local /quickwit/qwdata EBS volume, I can see that the queues directory has reached 33G (max_disk_usage is set to 32G):

I have no name!@quickwit-indexer-0:/quickwit/qwdata$ du -h queues/
33G     queues/

There are about 257 wal files, a few created every second:

I have no name!@quickwit-indexer-0:/quickwit/qwdata/queues$ ls -al
total 33554524
drwxrwsr-x 2 1005 1005     16384 Nov 13 22:39 .
drwxrwsr-x 7 root 1005      4096 Oct 29 18:04 ..
-rw-rw-r-- 1 1005 1005        43 Jun 21 16:01 partition_id
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055576
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055577
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055578
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055579
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:17 wal-00000000000001055580
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055581
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055582
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055583
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055584
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055585
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055586
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055587
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055588
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055589
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055590
-rw-r--r-- 1 1005 1005 134217728 Nov 13 22:18 wal-00000000000001055591
[...]

The last wal file has the timestamp matching the first "ingestion rejected due to disk limit" log error.

Expected behavior

I don't know what is the expect behaviour after setting max_queue_disk_usage to 32G. I can see this as a protection to not allow the queue to grow unbounded, thus the rate limiting on client requests. There are 2 issues with this:

  1. Once we get into this state, clients will always be rate limited but since the pod health check stays healthy, this pod will not be marked as unhealthy so it can not be removed from the ALB that sits in front of indexers.
  2. What happens when the queue disk gets full? It seems the root cause might be that for some reason WAL files were not removed after successful ingestion (I assume this but it's just a guess since I didn't see any other errors before this happened and normally /quickwit/qwdata/queues on pods never grows >1G). If the rate limiting kicks in, I would expect it to last until queue disk space is available again, but this never happened.

So far, ingestion has been going on with zero issues for most of the time. I've tried to dig into wal metrics and opened the following bug #5547 .

Thanks for your help!
Any guidance on how max_queue_disk_usage should be set would be also greatly appreciated.

Configuration:
Version: v0.8.2

node.yaml

data_dir: /quickwit/qwdata
default_index_root_uri: s3://prod-<redacted>-quickwit/indexes
gossip_listen_port: 7282
grpc:
  max_message_size: 80 MiB
indexer:
  enable_otlp_endpoint: true
ingest_api:
  max_queue_disk_usage: 32GiB
  max_queue_memory_usage: 4GiB
listen_address: 0.0.0.0
metastore:
  postgres:
    acquire_connection_timeout: 30s
    idle_connection_timeout: 1h
    max_connection_lifetime: 1d
    max_connections: 50
    min_connections: 10
storage:
  s3:
    region: us-east-1
version: 0.8

@fredsig
Copy link
Author

fredsig commented Nov 14, 2024

CPU/Mem usage for the pod during the rate limiting (between 22:40 and 23:30)
Screenshot 2024-11-14 at 18 54 56

Disk usage (%) for the EBS Volume
Screenshot 2024-11-14 at 18 57 44

Long term view for storage (all pods show similar patterns), no problems with available disk space
Screenshot 2024-11-14 at 18 58 54

@fredsig
Copy link
Author

fredsig commented Nov 18, 2024

Thanks @fulmicoton, in the meantime, I've created a quick Prometheus exporter to give me detailed metrics about wal directory usage, I can see we rarely go above ~1.5G, from max of 32G (this is the last 24h):

Screenshot 2024-11-18 at 09 12 30

I suspect an edge case may be happening that could affect the truncation of the WAL?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant