Missing Metrics #580

gkramer · 2022-01-31T15:36:42Z

Hey guys,

Wondering if someone could assist with an issue I'm having with BigGraphite [BG]. It currently receives a large number of metrics, but appears to drop a noticable proportion randomly... this was highlighted when looking at metrics from Apache Spark, which has frequent gaps per hour (of one minute each).

Infrastructure Setup:

Within EKS (1.20)
internal AWS NLB
Traffic Flow: NLB -> Carbon Container -> {elasticsearch + cassandra}
Carbon: Running inside an upstream Alpine container
PS:
1 root 0:00 {entrypoint} /bin/sh /entrypoint
49 root 0:00 runsvdir -P /etc/service
51 root 0:00 runsv bg-carbon
52 root 0:03 runsv brubeck
53 root 0:00 runsv carbon
54 root 0:00 runsv carbon-aggregator
55 root 0:03 runsv carbon-relay
56 root 0:03 runsv collectd
57 root 0:00 runsv cron
58 root 0:00 runsv go-carbon
59 root 0:00 runsv graphite
60 root 0:00 runsv nginx
61 root 0:03 runsv redis
62 root 0:00 runsv statsd
63 root 0:00 tee -a /var/log/carbon.log
65 root 0:00 tee -a /var/log/carbon-relay.log
68 root 0:00 tee -a /var/log/statsd.log
69 root 0:01 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0
70 root 0:09 {node} statsd /opt/statsd/config/tcp.js
71 root 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
76 root 0:00 /usr/sbin/crond -f
79 nginx 0:00 nginx: worker process
80 nginx 0:00 nginx: worker process
81 nginx 0:00 nginx: worker process
82 nginx 0:00 nginx: worker process
85 root 0:35 tee -a /var/log/bg-carbon.log
86 root 45:27 /opt/graphite/bin/python3 /opt/graphite/bin/bg-carbon-cache start --nodaemon --debug
88 root 0:00 tee -a /var/log/carbon-aggregator.log
156 root 0:41 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0
157 root 0:49 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0
158 root 0:46 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0
159 root 0:47 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0

I can see traffic coming in to the interface (tcpdump/tcpflow), and can see logs to bg-carbon.log with references to 'cache query', but almost no datapoint logs for spark metrics.

Any assistance in troubleshooting would be greatly appreciated!

geobeau · 2022-02-01T12:07:25Z

If you look on the Cassandra side:

do you have errors?
do you see a drop in write ops when you notice the drops?

Inside your container, does carbon restarts by itself?

gkramer · 2022-03-16T13:58:20Z

Apologies for the delay in coming back to you!

I've rebuilt the cache container to only run carbon cache. Previously, it was running statds+carbon+etc, and this was all under supervisord, or similar. The container now runs carbon exclusively.

At first, and under low load, there were no metric drop-outs at all. We were shipping all metrics for spark, and it was bulletproof. As soon as we started shipping more metrics from other services, we began to see drop-outs of 1-2 minutes. across multiple metrics. Another interesting observation is that metrics appear to disappear at times - I'm not sure if they are being overwritten by null values? What I can tell you is that metrics are being fed into now what is a dedicated carbon ingress, and being inspected from another graphite endpoint, so whisper data is not a thing.

I've made multiple tweaks to the configs, but I'm at a bit of a loss as to how to eradicate the intermittent data loss.

Any help would be GREATLY appreciated!

TIA!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Metrics #580

Missing Metrics #580

gkramer commented Jan 31, 2022

geobeau commented Feb 1, 2022

gkramer commented Mar 16, 2022

Missing Metrics #580

Missing Metrics #580

Comments

gkramer commented Jan 31, 2022

geobeau commented Feb 1, 2022

gkramer commented Mar 16, 2022