Skip to content
This repository has been archived by the owner on Jan 29, 2025. It is now read-only.

Missing Metrics #580

Open
gkramer opened this issue Jan 31, 2022 · 2 comments
Open

Missing Metrics #580

gkramer opened this issue Jan 31, 2022 · 2 comments

Comments

@gkramer
Copy link

gkramer commented Jan 31, 2022

Hey guys,

Wondering if someone could assist with an issue I'm having with BigGraphite [BG]. It currently receives a large number of metrics, but appears to drop a noticable proportion randomly... this was highlighted when looking at metrics from Apache Spark, which has frequent gaps per hour (of one minute each).

Infrastructure Setup:

  • Within EKS (1.20)
  • internal AWS NLB
  • Traffic Flow: NLB -> Carbon Container -> {elasticsearch + cassandra}
  • Carbon: Running inside an upstream Alpine container
  • PS:
    1 root 0:00 {entrypoint} /bin/sh /entrypoint
    49 root 0:00 runsvdir -P /etc/service
    51 root 0:00 runsv bg-carbon
    52 root 0:03 runsv brubeck
    53 root 0:00 runsv carbon
    54 root 0:00 runsv carbon-aggregator
    55 root 0:03 runsv carbon-relay
    56 root 0:03 runsv collectd
    57 root 0:00 runsv cron
    58 root 0:00 runsv go-carbon
    59 root 0:00 runsv graphite
    60 root 0:00 runsv nginx
    61 root 0:03 runsv redis
    62 root 0:00 runsv statsd
    63 root 0:00 tee -a /var/log/carbon.log
    65 root 0:00 tee -a /var/log/carbon-relay.log
    68 root 0:00 tee -a /var/log/statsd.log
    69 root 0:01 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0
    70 root 0:09 {node} statsd /opt/statsd/config/tcp.js
    71 root 0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
    76 root 0:00 /usr/sbin/crond -f
    79 nginx 0:00 nginx: worker process
    80 nginx 0:00 nginx: worker process
    81 nginx 0:00 nginx: worker process
    82 nginx 0:00 nginx: worker process
    85 root 0:35 tee -a /var/log/bg-carbon.log
    86 root 45:27 /opt/graphite/bin/python3 /opt/graphite/bin/bg-carbon-cache start --nodaemon --debug
    88 root 0:00 tee -a /var/log/carbon-aggregator.log
    156 root 0:41 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0
    157 root 0:49 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0
    158 root 0:46 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0
    159 root 0:47 {gunicorn} /opt/graphite/bin/python3 /opt/graphite/bin/gunicorn wsgi --pythonpath=/opt/graphite/webapp/graphite --preload --threads=1 --worker-class=sync --workers=4 --limit-request-line=0 --max-requests=1000 --timeout=65 --bind=0.0

I can see traffic coming in to the interface (tcpdump/tcpflow), and can see logs to bg-carbon.log with references to 'cache query', but almost no datapoint logs for spark metrics.

Any assistance in troubleshooting would be greatly appreciated!

@geobeau
Copy link
Contributor

geobeau commented Feb 1, 2022

If you look on the Cassandra side:

  • do you have errors?
  • do you see a drop in write ops when you notice the drops?

Inside your container, does carbon restarts by itself?

@gkramer
Copy link
Author

gkramer commented Mar 16, 2022

Apologies for the delay in coming back to you!

I've rebuilt the cache container to only run carbon cache. Previously, it was running statds+carbon+etc, and this was all under supervisord, or similar. The container now runs carbon exclusively.

At first, and under low load, there were no metric drop-outs at all. We were shipping all metrics for spark, and it was bulletproof. As soon as we started shipping more metrics from other services, we began to see drop-outs of 1-2 minutes. across multiple metrics. Another interesting observation is that metrics appear to disappear at times - I'm not sure if they are being overwritten by null values? What I can tell you is that metrics are being fed into now what is a dedicated carbon ingress, and being inspected from another graphite endpoint, so whisper data is not a thing.

I've made multiple tweaks to the configs, but I'm at a bit of a loss as to how to eradicate the intermittent data loss.

Any help would be GREATLY appreciated!

TIA!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants