Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/metrics of jmx_exporter does not respond in time #1006

Open
suavebajaj opened this issue Oct 9, 2024 · 14 comments
Open

/metrics of jmx_exporter does not respond in time #1006

suavebajaj opened this issue Oct 9, 2024 · 14 comments

Comments

@suavebajaj
Copy link

I'm facing an issue where the Hazelcast metrics endpoint (/metrics) does not return any data in one of my Google Kubernetes Engine (GKE) clusters, while it functions correctly in others. The only difference between them is the cluster members. The working cluster has 3 members while the non-working one has 15 members

Hazelcast version is 3.7.4

jmx version is 0.20.0

Heap allocated is 6Gb

Expected Behavior:
In my working clusters, I can retrieve metrics using the following command:

curl http://127.0.0.1:1099/metrics

This command returns the expected metrics data, such as:

# HELP jmx_config_reload_success_total Number of times configuration have successfully been reloaded.
# TYPE jmx_config_reload_success_total counter
jmx_config_reload_success_total 0.0
...

Observed Behavior:
In the non-working cluster, executing the same command hangs indefinitely:

root@hazelcast-0:/# curl -vvv -k http://127.0.0.1:1099/metrics
* Expire in 0 ms for 6 (transfer 0x56f0010850f0)
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x56f0010850f0)
* connect to 127.0.0.1 port 1099 failed: Connection timed out
* Failed to connect to 127.0.0.1 port 1099: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to 127.0.0.1 port 1099: Connection timed out

Below is the configuration file

#see: https://github.com/prometheus/jmx_exporter#configuration
startDelaySeconds: 0
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  # see "MBean Naming for Hazelcast Data Structures" here: https://docs.hazelcast.org/docs/latest-dev/manual/html-single/index.html#monitoring-with-jmx
  # example input: "com.hazelcast<instance=_hzInstance_1_dev, name="hz:scheduled", type=HazelcastInstance.ManagedExecutorService><>completedTaskCount"
  - pattern: 'com\.hazelcast<instance=(.*), name=(.*), type=(.*)><>(.*):(.*)'
    labels:
      "hz_instance": "$1"
      "hz_name": "$2"
      "hz_type": "$3"
    name: "hazelcast_$4"
  # Fallback to the default pattern for anything not matching above
  - pattern: '.*'
cat /etc/manh/hazelcast_config.xml
<?xml version="1.0" encoding="UTF-8"?>
<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.6.xsd"
       xmlns="http://www.hazelcast.com/schema/config"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <management-center enabled="false">http://localhost:8080/mancenter</management-center>
  <properties>
        <property name="hazelcast.jmx">true</property>
        <property name="hazelcast.rest.enabled">true</property>
  </properties>
  <map name="authserver.user">
    <time-to-live-seconds>60</time-to-live-seconds>
  </map>
  <map name="zuulserver.userGrants">
    <time-to-live-seconds>60</time-to-live-seconds>
  </map>
  <map name="zuulserver.resources">
    <time-to-live-seconds>60</time-to-live-seconds>
  </map>
root@hazelcast-0:/# ps afx | grep java
    234 pts/0    S+     0:00  \_ grep java
      1 ?        Ssl    9:21 java -javaagent:/data/hazelcast/jmx_prometheus_javaagent-0.20.0.jar=1099:/etc/manh/hazelcast_exporter_config.yml -Xmx6144m -Xss1024k -Dlogging.level.com.manh.cp=DEBUG -Dlogging.level.com.netflix=WARN -Dlogging.level.com.hazelcast.nio.tcp=WARN -XX:+DoEscapeAnalysis -XX:+UseG1GC -XX:MaxGCPauseMillis=2000 -verbose:gc -Xloggc:/mnt/logs/hazelcastserver_G1-gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/mnt/logs/hazelcastserver_oom.hprof -XX:+DisableExplicitGC -Djavax.net.ssl.trustStore=/mnt/truststore.jks -Deureka.client.registerWithEureka=true -jar /main.jar

Steps Taken:

  • Confirmed that the Hazelcast instance is running and accessible via its REST API at port 5701. curl -vvv 0.0.0.0:5701/hazelcast/rest/cluster
  • Compared configurations between the
    working and non-working clusters for discrepancies.
  • Verified the JVM options to ensure the JMX Prometheus agent is configured correctly to listen on port 1099 and checked that hazelcast.jmx is set to true

What additional troubleshooting steps or best practices can help diagnose this issue further?

@dhoard
Copy link
Collaborator

dhoard commented Oct 9, 2024

@suavebajaj The curl output...

* connect to 127.0.0.1 port 1099 failed: Connection timed out
* Failed to connect to 127.0.0.1 port 1099: Connection timed out

... indicates a connection issue. Curl isn't connecting to the exporter.

Some common debugging steps:

  • use netstat to verify listening IP address / port
netstat -tln
  • use nslookup to verify correct hostname to IP address (and reverse)
nslookup 127.0.0.1
nslookup localhost
  • check firewall settings

@suavebajaj
Copy link
Author

@dhoard Thank you for sharing the troubleshooting steps, the Port is open to listening to the same, firewall is okay! We have the same firewall settings in all our GKE clusters.

Sharing the output below

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:5701            0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:1099            0.0.0.0:*               LISTEN

root@hazelcast-14:/# nslookup 127.0.0.1
1.0.0.127.in-addr.arpa	name = localhost.

root@hazelcast-14:/# nslookup localhost
Server:		10.16.0.10
Address:	10.16.0.10#53

Name:	localhost
Address: 127.0.0.1
Name:	localhost
Address: ::1 

@dhoard
Copy link
Collaborator

dhoard commented Oct 24, 2024

@suavebajaj Have you tested using the IP address from the machine?

curl http://10.16.0.10:1099/metrics

@suavebajaj
Copy link
Author

@dhoard Yes, Tested with this IP, only metrics endpoint is not returning anything I tried again to get the cluster details with the POD IP and it was successful

root@hazelcast-14:/# curl -vvv -k 10.16.0.10:1099/metrics
* Expire in 0 ms for 6 (transfer 0x5c32af2d80f0)
*   Trying 10.16.0.10...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5c32af2d80f0)
* connect to 10.16.0.10 port 1099 failed: Connection timed out
* Failed to connect to 10.16.0.10 port 1099: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to 10.16.0.10 port 1099: Connection timed out

root@hazelcast-14:/# curl -vvv -k 10.12.194.20:5701/hazelcast/rest/cluster
* Expire in 0 ms for 6 (transfer 0x5ab169f5b0f0)
*   Trying 10.12.194.20...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5ab169f5b0f0)
* Connected to 10.12.194.20 (10.12.194.20) port 5701 (#0)
> GET /hazelcast/rest/cluster HTTP/1.1
> Host: 10.12.194.20:5701
> User-Agent: curl/7.64.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Length: 1668
<


Members [15] {
	Member [hazelcast-13.hazelcast-svc.default.svc.cluster.local]:5701 - 64a7611e-491c-49cc-ae83-6be9111be9e9
	Member [hazelcast-10.hazelcast-svc.default.svc.cluster.local]:5701 - 350b2809-f2ea-4ca6-a50e-ae9635e24d15
	Member [hazelcast-9.hazelcast-svc.default.svc.cluster.local]:5701 - 967195b8-198b-4da9-8043-6fdc32809730
	Member [hazelcast-8.hazelcast-svc.default.svc.cluster.local]:5701 - 4392d958-b7db-40dd-8ad2-bbce61884370
	Member [hazelcast-7.hazelcast-svc.default.svc.cluster.local]:5701 - 97cc8bcb-87c3-4654-85a6-f3f9ac7fc9aa
	Member [hazelcast-6.hazelcast-svc.default.svc.cluster.local]:5701 - d9a6a59a-42a5-4ab3-964a-5425b76862f5
	Member [hazelcast-5.hazelcast-svc.default.svc.cluster.local]:5701 - 137d7d4b-e9e1-44bc-8a98-a7d0e5302ea5
	Member [hazelcast-12.hazelcast-svc.default.svc.cluster.local]:5701 - 858900fc-ab24-4350-8cfd-6335b0de1e8f
	Member [hazelcast-14.hazelcast-svc.default.svc.cluster.local]:5701 - 991b22d8-bf74-402d-a429-11ade28c5c70 this
	Member [hazelcast-4.hazelcast-svc.default.svc.cluster.local]:5701 - 77d62147-dd05-4e42-9147-7453e0e56dc7
	Member [hazelcast-11.hazelcast-svc.default.svc.cluster.local]:5701 - 138ea843-3d4f-4168-b470-42c41b98a93f
	Member [hazelcast-1.hazelcast-svc.default.svc.cluster.local]:5701 - ecd0f499-93f4-4537-9bae-9059d6ba0450
	Member [hazelcast-0.hazelcast-svc.default.svc.cluster.local]:5701 - bea89516-7666-46a6-94bf-6e105caa4395
	Member [hazelcast-2.hazelcast-svc.default.svc.cluster.local]:5701 - 4f81e916-223d-4d0c-a6e5-80da924f7aa5
	Member [hazelcast-3.hazelcast-svc.default.svc.cluster.local]:5701 - ca5a1f98-c873-4a47-a838-432b69bc79a3
}

ConnectionCount: 123
AllConnectionCount: 138186
* Connection #0 to host 10.12.194.20 left intact

root@hazelcast-14:/# curl -vvv -k 10.12.194.20:1099/metrics
* Expire in 0 ms for 6 (transfer 0x56dcf37f20f0)
*   Trying 10.12.194.20...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x56dcf37f20f0)
* Connected to 10.12.194.20 (10.12.194.20) port 1099 (#0)
> GET /metrics HTTP/1.1
> Host: 10.12.194.20:1099
> User-Agent: curl/7.64.0
> Accept: */*
>
* connect to 10.12.194.20 port 1099 failed: Connection timed out
* Failed to connect to 10.12.194.20 port 1099: Connection timed out
* Closing connection 0

@dhoard
Copy link
Collaborator

dhoard commented Oct 25, 2024

@suavebajaj can you provide the JVM options you are using to load the exporter?

@suavebajaj
Copy link
Author

@dhoard Shared in the initial post, sharing it here again, please let me know if anything else is needed on the configuration side

Below is the configuration file

#see: https://github.com/prometheus/jmx_exporter#configuration
startDelaySeconds: 0
ssl: false
lowercaseOutputName: true
lowercaseOutputLabelNames: true
rules:
  # see "MBean Naming for Hazelcast Data Structures" here: https://docs.hazelcast.org/docs/latest-dev/manual/html-single/index.html#monitoring-with-jmx
  # example input: "com.hazelcast<instance=_hzInstance_1_dev, name="hz:scheduled", type=HazelcastInstance.ManagedExecutorService><>completedTaskCount"
  - pattern: 'com\.hazelcast<instance=(.*), name=(.*), type=(.*)><>(.*):(.*)'
    labels:
      "hz_instance": "$1"
      "hz_name": "$2"
      "hz_type": "$3"
    name: "hazelcast_$4"
  # Fallback to the default pattern for anything not matching above
  - pattern: '.*'
cat /etc/manh/hazelcast_config.xml
<?xml version="1.0" encoding="UTF-8"?>
<hazelcast xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.6.xsd"
     xmlns="http://www.hazelcast.com/schema/config"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<management-center enabled="false">http://localhost:8080/mancenter</management-center>
<properties>
      <property name="hazelcast.jmx">true</property>
      <property name="hazelcast.rest.enabled">true</property>
</properties>
<map name="authserver.user">
  <time-to-live-seconds>60</time-to-live-seconds>
</map>
<map name="zuulserver.userGrants">
  <time-to-live-seconds>60</time-to-live-seconds>
</map>
<map name="zuulserver.resources">
  <time-to-live-seconds>60</time-to-live-seconds>
</map>
root@hazelcast-0:/# ps afx | grep java
  234 pts/0    S+     0:00  \_ grep java
    1 ?        Ssl    9:21 java -javaagent:/data/hazelcast/jmx_prometheus_javaagent-0.20.0.jar=1099:/etc/manh/hazelcast_exporter_config.yml -Xmx6144m -Xss1024k -Dlogging.level.com.manh.cp=DEBUG -Dlogging.level.com.netflix=WARN -Dlogging.level.com.hazelcast.nio.tcp=WARN -XX:+DoEscapeAnalysis -XX:+UseG1GC -XX:MaxGCPauseMillis=2000 -verbose:gc -Xloggc:/mnt/logs/hazelcastserver_G1-gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/mnt/logs/hazelcastserver_oom.hprof -XX:+DisableExplicitGC -Djavax.net.ssl.trustStore=/mnt/truststore.jks -Deureka.client.registerWithEureka=true -jar /main.jar

@dhoard
Copy link
Collaborator

dhoard commented Oct 28, 2024

@suavebajaj previous output showed you couldn't connect.

Can you clarify if you can't connect or it's connecting but you are not getting metrics (empty response)?

@suavebajaj
Copy link
Author

@dhoard, the metric endpoint is not responding. We get a failed connection due to the connection timed out, not an empty response.

@dhoard
Copy link
Collaborator

dhoard commented Oct 28, 2024

@suavebajaj this is an interesting issue/it's not clear what's going on.

Looking at the 0.20.0 code, if no hostname/IP address is being provided as an argument, the code uses "0.0.0.0". This should represent all network addresses on the machine.

socket = new InetSocketAddress("0.0.0.0", port);

Some things to try...

  1. pass 127.0.0.1 as the host and see if you can access curl http://127.0.0.1:1099/metrics
  2. pass the machine IP as the host and see if you can access curl http://<MACHINE IP>:1099/metrics
  3. Test with the 1.0.1 release. It uses InetAddress.getByName("0.0.0.0")

@suavebajaj
Copy link
Author

@dhoard Tried the first 2 already, sharing the output again. Will try updating to 1.0.1 release

root@hazelcast-14:/# curl -vvv -k http://127.0.0.1:1099/metrics
* Expire in 0 ms for 6 (transfer 0x56f0010850f0)
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x56f0010850f0)
* connect to 127.0.0.1 port 1099 failed: Connection timed out
* Failed to connect to 127.0.0.1 port 1099: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to 127.0.0.1 port 1099: Connection timed out
root@hazelcast-14:/# curl -vvv -k 10.12.194.20:1099/metrics
* Expire in 0 ms for 6 (transfer 0x56dcf37f20f0)
*   Trying 10.12.194.20...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x56dcf37f20f0)
* Connected to 10.12.194.20 (10.12.194.20) port 1099 (#0)
> GET /metrics HTTP/1.1
> Host: 10.12.194.20:1099
> User-Agent: curl/7.64.0
> Accept: */*
>
* connect to 10.12.194.20 port 1099 failed: Connection timed out
* Failed to connect to 10.12.194.20 port 1099: Connection timed out
* Closing connection 0

@suavebajaj
Copy link
Author

Hey @dhoard , updated the exporter to 1.0.1 still the /metrics endpoint is not responding

root@hazelcast-14:/# ps afx | grep java
    144 pts/0    R+     0:00  \_ grep java
      1 ?        Ssl    1:22 java -javaagent:/data/hazelcast/jmx_prometheus_javaagent-1.0.1.jar=1099:/etc/manh/hazelcast_exporter_config.yml -Xmx3072m -Xss1024k -Dlogging.level.com.manh.cp=INFO -Dlogging.level.com.netflix=WARN -Dlogging.level.com.hazelcast.nio.tcp=WARN -XX:+DoEscapeAnalysis -XX:+UseG1GC -XX:MaxGCPauseMillis=2000 -verbose:gc -Xloggc:/mnt/logs/hazelcastserver_G1-gc.log -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/mnt/logs/hazelcastserver_oom.hprof -XX:+DisableExplicitGC -Djavax.net.ssl.trustStore=/mnt/truststore.jks -Deureka.client.registerWithEureka=true -jar /main.jar
root@hazelcast-14:/# curl -vvv -k 127.0.0.1:1099/metrics
* Expire in 0 ms for 6 (transfer 0x5aefc81b30f0)
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5aefc81b30f0)
* connect to 127.0.0.1 port 1099 failed: Connection timed out
* Failed to connect to 127.0.0.1 port 1099: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to 127.0.0.1 port 1099: Connection timed out
root@hazelcast-14:/# curl -vvv -k 10.12.9.16:1099/metrics
* Expire in 0 ms for 6 (transfer 0x5bd7f66820f0)
*   Trying 10.12.9.16...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5bd7f66820f0)
* connect to 10.12.9.16 port 1099 failed: Connection timed out
* Failed to connect to 10.12.9.16 port 1099: Connection timed out
* Closing connection 0
curl: (7) Failed to connect to 10.12.9.16 port 1099: Connection timed out
root@hazelcast-14:/# curl -vvv -k 10.12.9.16:5701/hazelcast/rest/cluster
* Expire in 0 ms for 6 (transfer 0x5a67224b90f0)
*   Trying 10.12.9.16...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0x5a67224b90f0)
* Connected to 10.12.9.16 (10.12.9.16) port 5701 (#0)
> GET /hazelcast/rest/cluster HTTP/1.1
> Host: 10.12.9.16:5701
> User-Agent: curl/7.64.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Length: 1555
<


Members [14] {
	Member [hazelcast-11.hazelcast-svc.default.svc.cluster.local]:5701 - 600cfaa7-80f2-4cae-a8cb-97a21d5cd219
	Member [hazelcast-10.hazelcast-svc.default.svc.cluster.local]:5701 - b0d1c400-521b-457a-8647-a69dc0f3883c
	Member [hazelcast-9.hazelcast-svc.default.svc.cluster.local]:5701 - b45d554a-fc53-474d-872c-b7989b6574b4
	Member [hazelcast-8.hazelcast-svc.default.svc.cluster.local]:5701 - 37a2c650-f24c-44c3-a650-70f24de9d07f
	Member [hazelcast-7.hazelcast-svc.default.svc.cluster.local]:5701 - f3669d57-3d0a-43ad-8378-4ee775749d5a
	Member [hazelcast-6.hazelcast-svc.default.svc.cluster.local]:5701 - b3c50f2c-3134-4c6b-950f-a8ebf125b426
	Member [hazelcast-5.hazelcast-svc.default.svc.cluster.local]:5701 - 66f9306d-50e0-4293-b532-71635906eefe
	Member [hazelcast-4.hazelcast-svc.default.svc.cluster.local]:5701 - e882ddf7-5539-49ae-b44f-be4cd3f5cb1b
	Member [hazelcast-3.hazelcast-svc.default.svc.cluster.local]:5701 - 13895861-e2e5-4f92-8e82-8177ca5bf9cd
	Member [hazelcast-2.hazelcast-svc.default.svc.cluster.local]:5701 - 7c9f3f17-aa5f-4928-80a7-41aee2013287
	Member [hazelcast-0.hazelcast-svc.default.svc.cluster.local]:5701 - 110b4008-01ab-4065-bb52-35fda277e3e8
	Member [hazelcast-1.hazelcast-svc.default.svc.cluster.local]:5701 - 5e23e2e2-bdc2-4edb-b23d-ce1437bf03af
	Member [hazelcast-14.hazelcast-svc.default.svc.cluster.local]:5701 - a426cf2e-4d5a-4866-b90f-ae510ed4b4d0 this
	Member [hazelcast-13.hazelcast-svc.default.svc.cluster.local]:5701 - ac4a67fa-32b9-44fb-9c0d-6fc0bc7e5994
}

ConnectionCount: 6
AllConnectionCount: 19
* Connection #0 to host 10.12.9.16 left intact

@dhoard
Copy link
Collaborator

dhoard commented Nov 4, 2024

@suavebajaj I have attached a zip that contains the Java agent based on the latest code with 4 debug prints that go to system out. Can you test?

DEBUG host [0.0.0.0]
DEBUG inetAddress [/0.0.0.0]
DEBUG port [8888]
DEBUG file [exporter.yaml]

debug.zip

@suavebajaj suavebajaj changed the title /metrics Endpoint of Hazelcast Does Not Return Any Data /metrics of jmx_exporter does not respond in time Nov 5, 2024
@SaylorZhu
Copy link

We encountered the same issue. In the early stages of program startup, the jmx_exporter responds normally. However, after some business requests run for a while, the jmx_exporter stops responding, even though the port is still being listened to.

@dhoard
Copy link
Collaborator

dhoard commented Dec 7, 2024

@suavebajaj @SaylorZhu can either of you provide a thread dump?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants