Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat connection to wrong Cluster IP when "replicationFactor: 2" #4

Open
afshinpaydar opened this issue Apr 20, 2019 · 1 comment
Labels
help wanted Extra attention is needed question Further information is requested

Comments

@afshinpaydar
Copy link

afshinpaydar commented Apr 20, 2019

Second aerospike cluster doesn't become up because of heartbeat connectivity issue:

cat docs/examples/20-aerospike-cluster.yml 
apiVersion: aerospike.travelaudience.com/v1alpha2
kind: AerospikeCluster
metadata:
  name: as-cluster-0
spec:
  version: "4.2.0.10"
  nodeCount: 3
  namespaces:
  - name: as-namespace-0
    replicationFactor: 3
    memorySize: 1G
    defaultTTL: 0s
    storage:
      type: file
      size: 1G
      storageClassName: glusterfs-storage
oc get pod -o wide
NAME                                  READY     STATUS    RESTARTS   AGE       IP             NODE                   NOMINATED NODE
aerospike-operator-55c4c4fc58-vqgks   1/1       Running   0          33s       10.130.1.156   node2.soshyant.local   <none>
as-cluster-0-0                        2/2       Running   0          21s       10.129.1.176   node1.soshyant.local   <none>
as-cluster-0-1                        0/2       Pending   0          3s        <none>         <none>                 <none>
oc logs as-cluster-0-0 -c aerospike-server

Apr 20 2019 05:25:55 GMT: INFO (as): (as.c:372) initializing services...
--
  | Apr 20 2019 05:25:55 GMT: INFO (tsvc): (thr_tsvc.c:136) 4 transaction queues: starting 4 threads per queue
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:800) updated fabric published address list to {10.129.1.176:3001}
  | Apr 20 2019 05:25:55 GMT: INFO (partition): (partition_balance.c:196) {as-namespace-0} 4096 partitions: found 4096 absent, 0 stored
  | Apr 20 2019 05:25:55 GMT: INFO (hb): (hb.c:5490) updated heartbeat published address list to {10.129.1.176:3002}
  | Apr 20 2019 05:25:55 GMT: INFO (batch): (batch.c:732) starting 4 batch-index-threads
  | Apr 20 2019 05:25:55 GMT: INFO (batch): (thr_batch.c:373) starting 4 batch-threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:452) starting 8 fabric send threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:469) starting 16 fabric rw channel recv threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric ctrl channel recv threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric bulk channel recv threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric meta channel recv threads
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (fabric.c:475) starting fabric accept thread
  | Apr 20 2019 05:25:55 GMT: INFO (hb): (hb.c:7003) initializing mesh heartbeat socket: 0.0.0.0:3002
  | Apr 20 2019 05:25:55 GMT: INFO (fabric): (socket.c:702) Started fabric endpoint 0.0.0.0:3001
  | Apr 20 2019 05:25:55 GMT: INFO (hb): (hb.c:7032) mtu of the network is 1450
  | Apr 20 2019 05:25:55 GMT: INFO (hb): (socket.c:702) Started mesh heartbeat endpoint 0.0.0.0:3002
  | Apr 20 2019 05:25:55 GMT: INFO (nsup): (thr_nsup.c:1096) starting namespace supervisor threads
  | Apr 20 2019 05:25:55 GMT: INFO (demarshal): (thr_demarshal.c:886) starting 4 demarshal threads
  | Apr 20 2019 05:25:55 GMT: INFO (demarshal): (socket.c:702) Started client endpoint 0.0.0.0:3000
  | Apr 20 2019 05:25:55 GMT: INFO (info-port): (thr_info_port.c:300) starting info port thread
  | Apr 20 2019 05:25:55 GMT: INFO (info-port): (socket.c:702) Started info endpoint 0.0.0.0:3003
  | Apr 20 2019 05:25:55 GMT: INFO (as): (as.c:416) service ready: soon there will be cake!


.
.
.


Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:171) NODE-ID ae90b0f2b5b13701 CLUSTER-SIZE 1
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:247)    cluster-clock: skew-ms 0
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:277)    system-memory: free-kbytes 3251196 free-pct 40 heap-kbytes (1093695,1094224,1124352) heap-efficiency-pct 97.3
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:291)    in-progress: tsvc-q 0 info-q 0 nsup-delete-q 0 rw-hash 0 proxy-hash 0 tree-gc-q 0
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:313)    fds: proto (0,87,87) heartbeat (0,0,0) fabric (0,0,0)
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:322)    heartbeat-received: self 0 foreign 0
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:353)    fabric-bytes-per-second: bulk (0,0) ctrl (0,0) meta (0,0) rw (0,0)
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:408) {as-namespace-0} objects: all 0 master 0 prole 0 non-replica 0
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:469) {as-namespace-0} migrations: complete
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:497) {as-namespace-0} memory-usage: total-bytes 0 index-bytes 0 sindex-bytes 0 used-pct 0.00
Apr 20 2019 05:32:46 GMT: INFO (info): (ticker.c:536) {as-namespace-0} device-usage: used-bytes 0 avail-pct 99 cache-read-pct 0.00
oc logs as-cluster-0-1 -c aerospike-server

Apr 20 2019 05:26:12 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric ctrl channel recv threads
Apr 20 2019 05:26:12 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric bulk channel recv threads
Apr 20 2019 05:26:12 GMT: INFO (fabric): (fabric.c:469) starting 4 fabric meta channel recv threads
Apr 20 2019 05:26:12 GMT: INFO (fabric): (fabric.c:475) starting fabric accept thread
Apr 20 2019 05:26:12 GMT: INFO (hb): (hb.c:7003) initializing mesh heartbeat socket: 0.0.0.0:3002
Apr 20 2019 05:26:12 GMT: INFO (fabric): (socket.c:702) Started fabric endpoint 0.0.0.0:3001
Apr 20 2019 05:26:12 GMT: INFO (hb): (hb.c:7032) mtu of the network is 1450
Apr 20 2019 05:26:12 GMT: INFO (hb): (socket.c:702) Started mesh heartbeat endpoint 0.0.0.0:3002
Apr 20 2019 05:26:12 GMT: INFO (nsup): (thr_nsup.c:1096) starting namespace supervisor threads
Apr 20 2019 05:26:12 GMT: INFO (demarshal): (thr_demarshal.c:886) starting 4 demarshal threads
Apr 20 2019 05:26:12 GMT: WARNING (socket): (socket.c:746) Timeout while connecting
Apr 20 2019 05:26:12 GMT: WARNING (socket): (socket.c:814) Error while connecting socket to 172.30.46.109:3002
Apr 20 2019 05:26:12 GMT: WARNING (hb): (hb.c:4845) could not create heartbeat connection to node {172.30.46.109:3002}
Apr 20 2019 05:26:12 GMT: INFO (demarshal): (socket.c:702) Started client endpoint 0.0.0.0:3000
Apr 20 2019 05:26:12 GMT: INFO (info-port): (thr_info_port.c:300) starting info port thread
Apr 20 2019 05:26:12 GMT: INFO (info-port): (socket.c:702) Started info endpoint 0.0.0.0:3003
Apr 20 2019 05:26:12 GMT: INFO (as): (as.c:416) service ready: soon there will be cake!
Apr 20 2019 05:26:14 GMT: INFO (clustering): (clustering.c:6345) principal node - forming new cluster with succession list: ad61545e60f70194
Apr 20 2019 05:26:14 GMT: INFO (clustering): (clustering.c:5784) applied new cluster key 96d3085cbe4d
Apr 20 2019 05:26:14 GMT: INFO (clustering): (clustering.c:5786) applied new succession list ad61545e60f70194
Apr 20 2019 05:26:14 GMT: INFO (clustering): (clustering.c:5788) applied cluster size 1
Apr 20 2019 05:26:14 GMT: INFO (exchange): (exchange.c:2159) data exchange started with cluster key 96d3085cbe4d
Apr 20 2019 05:26:14 GMT: INFO (exchange): (exchange.c:2989) received commit command from principal node ad61545e60f70194
Apr 20 2019 05:26:14 GMT: INFO (exchange): (exchange.c:2948) data exchange completed with cluster key 96d3085cbe4d
Apr 20 2019 05:26:14 GMT: INFO (partition): (partition_balance.c:915) {as-namespace-0} replication factor is 1
Apr 20 2019 05:26:14 GMT: INFO (partition): (partition_balance.c:887) {as-namespace-0} rebalanced: expected-migrations (0,0) expected-signals 0 fresh-partitions 4096
Apr 20 2019 05:26:15 GMT: WARNING (socket): (socket.c:755) Error while connecting: 113 (No route to host)
oc get svc -o wide
NAME                                                     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE       SELECTOR
aerospike-operator                                       ClusterIP   172.30.46.109   <none>        443/TCP                      55s       app=aerospike-operator
as-cluster-0                                             ClusterIP   None            <none>        3000/TCP,3002/TCP,9145/TCP   43s       app=aerospike,cluster=as-cluster-0
glusterfs-dynamic-bf1e525f-632c-11e9-95f6-005056afc8ad   ClusterIP   172.30.90.194   <none>        1/TCP                        37s       <none>
glusterfs-dynamic-c9be2905-632c-11e9-95f6-005056afc8ad   ClusterIP   172.30.169.99   <none>        1/TCP                        19s       <none>
@pires
Copy link
Contributor

pires commented Jun 10, 2019

I honestly can't figure out what may be wrong with the logs above. Also, and very unfortunately, I don't have experience with Openshift to be able to help. I know other people in the community have had issues with it, when trying to use this (and other) operator(s) we maintain.

@pires pires added help wanted Extra attention is needed question Further information is requested labels Jul 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants