Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkup_maintenance_agents ignored the agents_layout #547

Open
jeroenmaelbrancke opened this issue Jan 15, 2018 · 2 comments
Open

checkup_maintenance_agents ignored the agents_layout #547

jeroenmaelbrancke opened this issue Jan 15, 2018 · 2 comments
Assignees
Labels
Milestone

Comments

@jeroenmaelbrancke
Copy link

Last Friday we detected that the maintenance agents on NY2 are deployed on asd vms.
This is not what we configure in the agents_layout config.
After manually running the checkup_maintenance_agents method the services were removed on the asd vms and deployed on the correct nodes.

I'm not able to find information in the logging why the maintenance agents were deployed on different nodes.
Maybe the update triggered something or it was already deployed at the beginning...

What i don't understand is why the cron didn't detected this and a manually run did.

Agents layouts:

root@NY2SRV0011:~# ovs config get ovs/alba/backends/1ae3eb7e-a197-4021-bec1-888e167bba05/maintenance/agents_layout
["M9e2WY2yg13NlsEg7ssx7nmWmCRmqhsY", "ClniWKVepnpkVIXUHxtLoi6MJmr69wSb"]

root@NY2SRV0011:~# ovs config get ovs/alba/backends/3b408aa1-0407-4d9e-be3a-babce370ab13/maintenance/agents_layout
["RAAF6YiDaEWlmKvoS3Q9m3CRdUG9Dr8k", "iZ577tqejLcOesIg011uVZo2H475CIzN"]

root@NY2SRV0011:~# ovs config get ovs/alba/backends/460620d3-984b-4feb-a217-adf56fb14038/maintenance/agents_layout
["RAAF6YiDaEWlmKvoS3Q9m3CRdUG9Dr8k", "iZ577tqejLcOesIg011uVZo2H475CIzN"]

root@NY2SRV0011:~# ovs config get ovs/alba/backends/e4a9beee-2eff-466a-951d-257bf8395a0a/maintenance/agents_layout
["QBLmEzzfbnL6glKNVkGOR0VoKWiMDxNS", "OKf0P4IhuPdPcZlTFQm6AvXNOyIx8EaV"]

node_ids:

NY2SRV0001 | SUCCESS | rc=0 >>
{"node_id": "M9e2WY2yg13NlsEg7ssx7nmWmCRmqhsY"}

NY2SRV0002 | SUCCESS | rc=0 >>
{"node_id": "ClniWKVepnpkVIXUHxtLoi6MJmr69wSb"}

NY2SRV0003 | SUCCESS | rc=0 >>
{"node_id": "QBLmEzzfbnL6glKNVkGOR0VoKWiMDxNS"}

NY2SRV0004 | SUCCESS | rc=0 >>
{"node_id": "OKf0P4IhuPdPcZlTFQm6AvXNOyIx8EaV"}

NY2SRV0005 | SUCCESS | rc=0 >>
{"node_id": "iZ577tqejLcOesIg011uVZo2H475CIzN"}

NY2SRV0006 | SUCCESS | rc=0 >>
{"node_id": "RAAF6YiDaEWlmKvoS3Q9m3CRdUG9Dr8k"}

NY2SRV0007 | SUCCESS | rc=0 >>
{"node_id": "64U1SotoiZoxD4QFEfwVZMhOjhZ419wG"}

We discovered this when we added a second backend to globalbackend02.
At a certain time some asd vms didn't response to checkmk anymore. (the one with a running maintenance agent)

At that point we saw some errors in the celery log:

Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 20700 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/ensure single - 58714 - INFO - Ensure single CHAINED mode - ID 1515837600_shv8q5CzvC - New task alba.checkup_maintenance_
agents with default params scheduled for execution
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 20900 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58715 - INFO - Loading maintenance information
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 22900 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.strategy - 58722 - INFO - Received task: get asd statistics[f9d36c0b-9c1b-473d-b4a3-67b51be2adcf]
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 23000 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.autoscale - 58723 - INFO - Scaling up 1 processes.
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 27400 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.strategy - 58724 - INFO - Received task: get asd statistics[ca549abb-c0e1-4226-9e2e-27338cce3cae]
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 27500 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.autoscale - 58725 - INFO - Scaling up 1 processes.
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 32200 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 58726 - INFO - Task statsmonkey.get_mds_loads[8d36a846-81dd-409e-a98d-ad84a6f8117b] succeeded in 0
.139715231024s: []
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 52200 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 58727 - INFO - Task get asd statistics[52511ca2-cda2-461d-ad1a-dcc47430aa0a] succeeded in 0.198631
620035s: [{'fields': {'disk_usage': 403495647668.0, 'MultiGet2_low_max': 0.0325810909, 'GetDiskUsage_avg': 3.1140400000000004e-05,...
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 56100 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 58728 - INFO - Task get asd statistics[f9d36c0b-9c1b-473d-b4a3-67b51be2adcf] succeeded in 0.236320
186872s: [{'fields': {'disk_usage': 114842563930.0, 'MultiGet2_low_max': 45.2979691029, 'GetDiskUsage_avg': 2.9235600000000002e-05,...
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 58800 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 58729 - INFO - Task get asd statistics[57adccab-20b1-4c61-9fdd-fda75e1c8c9e] succeeded in 0.264004
799072s: [{'fields': {'PartialGets_histogram_1e+04': 1.0, 'disk_usage': 1911106014156.0, 'PartialGets_histogram_1': 1295.0,...
Jan 13 05:00:00 NY2SRV0014 celery[29157]: 2018-01-13 05:00:00 60500 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 58730 - INFO - Task get asd statistics[ca549abb-c0e1-4226-9e2e-27338cce3cae] succeeded in 0.280572
557356s: [{'fields': {'disk_usage': 1190283083963.0, 'MultiGet2_low_max': 32.6431541443, 'GetDiskUsage_avg': 2.55124e-05,...
Jan 13 05:00:01 NY2SRV0014 celery[29157]: 2018-01-13 05:00:01 02400 -0500 - NY2SRV0014 - 25832/139935744538368 - lib/ensure single - 58717 - INFO - Ensure single DEFAULT mode - ID 1515837600_K6VxAtZDNp - Task statsmonkey.get_nsm_stats fin
ished successfully
Jan 13 05:00:01 NY2SRV0014 celery[29157]: 2018-01-13 05:00:01 02400 -0500 - NY2SRV0014 - 25832/139935744538368 - lib/ensure single - 58718 - INFO - Ensure single DEFAULT mode - ID 1515837600_K6VxAtZDNp - Deleting key ovs_ensure_single_sta
tsmonkey.get_nsm_stats
Jan 13 05:00:01 NY2SRV0014 celery[29157]: 2018-01-13 05:00:01 06900 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 58731 - INFO - Task statsmonkey.get_nsm_stats[21ed941f-b1dc-445a-afb1-44112e4d0788] succeeded in 0
.886369946878s: [{'fields': {'CleanupForNamespace_min': 0.0010738372802734375, 'UpdateObject3_avg': 0.0024721622467041016,...
Jan 13 05:00:01 NY2SRV0014 celery[29157]: 2018-01-13 05:00:01 63400 -0500 - NY2SRV0014 - 24782/139935744538368 - lib/ensure single - 58702 - INFO - Ensure single DEFAULT mode - ID 1515837600_C6uyMekIFy - Task statsmonkey.get_disk_safety f
inished successfully
Jan 13 05:00:01 NY2SRV0014 celery[29157]: 2018-01-13 05:00:01 63400 -0500 - NY2SRV0014 - 24782/139935744538368 - lib/ensure single - 58703 - INFO - Ensure single DEFAULT mode - ID 1515837600_C6uyMekIFy - Deleting key ovs_ensure_single_sta
tsmonkey.get_disk_safety
Jan 13 05:00:01 NY2SRV0014 celery[29157]: 2018-01-13 05:00:01 66200 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 58732 - INFO - Task statsmonkey.get_disk_safety[7bf95072-1ea0-473a-aff2-205d888f80b4] succeeded in
 1.59923929581s: [{'fields': {'total_objects': 5081976, 'objects': 5081976}, 'tags': {'environment': u'NY2', 'disk_lost': 0, 'backend_name':...
Jan 13 05:00:02 NY2SRV0014 celery[29157]: 2018-01-13 05:00:02 06600 -0500 - NY2SRV0014 - 25828/139935744538368 - celery/celery.redirected - 58718 - WARNING - 2018-01-13 05:00:02 06600 -0500 - NY2SRV0014 - 25828/139935744538368 - extension
s/asdmanagerclient - 58717 - INFO - Request "list_maintenance_services" took 1.18 seconds (internal duration 1.18 seconds)
Jan 13 05:00:02 NY2SRV0014 celery[29157]: 2018-01-13 05:00:02 08400 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58719 - ERROR - * Cannot fetch maintenance information for 172.17.23.32
Jan 13 05:00:02 NY2SRV0014 celery[29157]: Traceback (most recent call last):
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/lib/alba.py", line 1408, in checkup_maintenance_agents
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     service_names = node.client.list_maintenance_services()
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 304, in list_maintenance_services
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return self._call(requests.get, 'maintenance', clean=True)['services']
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 95, in _call
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     response = method(**kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 67, in get
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return request('get', url, params=params, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return session.request(method=method, url=url, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 468, in request
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     resp = self.send(prep, **send_kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 576, in send
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     r = adapter.send(request, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 437, in send
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     raise ConnectionError(e, request=request)
Jan 13 05:00:02 NY2SRV0014 celery[29157]: ConnectionError: HTTPSConnectionPool(host='172.17.23.32', port=8500): Max retries exceeded with url: /maintenance (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPS
Connection object at 0x7f453c4d1f50>: Failed to establish a new connection: [Errno 111] Connection refused',))
Jan 13 05:00:02 NY2SRV0014 celery[29157]: 2018-01-13 05:00:02 08700 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58720 - ERROR - * Cannot fetch maintenance information for 172.17.23.41
Jan 13 05:00:02 NY2SRV0014 celery[29157]: Traceback (most recent call last):
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/lib/alba.py", line 1408, in checkup_maintenance_agents
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     service_names = node.client.list_maintenance_services()
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 304, in list_maintenance_services
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return self._call(requests.get, 'maintenance', clean=True)['services']
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 95, in _call
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     response = method(**kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 67, in get
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return request('get', url, params=params, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return session.request(method=method, url=url, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 468, in request
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     resp = self.send(prep, **send_kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 576, in send
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     r = adapter.send(request, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 437, in send
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     raise ConnectionError(e, request=request)
Jan 13 05:00:02 NY2SRV0014 celery[29157]: ConnectionError: HTTPSConnectionPool(host='172.17.23.41', port=8500): Max retries exceeded with url: /maintenance (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f453c48e050>: Failed to establish a new connection: [Errno 111] Connection refused',))
Jan 13 05:00:02 NY2SRV0014 celery[29157]: 2018-01-13 05:00:02 08900 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58721 - ERROR - * Cannot fetch maintenance information for 172.17.23.9
Jan 13 05:00:02 NY2SRV0014 celery[29157]: Traceback (most recent call last):
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/lib/alba.py", line 1408, in checkup_maintenance_agents
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     service_names = node.client.list_maintenance_services()
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 304, in list_maintenance_services
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return self._call(requests.get, 'maintenance', clean=True)['services']
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 95, in _call
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     response = method(**kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 67, in get
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return request('get', url, params=params, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return session.request(method=method, url=url, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 468, in request
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     resp = self.send(prep, **send_kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 576, in send
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     r = adapter.send(request, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 437, in send
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     raise ConnectionError(e, request=request)
Jan 13 05:00:02 NY2SRV0014 celery[29157]: ConnectionError: HTTPSConnectionPool(host='172.17.23.9', port=8500): Max retries exceeded with url: /maintenance (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f453c48e190>: Failed to establish a new connection: [Errno 111] Connection refused',))
Jan 13 05:00:02 NY2SRV0014 celery[29157]: 2018-01-13 05:00:02 09200 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58722 - ERROR - * Cannot fetch maintenance information for 172.17.23.38
Jan 13 05:00:02 NY2SRV0014 celery[29157]: Traceback (most recent call last):
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/lib/alba.py", line 1408, in checkup_maintenance_agents
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     service_names = node.client.list_maintenance_services()
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 304, in list_maintenance_services
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return self._call(requests.get, 'maintenance', clean=True)['services']
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 95, in _call
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     response = method(**kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 67, in get
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return request('get', url, params=params, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     return session.request(method=method, url=url, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 468, in request
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     resp = self.send(prep, **send_kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 576, in send
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     r = adapter.send(request, **kwargs)
Jan 13 05:00:02 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 437, in send
Jan 13 05:00:02 NY2SRV0014 celery[29157]:     raise ConnectionError(e, request=request)
Jan 13 05:00:02 NY2SRV0014 celery[29157]: ConnectionError: HTTPSConnectionPool(host='172.17.23.38', port=8500): Max retries exceeded with url: /maintenance (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f453c48e2d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
Jan 13 05:00:02 NY2SRV0014 celery[29157]: 2018-01-13 05:00:02 51900 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.strategy - 58733 - INFO - Received task: ovs.storagerouter.ping[f551dacd-6c61-4037-b7ca-87102661c6f3]
Jan 13 05:00:02 NY2SRV0014 celery[29157]: 2018-01-13 05:00:02 56100 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 58734 - INFO - Task ovs.storagerouter.ping[f551dacd-6c61-4037-b7ca-87102661c6f3] succeeded in 0.0409898106009s: None
Jan 13 05:00:03 NY2SRV0014 celery[29157]: 2018-01-13 05:00:03 33100 -0500 - NY2SRV0014 - 25828/139935744538368 - celery/celery.redirected - 58724 - WARNING - 2018-01-13 05:00:03 33100 -0500 - NY2SRV0014 - 25828/139935744538368 - extensions/asdmanagerclient - 58723 - INFO - Request "list_maintenance_services" took 1.23 seconds (internal duration 0.07 seconds)
Jan 13 05:00:04 NY2SRV0014 celery[29157]: 2018-01-13 05:00:04 63000 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58726 - ERROR - * Cannot fetch maintenance information for 172.17.23.33
Jan 13 05:00:04 NY2SRV0014 celery[29157]: Traceback (most recent call last):
Jan 13 05:00:04 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/lib/alba.py", line 1408, in checkup_maintenance_agents
Jan 13 05:00:04 NY2SRV0014 celery[29157]:     service_names = node.client.list_maintenance_services()
Jan 13 05:00:04 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 304, in list_maintenance_services
Jan 13 05:00:04 NY2SRV0014 celery[29157]:     return self._call(requests.get, 'maintenance', clean=True)['services']
Jan 13 05:00:04 NY2SRV0014 celery[29157]:   File "/opt/OpenvStorage/ovs/extensions/plugins/asdmanager.py", line 95, in _call
Jan 13 05:00:04 NY2SRV0014 celery[29157]:     response = method(**kwargs)
Jan 13 05:00:04 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 67, in get
Jan 13 05:00:04 NY2SRV0014 celery[29157]:     return request('get', url, params=params, **kwargs)
Jan 13 05:00:04 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/api.py", line 53, in request
Jan 13 05:00:04 NY2SRV0014 celery[29157]:     return session.request(method=method, url=url, **kwargs)
Jan 13 05:00:04 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 468, in request
Jan 13 05:00:04 NY2SRV0014 celery[29157]:     resp = self.send(prep, **send_kwargs)
Jan 13 05:00:04 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/sessions.py", line 576, in send
Jan 13 05:00:04 NY2SRV0014 celery[29157]:     r = adapter.send(request, **kwargs)
Jan 13 05:00:04 NY2SRV0014 celery[29157]:   File "/usr/lib/python2.7/dist-packages/requests/adapters.py", line 437, in send
Jan 13 05:00:04 NY2SRV0014 celery[29157]:     raise ConnectionError(e, request=request)
Jan 13 05:00:04 NY2SRV0014 celery[29157]: ConnectionError: HTTPSConnectionPool(host='172.17.23.33', port=8500): Max retries exceeded with url: /maintenance (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f453c3987d0>: Failed to establish a new connection: [Errno 111] Connection refused',))
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 77800 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58731 - INFO - Generating service work log for ny2-ssdbackend01
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 78500 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58733 - INFO - Applying service work log for ny2-ssdbackend01
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 78500 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58734 - INFO - Finished service work log for ny2-ssdbackend01
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 78700 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58735 - INFO - Generating service work log for ny2-hddbackend03
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 79400 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58737 - WARNING - * Layout contains unknown node RAAF6YiDaEWlmKvoS3Q9m3CRdUG9Dr8k
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 79400 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58738 - INFO - Applying service work log for ny2-hddbackend03
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 79400 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58739 - INFO - Finished service work log for ny2-hddbackend03
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 79600 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58740 - INFO - Generating service work log for ny2-hddbackend02
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 80200 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58742 - WARNING - * Layout contains unknown node RAAF6YiDaEWlmKvoS3Q9m3CRdUG9Dr8k
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 80200 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58743 - INFO - Applying service work log for ny2-hddbackend02
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 80200 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58744 - INFO - Finished service work log for ny2-hddbackend02
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 80400 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58745 - INFO - Generating service work log for ny2-hddbackend01
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 81000 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58747 - INFO - Applying service work log for ny2-hddbackend01
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 81000 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/alba - 58748 - INFO - Finished service work log for ny2-hddbackend01
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 81100 -0500 - NY2SRV0014 - 25828/139935744538368 - lib/ensure single - 58749 - INFO - Ensure single CHAINED mode - ID 1515837600_shv8q5CzvC - Task alba.checkup_maintenance_agents finished successfully
Jan 13 05:00:07 NY2SRV0014 celery[29157]: 2018-01-13 05:00:07 87800 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 58735 - INFO - Task alba.checkup_maintenance_agents[dd8e1c4d-54a4-46ed-90a2-71c144156679] succeeded in 7.69593919115s: None

Before the errors everything was normal even when the agents_layout was not respected.

Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 18400 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/ensure single - 56427 - INFO - Ensure single CHAINED mode - ID 1515830400_OYs6jfrM4h - New task alba.checkup_maintenance_a
gents with default params scheduled for execution
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 18600 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56428 - INFO - Loading maintenance information
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 19200 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.strategy - 56435 - INFO - Received task: get asd statistics[f65e7dfd-59ec-4746-b5f6-34ddf14c9134]
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 19300 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.autoscale - 56436 - INFO - Scaling up 1 processes.
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 20000 -0500 - NY2SRV0014 - 4010/139935744538368 - lib/ensure single - 56429 - INFO - Ensure single DEFAULT mode - ID 1515830400_193ERl1JF1 - Setting key ovs_ensure_single_stats
monkey.get_mds_loads
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 22400 -0500 - NY2SRV0014 - 4010/139935744538368 - lib/ensure single - 56430 - INFO - Ensure single DEFAULT mode - ID 1515830400_193ERl1JF1 - Task statsmonkey.get_mds_loads fini
shed successfully
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 22500 -0500 - NY2SRV0014 - 4010/139935744538368 - lib/ensure single - 56431 - INFO - Ensure single DEFAULT mode - ID 1515830400_193ERl1JF1 - Deleting key ovs_ensure_single_stat
smonkey.get_mds_loads
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 23000 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.strategy - 56437 - INFO - Received task: get asd statistics[d4df3516-95e3-4359-829c-ca3a2f94a426]
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 23100 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.autoscale - 56438 - INFO - Scaling up 1 processes.
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 26900 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 56439 - INFO - Task statsmonkey.get_mds_loads[346efcc8-90ae-450f-b854-dfdf47310d7e] succeeded in 0
.11239305418s: []
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 39900 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 56440 - INFO - Task get asd statistics[e0946e8c-f83c-407f-a121-c5fa0db5e785] succeeded in 0.128930
922132s: [{'fields': {'disk_usage': 403727929203.0, 'MultiGet2_low_max': 0.0325810909, 'GetDiskUsage_avg': 3.11421e-05,...
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 42000 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 56441 - INFO - Task get asd statistics[d4df3516-95e3-4359-829c-ca3a2f94a426] succeeded in 0.149578
18063s: [{'fields': {'disk_usage': 1181972018035.0, 'MultiGet2_low_max': 32.6431541443, 'GetDiskUsage_avg': 2.60286e-05,...
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 43300 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 56442 - INFO - Task get asd statistics[f65e7dfd-59ec-4746-b5f6-34ddf14c9134] succeeded in 0.162970
298901s: [{'fields': {'disk_usage': 108035437824.0, 'MultiGet2_low_max': 45.2979691029, 'GetDiskUsage_avg': 2.93947e-05,...
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 45800 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 56443 - INFO - Task get asd statistics[838dc903-3db5-4dd7-bdf6-470930b82e92] succeeded in 0.186922
571156s: [{'fields': {'PartialGets_histogram_1e+04': 1.0, 'disk_usage': 1906268525337.0, 'PartialGets_histogram_1': 1293.0,...
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 95200 -0500 - NY2SRV0014 - 2897/139935744538368 - lib/ensure single - 56416 - INFO - Ensure single DEFAULT mode - ID 1515830400_K6mCA1I1fO - Task statsmonkey.get_nsm_stats fini
shed successfully
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 95300 -0500 - NY2SRV0014 - 2897/139935744538368 - lib/ensure single - 56417 - INFO - Ensure single DEFAULT mode - ID 1515830400_K6mCA1I1fO - Deleting key ovs_ensure_single_stat
smonkey.get_nsm_stats
Jan 13 03:00:00 NY2SRV0014 celery[29157]: 2018-01-13 03:00:00 99000 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 56444 - INFO - Task statsmonkey.get_nsm_stats[4b1789aa-69a7-4630-ae9d-075f6a1dcb2a] succeeded in 0
.838836473878s: [{'fields': {'CleanupForNamespace_min': 0.0010738372802734375, 'UpdateObject3_avg': 0.0024721622467041016,...
Jan 13 03:00:01 NY2SRV0014 celery[29157]: 2018-01-13 03:00:01 74900 -0500 - NY2SRV0014 - 2895/139935744538368 - lib/ensure single - 56412 - INFO - Ensure single DEFAULT mode - ID 1515830400_TOOOT4azLJ - Task statsmonkey.get_asd_stats fini
shed successfully
Jan 13 03:00:01 NY2SRV0014 celery[29157]: 2018-01-13 03:00:01 75000 -0500 - NY2SRV0014 - 2895/139935744538368 - lib/ensure single - 56413 - INFO - Ensure single DEFAULT mode - ID 1515830400_TOOOT4azLJ - Deleting key ovs_ensure_single_stat
smonkey.get_asd_stats
Jan 13 03:00:01 NY2SRV0014 celery[29157]: 2018-01-13 03:00:01 83100 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 56445 - INFO - Task statsmonkey.get_asd_stats[2ec82f45-d09a-44a8-a9d2-c790c88a1e4c] succeeded in 1
.79584695678s: [{'fields': {'disk_usage': 403727929203.0, 'MultiGet2_low_max': 0.0325810909, 'GetDiskUsage_avg': 3.11421e-05,...
Jan 13 03:00:02 NY2SRV0014 celery[29157]: 2018-01-13 03:00:02 24500 -0500 - NY2SRV0014 - 4009/139935744538368 - celery/celery.redirected - 56431 - WARNING - 2018-01-13 03:00:02 24500 -0500 - NY2SRV0014 - 4009/139935744538368 - extensions/
asdmanagerclient - 56430 - INFO - Request "list_maintenance_services" took 1.38 seconds (internal duration 1.37 seconds)
Jan 13 03:00:02 NY2SRV0014 celery[29157]: 2018-01-13 03:00:02 53000 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.strategy - 56446 - INFO - Received task: ovs.storagerouter.ping[0e834f7e-d7ef-4677-b58a-5e34bad68936]
Jan 13 03:00:02 NY2SRV0014 celery[29157]: 2018-01-13 03:00:02 55600 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 56447 - INFO - Task ovs.storagerouter.ping[0e834f7e-d7ef-4677-b58a-5e34bad68936] succeeded in 0.02
50106919557s: None
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 45000 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56439 - INFO - Generating service work log for ny2-ssdbackend01
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 45700 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56441 - INFO - Applying service work log for ny2-ssdbackend01
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 45700 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56442 - INFO - Finished service work log for ny2-ssdbackend01
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 45900 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56443 - INFO - Generating service work log for ny2-hddbackend03
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 46600 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56445 - INFO - Applying service work log for ny2-hddbackend03
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 46700 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56446 - INFO - Finished service work log for ny2-hddbackend03
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 46900 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56447 - INFO - Generating service work log for ny2-hddbackend02
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 47600 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56449 - INFO - Applying service work log for ny2-hddbackend02
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 47600 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56450 - INFO - Finished service work log for ny2-hddbackend02
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 47800 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56451 - INFO - Generating service work log for ny2-hddbackend01
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 48400 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56453 - INFO - Applying service work log for ny2-hddbackend01
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 48400 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/alba - 56454 - INFO - Finished service work log for ny2-hddbackend01
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 48400 -0500 - NY2SRV0014 - 4009/139935744538368 - lib/ensure single - 56455 - INFO - Ensure single CHAINED mode - ID 1515830400_OYs6jfrM4h - Task alba.checkup_maintenance_agent
s finished successfully
Jan 13 03:00:05 NY2SRV0014 celery[29157]: 2018-01-13 03:00:05 50400 -0500 - NY2SRV0014 - 29157/139935744538368 - celery/celery.worker.job - 56448 - INFO - Task alba.checkup_maintenance_agents[e784a051-99c3-4a14-bcb0-a9141ee24aa3] succeede
d in 5.35206198692s: None

In CheckMK we added some extra monitor so we could detect if one of the maintenance agent goes down.

@matthiasdeblock
Copy link

matthiasdeblock commented Mar 20, 2018

This morning, while the maintenance node (NY1SRV1000) was down, the maintenance agents started being deployed on ASD VM's which was bringing down the ASD VM's as they got out of memory.

There is only 1 node configured to be candidate:

root@NY1SRV0019:~# ovs config get ovs/alba/backends/f9599945-f1f5-44d3-8497-0c462ede4ef9/maintenance/agents_layout
["BYkbxixfsebWk78YJtUXpYwEYIg4Teex"]

@wimpers wimpers added this to the N milestone Sep 4, 2018
@JeffreyDevloo
Copy link
Contributor

It might be that the all asd nodes were unresponsive when the checkup was invoked.
When all asdnodes don't return their maintenance services, the layout is ignored.

@JeffreyDevloo JeffreyDevloo self-assigned this Sep 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants