Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sctool backup on Azure fails with: "Error: create backup target: location is not accessible" "Failed to access location from node" #9606

Open
2 tasks
yarongilor opened this issue Dec 23, 2024 · 13 comments
Assignees

Comments

@yarongilor
Copy link
Contributor

Packages

Scylla version: 6.3.0~dev-20241222.200f0bb21926 with build-id e73b8c9942be689611d19d0acce447b553fba2dc

Kernel Version: 6.8.0-1018-azure

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.
A nemesis of disrupt_mgmt_backup_specific_keyspaces completed ok (on Azure).

< t:2024-12-23 01:07:02,308 f:nemesis.py      l:1962 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method mgmt_backup_specific_keyspaces
< t:2024-12-23 01:07:02,308 f:nemesis.py      l:415  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Target node pool type: NEMESIS_TARGET_POOLS.data_nodes
< t:2024-12-23 01:07:02,308 f:nemesis.py      l:1195 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Set current_disruption -> MgmtBackupSpecificKeyspaces
< t:2024-12-23 01:07:02,308 f:nemesis.py      l:458  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Current Target: Node longevity-10gb-3h-master-db-node-3aaf1b83-eastus-1 [None | 10.0.0.5] with running nemesis: MgmtBackupSpecificKeyspaces
< t:2024-12-23 01:07:17,193 f:nemesis.py      l:5423 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Start disruption at `2024-12-23 01:07:17.193090`
< t:2024-12-23 01:07:44,951 f:remote_base.py  l:560  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.13>: Running command "sudo sctool backup -c 48da10c3-0a69-4422-a319-ff00e3f19aa9 --keyspace keyspace1  --location azure:manager-backup-tests-us-east-1 "...
< t:2024-12-23 01:07:46,956 f:base.py         l:143  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.13>: Command "sudo sctool backup -c 48da10c3-0a69-4422-a319-ff00e3f19aa9 --keyspace keyspace1  --location azure:manager-backup-tests-us-east-1 " finished with status 0
< t:2024-12-23 01:08:46,869 f:nemesis.py      l:3151 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Task: backup/20adc7a1-e780-48cd-aab3-f162074dca1a is done.
< t:2024-12-23 01:08:46,870 f:nemesis.py      l:5483 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: MgmtBackupSpecificKeyspaces duration -> 89 s
< t:2024-12-23 01:08:46,871 f:nemesis.py      l:361  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Update nemesis info with: {'start': 1734916037, 'end': 1734916126, 'duration': 89, 'node': 'Node longevity-10gb-3h-master-db-node-3aaf1b83-eastus-1 [None | 10.0.0.5]', 'subtype': 'end'}
< t:2024-12-23 01:08:47,442 f:nemesis.py      l:5498 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: log_info: {'start': 1734916037, 'end': 1734916126, 'duration': 89, 'node': 'Node longevity-10gb-3h-master-db-node-3aaf1b83-eastus-1 [None | 10.0.0.5]', 'subtype': 'end'}
< t:2024-12-23 01:09:02,368 f:nemesis.py      l:1966 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: <<<<<<<<<<<<<Finished random_disrupt_method mgmt_backup_specific_keyspaces

Then, another nemesis of disrupt_mgmt_backup failed with:



< t:2024-12-23 04:05:06,231 f:nemesis.py      l:1962 c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: >>>>>>>>>>>>>Started random_disrupt_method mgmt_backup
< t:2024-12-23 04:05:06,232 f:nemesis.py      l:415  c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Target node pool type: NEMESIS_TARGET_POOLS.data_nodes
< t:2024-12-23 04:05:06,232 f:nemesis.py      l:1195 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Set current_disruption -> MgmtBackup
< t:2024-12-23 04:05:06,232 f:nemesis.py      l:458  c:sdcm.nemesis         p:INFO  > sdcm.nemesis.SisyphusMonkey: Current Target: Node longevity-10gb-3h-master-db-node-3aaf1b83-eastus-2 [None | 10.0.0.6] with running nemesis: MgmtBackup
< t:2024-12-23 04:05:19,894 f:nemesis.py      l:5423 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Start disruption at `2024-12-23 04:05:19.894019`
< t:2024-12-23 04:06:06,807 f:remote_base.py  l:560  c:RemoteLibSSH2CmdRunner p:DEBUG > <10.0.0.13>: Running command "sudo sctool backup -c 48da10c3-0a69-4422-a319-ff00e3f19aa9 --location azure:manager-backup-tests-us-east-1 "...
< t:2024-12-23 04:07:08,303 f:base.py         l:147  c:RemoteLibSSH2CmdRunner p:ERROR > <10.0.0.13>: Error executing command: "sudo sctool backup -c 48da10c3-0a69-4422-a319-ff00e3f19aa9 --location azure:manager-backup-tests-us-east-1 "; Exit status: 1
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <bound method Nemesis.disrupt_mgmt_backup of <sdcm.nemesis.SisyphusMonkey object at 0x7483a2be17e0>> < t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > sdcm.nemesis.SisyphusMonkey: Unhandled exception in method <bound method Nemesis.disrupt_mgmt_backup of <sdcm.nemesis.SisyphusMonkey object at 0x7483a2be17e0>>
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 1205, in run
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     res = self.manager_node.remoter.sudo(f"sctool {cmd}")
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/base.py", line 123, in sudo
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     return self.run(cmd=cmd,
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 653, in run
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     result = _run()
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 72, in inner
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     return func(*args, **kwargs)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 644, in _run
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 577, in _run_execute
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     result = connection.run(**command_kwargs)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 625, in run
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 660, in _complete_run
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     raise UnexpectedExit(result)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Command: 'sudo sctool backup -c 48da10c3-0a69-4422-a319-ff00e3f19aa9 --location azure:manager-backup-tests-us-east-1 '
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Exit code: 1
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Stdout:
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Stderr:
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Error: create backup target: location is not accessible
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >  10.0.0.14: giving up after 2 attempts: after 30s: context deadline exceeded
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Trace ID: iABZBwukS8OYHhOqtT6i_w (grep in scylla-manager logs)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > The above exception was the direct cause of the following exception:
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Traceback (most recent call last):
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5447, in wrapper
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     result = method(*args[1:], **kwargs)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3000, in disrupt_mgmt_backup
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     self._mgmt_backup(backup_specific_tables=False)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 531, in wrapper
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     return decorated_func(*args, **kwargs)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 516, in inner_func
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     return func(*args, **kwargs)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3145, in _mgmt_backup
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     mgr_task = mgr_cluster.create_backup_task(location_list=[location, ])
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 671, in create_backup_task
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     res = self.sctool.run(cmd=cmd, parse_table_res=False)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 1208, in run
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >     raise ScyllaManagerError(f"Encountered an error on sctool command: {cmd}: {ex}") from ex
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > sdcm.mgmt.common.ScyllaManagerError: Encountered an error on sctool command: backup -c 48da10c3-0a69-4422-a319-ff00e3f19aa9 --location azure:manager-backup-tests-us-east-1 : Encountered a bad command exit code!
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Command: 'sudo sctool backup -c 48da10c3-0a69-4422-a319-ff00e3f19aa9 --location azure:manager-backup-tests-us-east-1 '
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Exit code: 1
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Stdout:
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Stderr:
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Error: create backup target: location is not accessible
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR >  10.0.0.14: giving up after 2 attempts: after 30s: context deadline exceeded
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > Trace ID: iABZBwukS8OYHhOqtT6i_w (grep in scylla-manager logs)
< t:2024-12-23 04:07:08,311 f:nemesis.py      l:5471 c:sdcm.nemesis         p:ERROR > 

manager log had:

Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.836Z","N":"backup","M":"Generating backup target","cluster_id":"48da10c3-0a69-4422-a319-ff00e3f19aa9","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.846Z","N":"cluster.client","M":"Checking hosts connectivity","hosts":["10.0.0.10","10.0.0.14","10.0.0.5","10.0.0.6","10.0.0.7","10.0.0.8"],"_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.847Z","N":"cluster.client","M":"Host check OK","host":"10.0.0.6","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.847Z","N":"cluster.client","M":"Host check OK","host":"10.0.0.10","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.847Z","N":"cluster.client","M":"Host check OK","host":"10.0.0.5","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.847Z","N":"cluster.client","M":"Host check OK","host":"10.0.0.8","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.847Z","N":"cluster.client","M":"Host check OK","host":"10.0.0.7","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.847Z","N":"cluster.client","M":"Host check OK","host":"10.0.0.14","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.847Z","N":"cluster.client","M":"Done checking hosts connectivity","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:06 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:06.847Z","N":"backup","M":"Checking accessibility of remote locations","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:07 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:07.979Z","N":"backup","M":"Location check OK","host":"10.0.0.8","location":"azure:manager-backup-tests-us-east-1","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:07 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:07.985Z","N":"backup","M":"Location check OK","host":"10.0.0.10","location":"azure:manager-backup-tests-us-east-1","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:08 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:08.001Z","N":"backup","M":"Location check OK","host":"10.0.0.6","location":"azure:manager-backup-tests-us-east-1","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:08 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:08.091Z","N":"backup","M":"Location check OK","host":"10.0.0.7","location":"azure:manager-backup-tests-us-east-1","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:08 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:08.109Z","N":"backup","M":"Location check OK","host":"10.0.0.5","location":"azure:manager-backup-tests-us-east-1","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:06:20 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:20.314Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"48da10c3-0a69-4422-a319-ff00e3f19aa9","_trace_id":"7zCvUF2fQLaN_1MSJ4wcmQ"}
Dec 23 04:06:20 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:20.322Z","N":"cluster.client","M":"Measuring datacenter latencies","dcs":["eastus"],"_trace_id":"7zCvUF2fQLaN_1MSJ4wcmQ"}
Dec 23 04:06:36 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:06:36.848Z","N":"cluster.client","M":"HTTP retry backoff","operation":"OperationsCheckPermissions","wait":"1s","error":"after 30s: context deadline exceeded","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:07:07 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:07:07.849Z","N":"backup","M":"Location check FAILED","host":"10.0.0.14","location":"azure:manager-backup-tests-us-east-1","error":"giving up after 2 attempts: after 30s: context deadline exceeded","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:07:07 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"ERROR","T":"2024-12-23T04:07:07.849Z","N":"backup","M":"Failed to access location from node","node":"10.0.0.14","location":"azure:manager-backup-tests-us-east-1","error":"10.0.0.14: giving up after 2 attempts: after 30s: context deadline exceeded","_trace_id":"iABZBwukS8OYHhOqtT6i_w","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.com/scylladb/scylla-manager/v3/pkg/service/backup.(*Service).checkLocationsAvailableFromNodes.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/backup/service.go:253\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parallel.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/[email protected]/parallel/parallel.go:79"}
Dec 23 04:07:07 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:07:07.849Z","N":"backup","M":"Done checking accessibility of remote locations","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}
Dec 23 04:07:07 longevity-10gb-3h-master-monitor-node-3aaf1b83-eastus-1 scylla-manager[18714]: {"L":"INFO","T":"2024-12-23T04:07:07.849Z","N":"http","M":"POST /api/v1/cluster/48da10c3-0a69-4422-a319-ff00e3f19aa9/tasks","from":"127.0.0.1:35562","status":400,"bytes":189,"duration":"61018ms","error":"create backup target: location is not accessible: 10.0.0.14: giving up after 2 attempts: after 30s: context deadline exceeded","_trace_id":"iABZBwukS8OYHhOqtT6i_w"}

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-9 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-8 (null | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-7 (null | 10.0.0.5) (shards: 7)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-6 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-5 (null | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-4 (null | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-3 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-2 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-11 (null | 10.0.0.14) (shards: 7)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-10 (null | 10.0.0.14) (shards: -1)
  • longevity-10gb-3h-master-db-node-3aaf1b83-eastus-1 (null | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.3.0-dev-x86_64-2024-12-23T02-09-38 (azure: undefined_region)

Test: longevity-10gb-3h-azure-test
Test id: 3aaf1b83-ec80-44d0-8296-43c7e7ed1072
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 3aaf1b83-ec80-44d0-8296-43c7e7ed1072
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 3aaf1b83-ec80-44d0-8296-43c7e7ed1072

Logs:

Jenkins job URL
Argus

@yarongilor
Copy link
Contributor Author

@kreuzerkrieg , @karol-kokoszka , can you please advise? is it a testing issue or manager issue?

@yarongilor
Copy link
Contributor Author

reproduced:

2025-01-15 14:43:25.910: (DisruptionEvent Severity.ERROR) period_type=end event_id=593252fa-f585-4ca4-a4a8-b8675c98f289 duration=1m50s: nemesis_name=MgmtBackup target_node=Node longevity-10gb-3h-master-db-node-428c95d2-eastus-6 [None | 10.0.0.10] errors=Encountered an error on sctool command: backup -c e7815463-0527-4920-9c3c-ea16748fa6e2 --location azure:manager-backup-tests-us-east-1 : Encountered a bad command exit code!
Command: 'sudo sctool backup -c e7815463-0527-4920-9c3c-ea16748fa6e2 --location azure:manager-backup-tests-us-east-1 '
Exit code: 1
Stdout:
Stderr:
Error: create backup target: location is not accessible
10.0.0.14: giving up after 2 attempts: after 30s: context deadline exceeded
Trace ID: 1CGzpogiTEygeZ0H8yDhkg (grep in scylla-manager logs)
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 1205, in run
res = self.manager_node.remoter.sudo(f"sctool {cmd}")
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/base.py", line 123, in sudo
return self.run(cmd=cmd,
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 653, in run
result = _run()
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/decorators.py", line 72, in inner
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 644, in _run
return self._run_execute(cmd, timeout, ignore_status, verbose, new_session, watchers)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/remote_base.py", line 577, in _run_execute
result = connection.run(**command_kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 625, in run
return self._complete_run(channel, exception, timeout_reached, timeout, result, warn, stdout, stderr)
File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 660, in _complete_run
raise UnexpectedExit(result)
sdcm.remote.libssh2_client.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'sudo sctool backup -c e7815463-0527-4920-9c3c-ea16748fa6e2 --location azure:manager-backup-tests-us-east-1 '
Exit code: 1
Stdout:
Stderr:
Error: create backup target: location is not accessible
10.0.0.14: giving up after 2 attempts: after 30s: context deadline exceeded
Trace ID: 1CGzpogiTEygeZ0H8yDhkg (grep in scylla-manager logs)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5497, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3017, in disrupt_mgmt_backup
self._mgmt_backup(backup_specific_tables=False)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 534, in wrapper
return decorated_func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/sct_events/group_common_events.py", line 519, in inner_func
return func(*args, **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 3174, in _mgmt_backup
mgr_task = mgr_cluster.create_backup_task(location_list=[location, ])
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 671, in create_backup_task
res = self.sctool.run(cmd=cmd, parse_table_res=False)
File "/home/ubuntu/scylla-cluster-tests/sdcm/mgmt/cli.py", line 1208, in run
raise ScyllaManagerError(f"Encountered an error on sctool command: {cmd}: {ex}") from ex
sdcm.mgmt.common.ScyllaManagerError: Encountered an error on sctool command: backup -c e7815463-0527-4920-9c3c-ea16748fa6e2 --location azure:manager-backup-tests-us-east-1 : Encountered a bad command exit code!
Command: 'sudo sctool backup -c e7815463-0527-4920-9c3c-ea16748fa6e2 --location azure:manager-backup-tests-us-east-1 '
Exit code: 1
Stdout:
Stderr:
Error: create backup target: location is not accessible
10.0.0.14: giving up after 2 attempts: after 30s: context deadline exceeded
Trace ID: 1CGzpogiTEygeZ0H8yDhkg (grep in scylla-manager logs)

Packages

Scylla version: 6.3.0~dev-20250114.2eac7a2d616f with build-id dacec1bcd5c23786306f1752ad6bf0e78ee3162c

Kernel Version: 6.8.0-1020-azure

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 6 nodes (Standard_L8s_v3)

Scylla Nodes used in this run:

  • longevity-10gb-3h-master-db-node-428c95d2-eastus-9 (null | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-8 (null | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-7 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-6 (null | 10.0.0.10) (shards: 7)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-5 (null | 10.0.0.9) (shards: 7)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-4 (null | 10.0.0.8) (shards: 7)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-3 (null | 10.0.0.7) (shards: 7)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-2 (null | 10.0.0.6) (shards: 7)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-11 (null | 10.0.0.14) (shards: 7)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-10 (null | 10.0.0.14) (shards: -1)
  • longevity-10gb-3h-master-db-node-428c95d2-eastus-1 (null | 10.0.0.5) (shards: 7)

OS / Image: /subscriptions/6c268694-47ab-43ab-b306-3c5514bc4112/resourceGroups/scylla-images/providers/Microsoft.Compute/images/scylla-6.3.0-dev-x86_64-2025-01-15T12-38-28 (NO RUNNER: NO RUNNER)

Test: longevity-10gb-3h-azure-test
Test id: 428c95d2-bb7b-4ce6-b271-b1fb0a9c2673
Test name: scylla-master/longevity/longevity-10gb-3h-azure-test
Test method: longevity_test.LongevityTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 428c95d2-bb7b-4ce6-b271-b1fb0a9c2673
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 428c95d2-bb7b-4ce6-b271-b1fb0a9c2673

Logs:

Jenkins job URL
Argus

@yarongilor yarongilor removed their assignment Jan 15, 2025
@yarongilor
Copy link
Contributor Author

@fruch , please let me know for any update or anything i can do here.

@fruch
Copy link
Contributor

fruch commented Jan 15, 2025

@fruch , please let me know for any update or anything i can do here.

You can investigate it further, and check the underlined error in manger, and if this bucket exists in azure storage.

Or wait for @mikliapko to attend to that.

@fruch
Copy link
Contributor

fruch commented Jan 15, 2025

What is the context here ? Are these daily triggered jobs ?

@pehala
Copy link
Contributor

pehala commented Jan 16, 2025

What is the context here ? Are these daily triggered jobs ?

Yes, these are Azure daily sanity tests

@mikliapko
Copy link
Contributor

Or wait for @mikliapko to attend to that.

I'll try to take a look in coming days.

@mikliapko
Copy link
Contributor

Analyzing the issue reproduction from 15-01-2025 Argus

From scylla-manager logs:

Jan 15 14:42:22 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:42:22.630Z","N":"backup","M":"Checking accessibility of remote locations","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}
Jan 15 14:42:23 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:42:23.800Z","N":"backup","M":"Location check OK","host":"10.0.0.6","location":"azure:manager-backup-tests-us-east-1","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}
Jan 15 14:42:23 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:42:23.801Z","N":"backup","M":"Location check OK","host":"10.0.0.5","location":"azure:manager-backup-tests-us-east-1","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}
Jan 15 14:42:23 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:42:23.810Z","N":"backup","M":"Location check OK","host":"10.0.0.10","location":"azure:manager-backup-tests-us-east-1","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}
Jan 15 14:42:23 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:42:23.881Z","N":"backup","M":"Location check OK","host":"10.0.0.8","location":"azure:manager-backup-tests-us-east-1","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}
Jan 15 14:42:23 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:42:23.891Z","N":"backup","M":"Location check OK","host":"10.0.0.7","location":"azure:manager-backup-tests-us-east-1","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}
Jan 15 14:42:52 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:42:52.632Z","N":"cluster.client","M":"HTTP retry backoff","operation":"OperationsCheckPermissions","wait":"1s","error":"after 30s: context deadline exceeded","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}
Jan 15 14:43:06 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:43:06.939Z","N":"cluster","M":"Creating new Scylla HTTP client","cluster_id":"e7815463-0527-4920-9c3c-ea16748fa6e2","_trace_id":"rVe-U-6zTcalV3sR79dgLg"}
Jan 15 14:43:06 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:43:06.954Z","N":"cluster.client","M":"Measuring datacenter latencies","dcs":["eastus"],"_trace_id":"rVe-U-6zTcalV3sR79dgLg"}
Jan 15 14:43:23 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:43:23.634Z","N":"backup","M":"Location check FAILED","host":"10.0.0.14","location":"azure:manager-backup-tests-us-east-1","error":"giving up after 2 attempts: after 30s: context deadline exceeded","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}
Jan 15 14:43:23 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"ERROR","T":"2025-01-15T14:43:23.634Z","N":"backup","M":"Failed to access location from node","node":"10.0.0.14","location":"azure:manager-backup-tests-us-east-1","error":"10.0.0.14: giving up after 2 attempts: after 30s: context deadline exceeded","_trace_id":"1CGzpogiTEygeZ0H8yDhkg","S":"github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/[email protected]/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/[email protected]/logger.go:84\ngithub.com/scylladb/scylla-manager/v3/pkg/service/backup.(*Service).checkLocationsAvailableFromNodes.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/backup/service.go:253\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parallel.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/[email protected]/parallel/parallel.go:79"}
Jan 15 14:43:23 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:43:23.634Z","N":"backup","M":"Done checking accessibility of remote locations","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}
Jan 15 14:43:23 longevity-10gb-3h-master-monitor-node-428c95d2-eastus-1 scylla-manager[16943]: {"L":"INFO","T":"2025-01-15T14:43:23.634Z","N":"http","M":"POST /api/v1/cluster/e7815463-0527-4920-9c3c-ea16748fa6e2/tasks","from":"127.0.0.1:60258","status":400,"bytes":189,"duration":"61027ms","error":"create backup target: location is not accessible: 10.0.0.14: giving up after 2 attempts: after 30s: context deadline exceeded","_trace_id":"1CGzpogiTEygeZ0H8yDhkg"}

Location check fails for 10.0.0.14 node.

At first, I was thinking that the failure might be related to the fact the node was added during the test and possibly scylla-agent might have been configured in a wrong way, but in one of the latest runs (Argus) I see the same issue with location check failure but for the node that has been in cluster from the very beginning.

Also, the issue is flaky and doesn't reproduce every time. For example, here (Argus), both disrupt_mgmt_backup_specific_keyspaces and disrupt_mgmt_backup tests passed.

Frankly speaking, I see no issues from test side. Probably, it's something that should be addressed on Manager side(?)
@Michal-Leszczynski Do you have any idea what might be wrong here?

@mikliapko
Copy link
Contributor

Here on Logs tab I see no logs for longevity-10gb-3h-master-db-node-428c95d2-eastus-11 node - the one that failed to access backup location.
I'm interested in scylla-manager-agent.yaml file just to make sure the failing node has the proper manager agent configuration.
@fruch Why don't we collect them?

@vponomaryov
Copy link
Contributor

Here on Logs tab I see no logs for longevity-10gb-3h-master-db-node-428c95d2-eastus-11 node - the one that failed to access backup location. I'm interested in scylla-manager-agent.yaml file just to make sure the failing node has the proper manager agent configuration. @fruch Why don't we collect them?

The node-11 as any other "alive" node at the point in time when the collection happens became part of the group archive "db_cluster...".
All the DB nodes logs dedicated for one node are logs which are gathered right before node termination/deletion in the middle of the test.

So, everything is there, everything by design.

@mikliapko
Copy link
Contributor

The node-11 as any other "alive" node at the point in time when the collection happens became part of the group archive "db_cluster...". All the DB nodes logs dedicated for one node are logs which are gathered right before node termination/deletion in the middle of the test.

So, everything is there, everything by design.

ah, missed it, thanks for explanation

@Michal-Leszczynski
Copy link

@mikliapko I don't see anything wrong in the agent logs. Could we re-run this test with debug level logging?
It can be enabled by adding the following to the scylla-manager-agent.yaml:

logger:
  level: debug
  sampling: null

@mikliapko
Copy link
Contributor

@mikliapko I don't see anything wrong in the agent logs. Could we re-run this test with debug level logging? It can be enabled by adding the following to the scylla-manager-agent.yaml:

logger:
level: debug
sampling: null

Yes, we can.
I will need to do some adjustments to enable debug log level and it may take several runs to reproduce the described issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants