Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] rsync receive data from remote platform failed #434

Open
pxlxingliang opened this issue Jan 23, 2024 · 5 comments
Open

[BUG] rsync receive data from remote platform failed #434

pxlxingliang opened this issue Jan 23, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@pxlxingliang
Copy link

Bug summary

I use dpgen to submit a dpgen job to run the fp on SUGON platform, the fp is like:

    "fp": [
        {
            "command": "OMP_NUM_THREADS=1 mpirun -np 4 $abacus | tee out.log",
            "machine": {
		"batch_type": "Slurm",
		"context_type": "SSHContext",
                "local_root": "./",
                "remote_root": "/public/home/abacus/tmp",
                "remote_profile": {
                    "key_filename": "sugon",
                    "hostname": "cancon.hpccube.com",
                    "username": "abacus",
                    "port": 65023
                }
            },
            "resources": {
		    "batch_type": "Slurm",
                "number_node": 1,
                "cpu_per_node": 32,
		"group_size": 1,
                "queue_name": "kshdnormal",
                "custom_flags": [
                    "#SBATCH --gres=dcu:4"
                ],
                "source_list": [
                    "/public/home/abacus/run_dcu.sh"
                ]
            }
        }
    ]

The fp job can be submitted to sugon and run abacus successfully, but it throw the below warning when dpgen get the returned results:

2024-01-23 13:53:23,653 - ERROR : Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
Traceback (most recent call last):
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 273, in try_download_result
    self.download_jobs()
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 501, in download_jobs
    self.machine.context.download(self)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 675, in download
    self._get_files(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 905, in _get_files
    self.ssh_session.get(from_f, to_f)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 376, in get
    return rsync(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/utils.py", line 136, in rsync
    raise RuntimeError(f"Failed to run {cmd}: {err}")
RuntimeError: Failed to run ['rsync', '-az', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/695809f93a5474bde7743bddb46cbd857e2906c6/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz', '/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz']: b'rsync: chown "/personal/test/init_and_run1/Al.STRU.02x01x01/00.place_ele/.695809f93a5474bde7743bddb46cbd857e2906c6.tar.gz.sKchjf" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'
2024-01-23 13:53:23,655 - INFO : Retrying in 1 minute...

It seems that rsync try to do chown action, but it is failed.

DP-GEN Version

0.11.1.dev51+gbea559b

Platform, Python Version, Remote Platform, etc

Platform: bohrium

Python: 3.8.8

Remote Platform: Sugon

Input Files, Running Commands, Error Log, etc.

dpgen.zip
Need an extra Sugon secret file named as "sugon".
command: dpgen init_bulk init.json machine.json

Steps to Reproduce

  1. download the secret file of sugon, and name as "sugon"
  2. modify the fp in machine.json
  3. submit the job: dpgen init_bulk init.json machine.json

Further Information, Files, and Links

No response

@pxlxingliang pxlxingliang added the bug Something isn't working label Jan 23, 2024
@njzjz
Copy link
Member

njzjz commented Jan 24, 2024

It's not related to the remote machine, but it seems you didn't have the access to chown on the local machine.

@njzjz
Copy link
Member

njzjz commented Jan 24, 2024

Could you try to add --no-perms flag to rsync?

@pxlxingliang pxlxingliang changed the title [BUG] [BUG] rsync receive data from remote platform failed Jan 25, 2024
@pxlxingliang
Copy link
Author

Could you try to add --no-perms flag to rsync?

I have try to add this flag, but it did not work:

^CTraceback (most recent call last):
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 273, in try_download_result
    self.download_jobs()
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/submission.py", line 501, in download_jobs
    self.machine.context.download(self)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 675, in download
    self._get_files(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 905, in _get_files
    self.ssh_session.get(from_f, to_f)
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/ssh_context.py", line 376, in get
    return rsync(
  File "/root/anaconda3/lib/python3.8/site-packages/dpdispatcher/utils.py", line 137, in rsync
    raise RuntimeError(f"Failed to run {cmd}: {err}")
RuntimeError: Failed to run ['rsync', '-az', '--no-perms', '-e', 'ssh -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no -p 65023 -q -i sugon', '-q', '[email protected]:/public/home/abacus/tmp/013b6a211b33560666b55f011a60f9771da63b60/013b6a211b33560666b55f011a60f9771da63b60.tar.gz', '/personal/test/init_and_run2/Al.STRU.02x01x01/00.place_ele/013b6a211b33560666b55f011a60f9771da63b60.tar.gz']: b'rsync: chown "/personal/test/init_and_run2/Al.STRU.02x01x01/00.place_ele/.013b6a211b33560666b55f011a60f9771da63b60.tar.gz.JIoelN" failed: Operation not permitted (1)\nrsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1677) [generator=3.1.3]\n'

This issue may relate to directory right of Bohrium "/personal". When I run this test on others path, it will work.

@njzjz
Copy link
Member

njzjz commented Jan 25, 2024

Try no-o. I guess no-g may also be required. Below is the explanation.

    -r, --recursive             recurse into directories
    -l, --links                 copy symlinks as symlinks
    -p, --perms                 preserve permissions
    -t, --times                 preserve modification times
    -o, --owner                 preserve owner (super-user only)
    -g, --group                 preserve group
    -D                          same as --devices --specials
        --devices               preserve device files (super-user only)
        --specials              preserve special files

-a is equivalent to -rltpgoD

@njzjz njzjz transferred this issue from deepmodeling/dpgen Jan 27, 2024
@njzjz
Copy link
Member

njzjz commented Jan 27, 2024

I transfer the issue to dpdispatcher as it's more related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants