Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dvc commit operation is too slow #10653

Open
jiangtann opened this issue Dec 17, 2024 · 10 comments
Open

Dvc commit operation is too slow #10653

jiangtann opened this issue Dec 17, 2024 · 10 comments
Labels
performance improvement over resource / time consuming tasks triage Needs to be triaged

Comments

@jiangtann
Copy link

Bug Report

Description

I use git and dvc to manage my training datasets, which consists of thousands of jsonl files.

After I modify several jsonl files, I use dvc status && dvc commit. dvc status operation is completed quickly (I know dvc will only hash files once until it gets modified. Here only several jsonl files are modified so dvc status operation cost little). However, dvc commit operation cost a lot of time.

While dvc commit is executing, I see lots of "Checking out xxx/xxx/xxx.jsonl" shows in the terminal, and I believe those jsonl files are not modified. Why dvc need to check out files that are not modified?

Expected

Assume two files a.jsonl and b.jsonl are modified, I think dvc commit should equal to dvc add a.jsonl b.jsonl. However, it seems that dvc commit will check out all files tracked by dvc.

I expect dvc commit operation skip files which are not modified, so it can be completed quickly.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.9.19 on Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 3.16.5
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.7
Supports:
        http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.6.1, boto3 = 1.35.7)
Config:
        Global: /mnt/afs/jiangtan/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: fuse.quarkfs_client on quarkfs_client
Caches: local
Remotes: s3
Workspace directory: fuse.quarkfs_client on quarkfs_client
Repo: dvc, git
Repo.site_cache_dir: /path/to/repo/.dvc/site_cache_dir/repo/eddf3641719990517f0cfc808ea33376
@shcheklein
Copy link
Member

@jiangtann what is the reason for running dvc commit in your case? Just add a few more files or something else? Were those jsonls added via dvc add initially?

Also, just to get a bit more context and potentially recommend other solutions - could you describe the workflow? is it text inside those jsonls? I wonder if it makes sense to package them into tars for example?

If they don't change - would it make sense to just keep them in the cloud in a directory and don't move them into DVC at all?

@shcheklein shcheklein added performance improvement over resource / time consuming tasks triage Needs to be triaged awaiting response we are waiting for your reply, please respond! :) labels Dec 17, 2024
@jiangtann
Copy link
Author

jiangtann commented Dec 17, 2024

workflow

We are a LLM team, and multiple colleagues continuously collect new training datasets (jsonl files, most files exceed 10MiB) together. We use git + dvc (similar to git-lfs) to manage our training dataset repository. In short, we need version control system of large files. So each jsonl file is tracked by a .dvc file.

If directly push jsonl files to s3 without any version control system, we will miss a lof of information. For example, we do not know who commit the specific dataset if we find some errors exsit in the jsonl file.

We often need to modify a jsonl file (format check, clean, filter etc.) . If a wrong modification occurs, we will need rollback.

In our workflow, dvc is an alternative to git-lfs. Because we use s3 as backend, we choose dvc.

structure of our repository

Please refer to this:
#10648 (comment)

example of a dataset (jsonl file, each line is a training sample)

{"image": "batch_1/background_images/ee8bc699-e481-4738-80d2-aeade4776033.jpg", "conversations": [{"from": "human", "value": "<image>\nPlease write out the expression of the formula in the image using LaTeX format."}, {"from": "gpt", "value": "\\[\\lim_{g\\to9}\\frac{\\frac{d}{dg}\\left(e^{g}+-4\\tan{g}-9\\right)}{\\frac{d}{dg}\\left(g^{4}+2g^{7}+22g^{5}\\right)}\\]"}], "width": 1689, "height": 425}
{"image": "batch_1/background_images/436008a6-638f-4bd0-82cf-d696f383eebe.jpg", "conversations": [{"from": "human", "value": "<image>\nWrite the given equation in LaTeX code, ensuring no mistakes in notation or format."}, {"from": "gpt", "value": "\\[\\lim_{x\\to1^{+}}-71\\csc{x}\\]"}], "width": 941, "height": 265}
...

the reason for running dvc commit

In our training dataset repository, we have thousands of jsonl files which have already been tracked by dvc. Sometimes, we need to clean and filter specific samples in all datasets(jsonls) according to some regex pattern. But we do not know which dataset files are modified after filtering, unless we use log to record when filtering.

Previously, we would use the shell script to update .dvc after filtering (actually the script is more complex):

all_files=$(git ls-files --others | grep -E '^202[0-9]{3}/.*\.jsonl$')
echo "$all_files" | xargs dvc add
git add .
git commit -m "xxx"
git push

But recently I find dvc commit can automatically record changes to files tracked by DVC. So I use dvc commit to replace the above shell script. And I find dvc commit will not skip those unmodified jsonl files and is as slow as the above shell script. We don't want to change the current mode that a .dvc track a jsonl file.

Can you explain the principle of dvc commit. Assume two files a.jsonl and b.jsonl are modified, why dvc commit does not equal to dvc add a.jsonl b.jsonl (the former needs a lot of time but the latter finishes quickly)?

@shcheklein
Copy link
Member

@jiangtann thanks a lot for the details. How many files are we talking about? Are you creating a separate .dvc for each of them?

Can you explain the principle of dvc commit. Assume two files a.jsonl and b.jsonl are modified, why dvc commit does not equal to dvc add a.jsonl b.jsonl (the former needs a lot of time but the latter finishes quickly)?

it should be indeed more or less the same (except it might still need to check of files changed if you run it w/o targets), I might be missing some details (need to check this). cc @skshetry any idea why is it doing checkout?

@jiangtann
Copy link
Author

I use git and dvc to manage my training datasets, which consists of thousands of jsonl files.

workflow

We are a LLM team, and multiple colleagues continuously collect new training datasets (jsonl files, most files exceed 10MiB) together. We use git + dvc (similar to git-lfs) to manage our training dataset repository. In short, we need version control system of large files. So each jsonl file is tracked by a .dvc file.

In our repository, we have about 2000 jsonl files, and each jsonl file is tracked by a .dvc file. Due to the unavailability of reflink, we use copy as the cache.type.

I expect dvc commit operation skip files which are not modified, so it can be completed quickly.

@skshetry
Copy link
Member

Checking out progress bar is just top-level output, internally it may or may not checkout if it has not changed.

dvc commit has to check all 2000 files, so you can also provide targets to the commit command to reduce scope of operations.

Can you please provide a profiling data - "too slow" is not actionable?
See https://github.com/iterative/dvc/wiki/Debugging,-Profiling-and-Benchmarking-DVC#generating-cprofile-data.

@shcheklein
Copy link
Member

to check

I assume just a basic test if they changed or not?

(I agree with @skshetry - seeing profile would be great)

@jiangtann
Copy link
Author

prof.zip

I hope when there are only 10 or 20 jsonl files are changed, the time of dvc status equals to the time of dvc commit. But although no files are changed, the time consumed by dvc commit is many times that of dvc status. Please refer to prof.zip. @skshetry

@skshetry
Copy link
Member

skshetry commented Dec 18, 2024

@jiangtann, looking at the profiling, ~1300 files are being checked out.

This could happen on two cases:

  1. Either the files have changed.
  2. Or, the files in the workspace has not been properly linked to the cache.

If dvc status says unchanged, it likely means 2). Do you modify these files outside dvc?

You might be able to check link type with dvc-data check-link. If it says unknown, it is not being linked properly.

$ pip install "dvc-data[cli]"
$ dvc-data check-link path/to/file

There is dvc commit --no-relink to avoid relinking.

@jiangtann
Copy link
Author

jiangtann commented Dec 18, 2024

I also upload dvc_status.prof in prof.zip. You can find dvc status says Data and pipelines are up to date., I ran dvc status && dvc commit, so I didn't modify these files between dvc status and dvc commit.

I used dvc-data check-link to check one of jsonl files:

$ dvc-data check-link 202408/ai2d/xxx.jsonl
-rw-r--r-- xxx.jsonl unknown

$ dvc add 202408/ai2d/xxx.jsonl -v
2024-12-18 21:57:57,987 DEBUG: v3.55.2 (pip), CPython 3.9.19 on Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.31
2024-12-18 21:57:57,988 DEBUG: command: /mnt/afs/jiangtan/software/miniconda3/bin/dvc add 202408/ai2d/xxx.jsonl -v
2024-12-18 21:58:08,813 DEBUG: Preparing to transfer data from 'memory://dvc-staging-md5/bce8dfab13f1e072dda203b8ccce7cf1f1f5b96f3fa7403ddae2e2c1bab7c5e5' to '/path/to/repo/.dvc/cache/files/md5'                                                                                                   
2024-12-18 21:58:08,813 DEBUG: Preparing to collect status from '/path/to/repo/.dvc/cache/files/md5'
2024-12-18 21:58:08,813 DEBUG: Collecting status from '/path/to/repo/.dvc/cache/files/md5'
2024-12-18 21:58:08,822 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2024-12-18 21:58:08,823 DEBUG: Removing '/path/to/repo/202408/ai2d/.4-h0o6x_wn9M860FpHWZzw.tmp'                                     
2024-12-18 21:58:08,836 DEBUG: Removing '/path/to/repo/202408/ai2d/.4-h0o6x_wn9M860FpHWZzw.tmp'
2024-12-18 21:58:08,837 DEBUG: Removing '/path/to/repo/.dvc/cache/files/md5/.EcqNcmTFSpu624mIaXVAUA.tmp'
2024-12-18 21:58:08,839 DEBUG: Removing '/path/to/repo/202408/ai2d/xxx.jsonl'
2024-12-18 21:58:08,927 DEBUG: Saving information to '202408/ai2d/xxx.jsonl.dvc'.                                                
100% Adding...|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00,  3.91file/s]
2024-12-18 21:58:08,938 DEBUG: Staging files: {'202408/ai2d/xxx.jsonl.dvc'}
2024-12-18 21:58:09,128 DEBUG: Analytics is enabled.
2024-12-18 21:58:09,615 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpc6qa1h08', '-v']
2024-12-18 21:58:09,638 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpc6qa1h08', '-v'] with pid 3980192

$ dvc-data check-link 202408/ai2d/xxx.jsonl
-rw-r--r-- xxx.jsonl unknown

As I claimed earlier:

In our repository, we have about 2000 jsonl files, and each jsonl file is tracked by a .dvc file. Due to the unavailability of reflink, we use copy as the cache.type.

Does using copy as cache.type lead to that dvc-data check-link says unknown? @skshetry

@shcheklein shcheklein removed the awaiting response we are waiting for your reply, please respond! :) label Dec 22, 2024
@shcheklein
Copy link
Member

@skshetry PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement over resource / time consuming tasks triage Needs to be triaged
Projects
None yet
Development

No branches or pull requests

3 participants