-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dvc commit operation is too slow #10653
Comments
@jiangtann what is the reason for running Also, just to get a bit more context and potentially recommend other solutions - could you describe the workflow? is it text inside those If they don't change - would it make sense to just keep them in the cloud in a directory and don't move them into DVC at all? |
workflowWe are a LLM team, and multiple colleagues continuously collect new training datasets (jsonl files, most files exceed 10MiB) together. We use git + dvc (similar to git-lfs) to manage our training dataset repository. In short, we need version control system of large files. So each If directly push jsonl files to s3 without any version control system, we will miss a lof of information. For example, we do not know who commit the specific dataset if we find some errors exsit in the jsonl file. We often need to modify a jsonl file (format check, clean, filter etc.) . If a wrong modification occurs, we will need rollback. In our workflow, dvc is an alternative to git-lfs. Because we use s3 as backend, we choose dvc. structure of our repositoryPlease refer to this: example of a dataset (jsonl file, each line is a training sample){"image": "batch_1/background_images/ee8bc699-e481-4738-80d2-aeade4776033.jpg", "conversations": [{"from": "human", "value": "<image>\nPlease write out the expression of the formula in the image using LaTeX format."}, {"from": "gpt", "value": "\\[\\lim_{g\\to9}\\frac{\\frac{d}{dg}\\left(e^{g}+-4\\tan{g}-9\\right)}{\\frac{d}{dg}\\left(g^{4}+2g^{7}+22g^{5}\\right)}\\]"}], "width": 1689, "height": 425}
{"image": "batch_1/background_images/436008a6-638f-4bd0-82cf-d696f383eebe.jpg", "conversations": [{"from": "human", "value": "<image>\nWrite the given equation in LaTeX code, ensuring no mistakes in notation or format."}, {"from": "gpt", "value": "\\[\\lim_{x\\to1^{+}}-71\\csc{x}\\]"}], "width": 941, "height": 265}
... the reason for running
|
@jiangtann thanks a lot for the details. How many files are we talking about? Are you creating a separate
it should be indeed more or less the same (except it might still need to check of files changed if you run it w/o targets), I might be missing some details (need to check this). cc @skshetry any idea why is it doing checkout? |
In our repository, we have about 2000 I expect |
Can you please provide a profiling data - "too slow" is not actionable? |
I assume just a basic test if they changed or not? (I agree with @skshetry - seeing profile would be great) |
@jiangtann, looking at the profiling, ~1300 files are being checked out. This could happen on two cases:
If You might be able to check link type with $ pip install "dvc-data[cli]"
$ dvc-data check-link path/to/file There is |
I also upload I used $ dvc-data check-link 202408/ai2d/xxx.jsonl
-rw-r--r-- xxx.jsonl unknown
$ dvc add 202408/ai2d/xxx.jsonl -v
2024-12-18 21:57:57,987 DEBUG: v3.55.2 (pip), CPython 3.9.19 on Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.31
2024-12-18 21:57:57,988 DEBUG: command: /mnt/afs/jiangtan/software/miniconda3/bin/dvc add 202408/ai2d/xxx.jsonl -v
2024-12-18 21:58:08,813 DEBUG: Preparing to transfer data from 'memory://dvc-staging-md5/bce8dfab13f1e072dda203b8ccce7cf1f1f5b96f3fa7403ddae2e2c1bab7c5e5' to '/path/to/repo/.dvc/cache/files/md5'
2024-12-18 21:58:08,813 DEBUG: Preparing to collect status from '/path/to/repo/.dvc/cache/files/md5'
2024-12-18 21:58:08,813 DEBUG: Collecting status from '/path/to/repo/.dvc/cache/files/md5'
2024-12-18 21:58:08,822 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2024-12-18 21:58:08,823 DEBUG: Removing '/path/to/repo/202408/ai2d/.4-h0o6x_wn9M860FpHWZzw.tmp'
2024-12-18 21:58:08,836 DEBUG: Removing '/path/to/repo/202408/ai2d/.4-h0o6x_wn9M860FpHWZzw.tmp'
2024-12-18 21:58:08,837 DEBUG: Removing '/path/to/repo/.dvc/cache/files/md5/.EcqNcmTFSpu624mIaXVAUA.tmp'
2024-12-18 21:58:08,839 DEBUG: Removing '/path/to/repo/202408/ai2d/xxx.jsonl'
2024-12-18 21:58:08,927 DEBUG: Saving information to '202408/ai2d/xxx.jsonl.dvc'.
100% Adding...|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████|1/1 [00:00, 3.91file/s]
2024-12-18 21:58:08,938 DEBUG: Staging files: {'202408/ai2d/xxx.jsonl.dvc'}
2024-12-18 21:58:09,128 DEBUG: Analytics is enabled.
2024-12-18 21:58:09,615 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmpc6qa1h08', '-v']
2024-12-18 21:58:09,638 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmpc6qa1h08', '-v'] with pid 3980192
$ dvc-data check-link 202408/ai2d/xxx.jsonl
-rw-r--r-- xxx.jsonl unknown As I claimed earlier:
Does using |
@skshetry PTAL |
Bug Report
Description
I use git and dvc to manage my training datasets, which consists of thousands of jsonl files.
After I modify several jsonl files, I use
dvc status && dvc commit
.dvc status
operation is completed quickly (I know dvc will only hash files once until it gets modified. Here only several jsonl files are modified sodvc status
operation cost little). However,dvc commit
operation cost a lot of time.While
dvc commit
is executing, I see lots of "Checking out xxx/xxx/xxx.jsonl" shows in the terminal, and I believe those jsonl files are not modified. Why dvc need to check out files that are not modified?Expected
Assume two files
a.jsonl
andb.jsonl
are modified, I thinkdvc commit
should equal todvc add a.jsonl b.jsonl
. However, it seems thatdvc commit
will check out all files tracked by dvc.I expect
dvc commit
operation skip files which are not modified, so it can be completed quickly.Environment information
Output of
dvc doctor
:The text was updated successfully, but these errors were encountered: