Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to sort out extension ehowing huge amount of "pending changes" #5861

Open
dibus2 opened this issue Dec 10, 2024 · 27 comments
Open

How to sort out extension ehowing huge amount of "pending changes" #5861

dibus2 opened this issue Dec 10, 2024 · 27 comments
Labels
performance question Further information is requested

Comments

@dibus2
Copy link

dibus2 commented Dec 10, 2024

When starting the VS code extension it shows 110K Pending Files.
Application is extremely laggy to the point where it makes using the EC2 machine impossible.
The DVC section in the GIT tab shows nothing (as I suspect there are too many files).
The way the repo is setup is as such:

  • project_1 / pipeline_1 / dvc.yaml
  • project_1 / pipeline_2/ dvc.yaml
  • project_1 / pipeline_3/ dvc.yaml
  • ...
  • project_1 / pipeline_13/ dvc.yaml
  • project_2/ ...

I also tried to specialized the dvc extension to a single pipeline but it doens't seem to be helping. This is what I have in my settings.json
"dvc.focusedProjects": [
"project_1/pipeline_1"]
Note that pipeline_1 folder contains the dvc.yaml file.

Image

Any help on sorting this out is appreciated.

Thanks.

@mattseddon
Copy link
Member

mattseddon commented Dec 10, 2024

Have you tried running dvc pull before opening the extension? Is there a folder that should be gitignored that isn't?

@dibus2
Copy link
Author

dibus2 commented Dec 10, 2024

Hi Matt,

thanks for looking at this. I had not done a pull just before starting it to be fair so I did, got the number down to 108K. My git status shows nothing so this is all the status added by DVC.

You are bringing a good point though, that some of these pipilines are maintained only by some teams and ideally other teams don't have to pull the data from all other pipelines if they don't need it.

Cheers,

@dibus2
Copy link
Author

dibus2 commented Dec 10, 2024

Here is the breakdown of the git DVC section actually. It wasn't starting earlier but this time it managed to start.Image

I honestly don't understand why there are all these files in the Uncommitted section. When I look at these files they are actually files for which there is a .dvc file in the repo but that I have not pulled because they are not part of any pipelines and are as such not needed in the current commit of the repo.

@dibus2
Copy link
Author

dibus2 commented Dec 10, 2024

Also, shouldn't these all be ignored since I specified a specific folder through dvc.focusedProjects that does not include any of these .dvc file in it?

@mattseddon
Copy link
Member

I would guess that there are duplicates between uncommitted and not in cache as there is an unknown uncommitted status that things fall into when they are not in the cache (I forget the history on this), you can search the DVC issues if you want to know more.

For focused projects, these relate to - "A subset of paths to the workspace's available DVC projects. Using this option will override project auto-discovery." I am guessing that you've set this to a pipeline instead of a project, so it failed validation and bypassed the option. Try using the "Select Project(s) to Focus (set dvc.focusedProjects)" quick pick to select a project to focus. This should give a list of valid options.

@mattseddon mattseddon self-assigned this Dec 10, 2024
@mattseddon mattseddon added question Further information is requested performance labels Dec 10, 2024
@dmpetrov
Copy link
Member

@dibus2 thank you for reporting this!

@shcheklein
Copy link
Member

@dibus2 thanks for creating the issue! a few questions:

  • where do you have .dvc directories for those projects? (just to better understand how things are isolated from each other)
  • 50K+ files - is it a single directory? what is the part in the tree to it? are those parquet files? is it input for the pipeline?

@mattseddon mattseddon removed their assignment Dec 11, 2024
@dibus2
Copy link
Author

dibus2 commented Dec 11, 2024

@shcheklein

  • so each one of the pipeline_x folders contain a .dvc.yaml file.
  • the files are spread across different projects/ in which we have a /data folder in each one of them.

For instance in project_1/data/

  • data_set_1.dvc
  • data_set_2.dvc
  • data_set_3.dvc
    ...
  • data_set_10.dvc

Now most of these are usually not being pulled because they are not used anymore or not yet and maybe that's a problem to keep the .dvc files around?
From a content perspective, these are mostly JSON files actually, there are few parquet but they are mostly json files.

@dibus2
Copy link
Author

dibus2 commented Dec 11, 2024

I would guess that there are duplicates between uncommitted and not in cache as there is an unknown uncommitted status that things fall into when they are not in the cache (I forget the history on this), you can search the DVC issues if you want to know more.

For focused projects, these relate to - "A subset of paths to the workspace's available DVC projects. Using this option will override project auto-discovery." I am guessing that you've set this to a pipeline instead of a project, so it failed validation and bypassed the option. Try using the "Select Project(s) to Focus (set dvc.focusedProjects)" quick pick to select a project to focus. This should give a list of valid options.

I actually do not find the quick pick to select the project option @mattseddon Image

@shcheklein
Copy link
Member

@dibus2 sorry, just to clarify. I specifically mean .dvc directories (not files). Do you have a single one in the root or multiple (I'm referring to the --subdir option).

Now most of these are usually not being pulled because they are not used anymore or not yet and maybe that's a problem to keep the .dvc files around?

yes, if you don't need them - better to drop them I guess (you can always recover them from the Git history - that's the beauty of DVC).

@dibus2
Copy link
Author

dibus2 commented Dec 11, 2024

ah sorry, yes I have only one in the top folder @shcheklein

@shcheklein
Copy link
Member

so, if the projects are independent - can we consider making them subprojects? that might help I think DVC and extension and Studio a lot. (each DVC command won't be analyzing all the existing pipelines, it can "focus" only on a single one at a time).

@dibus2
Copy link
Author

dibus2 commented Dec 11, 2024

actually let me explore that option. How do I do that?

@shcheklein
Copy link
Member

I think you can just try to do dvc init --subdir in one of the subdirectories. Then it might be needed to migrate some .dvc/config options the the /.dvc/config. That should be enough to start I think. And let's see if we hit some issues.

@dibus2
Copy link
Author

dibus2 commented Dec 13, 2024

Hi @shcheklein,
I looked into that, I can't really do this and the reason is because we are actually sharing common utils files and datasets across these different projects.

However, I did cleanup the repo removing all the .dvc files that were not needed in the current commit and I got it down to only 14 pending piles which I think can be ignored. However, I still can't do anything with the extension.

It's getting stuck into I'm not sure None of the commands are responding.

Image Image I did notice that the data dvc status takes about 40 seconds and it seems to run this quite a bit Image and I wonder if it's not just spinning its wheels on this?

@shcheklein
Copy link
Member

I looked into that, I can't really do this and the reason is because we are actually sharing common utils files and datasets across these different projects.

I guess util files are fine (Python files?)

Datasets - it depends a bit. It's not usually a problem in DVC to duplicate it (if cache is shared and you use symlinks / reflinks / hardlinks - there will be no impact on space or anything).

I see that plots diff also takes quite a long time.

I suspect that even collecting a full dag is probably an expensive operation - could you run dvc status in CLI? how much time does it take?

I see also that even git log takes 2 seconds to run - weird. Is it the same in CLI?

@dibus2
Copy link
Author

dibus2 commented Dec 13, 2024

so @shcheklein regarding time to run dvc status see screenshot Image

Regarding the git log I m not sure where you see it in the log it doesn't take any time in the cli.

Regarding the plots/ I noticed that we are actually tracking output folders in a lot of the pipelines / and that lead to not being able to show them through the show plots in the extension (at least I assume because it says there is no plots to show)
Image
Even though we have 100s of plots being tracked here is one example

Image

@mattseddon
Copy link
Member

I suspect that processing the data for these plots is blocking the extension host thread. We could add something like a focus pipeline option but that would be at least a couple of days of work.

@shcheklein
Copy link
Member

@mattseddon do we run plots diff before we open the plots view?

Since data status takes 55s to run even in CLI, not sure that the plots is the (only) issue.

It seems it just takes time to get all the DVC files together.

How many dvc.yamls do we have? How many .dvc files? How many files overall in the tree?

@dibus2
Copy link
Author

dibus2 commented Dec 18, 2024

@shcheklein

  • we have 14 dvc.yaml files at the moment.
  • we have 24 *.dvc files
  • I'm not sure about the last one, are you talking about all the files in that dvc project basically or just the tracked files? We have 34K files in the repo. Not sure about the other one.

@shcheklein
Copy link
Member

@dibus2 thanks! can you run it with profiler please https://github.com/iterative/dvc/wiki/Debugging,-Profiling-and-Benchmarking-DVC ? (I mean dvc status) - let's see what exactly takes so much time.

@dibus2
Copy link
Author

dibus2 commented Dec 19, 2024

Sorry for the late reply @shcheklein .

Note I had to zip the file .prof wasn't allowed.

dump.prof.zip

@skshetry
Copy link
Member

skshetry commented Dec 20, 2024

Image

41s (out of 45s runtime) is being spent on hashing ~22k files. Are those recently pulled or new files that dvc has not seen before? What was the output of dvc status?

@dibus2
Copy link
Author

dibus2 commented Dec 20, 2024

actyually no these files have been there for ever and every time we run a command (status, repro) it goes through them again, isn't that the expected behavior?

@shcheklein
Copy link
Member

no, it's not expected (it should not be hashing them again).

could you share the output of the dvc status in this case?

also, can you try to run dvc version and also share the output?

@dibus2
Copy link
Author

dibus2 commented Dec 21, 2024

Here is the dvc version output:
DVC version: 3.57.0 (pip)

Platform: Python 3.11.10 on Linux-5.15.0-1028-aws-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.16.7
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.3.9
Supports:
http (aiohttp = 3.11.7, aiohttp-retry = 2.9.1),
https (aiohttp = 3.11.7, aiohttp-retry = 2.9.1),
s3 (s3fs = 2024.10.0, boto3 = 1.35.36)
Config:
Global: /home/ubuntu/.config/dvc
System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/83b435f16c2c22f52df4c119dbc42bb5

Attached is the output of the dvc status. Note that I had to do a few pulls and push today so it's a bit different from this morning so I re-ran the profiler attached as well. I'm not sure how the output of the dvc status is informative, cause all the hashing logging isn't capture into it maybe you meant something else for me to send. Let me know.

Thanks!

f. Archive.zip

@shcheklein
Copy link
Member

@skshetry can you take a look please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants