Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: granular output for directories #2180

Closed
MikkelAntonsen opened this issue Jun 24, 2019 · 13 comments
Closed

status: granular output for directories #2180

MikkelAntonsen opened this issue Jun 24, 2019 · 13 comments
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do

Comments

@MikkelAntonsen
Copy link

Imagine dvc tracking a directory full of images and files containing labels.

If I delete/change/update a label or image, dvc will tell me that something in the directory has changed, but not exactly what. It would be nice is dvc status could be made more granular, returning something like

data.dvc:
    outputs:
        data/foo: new
        data/bar: deleted

instead of

data.dvc:
	changed outs:
		modified:           data

as it is today.

Thanks:)

@efiop
Copy link
Contributor

efiop commented Jun 24, 2019

@efiop efiop added feature request Requesting a new feature p2-medium Medium priority, should be done, but less important labels Jun 24, 2019
@efiop efiop changed the title Improving the dvc status output after updating a label status: granular output for directories Jun 24, 2019
@efiop
Copy link
Contributor

efiop commented Jun 24, 2019

Entry points where to start looking at this:

  1. Here is where the status is being created https://github.com/iterative/dvc/blob/0.41.3/dvc/output/base.py#L178
  2. Need to make Output.changed_checksum() either more granular to report particular files in the directory that has changed or create even a new method that will return a status dict for a directory;

@shcheklein
Copy link
Member

we should be careful with this, imagine a directory with 1M new files. Probably we don't want to show all of them. I would say we should show a summary by default or at least on some threshold (on number of changed files per directory)

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 11, 2020

  • Once this is implemented (or in parallel), I recommend we deprecate the -R/--recursive option, as dvc status seems to always be recursive (to a certain limit?) already. Just being able to give any dir as target would suffice (full context: status: support outputs as targets [qa] #4191).

@pared
Copy link
Contributor

pared commented Jul 13, 2020

I agree with @shcheklein. We could probably set some small default threshold of files to show (like 5, to not pollute the terminal) and add option to show all changes.
In case of default dir status I wouldn't display it as file -> status mapping, but rather
status -> list of files with that status just to save some space (in case of explicit show-all-changes it makes more sense to me to show file -> status mapping, as one probably does not want to scroll through the terminal to see the status of a particular file.

@jorgeorpinel
Copy link
Contributor

We could probably set some small default threshold of files to show (like 5

With pagination?

I wouldn't display it as file -> status mapping, but rather status -> list of files

Agree. I think we do this in other commands already like metrics/params show/diff? And other output

@pared
Copy link
Contributor

pared commented Jul 13, 2020

With pagination?

I would refrain from that in the default case. By default, I think it's better to "show the status of dir, but if it's small enough, we can go for particular files" rather than make user interact with it.

Agree. I think we do this in other commands already like metrics/params show/diff? And other output

Yes, we do, I was just referring to @MikkelAntonsen example.

@lefos99
Copy link

lefos99 commented Sep 3, 2020

What is the progress of this feature request? 😃

@efiop
Copy link
Contributor

efiop commented Sep 3, 2020

@lefos99 Not implemented yet :( To implement it, we'll need to change

if self.changed_checksum():
to compare self.dir_cache with self.get_checksum().dir_info(both are just lists of simple dicts) and provide granular status instead of a current generic one.

@lefos99
Copy link

lefos99 commented Sep 4, 2020

@efiop Thanks for the update! 😃

@efiop
Copy link
Contributor

efiop commented Oct 8, 2021

Update: we've been revisiting our data management this year (see dvc/objects) and the approach suggested above (with dir_info) is no longer relevant. We will be migrating status to the new objects in the upcoming weeks and will likely implement this along the way.

@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important labels Jun 16, 2022
@dberenbaum
Copy link
Collaborator

Closing as completed in #7943. Thanks @skshetry!

@daavoo
Copy link
Contributor

daavoo commented Sep 8, 2022

For reference: https://dvc.org/doc/command-reference/data/status#example-granular-output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do
Projects
None yet
Development

No branches or pull requests

8 participants