Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: data status #3812

Merged
merged 15 commits into from
Sep 2, 2022
Merged

ref: data status #3812

merged 15 commits into from
Sep 2, 2022

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Jul 26, 2022

I have tried to keep it simple.

I was not sure what to add in data/index.md, so I started to introduce a few concepts around index/dvc-data, which is going to be helpful in the data/status.md to clarify and not repeat that we are comparing to file hashes in dvc.lock and .dvc files and just mention the term index. But I did not go completely and introduce that concept yet. :)

Closes #3732.

@skshetry skshetry self-assigned this Jul 26, 2022
@shcheklein shcheklein temporarily deployed to dvc-org-dvc-data-status-smqvo1 July 26, 2022 11:48 Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Jul 26, 2022

Link Check Report

There were no links to check!

@dberenbaum
Copy link
Contributor

Looks good!

A couple high-level comments:

  1. Could some of the introductory text be more generic ("Show changes/status of data tracked by DVC") instead of immediately jumping into all of the "indexes" being compared? While that detail is important, to me it seems like it's too much of the focus here and is introduced too early and repeated too much. It makes it hard to get an overview of the command.
  2. Would it be useful to directly compare to git status? I think that a helpful way to frame this command and show how it might be useful over the existing dvc status is to make the analogy to git status and maybe even suggest using them together or show that in the examples as a way to get a holistic status of the repo.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. @skshetry , and thank for keeping it simple first.

We'll need to do a few iterations here and I have some product questions, would love your feedback on that.

@dberenbaum
Copy link
Contributor

@skshetry When you are available, could you do another iteration on this?

@jorgeorpinel jorgeorpinel added A: docs Area: user documentation (gatsby-theme-iterative) C: ref Content of /doc/*-reference labels Aug 3, 2022
@shcheklein shcheklein temporarily deployed to dvc-org-dvc-data-status-smqvo1 September 1, 2022 11:59 Inactive
Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving so that it's not blocked on me. There are still some pending minor discussions - like removing index page for now.

@skshetry
Copy link
Member Author

skshetry commented Sep 1, 2022

I’ll remove index and push another version tomorrow. :)

@skshetry
Copy link
Member Author

skshetry commented Sep 2, 2022

Looks like we do need the index.md file:

Gatsby.js development 404 page
There's not a page or function yet at /doc/command-reference/data

@shcheklein
Copy link
Member

@skshetry pls check the Contributing section:

  {
    "label": "Contributing",
    "slug": "contributing",
    "source": false,
    "children": [
      {
        "label": "DVC Core Project",
        "slug": "core"
      },
      {
        "label": "Docs and Website",
        "slug": "docs"
      },
      {
        "label": "Writing Blog Posts",
        "slug": "blog"
      }
    ]
  },

@shcheklein shcheklein temporarily deployed to dvc-org-dvc-data-status-smqvo1 September 2, 2022 03:18 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-dvc-data-status-smqvo1 September 2, 2022 03:21 Inactive
@skshetry
Copy link
Member Author

skshetry commented Sep 2, 2022

Thanks @shcheklein. I have deleted the index.

@shcheklein shcheklein temporarily deployed to dvc-org-dvc-data-status-smqvo1 September 2, 2022 03:25 Inactive
@skshetry
Copy link
Member Author

skshetry commented Sep 2, 2022

All of the conversations are resolved, and the index file is now removed. I am merging this, if there’s anything, let me know. I am happy to work on top. And thankyou for the all the suggestions and help. 🙏🏼

@skshetry skshetry merged commit 62a5c97 into main Sep 2, 2022
@skshetry skshetry deleted the dvc-data-status branch September 2, 2022 07:57
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Sep 6, 2022

Hi. A couple post-merge questions (based on existing conversations).

should we just drop the index page for now and to simplify this?

Since there's no index for this command's base, should we just list data status directly in the nav? No need for the parent data link (if we expect more subcommands we can fix it later)

image

On the other hand, not having an index.md breaks the existing pattern and is technically incomplete since you can dvc data (-h).

Would it be useful to directly compare to git status?
show how it might be useful over the existing dvc status

At least mentioning the difference with dvc status would be great, and link back and forth between both refs. And/or these things could be explained in the index.md page if we restore it, along with some other general motivation.

@shcheklein
Copy link
Member

At least mentioning the difference with dvc status would be great, and link back and forth between both refs. And/or these things could be explained in the index.md page if we restore it, along with some other general motivation.

makes sense!

On the other hand, not having an index.md breaks the existing pattern and is technically incomplete since you can dvc data (-h).

let's keep it simple for now? that page is not useful at all ... and let's keep the hierarchy in place, also for simplicity and for consistency

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should be able to merge something and follow up later if needed...
if there’s anything, let me know. I am happy to work on top.

Great work so far! Some other possible follow ups:

content/docs/command-reference/data/status.md Show resolved Hide resolved
Comment on lines +8 to +11
usage: dvc data status [-h] [-q | -v]
[--granular] [--unchanged]
[--untracked-files [{no,all}]]
[--json]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a better visual grouping (even if it doesn't match the help output) cc @dberenbaum

Suggested change
usage: dvc data status [-h] [-q | -v]
[--granular] [--unchanged]
[--untracked-files [{no,all}]]
[--json]
usage: dvc data status [-h] [-q | -v] [--json] [--granular]
[--unchanged] [--untracked-files [{no,all}]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look important to me

Copy link
Contributor

@jorgeorpinel jorgeorpinel Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably secondary, yes. It's a quality-related best practice for the cmd ref/ usage blocks we started applying recently, see #3345.

Comment on lines +16 to +27
The `data status` command displays the state of the working directory and the
changes with respect to the last Git commit (`HEAD`). It shows you what new
changes have been committed to DVC, which haven't been committed, which files
aren't being tracked by DVC and Git, and what files are missing from the
<abbr>cache</abbr>.

The `dvc data status` command only outputs information, it won't modify or
change anything in your working directory. It's a good practice to check the
state of your repository before doing `dvc commit` or `git commit` so that you
don't accidentally commit something you don't mean to.

An example output might look something like follows:
Copy link
Contributor

@jorgeorpinel jorgeorpinel Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `data status` command displays the state of the working directory and the
changes with respect to the last Git commit (`HEAD`). It shows you what new
changes have been committed to DVC, which haven't been committed, which files
aren't being tracked by DVC and Git, and what files are missing from the
<abbr>cache</abbr>.
The `dvc data status` command only outputs information, it won't modify or
change anything in your working directory. It's a good practice to check the
state of your repository before doing `dvc commit` or `git commit` so that you
don't accidentally commit something you don't mean to.
An example output might look something like follows:
Displays the state of the <abbr>workspace</abbr> compared to the last Git commit
(`HEAD`). This includes committed and uncommitted additions, updates, and
deletions of DVC-tracked files. Checking the state of your tracked data is
useful to know what to `dvc add` (or `dvc commit`) and `git commit`. Example:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wha -> what

it was fine before I think, don't see a reason to change this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reason to change this

Minor but not cosmetic: Reduced from 3 paragraph Desc. intro to 1. (We want explanations in refs to stay short.) I also removed some sentences that aren't needed IMO. Applied some other existing patterns (e.g. don't mention the command name at the beginning so it's not repetitive later when you need it in other paragraphs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying existing patterns is fine, everything else feels very cosmetic still + changes the meaning / original intention (which is fine, but we don't have strong enough reason to spend time reviewing this to my mind in this case and debate one more time about intentions in the text, benefits, etc, etc). Please, unless it's super important let's not do this - it takes a lot of time to review it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I'm making a point to spend time on these for now just to compile mayor existing practices (example).

content/docs/command-reference/data/status.md Show resolved Hide resolved
content/docs/command-reference/data/status.md Show resolved Hide resolved
DVC-tracked directories. By default, `dvc data status` does not show
individual changes for files inside the tracked directories.

- `--untracked-files` - show files that are not being tracked by DVC and Git.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it's tracked by EITHER DVC or Git it will not be included here?

content/docs/command-reference/data/status.md Show resolved Hide resolved
@shcheklein
Copy link
Member

@jorgeorpinel could you please distill what is actually important here and doesn't require changes to the DVC repo and make a PR to review? a lot of changes, lots of them very cosmetic, and I'm not sure if there is a reason behind them ... reviewing them takes time

@jorgeorpinel
Copy link
Contributor

keep it simple for now? that page is not useful at all

The cmd group index pages can be useful e.g. see https://dvc.org/doc/command-reference/params as a short version of the guide for people who just want to have enough context to use the feature. Not sure whether we need that for data though. Up to the DVC team!

jorgeorpinel added a commit that referenced this pull request Sep 6, 2022
jorgeorpinel added a commit that referenced this pull request Sep 6, 2022
jorgeorpinel added a commit that referenced this pull request Sep 6, 2022
@jorgeorpinel
Copy link
Contributor

please distill what is actually important here

Started #3924 and resolved the things I included there.

doesn't require changes to the DVC repo

Nothing I added so far, but these string changes are typically pretty easy to propagate to the core repo in my experience.

jorgeorpinel added a commit that referenced this pull request Oct 18, 2022
* ref: update data status intro
per #3812 (comment)

* ref: update data status explanations
per #3812 (comment)

* ref: update data status --granular explanation
per #3812 (comment)

* ref: remove redundant data status example
per #3812 (comment)

* ref: roll back data status prefixes

* Restyled by prettier (#3925)

Co-authored-by: Restyled.io <[email protected]>

* ref: reinstate duplicated example in data status
per #3924 (comment)

* ref: roll back minor style change
per #3924 (review)

* ref: simplify term "tracked dirs"
per #3924 (review)

* ref: note unknown data status for --granular
rel #3924 (review)

* ref: avoid term "file records"
per #3924 (review)

* ref: restore whitespaces

Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
@jorgeorpinel jorgeorpinel changed the title cmd-ref: document data:status command ref: data status Oct 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) C: ref Content of /doc/*-reference
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

ref: dvc data status
6 participants