Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ref: data status #3812

Merged
merged 15 commits into from
Sep 2, 2022
24 changes: 24 additions & 0 deletions content/docs/command-reference/data/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# data

Contains a command that shows changes in data tracked by DVC:
[status](/doc/command-reference/data/status).

## Synopsis

```usage
usage: dvc data [-h] [-q | -v] {status} ...
skshetry marked this conversation as resolved.
Show resolved Hide resolved

positional arguments:
COMMAND
status Show changes in the data tracked by DVC in the workspace.
```

## Description

DVC discovers data tracked by DVC using the file path and the file hash
specified in the `.dvc` and `dvc.lock` files, and builds an index out of it.

This is used by DVC, for example, to show `dvc data status`, by comparing
different versions of the index. DVC uses <abbr>cache</abbr> to compare between
the specified hashes, the workspace and the actual file present in the cache to
see if they have changed.
skshetry marked this conversation as resolved.
Show resolved Hide resolved
132 changes: 132 additions & 0 deletions content/docs/command-reference/data/status.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# data status

Show changes in the data tracked by DVC in the workspace.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Synopsis

```usage
usage: dvc data status [-h] [-q | -v]
skshetry marked this conversation as resolved.
Show resolved Hide resolved
[--granular] [--unchanged]
[--untracked-files [{no,all}]]
[--json]
Comment on lines +8 to +11
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be a better visual grouping (even if it doesn't match the help output) cc @dberenbaum

Suggested change
usage: dvc data status [-h] [-q | -v]
[--granular] [--unchanged]
[--untracked-files [{no,all}]]
[--json]
usage: dvc data status [-h] [-q | -v] [--json] [--granular]
[--unchanged] [--untracked-files [{no,all}]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't look important to me

Copy link
Contributor

@jorgeorpinel jorgeorpinel Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably secondary, yes. It's a quality-related best practice for the cmd ref/ usage blocks we started applying recently, see #3345.

```

## Description
shcheklein marked this conversation as resolved.
Show resolved Hide resolved

The `data status` command displays the state of the working directory and the
changes with respect to the last Git commit (`HEAD`). It shows you what new
changes have been committed to DVC, which haven't been committed, which files
aren't being tracked by DVC and Git, and what files are missing from the
<abbr>cache</abbr>.

The `dvc data status` command only outputs information, it won't modify or
change anything in your working directory. It's a good practice to check the
state of your repository before doing `git commit` so that you don't
skshetry marked this conversation as resolved.
Show resolved Hide resolved
accidentally commit something you don't mean to.

An example output might look something like follows:

```dvc
$ dvc data status
Not in cache:
(use "dvc pull <file>..." to download files)
data/data.xml

DVC committed changes:
(git commit the corresponding dvc files to update the repo)
modified: data/features/

DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
deleted: model.pkl
(there are other changes not tracked by dvc, use "git status" to see)
```

As shown above, the `dvc data status` displays changes in multiple categories:

- _Not in cache_ indicates that the hash for files are recorded in `dvc.lock`
skshetry marked this conversation as resolved.
Show resolved Hide resolved
and `.dvc` files but the corresponding cache files are missing.
- _DVC committed changes_ indicates that there are changes that are
`dvc-commit`-ed that differs with the last Git commit. There might be more
dberenbaum marked this conversation as resolved.
Show resolved Hide resolved
detailed state on how each of those files changed: _added_, _modified_,
_deleted_ and _unknown_.
- _DVC uncommitted changes_ indicates that there are changes in the working
directory that are not `dvc commit`-ed yet. Same as _DVC committed changes_,
there might be more detailed state on how each of those files changed.
- _Untracked files_ shows the files that are not being tracked by DVC and Git.
This is disabled by default, unless [`--untracked-files`](#--untracked-files)
is specified.
- _DVC Unchanged files_ shows the files that are not changed. This is not shown
by default, unless [`--unchanged`](#--unchanged) is specified.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

By default, `dvc data status` does not show individual changes inside the
skshetry marked this conversation as resolved.
Show resolved Hide resolved
tracked directories, which can be enabled with [`--granular`](#--granular)
option.

## Options

- `--granular` - show granular, file-level information of the changes for
DVC-tracked directories. By default, `dvc data status` does not show
individual changes for files inside the tracked directories.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- `--untracked-files` - show files that are not being tracked by DVC and Git.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if it's tracked by EITHER DVC or Git it will not be included here?


- `--unchanged` - show unchanged DVC-tracked files.

- `--json` - prints the command's output in easily parsable JSON format, instead
of a human-readable output.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

```dvc
$ dvc data status
Not in cache:
(use "dvc pull <file>..." to download files)
data/data.xml

DVC committed changes:
(git commit the corresponding dvc files to update the repo)
modified: data/features/

DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
skshetry marked this conversation as resolved.
Show resolved Hide resolved
deleted: model.pkl
(there are other changes not tracked by dvc, use "git status" to see)
```

This shows that the `data/data.xml` is missing from the cache, `data/features/`
a directory, has changes that are being tracked by DVC but is not Git committed
yet, and a file `model.pkl` has been deleted from the workspace. The
`data/features/` directory is modified, but there is no further details to what
changed inside. The `--granular` option can provide more information on that.

## Example: Granular output

Following on from the above example, using `--granular` will show file-level
information for the changes:

```dvc
$ dvc data status --granular
Not in cache:
(use "dvc pull <file>..." to download files)
data/data.xml

DVC committed changes:
(git commit the corresponding dvc files to update the repo)
added: data/features/foo

DVC uncommitted changes:
(use "dvc commit <file>..." to track changes)
shcheklein marked this conversation as resolved.
Show resolved Hide resolved
deleted: model.pkl
(there are other changes not tracked by dvc, use "git status" to see)
```

Now there's more information in _DVC committed changes_ regarding the changes in
`data/features`. From the output, it shows that there is a new file added to
`data/features`: `data/features/foo`.
10 changes: 10 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -218,6 +218,16 @@
"label": "dag",
"slug": "dag"
},
{
"label": "data",
"slug": "data",
"children": [
{
"label": "data status",
"slug": "status"
}
]
},
{
"label": "destroy",
"slug": "destroy"
Expand Down