Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce new data:status command #7943

Merged
merged 4 commits into from
Jul 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion dvc/cli/command.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ def __init__(self, args):

os.chdir(args.cd)

self.repo = Repo(uninitialized=self.UNINITIALIZED)
self.repo: "Repo" = Repo(uninitialized=self.UNINITIALIZED)
self.config = self.repo.config
self.args = args

Expand Down
2 changes: 2 additions & 0 deletions dvc/cli/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
config,
daemon,
dag,
data,
data_sync,
destroy,
diff,
Expand Down Expand Up @@ -88,6 +89,7 @@
experiments,
check_ignore,
machine,
data,
]


Expand Down
178 changes: 178 additions & 0 deletions dvc/commands/data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,178 @@
import argparse
import logging
from typing import TYPE_CHECKING

from funcy import compact, log_durations

from dvc.cli.command import CmdBase
from dvc.cli.utils import append_doc_link, fix_subparsers
from dvc.ui import ui

if TYPE_CHECKING:
from dvc.repo.data import Status as DataStatus


logger = logging.getLogger(__name__)


class CmdDataStatus(CmdBase):
COLORS = {
"not_in_cache": "red",
"committed": "green",
"uncommitted": "yellow",
"untracked": "cyan",
}
LABELS = {
"not_in_cache": "Not in cache",
"committed": "DVC committed changes",
"uncommitted": "DVC uncommitted changes",
"untracked": "Untracked files",
"unchanged": "DVC unchanged files",
skshetry marked this conversation as resolved.
Show resolved Hide resolved
}
HINTS = {
"not_in_cache": 'use "dvc pull <file>..." '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we used back ticks in these texts? E.g.

Suggested change
"not_in_cache": 'use "dvc pull <file>..." '
"not_in_cache": 'Use `dvc pull <file>` '

"to update your local storage",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • storage -> cache?
  • Missing sentence period .

Copy link
Member Author

@skshetry skshetry Sep 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have made changes to this in successive PRs. Now it says:

(use "dvc fetch <file>..." to download files)

It's a hint, so we don't need a period.

"committed": "git commit the corresponding dvc files "
"to update the repo",
Comment on lines +35 to +36
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"committed": "git commit the corresponding dvc files "
"to update the repo",
"committed": "`git commit` the corresponding DVC metafiles "
"to update the repo.",

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer quotes, it's easier on eyes. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's a convention already to code-inline this kind of things?

Copy link
Member Author

@skshetry skshetry Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do use backticks on help messages, but I am not sure about the reasoning. The only other place within CLI (except help) is when we print commands in git add hints where we just indent those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only rec. here is to aim for consistency if possible. Probably not major (up to you) but it's a product quality question.

"uncommitted": 'use "dvc commit <file>..." to track changes',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"uncommitted": 'use "dvc commit <file>..." to track changes',
"uncommitted": 'Use `dvc commit <file>` to track changes.',

"untracked": 'use "git add <file> ..." or '
'dvc add <file>..." to commit to git or to dvc',
Comment on lines +38 to +39
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"untracked": 'use "git add <file> ..." or '
'dvc add <file>..." to commit to git or to dvc',
"untracked": 'Use `git add <file>` or '
'`dvc add <file>` to track with Git or DVC.',

"git_dirty": "there are {}changes not tracked by dvc, "
'use "git status" to see',
Comment on lines +40 to +41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"git_dirty": "there are {}changes not tracked by dvc, "
'use "git status" to see',
"git_dirty": "There are {}changes not tracked by DVC. "
'Use `git status` to see them.',

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hint, no need to upper case it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why hints don't need capitalization or punctuation. Are they part of a more complete sentence?

}

@staticmethod
def _process_status(status: "DataStatus"):
"""Flatten stage status, and filter empty stage status contents."""
for stage, stage_status in status.items():
items = stage_status
if isinstance(stage_status, dict):
items = {
file: state
for state, files in stage_status.items()
for file in files
}
if not items:
continue
yield stage, items

@classmethod
def _show_status(cls, status: "DataStatus") -> int:
git_info = status.pop("git") # type: ignore[misc]
result = dict(cls._process_status(status))
if not result:
no_changes = "No changes"
if git_info.get("is_empty", False):
no_changes += " in an empty git repo"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
no_changes += " in an empty git repo"
no_changes += " in an empty Git repo"

ui.write(f"{no_changes}.")

for idx, (stage, stage_status) in enumerate(result.items()):
if idx:
ui.write()

label = cls.LABELS.get(stage, stage.capitalize() + " files")
header = f"{label}:"
color = cls.COLORS.get(stage, "normal")

ui.write(header)
if hint := cls.HINTS.get(stage):
ui.write(f" ({hint})")

if isinstance(stage_status, dict):
items = [
": ".join([state, file])
for file, state in stage_status.items()
]
else:
items = stage_status

for item in items:
ui.write(f"\t[{color}]{item}[/]".expandtabs(8), styled=True)

if (hint := cls.HINTS.get("git_dirty")) and git_info.get("is_dirty"):
message = hint.format("other " if result else "")
ui.write(f"[blue]({message})[/]", styled=True)
return 0

def run(self) -> int:
with log_durations(logger.trace, "in data_status"): # type: ignore
status = self.repo.data_status(
granular=self.args.granular,
untracked_files=self.args.untracked_files,
with_dirs=self.args.with_dirs,
)

if not self.args.unchanged:
status.pop("unchanged") # type: ignore[misc]
if self.args.untracked_files == "no":
status.pop("untracked")
if self.args.json:
status.pop("git") # type: ignore[misc]
ui.write_json(compact(status))
return 0
return self._show_status(status)


def add_parser(subparsers, parent_parser):
data_parser = subparsers.add_parser(
"data",
parents=[parent_parser],
formatter_class=argparse.RawDescriptionHelpFormatter,
)
data_subparsers = data_parser.add_subparsers(
dest="cmd",
help="Use `dvc data CMD --help` to display command-specific help.",
)
fix_subparsers(data_subparsers)

DATA_STATUS_HELP = (
"Show changes between the last git commit, "
"the dvcfiles and the workspace."
Comment on lines +128 to +130
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and migrating discussion from iterative/dvc.org#3812 (review) cc @skshetry @shcheklein:

Suggested change
DATA_STATUS_HELP = (
"Show changes between the last git commit, "
"the dvcfiles and the workspace."
DATA_STATUS_HELP = (
"Show changes between the last Git commit, "
"DVC metafiles, and the workspace."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metafile is only docs-concept
this term has not been introduced in DVC (maybe for good reasons).

I think the only reason is that we don't have a system to ensure consistency in terminology between docs and help output but we probably should. It would improve the UX.

dvcfiles is also a weird term
we should... not introduce internal terms into help / docs.

And it sounds like .dvc files specifically. What about dvc.lock?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be further simplified though, to just:

Show changes in DVC-tracked data between the last Git commit and the workspace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metafile is a confusing term. Tbh I don't even know what it is, I can guess that it's a file having metadata about something. Also, it's a logical concept, not a physical one like .dvc and dvc.yaml, so I find it to be an unnecessary redirection. Unlike docs, we can afford repetition which is not that many.

Here, I think we can just avoid mentioning the files at all.

Show changes in the data tracked by DVC in the workspace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show changes in DVC-tracked data between the last Git commit and the workspace.

yep, we can avoid metafiles, dvcfiles.


my 2cs - I find both terms suboptimal, but would probably prefer metafiles - since, yes they are technically metafiles - they contain a spec for data, yes metadata about data. At least it sounds reasonable to me. Plus we already use it in docs. May be we can avoid using this low lever terminology in help messages at all btw- that would the best option for end users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think we can just avoid mentioning the files at all.
we can avoid using this low lever terminology in help messages

➕➕

)
data_status_parser = data_subparsers.add_parser(
"status",
parents=[parent_parser],
description=append_doc_link(DATA_STATUS_HELP, "data/status"),
formatter_class=argparse.RawDescriptionHelpFormatter,
help=DATA_STATUS_HELP,
)
data_status_parser.add_argument(
"--json",
action="store_true",
default=False,
help="Show output in JSON format.",
)
data_status_parser.add_argument(
"--show-json",
action="store_true",
default=False,
dest="json",
help=argparse.SUPPRESS,
)
data_status_parser.add_argument(
"--granular",
action="store_true",
default=False,
help="Show granular file-level info for DVC-tracked directories.",
)
data_status_parser.add_argument(
"--unchanged",
action="store_true",
default=False,
help="Show unmodified DVC-tracked files.",
)
data_status_parser.add_argument(
"--untracked-files",
Comment on lines +158 to +165
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more...

Should the 2nd option be just --untracked? Unclear why -files is only in that name (plus it can be directories too).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, let's introduce this alternative now, and deprecate the previous option, drop it later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git uses --untracked-files, so that's a major reason why this is this way. Also, --untracked-files at the moment is recursive, so the files is more correct. There are some questions regarding it's behaviour: #8061, so I'd defer it until then.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Sep 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK (up to you) but remember that Git's UI is not very good. We should not aim to copy it just because it's Git. Consistency is more important IMO. And "untracked" can refer to plural (files) so both are correct, I think.

choices=["no", "all"],
default="no",
const="all",
nargs="?",
help="Show untracked files.",
)
data_status_parser.add_argument(
"--with-dirs",
action="store_true",
default=False,
help=argparse.SUPPRESS,
)
data_status_parser.set_defaults(func=CmdDataStatus)
2 changes: 2 additions & 0 deletions dvc/repo/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,8 @@ class Repo:
from dvc.repo.status import status # type: ignore[misc]
from dvc.repo.update import update # type: ignore[misc]

from .data import status as data_status # type: ignore[misc]

ls = staticmethod(_ls)
get = staticmethod(_get)
get_url = staticmethod(_get_url)
Expand Down
Loading