Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: external data updates #1735

Merged
merged 9 commits into from
Sep 1, 2020
23 changes: 11 additions & 12 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,27 +156,26 @@ for more details.) This section contains the following options:
`dvc remote` for more information on "local remotes".) This will overwrite the
value provided to `dvc config cache.dir` or `dvc cache dir`.

- `cache.ssh` - name of an
[SSH remote to use as external cache](/doc/user-guide/managing-external-data#ssh).
Comment on lines 156 to -160
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Aug 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was just reordered to match the standard sorting of remote types (implemented in recent PRs) — but these cache.{type} options are directly related to external cache setup, and the docs link to each other.


> Avoid using the same remote location that you are using for `dvc push`,
> `dvc pull`, `dvc fetch` as external cache for your external outputs, because
> it may cause possible file hash overlaps: the hash of a data file in
> external storage could collide with a hash generated locally for another
> file with a different content.

- `cache.s3` - name of an
[Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3).

- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as
[external cache](/doc/user-guide/managing-external-data).

- `cache.gs` - name of a
[Google Cloud Storage remote to use as external cache](/doc/user-guide/managing-external-data#google-cloud-storage).

- `cache.ssh` - name of an SSH remote to use
[as external cache](/doc/user-guide/managing-external-data#ssh).

> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
> `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file
> hash overlaps: the hash of an external <abbr>output</abbr> could collide
> with a hash generated locally for another file with different content.

- `cache.hdfs` - name of an
[HDFS remote to use as external cache](/doc/user-guide/managing-external-data#hdfs).

- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as
[external cache](/doc/user-guide/managing-external-data).

### state

See
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Relevant notes:

- [external dependencies](/doc/user-guide/external-dependencies) and
[external outputs](/doc/user-guide/managing-external-data) (outside of the
<abbr>workspace</abbr>) are also supported.
<abbr>workspace</abbr>) are also supported (except metrics and plots).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of an important note added to the external outputs doc.


- Outputs are deleted from the workspace before executing the command (including
at `dvc repro`) if their paths are found as existing files/directories. This
Expand Down
36 changes: 19 additions & 17 deletions content/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,38 @@
# External Dependencies

There are cases when data is so large, or its processing is organized in a way
that you would like to avoid moving it out of its external/remote location. For
example from a network attached storage (NAS) drive, processing data on HDFS,
such that you would like to avoid moving it out of its external/remote location.
For example from a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
from S3 to process it. A mechanism for external dependencies and
[external outputs](/doc/user-guide/managing-external-data) provides a way for
DVC to control data externally.
from S3 to process it.

## Description
External <abbr>dependencies</abbr> and
[external outputs](/doc/user-guide/managing-external-data) provide ways to track
data outside of the <abbr>project</abbr>.

With DVC, you can specify external files as dependencies for your pipeline
## How it works

You can specify external files or directories as dependencies for your pipeline
stages. DVC will track changes in them and reflect this in the output of
`dvc status`.

Currently, the following types (protocols) of external dependencies are
supported:

- Local files and directories outside of your <abbr>workspace</abbr>
- SSH
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
- SSH
- HDFS
- HTTP
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
- Local files and directories outside the <abbr>workspace</abbr>

> Note that these are a subset of the remote storage types supported by
> `dvc remote`.

In order to specify an external dependency for your stage, use the usual `-d`
option in `dvc run` with the external path or URL to your desired file or
directory.
In order to specify an external <abbr>dependency</abbr> for your stage, use the
usual `-d` option in `dvc run` with the external path or URL to your desired
file or directory.

## Examples

Expand Down Expand Up @@ -149,12 +151,12 @@ $ dvc import-url https://data.dvc.org/get-started/data.xml
Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml'
```

The command above creates the <abbr>import stage</abbr> (DVC-file)
`data.xml.dvc`, that uses an external dependency (in this case an HTTPs URL).
The command above creates the import `.dvc` file `data.xml.dvc`, that contains
an external dependency (in this case an HTTPs URL).

<details>

### Expand to see resulting DVC-file
### Expand to see resulting `.dvc` file
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```yaml
# ...
Expand All @@ -180,7 +182,7 @@ determine whether the source has changed and we need to download the file again.

`dvc import` can download a <abbr>data artifact</abbr> from any <abbr>DVC
project</abbr> or Git repository. It also creates an external dependency in its
<abbr>import stage</abbr> (DVC-file).
import `.dvc` file.

```dvc
$ dvc import [email protected]:iterative/example-get-started model.pkl
Expand All @@ -193,7 +195,7 @@ specified (with the `repo` field).

<details>

### Expand to see resulting DVC-file
### Expand to see resulting `.dvc` file

```yaml
# ...
Expand Down
122 changes: 66 additions & 56 deletions content/docs/user-guide/managing-external-data.md
Original file line number Diff line number Diff line change
@@ -1,83 +1,94 @@
# Managing External Data

There are cases when data is so large, or its processing is organized in a way
that you would like to avoid moving it out of its external/remote location. For
example from a network attached storage (NAS) drive, processing data on HDFS,
such that its preferable to avoid moving it from its external/remote location.
For example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or having a script that streams data
from S3 to process it. External outputs and
[external dependencies](/doc/user-guide/external-dependencies) provide a way for
DVC to control data outside of the <abbr>project</abbr> directory.
from S3 to process it.

## Description
External <abbr>outputs</abbr> and
[external dependencies](/doc/user-guide/external-dependencies) provide ways to
track data outside of the <abbr>project</abbr>.

DVC can track files on an external storage with `dvc add` or specify external
files as <abbr>outputs</abbr> for
[DVC-files](/doc/user-guide/dvc-files-and-directories) created by `dvc run`
(stage files). External outputs are considered part of the DVC project. DVC will
track changes in them and reflect this in the output of `dvc status`.
## How external outputs work

DVC can track existing files or directories on an external location with
`dvc add` (`out` field). It can also create external files or directories as
outputs for `dvc.yaml` files (only `outs` field, not metrics or plots).

External outputs are considered part of the (extended) DVC project: DVC will
track changes in them, and reflect this in `dvc status` reports, for example.

For cached external outputs (e.g. `dvc add`, `dvc run -o`), you will need to
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file system first.

Currently, the following types (protocols) of external outputs (and
<abbr>cache</abbr>) are supported:

- Local files and directories outside of your <abbr>workspace</abbr>
- SSH
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Storage
- SSH
- HDFS
- Local files and directories outside the <abbr>workspace</abbr>

> Note that these are a subset of the remote storage types supported by
> `dvc remote`.

In order to specify an external output for a stage file, use the usual `-o` or
`-O` options of `dvc run`, but with the external path or URL to the file in
question. For <abbr>cached</abbr> external outputs (`-o`) you will need to
[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file system first.

> Avoid using the same location of the
> [remote storage](/doc/command-reference/remote) that you have for `dvc push`
> and `dvc pull` for external outputs or as external cache, because it may cause
> file hash overlaps: The hash value of a data file in external storage could
> collide with the one generated locally for another file.
> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
> `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file
> hash overlaps: the hash of an external output could collide with a hash
> generated locally for another file with different content.

## Examples

For the examples, let's take a look at a [stage](/doc/command-reference/run)
that simply moves local file to an external location, producing a `data.txt.dvc`
DVC-file.
For the examples, let's take a look at

1. Adding a `dvc remote` to use as cache for data in the external location, and
configure it as external <abbr>cache</abbr> with `dvc config`.
2. Tracking existing data on an external location with `dvc add` (this doesn't
download it). This produces a `.dvc` file with an external output.
3. Creating a simple [stage](/doc/command-reference/run) that moves a local file
to the external location. This produces a stage with another external output
in `dvc.yaml`.

### Amazon S3

```dvc
# Add S3 remote to be used as cache location for S3 files
$ dvc remote add s3cache s3://mybucket/cache

# Tell DVC to use the 's3cache' remote as S3 cache location
$ dvc config cache.s3 s3cache

# Add data on S3 directly
$ dvc add --external s3://mybucket/mydata
$ dvc add --external s3://mybucket/existing-data

# Create the stage with an external S3 output
$ dvc run -d data.txt \
--external \
-o s3://mybucket/data.txt \
aws s3 cp data.txt s3://mybucket/data.txt
```

### Microsoft Azure Blob Storage

```dvc
$ dvc remote add azurecache azure://mycontainer/cache
$ dvc config cache.azure azurecache

$ dvc add --external azure://mycontainer/existing-data

$ dvc run -d data.txt \
--external \
-o azure://mycontainer/data.txt \
az storage blob upload -f data.txt -c mycontainer -n data.txt
```

### Google Cloud Storage

```dvc
# Add GS remote to be used as cache location for GS files
$ dvc remote add gscache gs://mybucket/cache

# Tell DVC to use the 'gscache' remote as GS cache location
$ dvc config cache.gs gscache

# Add data on GS directly
$ dvc add --external gs://mybucket/mydata
$ dvc add --external gs://mybucket/existing-data

# Create the stage with an external GS output
$ dvc run -d data.txt \
--external \
-o gs://mybucket/data.txt \
Expand All @@ -87,22 +98,22 @@ $ dvc run -d data.txt \
### SSH

```dvc
# Add SSH remote to be used as cache location for SSH files
$ dvc remote add sshcache ssh://[email protected]/cache

# Tell DVC to use the 'sshcache' remote as SSH cache location
$ dvc config cache.ssh sshcache

# Add data on SSH directly
$ dvc add --external ssh://[email protected]/mydata
$ dvc add --external ssh://[email protected]/existing-data

# Create the stage with an external SSH output
$ dvc run -d data.txt \
--external \
-o ssh://[email protected]/data.txt \
scp data.txt [email protected]:/data.txt
```

> Please note that to use password authentication, it's necessary to set the
> `password` or `ask_password` SSH remote options first (see
> `dvc remote modify`), and use a special `remote://` URL in step 2:
> `dvc add --external remote://sshcache/existing-data`.

⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations.
Please check that you are able to connect both ways with tools like `ssh` and
`sftp` (GNU/Linux).
Expand All @@ -112,16 +123,11 @@ Please check that you are able to connect both ways with tools like `ssh` and
### HDFS

```dvc
# Add HDFS remote to be used as cache location for HDFS files
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we put some summary of these comments above of after the code blocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all under the ## Examples header now as a numbered list 🙂

Maybe we should make all these H3s into expandable details sections so that you don't have to scroll that much between the numbered list and the actual example?

$ dvc remote add hdfscache hdfs://[email protected]/cache

# Tell DVC to use the 'hdfscache' remote as HDFS cache location
$ dvc config cache.hdfs hdfscache

# Add data on HDFS directly
$ dvc add --external hdfs://[email protected]/mydata
$ dvc add --external hdfs://[email protected]/existing-data

# Create the stage with an external HDFS output
$ dvc run -d data.txt \
--external \
-o hdfs://[email protected]/data.txt \
Expand All @@ -135,14 +141,18 @@ it. So systems like Hadoop, Hive, and HBase are supported!

### Local file system path

The default cache location is `.dvc/cache`, so there is no need to move it for
local paths outside of your project.
The default <abbr>cache</abbr> is in `.dvc/cache`, so there is no need to set a
custom cache location for local paths outside of your project.

> Except for external data on different storage devices or partitions mounted on
> the same file system (e.g. `/mnt/raid/data`). In that case please setup an
> external cache in that same drive to enable
> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
> and avoid copying data.

```dvc
# Add data on an external location directly
$ dvc add --external /home/shared/mydata
$ dvc add --external /home/shared/existing-data

# Create the stage with an external location output
$ dvc run -d data.txt \
--external \
-o /home/shared/data.txt \
Expand Down