diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 411be369fb..43388f70f8 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -156,27 +156,26 @@ for more details.) This section contains the following options: `dvc remote` for more information on "local remotes".) This will overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`. -- `cache.ssh` - name of an - [SSH remote to use as external cache](/doc/user-guide/managing-external-data#ssh). - - > Avoid using the same remote location that you are using for `dvc push`, - > `dvc pull`, `dvc fetch` as external cache for your external outputs, because - > it may cause possible file hash overlaps: the hash of a data file in - > external storage could collide with a hash generated locally for another - > file with a different content. - - `cache.s3` - name of an [Amazon S3 remote to use as external cache](/doc/user-guide/managing-external-data#amazon-s-3). +- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as + [external cache](/doc/user-guide/managing-external-data). + - `cache.gs` - name of a [Google Cloud Storage remote to use as external cache](/doc/user-guide/managing-external-data#google-cloud-storage). +- `cache.ssh` - name of an SSH remote to use + [as external cache](/doc/user-guide/managing-external-data#ssh). + + > Avoid using the same [DVC remote](/doc/command-reference/remote) (used for + > `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file + > hash overlaps: the hash of an external output could collide + > with a hash generated locally for another file with different content. + - `cache.hdfs` - name of an [HDFS remote to use as external cache](/doc/user-guide/managing-external-data#hdfs). -- `cache.azure` - name of a Microsoft Azure Blob Storage remote to use as - [external cache](/doc/user-guide/managing-external-data). - ### state See diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 7746a9cfdf..1c75275c68 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -99,7 +99,7 @@ Relevant notes: - [external dependencies](/doc/user-guide/external-dependencies) and [external outputs](/doc/user-guide/managing-external-data) (outside of the - workspace) are also supported. + workspace) are also supported (except metrics and plots). - Outputs are deleted from the workspace before executing the command (including at `dvc repro`) if their paths are found as existing files/directories. This diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 0e46718520..e3c25c3b06 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -1,36 +1,38 @@ # External Dependencies There are cases when data is so large, or its processing is organized in a way -that you would like to avoid moving it out of its external/remote location. For -example from a network attached storage (NAS) drive, processing data on HDFS, +such that you would like to avoid moving it out of its external/remote location. +For example from a network attached storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams data -from S3 to process it. A mechanism for external dependencies and -[external outputs](/doc/user-guide/managing-external-data) provides a way for -DVC to control data externally. +from S3 to process it. -## Description +External dependencies and +[external outputs](/doc/user-guide/managing-external-data) provide ways to track +data outside of the project. -With DVC, you can specify external files as dependencies for your pipeline +## How it works + +You can specify external files or directories as dependencies for your pipeline stages. DVC will track changes in them and reflect this in the output of `dvc status`. Currently, the following types (protocols) of external dependencies are supported: -- Local files and directories outside of your workspace -- SSH - Amazon S3 - Microsoft Azure Blob Storage - Google Cloud Storage +- SSH - HDFS - HTTP +- Local files and directories outside the workspace > Note that these are a subset of the remote storage types supported by > `dvc remote`. -In order to specify an external dependency for your stage, use the usual `-d` -option in `dvc run` with the external path or URL to your desired file or -directory. +In order to specify an external dependency for your stage, use the +usual `-d` option in `dvc run` with the external path or URL to your desired +file or directory. ## Examples @@ -149,8 +151,8 @@ $ dvc import-url https://data.dvc.org/get-started/data.xml Importing 'https://data.dvc.org/get-started/data.xml' -> 'data.xml' ``` -The command above creates the import stage (DVC-file) -`data.xml.dvc`, that uses an external dependency (in this case an HTTPs URL). +The command above creates the import `.dvc` file `data.xml.dvc`, that contains +an external dependency (in this case an HTTPs URL).
@@ -180,7 +182,7 @@ determine whether the source has changed and we need to download the file again. `dvc import` can download a data artifact from any DVC project or Git repository. It also creates an external dependency in its -import stage (DVC-file). +import `.dvc` file. ```dvc $ dvc import git@github.com:iterative/example-get-started model.pkl diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 2348ee977a..649d73b12a 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -1,83 +1,94 @@ # Managing External Data There are cases when data is so large, or its processing is organized in a way -that you would like to avoid moving it out of its external/remote location. For -example from a network attached storage (NAS) drive, processing data on HDFS, +such that its preferable to avoid moving it from its external/remote location. +For example data on a network attached storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via SSH, or having a script that streams data -from S3 to process it. External outputs and -[external dependencies](/doc/user-guide/external-dependencies) provide a way for -DVC to control data outside of the project directory. +from S3 to process it. -## Description +External outputs and +[external dependencies](/doc/user-guide/external-dependencies) provide ways to +track data outside of the project. -DVC can track files on an external storage with `dvc add` or specify external -files as outputs for -[DVC-files](/doc/user-guide/dvc-files-and-directories) created by `dvc run` -(stage files). External outputs are considered part of the DVC project. DVC will -track changes in them and reflect this in the output of `dvc status`. +## How external outputs work + +DVC can track existing files or directories on an external location with +`dvc add` (`out` field). It can also create external files or directories as +outputs for `dvc.yaml` files (only `outs` field, not metrics or plots). + +External outputs are considered part of the (extended) DVC project: DVC will +track changes in them, and reflect this in `dvc status` reports, for example. + +For cached external outputs (e.g. `dvc add`, `dvc run -o`), you will need to +[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +in the same external/remote file system first. Currently, the following types (protocols) of external outputs (and cache) are supported: -- Local files and directories outside of your workspace -- SSH - Amazon S3 +- Microsoft Azure Blob Storage - Google Cloud Storage +- SSH - HDFS +- Local files and directories outside the workspace > Note that these are a subset of the remote storage types supported by > `dvc remote`. -In order to specify an external output for a stage file, use the usual `-o` or -`-O` options of `dvc run`, but with the external path or URL to the file in -question. For cached external outputs (`-o`) you will need to -[setup an external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -in the same external/remote file system first. - -> Avoid using the same location of the -> [remote storage](/doc/command-reference/remote) that you have for `dvc push` -> and `dvc pull` for external outputs or as external cache, because it may cause -> file hash overlaps: The hash value of a data file in external storage could -> collide with the one generated locally for another file. +> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for +> `dvc push`, `dvc pull`, etc.) for external outputs, because it may cause file +> hash overlaps: the hash of an external output could collide with a hash +> generated locally for another file with different content. ## Examples -For the examples, let's take a look at a [stage](/doc/command-reference/run) -that simply moves local file to an external location, producing a `data.txt.dvc` -DVC-file. +For the examples, let's take a look at + +1. Adding a `dvc remote` to use as cache for data in the external location, and + configure it as external cache with `dvc config`. +2. Tracking existing data on an external location with `dvc add` (this doesn't + download it). This produces a `.dvc` file with an external output. +3. Creating a simple [stage](/doc/command-reference/run) that moves a local file + to the external location. This produces a stage with another external output + in `dvc.yaml`. ### Amazon S3 ```dvc -# Add S3 remote to be used as cache location for S3 files $ dvc remote add s3cache s3://mybucket/cache - -# Tell DVC to use the 's3cache' remote as S3 cache location $ dvc config cache.s3 s3cache -# Add data on S3 directly -$ dvc add --external s3://mybucket/mydata +$ dvc add --external s3://mybucket/existing-data -# Create the stage with an external S3 output $ dvc run -d data.txt \ --external \ -o s3://mybucket/data.txt \ aws s3 cp data.txt s3://mybucket/data.txt ``` +### Microsoft Azure Blob Storage + +```dvc +$ dvc remote add azurecache azure://mycontainer/cache +$ dvc config cache.azure azurecache + +$ dvc add --external azure://mycontainer/existing-data + +$ dvc run -d data.txt \ + --external \ + -o azure://mycontainer/data.txt \ + az storage blob upload -f data.txt -c mycontainer -n data.txt +``` + ### Google Cloud Storage ```dvc -# Add GS remote to be used as cache location for GS files $ dvc remote add gscache gs://mybucket/cache - -# Tell DVC to use the 'gscache' remote as GS cache location $ dvc config cache.gs gscache -# Add data on GS directly -$ dvc add --external gs://mybucket/mydata +$ dvc add --external gs://mybucket/existing-data -# Create the stage with an external GS output $ dvc run -d data.txt \ --external \ -o gs://mybucket/data.txt \ @@ -87,22 +98,22 @@ $ dvc run -d data.txt \ ### SSH ```dvc -# Add SSH remote to be used as cache location for SSH files $ dvc remote add sshcache ssh://user@example.com/cache - -# Tell DVC to use the 'sshcache' remote as SSH cache location $ dvc config cache.ssh sshcache -# Add data on SSH directly -$ dvc add --external ssh://user@example.com/mydata +$ dvc add --external ssh://user@example.com/existing-data -# Create the stage with an external SSH output $ dvc run -d data.txt \ --external \ -o ssh://user@example.com/data.txt \ scp data.txt user@example.com:/data.txt ``` +> Please note that to use password authentication, it's necessary to set the +> `password` or `ask_password` SSH remote options first (see +> `dvc remote modify`), and use a special `remote://` URL in step 2: +> `dvc add --external remote://sshcache/existing-data`. + ⚠️ DVC requires both SSH and SFTP access to work with remote SSH locations. Please check that you are able to connect both ways with tools like `ssh` and `sftp` (GNU/Linux). @@ -112,16 +123,11 @@ Please check that you are able to connect both ways with tools like `ssh` and ### HDFS ```dvc -# Add HDFS remote to be used as cache location for HDFS files $ dvc remote add hdfscache hdfs://user@example.com/cache - -# Tell DVC to use the 'hdfscache' remote as HDFS cache location $ dvc config cache.hdfs hdfscache -# Add data on HDFS directly -$ dvc add --external hdfs://user@example.com/mydata +$ dvc add --external hdfs://user@example.com/existing-data -# Create the stage with an external HDFS output $ dvc run -d data.txt \ --external \ -o hdfs://user@example.com/data.txt \ @@ -135,14 +141,18 @@ it. So systems like Hadoop, Hive, and HBase are supported! ### Local file system path -The default cache location is `.dvc/cache`, so there is no need to move it for -local paths outside of your project. +The default cache is in `.dvc/cache`, so there is no need to set a +custom cache location for local paths outside of your project. + +> Except for external data on different storage devices or partitions mounted on +> the same file system (e.g. `/mnt/raid/data`). In that case please setup an +> external cache in that same drive to enable +> [file links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +> and avoid copying data. ```dvc -# Add data on an external location directly -$ dvc add --external /home/shared/mydata +$ dvc add --external /home/shared/existing-data -# Create the stage with an external location output $ dvc run -d data.txt \ --external \ -o /home/shared/data.txt \