From 0ed9a3fc5401fbdf8281d72a0d99a3b293b3c0e4 Mon Sep 17 00:00:00 2001 From: Thierry Onkelinx Date: Fri, 8 Nov 2019 16:24:28 +0100 Subject: [PATCH 1/9] improve README.md based on feedback from gramr::write_good_file --- README.md | 63 +++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 45 insertions(+), 18 deletions(-) diff --git a/README.md b/README.md index 243d035..20935be 100644 --- a/README.md +++ b/README.md @@ -22,28 +22,42 @@ ## Rationale -The `git2rdata` package is an R package for writing and reading dataframes as plain text files. Important information is stored in a metadata file. - -1. Storing metadata allows to maintain the classes of variables. By default, the data is optimized for file storage prior to writing. The optimization is most effective on data containing factors. The optimization makes the data less human readable and can be turned off. Details on the implementation are available in `vignette("plain_text", package = "git2rdata")`. -1. Storing metadata also allows to minimize row based [diffs](https://en.wikipedia.org/wiki/Diff) between two consecutive [commits](https://en.wikipedia.org/wiki/Commit_(version_control)). This is a useful feature when storing data as plain text files under version control. Details on this part of the implementation are available in `vignette("version_control", package = "git2rdata")`. Although `git2rdata` was envisioned with a [git](https://git-scm.com/) workflow in mind, it can also be used in combination with other version control systems like [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/). -1. `git2rdata` is intended to facilitate a reproducible and traceable workflow. A toy example is given in `vignette("workflow", package = "git2rdata")`. -1. `vignette("efficiency", package = "git2rdata")` provides some insight into the efficiency in terms of file storage, git repository size and speed for writing and reading. +The `git2rdata` package is an R package for writing and reading dataframes as plain text files. +A metadata file stores important information. + +1. Storing metadata allows to maintain the classes of variables. +By default, `git2rdata` optimizes the data for file storage. +The optimization is most effective on data containing factors. +The optimization makes the data less human readable. +The user can turn this off when they prefer a human readable format over smaller files. +Details on the implementation are available in `vignette("plain_text", package = "git2rdata")`. +1. Storing metadata also allows smaller row based [diffs](https://en.wikipedia.org/wiki/Diff) between two consecutive [commits](https://en.wikipedia.org/wiki/Commit_(version_control)). +This is a useful feature when storing data as plain text files under version control. +Details on this part of the implementation are available in `vignette("version_control", package = "git2rdata")`. +Although we envisioned `git2rdata` with a [git](https://git-scm.com/) workflow in mind, you can use it in combination with other version control systems like [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/). +1. `git2rdata` is a useful tool in a reproducible and traceable workflow. +`vignette("workflow", package = "git2rdata")` gives a toy example. +1. `vignette("efficiency", package = "git2rdata")` provides some insight into the efficiency of file storage, git repository size and speed for writing and reading. ## Why Use Git2rdata? - You can store dataframes as plain text files. -- The dataframe you read has exactly the same information content as the one you wrote. +- The dataframe you read identical information content as the one you wrote. - No changes in data type. - Factors keep their original levels, including their order. - - Date and date-time are stored in an unambiguous format, documented in the metadata. -- The data and the metadata are stored in a standard and open format, making it readable by other software. -- Data and metadata are checked during the reading. The user is informed if there is tampering with the data or metadata. + - Date and date-time format are unambiguous, documented in the metadata. +- The data and the metadata are in a standard and open format, making it readable by other software. +- `git2rdata` checks the data and metadata during the reading. +`read_vc()` informes the user if there is tampering with the data or metadata. - Git2rdata integrates with the [`git2r`](https://cran.r-project.org/package=git2r) package for working with git repository from R. - Another option is using git2rdata solely for writing to disk and handle the plain text files with your favourite version control system outside of R. - The optimization reduces the required disk space by about 30% for both the working directory and the git history. - Reading data from a HDD is 30% faster than `read.table()`, writing to a HDD takes about 70% more time than `write.table()`. -- Git2rdata is useful as a tool in a reproducible and traceable workflow. See `vignette("workflow", package = "git2rdata")`. -- You can detect when a file was last modified in the git history. Use this to check whether an existing analysis is obsolete due to new data. This allows to not rerun up to date analyses, saving resources. +- Git2rdata is useful as a tool in a reproducible and traceable workflow. +See `vignette("workflow", package = "git2rdata")`. +- You can detect when a file was last modified in the git history. +Use this to check whether an existing analysis is obsolete due to new data. +This allows to not rerun up to date analyses, saving resources. ## Talk About `git2rdata` at useR!2019 in Toulouse, France @@ -74,9 +88,14 @@ remotes::install_github( remotes::install_github("ropensci/git2rdata")) ``` -## Usage in a Nutshell +## Usage in Brief -Dataframes are stored using `write_vc()` and retrieved with `read_vc()`. Both functions share the arguments `root` and `file`. `root` refers to a base location where the dataframe should be stored. It can either point to a local directory or a local git repository. `file` is the file name to use and can include a path relative to `root`. Make sure the relative path stays within `root`. +The user stores dataframes with `write_vc()` and retrieves them with `read_vc()`. +Both functions share the arguments `root` and `file`. +`root` refers to a base location where to store the dataframe. +It can either point to a local directory or a local git repository. +`file` is the file name to use and can include a path relative to `root`. +Make sure the relative path stays within `root`. ```r # using a local directory @@ -104,9 +123,14 @@ Please read `vignette("version_control", package = "git2rdata")` for more detail ## What Data Sizes Can Git2rdata Handle? -The recommendation for git repositories is to use files smaller than 100 MiB, an overall repository size less than 1 GiB and less than 25k files. The individual file size is the limiting factor. Storing the airbag dataset ([`DAAG::nassCDS`](https://cran.r-project.org/package=DAAG)) with `write_vc()` requires on average 68 (optimized) or 97 (verbose) byte per record. The 100 MiB file limit for this data is reached after about 1.5 million (optimize) or 1 million (verbose) observations. +The recommendation for git repositories is to use files smaller than 100 MiB, a repository size less than 1 GiB and less than 25k files. +The individual file size is the limiting factor. +Storing the airbag dataset ([`DAAG::nassCDS`](https://cran.r-project.org/package=DAAG)) with `write_vc()` requires on average 68 (optimized) or 97 (verbose) byte per record. +The file reaches the 100 MiB limit for this data after about 1.5 million (optimized) or 1 million (verbose) observations. -Storing a 90% random subset of the airbag dataset requires 370 kiB (optimized) or 400 kiB (verbose) storage in the git history. Updating the dataset with other 90% random subsets requires on average 60 kiB (optimized) to 100 kiB (verbose) per commit. The git history limit of 1 GiB will be reached after 17k (optimized) to 10k (verbose) commits. +Storing a 90% random subset of the airbag dataset requires 370 kiB (optimized) or 400 kiB (verbose) storage in the git history. +Updating the dataset with other 90% random subsets requires on average 60 kiB (optimized) to 100 kiB (verbose) per commit. +The git history reaches the limit of 1 GiB after 17k (optimized) to 10k (verbose) commits. Your mileage might vary. @@ -122,7 +146,7 @@ Please use the output of `citation("git2rdata")` - `testthat`: R scripts with unit tests using the [testthat](http://testthat.r-lib.org/) framework - `vignettes`: source code for the vignettes describing the package - `man-roxygen`: templates for documentation in Roxygen format -- `pkgdown`: additional source files for the `git2rdata` [website](https://ropensci.github.io/git2rdata/) +- `pkgdown`: source files for the `git2rdata` [website](https://ropensci.github.io/git2rdata/) - `.github`: guidelines and templates for contributors ``` @@ -141,6 +165,9 @@ git2rdata ## Contributions -Contributions to `git2rdata` are welcome. Please read our [Contributing guidelines](https://github.com/ropensci/git2rdata/blob/master/.github/CONTRIBUTING.md) first. The `git2rdata` project is released with a [Contributor Code of Conduct](https://github.com/ropensci/git2rdata/blob/master/.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms. +`git2rdata` welcomes contributions. +Please read our [Contributing guidelines](https://github.com/ropensci/git2rdata/blob/master/.github/CONTRIBUTING.md) first. +The `git2rdata` project has a [Contributor Code of Conduct](https://github.com/ropensci/git2rdata/blob/master/.github/CODE_OF_CONDUCT.md). +By contributing to this project, you agree to abide by its terms. [![rOpenSci footer](http://ropensci.org/public_images/github_footer.png)](https://ropensci.org) From 46d57bb96be501e1cf764c79545057877b5d4471 Mon Sep 17 00:00:00 2001 From: Thierry Onkelinx Date: Fri, 8 Nov 2019 17:06:58 +0100 Subject: [PATCH 2/9] update NEWS.md --- NEWS.md | 41 ++++++++++++----------------------------- 1 file changed, 12 insertions(+), 29 deletions(-) diff --git a/NEWS.md b/NEWS.md index 3f4e14b..2c6bfe0 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,42 +1,24 @@ -git2rdata 0.1.0.9003 (2019-11-07) +git2rdata 0.2.0 (2019-11-08) ================================= ### BREAKING FEATURES - * reordering factor levels requires `strict = TRUE` - -git2rdata 0.1.0.9002 (2019-09-27) -================================= - -### BREAKING FEATURES - - * sorting is based on the "C" locale - * the data hash is based on the plain text file - -git2rdata 0.1.0.9001 (2019-09-09) -================================= - -### BREAKING FEATURES - - * Calculation of data hash has changed, due to which `read_vc()` will once warn that data are altered outside git2rdata when reading a previously written git2rdata object (#53). - * `read_vc()` only works with data stored with version >= 0.1.0.9001. Use `upgrade_data()` on data written with an earlier version. + * Calculation of data hash has changed (#53). + You must use `upgrade_data()` to read data stored by an older version. * `is_git2rdata()` and `upgrade_data()` do not test equality in data hashes anymore (but `read_vc()` still does). + * `write_vc()` and `read_vc()` fail when `file` is a location outside of `root` (#50). + * Reordering factor levels requires `strict = TRUE`. ### Bugfixes - * The same data hash is generated on Linux and Windows machines (#49). - -git2rdata 0.1.0.9000 (2019-08-13) -================================= - -### BREAKING FEATURES - - * `write_vc()` and `read_vc()` fail when `file` is a location outside of `root` (#50). + * Linux and Windows machines now generated the same data hash (#49). ### NEW FEATURES - * Only require `upgrade_data()` for data written with versions prior to 0.0.5 (#44). - * Improve warnings() and error(). + * Internal sorting uses the "C" locale, regardless of the current locale. + * `read_vc()` reads older stored in an older version (#44). + When the version is too old, it prompts to `upgrade_data()`. + * Improve `warnings()` and `error()` messages. * Use vector version of logo. git2rdata 0.1 (2019-06-04) @@ -64,7 +46,8 @@ git2rdata 0.0.4 (2019-05-16) * The meta data gains a data hash. A mismatch throws a warning when reading the object. This tolerates updating the data by other software, while informing the user that such change occurred. * `is_git2rmeta()` validates metadata. * `list_data()` lists files with valid metadata. - * `rm_data()` and `prune_meta()` remove files with valid metadata. Other files are untouched. + * `rm_data()` and `prune_meta()` remove files with valid metadata. + They don't touch `tsv` file without metadata or `yml` files not assosiated with `git2rdata`. * Files with invalid metadata yield a warning with `list_data()`, `rm_data()` and `prune_meta()`. ### Bugfixes From 96f15e5b81ebe4e8e4ebc5d6fc3d84a91983ada2 Mon Sep 17 00:00:00 2001 From: Thierry Onkelinx Date: Wed, 13 Nov 2019 13:44:59 +0100 Subject: [PATCH 3/9] improve written on efficiency vignette --- vignettes/efficiency.Rmd | 43 ++++++++++++++++++++++++++++++---------- 1 file changed, 32 insertions(+), 11 deletions(-) diff --git a/vignettes/efficiency.Rmd b/vignettes/efficiency.Rmd index cdff72f..56380af 100644 --- a/vignettes/efficiency.Rmd +++ b/vignettes/efficiency.Rmd @@ -1,11 +1,11 @@ --- -title: "Efficiency in Terms of Storage and Time" +title: "Efficiency Relative to Storage and Time" author: "Thierry Onkelinx" output: rmarkdown::html_vignette: fig_caption: yes vignette: > - %\VignetteIndexEntry{Efficiency in Terms of Storage and Time} + %\VignetteIndexEntry{Efficiency Relative to Storage and Time} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{git2r} @@ -79,7 +79,8 @@ theme_inbo <- function(base_size = 12, base_family = "") { legend.direction = NULL, legend.justification = "center", legend.box = NULL, - legend.box.margin = margin(half_line, half_line, half_line, half_line), + legend.box.margin = margin(t = half_line, r = half_line, b = half_line, + l = half_line), legend.box.background = element_rect(colour = NA, fill = legend.bg), legend.box.spacing = unit(0.2, "cm"), panel.background = element_rect(fill = panel.bg, colour = NA), @@ -105,7 +106,8 @@ theme_inbo <- function(base_size = 12, base_family = "") { margin = margin(0, 0, half_line, 0)), plot.caption = element_text(size = rel(0.6), margin = margin(0, 0, half_line, 0)), - plot.margin = margin(half_line, half_line, half_line, half_line), + plot.margin = margin(t = half_line, r = half_line, b = half_line, + l = half_line), plot.tag = element_text(size = rel(1.2), hjust = 0.5, vjust = 0.5), plot.tag.position = "topleft", complete = TRUE @@ -121,7 +123,8 @@ update_geom_defaults("boxplot", list(colour = "#356196")) This vignette compares storage and retrieval of data by `git2rdata` with other standard R functionality. We consider `write.table()` and `read.table()` for data stored in a plain text format. `saveRDS()` and `readRDS()` use a compressed binary format. -In order to get some meaningful results, we will use the `nassCDS` dataset from the [DAAG](https://www.rdocumentation.org/packages/DAAG/versions/1.22/topics/nassCDS) package. We'll avoid the dependency on the package by directly downloading the data. +To get some meaningful results, we will use the `nassCDS` dataset from the [DAAG](https://www.rdocumentation.org/packages/DAAG/versions/1.22/topics/nassCDS) package. +We'll avoid the dependency on the package by directly downloading the data. ```{r download_data, eval = system.file("efficiency", "airbag.rds", package = "git2rdata") == ""} airbag <- read.csv( @@ -173,7 +176,12 @@ fn <- write_vc(airbag, "airbag_verbose", root, sorting = "X", optimize = FALSE) verbose_size <- sum(file.size(file.path(root, fn))) ``` -Since the data is highly compressible, `saveRDS()` yields the smallest file at the cost of having a binary file format. Both `write_vc()` formats yield smaller files than `write.table()`. Partly because `write_vc()` doesn't store row names and only uses quotes when needed. The difference between the optimized and verbose version of `write_vc()` is, in this case, solely due to the way factors are stored in the data (tsv) file. The optimized version stores the indices of the factor whereas the verbose version stores the levels. For example: `airbag$dvcat` has 5 levels with fairly short labels (on average 5 character), however storing the index requires only 1 character. Resulting in more compact files. +Since the data is highly compressible, `saveRDS()` yields the smallest file at the cost of having a binary file format. Both `write_vc()` formats yield smaller files than `write.table()`. +Partly because `write_vc()` doesn't store row names and doesn't use quotes unless needed. +The difference between the optimized and verbose version of `write_vc()` is, in this case, solely due to the way `write_vc()` stores factors in the data (`tsv`) file. +The optimized version stores the indices of the factor whereas the verbose version stores the levels. +For example: `airbag$dvcat` has 5 levels with short labels (on average 5 character), storing the index requires 1 character. +This results in more compact files. ```{r table_file_size, echo = FALSE} kable( @@ -188,7 +196,10 @@ kable( ) ``` -The reduction in file size when storing in factors depends on the length of the labels, the number of levels and the number of observations. The figure below illustrates the huge gain as soon as the level labels contain a few characters. The gain is less pronounced when the factor has a large number of levels. The optimization fails only in extreme cases with very short factor labels and a high number of levels. +The reduction in file size when storing in factors depends on the length of the labels, the number of levels and the number of observations. +The figure below illustrates the strong gain as soon as the level labels contain more than two characters. +The gain is less pronounced when the factor has a large number of levels. +The optimization fails in extreme cases with short factor labels and a high number of levels. ```{r factor_label_length, echo = FALSE, fig.cap = "Effect of the label length on the efficiency of storing factor optimized, assuming 1000 observations", warning = FALSE} ratio <- function(label_length = 1:20, n_levels = 9, n_obs = 1000) { @@ -272,9 +283,12 @@ ggplot(f_ratio, aes(x = observations, y = ratio, colour = levels)) + ### In Git Repositories -Here we will simulate how much space the data requires when the history is stored in a git repository. We will create a git repository for each method and store several subsets of the same data. Each commit contains a new version of the data. Each version is a random sample containing 90% of the observations of the `airbag` data. Two consecutive versions of the subset will have about 90% of the observations in common. 10% of the observations will be replaced by other observations. +Here we will simulate how much space the data requires to store the history in a git repository. +We will create a git repository for each method and store different subsets of the same data. +Each commit contains a new version of the data. Each version is a random sample containing 90% of the observations of the `airbag` data. +Two consecutive versions of the subset will have about 90% of the observations in common. -After writing each version, we commit the file, perform garbage collection (`git gc`) on the git repository to minimize its size and then calculate the size of the git history (`git count-objects -v`). +After writing each version, we commit the file, perform garbage collection (`git gc`) on the git repository and then calculate the size of the git history (`git count-objects -v`). ```{r git_size, eval = system.file("efficiency", "git_size.rds", package = "git2rdata") == ""} library(git2r) @@ -342,7 +356,12 @@ if (system.file("efficiency", "git_size.rds", package = "git2rdata") == "") { Each version of the data has on purpose a random order of observations and variables. This is what would happen in a worst case scenario as it would generate the largest possible diff. We also test `write.table()` with a stable ordering of the observations and variables. -The randomised `write.table()` yields the largest git repository, converging to about `r sprintf("%.1f", repo_size["write.table", 100] / repo_size["write.table.sorted", 100])` times the size of a git repository based on the sorted `write.table()`. `saveRDS()` yields a `r sprintf("%.0f%%", 100 - 100 * repo_size["saveRDS", 100] / repo_size["write.table", 100])` reduction in repository size compared to the randomised `write.table()`, but still is `r sprintf("%.1f", repo_size["saveRDS", 100] / repo_size["write.table.sorted", 100])` times larger than the sorted `write.table()`. Note that the gain of storing binary files in a git repository is much smaller than the gain in individual file size because the git repository will be compressed too. The optimized `write_vc()` starts at `r sprintf("%.0f%%", 100 * repo_size["write_vc.optimized", 1] / repo_size["write.table.sorted", 1])` and converges toward `r sprintf("%.0f%%", 100 * repo_size["write_vc.optimized", 100] / repo_size["write.table.sorted", 100])`, the verbose version starts at `r sprintf("%.0f%%", 100 * repo_size["write_vc.verbose", 1] / repo_size["write.table.sorted", 1])` and converges towards `r sprintf("%.0f%%", 100 * repo_size["write_vc.verbose", 100] / repo_size["write.table.sorted", 100])`. There is a clear gain when using `write_vc()` with optimization in terms of storage size and the availability of metadata. The verbose option of `write_vc()` lacks the gain in terms of storage size but still has the metadata advantage. +The randomised `write.table()` yields the largest git repository, converging to about `r sprintf("%.1f", repo_size["write.table", 100] / repo_size["write.table.sorted", 100])` times the size of a git repository based on the sorted `write.table()`. `saveRDS()` yields a `r sprintf("%.0f%%", 100 - 100 * repo_size["saveRDS", 100] / repo_size["write.table", 100])` reduction in repository size compared to the randomised `write.table()`, but still is `r sprintf("%.1f", repo_size["saveRDS", 100] / repo_size["write.table.sorted", 100])` times larger than the sorted `write.table()`. +Note that the gain of storing binary files in a git repository is much smaller than the gain in individual file size because git compresses its history. +The optimized `write_vc()` starts at `r sprintf("%.0f%%", 100 * repo_size["write_vc.optimized", 1] / repo_size["write.table.sorted", 1])` and converges toward `r sprintf("%.0f%%", 100 * repo_size["write_vc.optimized", 100] / repo_size["write.table.sorted", 100])`, the verbose version starts at `r sprintf("%.0f%%", 100 * repo_size["write_vc.verbose", 1] / repo_size["write.table.sorted", 1])` and converges towards `r sprintf("%.0f%%", 100 * repo_size["write_vc.verbose", 100] / repo_size["write.table.sorted", 100])`. +Storage size is a lot smaller when using `write_vc()` with optimization. +The verbose option of `write_vc()` has little the gain in storage size. +Another advantage is that `write_vc()` stores metadata. ```{r plot_git_size, echo = FALSE, fig.cap = "Size of the git history using the different storage methods."} rs <- lapply( @@ -420,7 +439,9 @@ names(write_ratio) <- median_time$expr ``` -`write_vc()` takes `r paste(sprintf("%.0f%%", -100 + write_ratio[grep("write_vc", names(write_ratio))]), collapse = " to ")` more time than `write.table()` because it needs to prepare the metadata and sort the observations and variables. When overwriting existing data, the new data is checked against the existing metadata. `saveRDS()` requires only `r sprintf("%.0f%%", write_ratio["saveRDS"])` of the time that `write.table()` needs. +`write_vc()` takes `r paste(sprintf("%.0f%%", -100 + write_ratio[grep("write_vc", names(write_ratio))]), collapse = " to ")` more time than `write.table()` because it needs to prepare the metadata and sort the observations and variables. +When overwriting existing data, `write_vc()` checks the new data against the existing metadata. +`saveRDS()` requires `r sprintf("%.0f%%", write_ratio["saveRDS"])` of the time that `write.table()` needs. ```{r plot_file_timings, echo = FALSE, fig.cap = "Boxplot of the write timings for the different methods."} mb$expr <- reorder(mb$expr, mb$time, FUN = median) From 6b6ac2baa79a544cd74b361852593dfe5d2b0085 Mon Sep 17 00:00:00 2001 From: Thierry Onkelinx Date: Wed, 13 Nov 2019 14:14:24 +0100 Subject: [PATCH 4/9] improve writting on plain text vignette --- vignettes/plain_text.Rmd | 76 +++++++++++++++++++++++++++++----------- 1 file changed, 56 insertions(+), 20 deletions(-) diff --git a/vignettes/plain_text.Rmd b/vignettes/plain_text.Rmd index 880b727..e6f6b4c 100644 --- a/vignettes/plain_text.Rmd +++ b/vignettes/plain_text.Rmd @@ -22,26 +22,48 @@ options(width = 83) This vignette motivates why we wrote `git2rdata` and illustrates how you can use it to store dataframes as plain text files. -### Maintaining variable classes - -R has several options to store dataframes as plain text files from R. Base R has `write.table()` and its companions like `write.csv()`. Some other options are `data.table::fwrite()`, `readr::write_delim()`, `readr::write_csv()` and `readr::write_tsv()`. Each of them writes a dataframe as a plain text file by converting all variables into characters. After reading the file, the conversion is reversed. However, the distinction between `character` and `factor` is lost in translation. `read.table()` converts by default all strings to factors, `readr::read_csv()` keeps by default all strings as character. The factor levels are another thing which is lost. These functions determine factor levels based on the observed levels in the plain text file. Hence factor levels without observations will disappear. The order of the factor levels is also determined by the available levels in the plain text file, which can be different from the original order. +### Maintaining Variable Classes + +R has different options to store dataframes as plain text files from R. +Base R has `write.table()` and its companions like `write.csv()`. +Some other options are `data.table::fwrite()`, `readr::write_delim()`, `readr::write_csv()` and `readr::write_tsv()`. +Each of them writes a dataframe as a plain text file by converting all variables into characters. +After reading the file, they revert this conversion. +The distinction between `character` and `factor` gets lost in translation. +`read.table()` converts by default all strings to factors, `readr::read_csv()` keeps by default all strings as character. +These functions cannot recover the factor levels. +These functions determine factor levels based on the observed levels in the plain text file. +Hence factor levels without observations will disappear. +The order of the factor levels is also determined by the available levels in the plain text file, which can be different from the original order. The `write_vc()` and `read_vc()` functions from `git2rdata` keep track of the class of each variable and, in case of a factor, also of the factor levels and their order. Hence this function pair preserves the information content of the dataframe. The `vc` suffix stands for **v**ersion **c**ontrol as these functions use their full capacity in combination with a version control system. -Efficiency in terms of storage and time -### Optimizing file storage -Plain text files require more disk space than binary files. This is the price we have to pay for a readable file format. The default option of `write_vc()` is to minimize file size as much as possible prior to writing. Since we use a tab delimited file format, we can omit quotes around character variables. This saves 2 bytes per row for each character variable. Quotes are added automatically in the exceptional cases when they are needed, e.g. to store a string that contains tab or newline characters. In such cases, quotes are only used in row-variable combinations where the exception occurs. +## Efficiency Relative to Storage and Time + +### Optimizing File Storage + +Plain text files require more disk space than binary files. +This is the price we have to pay for a readable file format. +The default option of `write_vc()` is to create file as compact as possible. +Since we use a tab delimited file format, we can omit quotes around character variables. +This saves 2 bytes per row for each character variable. +`write_vc` add quotes automatically in the exceptional cases when we needed them, e.g. to store a string that contains tab or newline characters. +We don't add quotes to row-variable combinations where we don't need them. -Since we store the class of each variable, further file size reductions can be achieved by following rules: +Since we store the class of each variable, we can further reduce the file size by following rules: -- `logical` is written as 0 (FALSE), 1 (TRUE) or NA to the data -- `factor` is stored as its indices in the data. The index and labels of levels and their order are stored in the metadata. -- `POSIXct` is written as a numeric to the data. The class and the origin are stored in the metadata. Timestamps are always stored and returned as UTC. -- `Date` is written as an integer to the data. The class and the origin are stored in the metadata. +- Store a `logical` as 0 (FALSE), 1 (TRUE) or NA to the data. +- Store a `factor` as its indices in the data. +Store the index, labels of levels and their order in the metadata. +- Store a `POSIXct` as a numeric to the data. +Store the class and the origin in the metadata. +Store and return timestamps as UTC. +- Store a `Date` as an integer to the data. +Store the class and the origin in the metadata. Storing the factors, POSIXct and Date as their index, makes them less user readable. The user can turn off this optimization when user readability is more important than file size. -### Optimized for version control +### Optimized for Version Control Another main goal of `git2rdata` is to optimise the storage of the plain text files under version control. `write_vc()` and `read_vc()` has methods for interacting with [git](https://git-scm.com/) repositories using the `git2r` framework. Users who want to use git without `git2r` or use a different version control system (e.g. [Subversion](https://subversion.apache.org/), [Mercurial](https://www.mercurial-scm.org/)), still can use `git2rdata` to write the files to disk and uses their preferred workflow on version control. @@ -80,14 +102,22 @@ str(x) ## Storing Optimized -Use `write_vc()` to store the dataframe. The `root` argument refers to the base directory where the data is stored. The `file` argument is used as the base name of the files. The data file gets a `.tsv` extension, the metadata file a `.yml` extension. `file` can include a relative path starting from `root`. +Use `write_vc()` to store the dataframe. +The `root` argument refers to the base directory where we store the data. +The `file` argument becomes the base name of the files. +The data file gets a `.tsv` extension, the metadata file a `.yml` extension. +`file` can include a relative path starting from `root`. ```{r first_write} library(git2rdata) write_vc(x = x, file = "first_test", root = path, strict = FALSE) ``` -`write_vc()` returns a vector of relative paths to the raw data and metadata files. The hashes of these files are used as names of the vector. We can have a look at both files. We'll only display the first 10 rows of the raw data. Notice that the YAML format of the metadata has the benefit of being both human and machine readable. +`write_vc()` returns a vector of relative paths to the raw data and metadata files. +The names of this vector contains the hashes of these files. +We can have a look at both files. +We'll display the first 10 rows of the raw data. +Notice that the YAML format of the metadata has the benefit of being both human and machine readable. ```{r manual_data} print_file <- function(file, root, n = -1) { @@ -113,13 +143,15 @@ print_file("verbose.tsv", path, 10) print_file("verbose.yml", path) ``` -## Efficiency in Terms of File Storage +## Efficiency Relative to File Storage -Storing dataframes optimized or verbose has an impact on the required file size. A comparison can be found in the [efficiency](efficiency.html#on-a-file-system) vignette. +Storing dataframes optimized or verbose has an impact on the required file size. +The [efficiency](efficiency.html#on-a-file-system) vignette give a comparison. ## Reading Data -The data can be retrieved with `read_vc()`. This function will reinstate the variables to their original state. +You retrieve the data with `read_vc()`. +This function will reinstate the variables to their original state. ```{r first_read} y <- read_vc(file = "first_test", root = path) @@ -128,11 +160,15 @@ y2 <- read_vc(file = "verbose", root = path) all.equal(x, y2, check.attributes = FALSE) ``` -As `read_vc()` requires the meta data, it can only read dataframes which were stored by `write_vc()`. +`read_vc()` requires the meta data. +It cannot handle dataframe not stored by `write_vc()`. ## Missing Values -`write_vc()` has an `na` argument which specifies the string which is used to indicate missing values. Because we avoid using quotes, this string must be different from any character value in the data. This includes factor labels when the data is stored verbose. This is checked and will always return an error, even with `strict = FALSE`. +`write_vc()` has an `na` argument which specifies the string which to use for missing values. +Because we avoid using quotes, this string must be different from any character value in the data. +This includes factor labels with verbose data storage. +`write_vc()` checks this and will always return an error, even with `strict = FALSE`. ```{r echo = FALSE, results = "hide"} stopifnot("X" %in% x$x, "b" %in% x$y) @@ -145,7 +181,7 @@ write_vc(x, "custom_na", path, strict = FALSE, na = "X") write_vc(x, "custom_na", path, strict = FALSE, na = "b") ``` -Please note that a single NA string is used for the entire dataset, thus for every variable. +Please note that `write_vc()` uses the same NA string for the entire dataset, thus for every variable. ```{r manual_na_data} print_file("custom_na.tsv", path, 10) From 12a369da415af77dcfd21f7e5d99eecfc5e4857d Mon Sep 17 00:00:00 2001 From: Thierry Onkelinx Date: Thu, 14 Nov 2019 15:46:18 +0100 Subject: [PATCH 5/9] reword workflow --- vignettes/workflow.Rmd | 84 +++++++++++++++++++++++++++++++++--------- 1 file changed, 67 insertions(+), 17 deletions(-) diff --git a/vignettes/workflow.Rmd b/vignettes/workflow.Rmd index 403686e..b6f0b90 100644 --- a/vignettes/workflow.Rmd +++ b/vignettes/workflow.Rmd @@ -27,13 +27,21 @@ This vignette describes a suggested workflow for storing a snapshot of dataframe In this vignette we use a `git2r::repository()` object as the root. This adds git functionality to `write_vc()` and `read_vc()`, provided by the [`git2r`](https://cran.r-project.org/package=git2r) package. This allows to pull, stage, commit and push from within R. -Each commit in the data git repository describes a complete snapshot of the data at the time of the commit. The difference between two commits can consist of changes in existing git2rdata object (updated observations, new observations, deleted observations or updated metadata). Besides updating the existing git2rdata objects, we can also add new git2rdata objects or remove existing ones. Such higher level addition and deletions need to be tracked as well. +Each commit in the data git repository describes a complete snapshot of the data at the time of the commit. +The difference between two commits can consist of changes in existing git2rdata object (updated observations, new observations, deleted observations or updated metadata). +Besides updating the existing git2rdata objects, we can also add new git2rdata objects or remove existing ones. +We need to track such higher level addition and deletions as well. We illustrate the workflow with a mock analysis on the `datasets::beaver1` and `datasets::beaver2` datasets. ## Setup -We start by initializing a git repository. `git2rdata` assumes that is already done. Therefore we'll use the `git2r` functions to do so. We start by creating a local bare repository. In practice we will use a remote on an external server (GitHub, Gitlab, Bitbucket, ...). The example below creates a local git repository with an upstream git repository. Any other workflow to create a similar structure is fine. +We start by initializing a git repository. `git2rdata` assumes that is already done. +We'll use the `git2r` functions to do so. +We start by creating a local bare repository. +In practice we will use a remote on an external server (GitHub, Gitlab, Bitbucket, ...). +The example below creates a local git repository with an upstream git repository. +Any other workflow to create a similar structure is fine. ```{r initialize} # initialize a bare git repo to be used as remote @@ -59,17 +67,30 @@ rm(init_repo) ## Structuring Git2rdata Objects Within a Project -`git2rdata` imposes very little structure. Both the `.tsv` and the `.yml` file need to be in the same folder. That's it. For the sake of simplicity, in this vignette we dump all git2rdata objects at the root of the repository. +`git2rdata` imposes a minimal structure. +Both the `.tsv` and the `.yml` file need to be in the same folder. +That's it. +For the sake of simplicity, in this vignette we dump all git2rdata objects at the root of the repository. -However, this might not be good idea for real project. We recommend to use at least a different directory tree for each import script. This directory can go into the root of a data only repository. It goes in the `data` directory in case of a data and code repository. Or the `inst` directory in case of an R package. +This might not be good idea for real project. +We recommend to use at least a different directory tree for each import script. +This directory can go into the root of a data repository. +It goes in the `data` directory in case of a data and code repository. +Or the `inst` directory in case of an R package. -Your project might need a different directory structure. Feel free to implement the most relevant data structure for your project. +Your project might need a different directory structure. +Feel free to choose the most relevant data structure for your project. ## Storing Dataframes _ad Hoc_ into a Git Repository ### First Commit -In the first commit we use `datasets::beaver1`. We connect to the git repository using `repository()`. Note that this assumes that `path` is an existing git repository. Now we can write the dataset as a git2rdata object in the repository. If the `root` argument of `write_vc()` is a `git_repository`, it gains two additional arguments: `stage` and `force`. Setting `stage = TRUE`, will automatically stage the files written by `write_vc()`. +In the first commit we use `datasets::beaver1`. +We connect to the git repository using `repository()`. +Note that this assumes that `path` is an existing git repository. +Now we can write the dataset as a git2rdata object in the repository. +If the `root` argument of `write_vc()` is a `git_repository`, it gains two extra arguments: `stage` and `force`. +Setting `stage = TRUE`, will automatically stage the files written by `write_vc()`. ```{r store_data_1} library(git2rdata) @@ -77,7 +98,8 @@ repo <- repository(path) fn <- write_vc(beaver1, "beaver", repo, sorting = "time", stage = TRUE) ``` -We can use `status()` to check that the required files are written and staged. Then we `commit()` the changes. +We can use `status()` to check that `write_vc()` wrote and staged the required files. +Then we `commit()` the changes. ```{r avoid_subsecond_commit, echo = FALSE} Sys.sleep(1.2) @@ -99,7 +121,9 @@ fn <- write_vc(beaver2, "extra_beaver", repo, sorting = "time", stage = TRUE) status(repo) ``` -Notice that `extra_beaver` is not listed in the `status()`, although it was written to the repository. The reason is that we set a `.gitignore` which contains `"*extra*`, so any git2rdata object with a name containing "extra" is ignored. We can force it to be staged by setting `force = TRUE`. +Notice that `extra_beaver` is not listed in the `status()`, although `write_vc()` wrote it to the repository. +The reason is that we set a `.gitignore` which contains `"*extra*`, so git ignores any git2rdata object with a name containing "extra". +We force git to stage it by setting `force = TRUE`. ```{r avoid_subsecond_commit2, echo = FALSE} Sys.sleep(1.2) @@ -115,7 +139,12 @@ cm2 <- commit(repo, message = "Second commit") ### Third Commit -At this point in time we decide that a single git2rdata object containing the data of both beavers is more relevant. We add an ID variable for each of the animals. This requires updating the `sorting` to eliminate ties. And `strict = FALSE` to update the metadata. The "extra_beaver" git2rdata object is no longer needed so we remove it. We use `all = TRUE` to stage the removal of "extra_beaver" while committing the changes. +Now we decide that a single git2rdata object containing the data of both beavers is more relevant. +We add an ID variable for each of the animals. +This requires updating the `sorting` to avoid ties. +And `strict = FALSE` to update the metadata. +The "extra_beaver" git2rdata object is no longer needed so we remove it. +We use `all = TRUE` to stage the removal of "extra_beaver" while committing the changes. ```{r avoid_subsecond_commit3, echo = FALSE} Sys.sleep(1.2) @@ -139,7 +168,9 @@ We strongly recommend to add git2rdata object through an import script instead o Old versions of the import script and the associated git2rdata remain available through the version control history. Remove obsolete git2rdata objects from the import script. This keeps both the import script and the working directory tidy and minimal. -Basically, the import script should create all git2rdata objects within a given directory tree. This gives the advantage that we start the import script by clearing any existing git2rdata object in this directory. Any git2rdata object which no longer is created by the import script gets removed without the need to track what git2rdata objects existed in the previous version. +Basically, the import script should create all git2rdata objects within a given directory tree. +This gives the advantage that we start the import script by clearing any existing git2rdata object in this directory. +If the import script no longer creates a git2rdata object, it gets removed without the need to track what git2rdata objects existed in the previous version. The brute force method of removing all files or all `.tsv` / `.yml` pairs is not a good idea. This removes the existing metadata which we need for efficient storage (see `vignette("efficiency", package = "git2rdata")`). A better solution is to use `rm_data()` on the directory at the start of the import script. This removes all `.tsv` files which have valid metadata. The existing metadata remains untouched at this point. @@ -178,9 +209,21 @@ push(repo = repo) ## R Package Workflow for Storing Dataframes -We recommend a two repository set-up in case of recurring analyses. These are relative stable analyses which have to run with some frequency on updated data (e.g. once a month). Then it is worthwhile to convert the analyses into an R package. Long scripts can be converted into a set of shorter functions which are much easier to document and maintain. An R package offers lots of [functionality](http://r-pkgs.had.co.nz/check.html) out of the box to check the quality of your code. - -The example below converts the import script above into a function. We illustrate how you can use Roxygen2 (see `vignette("roxygen2", package = "roxygen2")`) tags to document the function and to list its dependencies. Note that we added `session = TRUE` to `commit()`. This will append the `sessionInfo()` at the time of the commit to the commit message. Thus documenting all loaded R packages and their version. This documents to code used to create the git2rdata object since your analysis code resides in a dedicated package with its own version number. We strongly recommend to run the import from a fresh R session. Then the `sessionInfo()` at commit time is limited to those packages with are strictly required for the import. Consider running the import from the command line. e.g. `Rscript -e 'mypackage::import_body_temp("path/to/root")'`. +We recommend a two repository set-up in case of recurring analyses. +These are relative stable analyses which have to run with some frequency on updated data (e.g. once a month). +That makes it worthwhile to convert the analyses into an R package. +Split long scripts into a set of shorter functions which are much easier to document and maintain. +An R package offers lots of [functionality](http://r-pkgs.had.co.nz/check.html) out of the box to check the quality of your code. + +The example below converts the import script above into a function. +We illustrate how you can use Roxygen2 (see `vignette("roxygen2", package = "roxygen2")`) tags to document the function and to list its dependencies. +Note that we added `session = TRUE` to `commit()`. +This will append the `sessionInfo()` at the time of the commit to the commit message. +Thus documenting all loaded R packages and their version. +This documents to code used to create the git2rdata object since your analysis code resides in a dedicated package with its own version number. +We strongly recommend to run the import from a fresh R session. +Then the `sessionInfo()` at commit time contains those packages with are strictly required for the import. +Consider running the import from the command line. e.g. `Rscript -e 'mypackage::import_body_temp("path/to/root")'`. ```{r eval = FALSE} #' Import the beaver body temperature data @@ -215,7 +258,8 @@ import_body_temp <- function(path) { ## Analysis Workflow with Reproducible Data -The example below is a small trivial example of a standardized analysis in which the source of the data is documented by describing the name of the data, the repository URL and the commit. We can use this information when reporting the results. This makes the data underlying the results traceable. +The example below is a small trivial example of a standardized analysis in which documents the source of the data by describing the name of the data, the repository URL and the commit. +We can use this information when reporting the results. This makes the data underlying the results traceable. ```{r standardized_analysis} analysis <- function(ds_name, repo) { @@ -246,7 +290,7 @@ result <- lapply(current, report) junk <- lapply(result, print) ``` -The example below does exactly the same thing for the first and second commit. +The example below does the same thing for the first and second commit. ```{r run_previous_analyses, results = "asis"} # checkout first commit @@ -265,8 +309,14 @@ result <- lapply(previous, report) junk <- lapply(result, print) ``` -If you inspect the reported results carefully you'll notice that all the output (coefficients and commit hash) for "beaver" object is identical for the first and second commit. This makes sense since the "beaver" object didn't change during the second commit. The output for the current (third) commit is different because the dataset changed. +If you inspect the reported results you'll notice that all the output (coefficients and commit hash) for "beaver" object is identical for the first and second commit. +This makes sense since the "beaver" object didn't change during the second commit. +The output for the current (third) commit is different because the dataset changed. ### Long running analysis -Imagine the case where an individual analysis takes quite a while to run. We store the most recent version of each analysis and add the information from `recent_commit()`. When preparing the analysis, you can run `recent_commit()` again on the dataset and compare the commit hash with that one of the currently available analysis. If the commit hashes match, then the data hasn't changed. So there is no need to rerun the analysis^[assuming the code for running the analysis didn't change.], saving valuable computing resources and time. +Imagine the case where an individual analysis takes a while to run. +We store the most recent version of each analysis and add the information from `recent_commit()`. +When preparing the analysis, you can run `recent_commit()` again on the dataset and compare the commit hash with that one of the available analysis. +If the commit hashes match, then the data hasn't changed. +Then there is no need to rerun the analysis^[assuming the code for running the analysis didn't change.], saving valuable computing resources and time. From a63b7d0761f7547862c44141c64889098d12a74e Mon Sep 17 00:00:00 2001 From: Thierry Onkelinx Date: Thu, 14 Nov 2019 15:46:28 +0100 Subject: [PATCH 6/9] fix typo's --- NEWS.md | 4 ++-- README.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/NEWS.md b/NEWS.md index 2c6bfe0..84b70bc 100644 --- a/NEWS.md +++ b/NEWS.md @@ -5,7 +5,7 @@ git2rdata 0.2.0 (2019-11-08) * Calculation of data hash has changed (#53). You must use `upgrade_data()` to read data stored by an older version. - * `is_git2rdata()` and `upgrade_data()` do not test equality in data hashes anymore (but `read_vc()` still does). + * `is_git2rdata()` and `upgrade_data()` do not test equality in data hashes any more (but `read_vc()` still does). * `write_vc()` and `read_vc()` fail when `file` is a location outside of `root` (#50). * Reordering factor levels requires `strict = TRUE`. @@ -47,7 +47,7 @@ git2rdata 0.0.4 (2019-05-16) * `is_git2rmeta()` validates metadata. * `list_data()` lists files with valid metadata. * `rm_data()` and `prune_meta()` remove files with valid metadata. - They don't touch `tsv` file without metadata or `yml` files not assosiated with `git2rdata`. + They don't touch `tsv` file without metadata or `yml` files not associated with `git2rdata`. * Files with invalid metadata yield a warning with `list_data()`, `rm_data()` and `prune_meta()`. ### Bugfixes diff --git a/README.md b/README.md index 20935be..019b1a9 100644 --- a/README.md +++ b/README.md @@ -48,7 +48,7 @@ Although we envisioned `git2rdata` with a [git](https://git-scm.com/) workflow i - Date and date-time format are unambiguous, documented in the metadata. - The data and the metadata are in a standard and open format, making it readable by other software. - `git2rdata` checks the data and metadata during the reading. -`read_vc()` informes the user if there is tampering with the data or metadata. +`read_vc()` informs the user if there is tampering with the data or metadata. - Git2rdata integrates with the [`git2r`](https://cran.r-project.org/package=git2r) package for working with git repository from R. - Another option is using git2rdata solely for writing to disk and handle the plain text files with your favourite version control system outside of R. - The optimization reduces the required disk space by about 30% for both the working directory and the git history. From 56d05402c6538117e54bc291d807ab01ce906378 Mon Sep 17 00:00:00 2001 From: Thierry Onkelinx Date: Mon, 18 Nov 2019 11:38:14 +0100 Subject: [PATCH 7/9] update writing on version control vignette --- vignettes/version_control.Rmd | 79 +++++++++++++++++++++++++++-------- 1 file changed, 62 insertions(+), 17 deletions(-) diff --git a/vignettes/version_control.Rmd b/vignettes/version_control.Rmd index 7128556..51b9b2e 100644 --- a/vignettes/version_control.Rmd +++ b/vignettes/version_control.Rmd @@ -19,9 +19,12 @@ options(width = 83) ## Introduction -This vignette focuses on what `git2rdata` does to make storing dataframes under version control more efficient and convenient. All details on the actual file format are described in `vignette("plain_text", package = "git2rdata")`. Hence we will not discuss the `optimize` and `na` arguments to the `write_vc()` function. +This vignette focuses on what `git2rdata` does to make storing dataframes under version control more efficient and convenient. +`vignette("plain_text", package = "git2rdata")` describes all details on the actual file format. +Hence we will not discuss the `optimize` and `na` arguments to the `write_vc()` function. -We will not illustrate the efficiency of `write_vc()` and `read_vc()` since that is covered in `vignette("efficiency", package = "git2rdata")`. +We will not illustrate the efficiency of `write_vc()` and `read_vc()`. +`vignette("efficiency", package = "git2rdata")` covers those topics. ## Setup @@ -52,13 +55,22 @@ str(x) ## Assumptions -A critical assumption made by `git2rdata` is that all information is contained within the dataframe itself. Each row is an observation, each column is a variable and only the variables are named. This implies that two observations switching place does not alter the information content. Nor does switching two variables. +A critical assumption made by `git2rdata` is that the dataframe itself contains all information. +Each row is an observation, each column is a variable. +The dataframe has `colnames` but no `rownames`. +This implies that two observations switching place does not alter the information content. +Nor does switching two variables. -Version control systems like [git](https://git-scm.com/), [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/) focus on accurately keeping track of _any_ change in the files. Two observations switching place in a plain text file _is_ a change, although the information content^[_sensu_ `git2rdata`] doesn't change. Therefore `git2rdata` helps the user to prepare the plain text files in such a way that any change in the version history is an actual change in the information content. +Version control systems like [git](https://git-scm.com/), [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/) focus on accurately keeping track of _any_ change in the files. +Two observations switching place in a plain text file _is_ a change, although the information content^[_sensu_ `git2rdata`] doesn't change. +`git2rdata` helps the user to prepare the plain text files in such a way that any change in the version history is an actual change in the information content. ## Sorting Observations -Version control systems often track changes in plain text files based on row based differences. In layman's terms they only record which lines in a file are removed and which lines are inserted at what location. Changing an existing line implies removing the old version and inserting the new one. This is illustrated in the minimal example below. +Version control systems often track changes in plain text files based on row based differences. +In layman's terms they record lines removed from and inserted in the file at what location. +Changing an existing line implies removing the old version and inserting the new one. +The minimal example below illustrates this. Original version @@ -69,7 +81,9 @@ A,B 3,12 ``` -Altered version. The row containing `1, 10` was moved to the last line. The row containing `3,12` was changed to `3,0` +Altered version. +The row containing `1, 10` moves to the last line. +The row containing `3,12` changed to `3,0`. ``` A,B @@ -108,7 +122,12 @@ A,B +3,0 ``` -This is where the `sorting` argument comes into play. If this argument is not provided when a file is written for the first time, it will yield a warning about the lack of sorting. The observations will be written in their current order. New versions of the file will not apply any sorting either, leaving this burden to the user. This is illustrated by the changed hash for the data file in the example below, whereas the metadata is not changed (no change in hash). +This is where the `sorting` argument comes into play. +If this argument is not provided when writing a file for the first time, it will yield a warning about the lack of sorting. +`write_vc()` then writes the observations in their current order. +New versions of the file will not apply any sorting either, leaving this burden to the user. +The changed hash for the data file illustrates this in the example below. +The metadata hash remains the same. ```{r row_order} library(git2rdata) @@ -116,9 +135,13 @@ write_vc(x, file = "row_order", root = root) write_vc(x[sample(nrow(x)), ], file = "row_order", root = root) ``` -`sorting` should contain a vector of variable names. The observations are automatically sorted along these variables prior to writing. However, we now get an error because the set of sorting variables has changed. The set of sorting variables is stored in the metadata. Changing the sorting can potentially lead to large diffs, which `git2rdata` tries to avoid as much as possible. +`sorting` should contain a vector of variable names. +The observations are automatically sorted along these variables. +Now we get an error because the set of sorting variables has changed. +The metadata stores the set of sorting variables. +Changing the sorting can potentially lead to large diffs, which `git2rdata` tries to avoid as much as possible. -From this moment on we will store the output of `write_vc()` in an object to minimize the output. +From this moment on we will store the output of `write_vc()` in an object reduce output. ```{r apply_sorting, error = TRUE} fn <- write_vc(x, "row_order", root, sorting = "y") @@ -131,7 +154,10 @@ fn <- write_vc(x, "row_order", root, sorting = "y", strict = FALSE) fn <- write_vc(x, "row_order", root, sorting = c("y", "x"), strict = FALSE) ``` -Once the sorting is defined we may omit the `sorting` argument when writing new versions. The sorting as defined in the existing metadata will be used to sort the observations. A check for potential ties will be performed and results in a warning when ties are found. +Once we have defined the sorting, we may omit the `sorting` argument when writing new versions. +`write_vc` uses the sorting as defined in the existing metadata. +It checks for potential ties. +Ties results in a warning. ```{r update_sorted} print_file <- function(file, root, n = -1) { @@ -156,7 +182,10 @@ B,A 13,3 ``` -The resulting diff is maximal because every single row was updated. Yet none of the information was changed. Hence, it is crucial to maintain column order when storing dataframes as plain text files under version control. This is illustrated on a more realistic data set in the `vignette("efficiency", package = "git2rdata")` vignette. +The resulting diff is maximal because every single row changed. +Yet none of the information changed. +Hence, maintaining column order is crucial when storing dataframes as plain text files under version control. +The `vignette("efficiency", package = "git2rdata")` vignette illustrates this on a more realistic data set. ```diff -A,B @@ -169,7 +198,11 @@ The resulting diff is maximal because every single row was updated. Yet none of +13,3 ``` -`git2rdata` tackles this problem by storing the order of the columns in the metadata. The order is defined by the order in the dataframe when it is written for the first time. From that moment on, the same order will be reused. The example below writes the same data set twice. The second version contains exactly the same information but randomizes the order of the observations and the columns. The sorting by the internals of `write_vc()` will undo this randomization, resulting in an unchanged file. +When `write_vc()` writes a dataframe for the first time, it stores the original order of the columns in the metadata. +From that moment on, `write_vc()` uses the order stored in the metadata. +The example below writes the same data set twice. +The second version contains identical information but randomizes the order of the observations and the columns. +The sorting by the internals of `write_vc()` will undo this randomization, resulting in an unchanged file. ```{r variable_order} write_vc(x, "column_order", root, sorting = c("x", "abc")) @@ -180,7 +213,8 @@ print_file("column_order.tsv", root, n = 5) ## Handling Factors Optimized -`vignette("plain_text", package = "git2rdata")` and `vignette("efficiency", package = "git2rdata")` illustrate how a factor can be stored more efficiently when storing their index in the data file and the indices and labels in the metadata. We take this even a bit further: what happens if new data arrives and an extra factor level is required? +`vignette("plain_text", package = "git2rdata")` and `vignette("efficiency", package = "git2rdata")` illustrate how we can store a factor more efficiently when storing their index in the data file and the indices and labels in the metadata. +We take this even a bit further: what happens if new data arrives and we need an extra factor level? ```{r factor} old <- data.frame(color = c("red", "blue")) @@ -204,7 +238,9 @@ fn <- write_vc(updated, "factor", root, strict = FALSE) print_file("factor.yml", root) ``` -The next example removes the `"blue"` level and switches the order of the remaining levels. Notice that again the existing indices are retained. The order of the labels and indices reflects their new ordering. +The next example removes the `"blue"` level and switches the order of the remaining levels. +Notice that the medatadata retains the existing indices. +The order of the labels and indices reflects their new ordering. ```{r factor_deleted} deleted <- data.frame( @@ -224,7 +260,9 @@ print_file("factor.yml", root) ## Relabelling a Factor -The example below will store a dataframe, relabel the factor levels and store it again using `write_vc()`. Notice that both the labels and the indices are updated. Hence creating a large diff, where just updating the labels would be sufficient. +The example below will store a dataframe, relabel the factor levels and store it again using `write_vc()`. +Notice the update of both the labels and the indices. +Hence creating a large diff, where updating the labels would do. ```{r} write_vc(old, "write_vc", root, sorting = "color") @@ -236,7 +274,12 @@ write_vc(relabeled, "write_vc", root, strict = FALSE) print_file("write_vc.yml", root) ``` -Therefore we created `relabel()`, which changes only the labels in the metadata. It takes three arguments: the name of the data file, the root and the change. `change` accepts two formats, a list or a dataframe. The name of the list must match with the variable name of a factor in the data. Each element of the list must be a named vector, the name being the existing label and the value the new label. The dataframe format requires a `factor`, `old` and `new` variable with one row for each change in label. +We created `relabel()`, which changes the labels in the meta data while maintaining their indices. +It takes three arguments: the name of the data file, the root and the change. +`change` accepts two formats, a list or a dataframe. +The name of the list must match with the variable name of a factor in the data. +Each element of the list must be a named vector, the name being the existing label and the value the new label. +The dataframe format requires a `factor`, `old` and `new` variable with one row for each change in label. ```{r} write_vc(old, "relabel", root, sorting = "color") @@ -247,4 +290,6 @@ relabel("relabel", root, print_file("relabel.yml", root) ``` -A _caveat_: `relabel()` only makes sense when the data file uses optimized storage. The verbose mode stores the factor labels and not their indices, in which case relabelling a label will always yield a large diff. Therefore `relabel()` will only handle the optimized storage. +A _caveat_: `relabel()` does not make sense when the data file uses verbose storage. +The verbose mode stores the factor labels and not their indices, in which case relabelling a label will always yield a large diff. +Hence, `relabel()` requires the optimized storage. From 4c73227cc9638ef826e901ccd3348961c0e26f65 Mon Sep 17 00:00:00 2001 From: Thierry Onkelinx Date: Mon, 18 Nov 2019 12:12:42 +0100 Subject: [PATCH 8/9] add reminder to devtools::release() --- DESCRIPTION | 1 + R/utils.R | 6 ++++++ 2 files changed, 7 insertions(+) create mode 100644 R/utils.R diff --git a/DESCRIPTION b/DESCRIPTION index 2dd70bd..bf694ed 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -57,5 +57,6 @@ Collate: 'reexport.R' 'relabel.R' 'upgrade_data.R' + 'utils.R' VignetteBuilder: knitr Language: en-GB diff --git a/R/utils.R b/R/utils.R new file mode 100644 index 0000000..fb6a3ed --- /dev/null +++ b/R/utils.R @@ -0,0 +1,6 @@ +# gramr is avaible from https://github.com/ropenscilabs/gramr +release_questions <- function() { + c( + 'Did you ran `gramr::check_project(exclude_chunks = TRUE)`' + ) +} From ea5cf9e01e9b02af346e274ee69d1fe8beb2203b Mon Sep 17 00:00:00 2001 From: Thierry Onkelinx Date: Mon, 18 Nov 2019 12:12:59 +0100 Subject: [PATCH 9/9] bump package version --- DESCRIPTION | 2 +- codemeta.json | 4 ++-- cran-comments.md | 8 +++----- 3 files changed, 6 insertions(+), 8 deletions(-) diff --git a/DESCRIPTION b/DESCRIPTION index bf694ed..301a4c0 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: git2rdata Title: Store and Retrieve Data.frames in a Git Repository -Version: 0.1.0.9003 +Version: 0.2.0 Authors@R: c( person( "Thierry", "Onkelinx", role = c("aut", "cre"), diff --git a/codemeta.json b/codemeta.json index 28870e3..f4a37be 100644 --- a/codemeta.json +++ b/codemeta.json @@ -14,7 +14,7 @@ ], "issueTracker": "https://github.com/ropensci/git2rdata/issues", "license": "https://spdx.org/licenses/GPL-3.0", - "version": "0.1.0.9002", + "version": "0.2.0", "programmingLanguage": { "@type": "ComputerLanguage", "name": "R", @@ -203,7 +203,7 @@ ], "releaseNotes": "https://github.com/ropensci/git2rdata/blob/master/NEWS.md", "readme": "https://github.com/ropensci/git2rdata/blob/master/README.md", - "fileSize": "362.855KB", + "fileSize": "341.663KB", "contIntegration": [ "https://travis-ci.org/inbo/git2rdata", "https://ci.appveyor.com/project/ThierryO/git2rdata/branch/master", diff --git a/cran-comments.md b/cran-comments.md index 608e1ee..b942d48 100644 --- a/cran-comments.md +++ b/cran-comments.md @@ -1,12 +1,12 @@ ## Test environments * local - * ubuntu 18.04, R 3.6.0 + * ubuntu 18.04.3 LTS, R 3.6.1 * travis-ci * trusty, oldrel * xenial, release and devel * osx, release * AppVeyor - * Windows Server 2012, R 3.6.0 Patched + * Windows Server 2012 R2 x64, R 3.6.1 Patched * r-hub * Windows Server 2008 R2 SP1, R-devel, 32/64 bit * Ubuntu Linux 16.04 LTS, R-release, GCC @@ -14,6 +14,4 @@ ## R CMD check results -0 errors | 0 warnings | 1 note - -* This is a new release. +0 errors | 0 warnings | 0 note