Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release #56

Merged
merged 9 commits into from
Nov 18, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: git2rdata
Title: Store and Retrieve Data.frames in a Git Repository
Version: 0.1.0.9003
Version: 0.2.0
Authors@R: c(
person(
"Thierry", "Onkelinx", role = c("aut", "cre"),
Expand Down Expand Up @@ -57,5 +57,6 @@ Collate:
'reexport.R'
'relabel.R'
'upgrade_data.R'
'utils.R'
VignetteBuilder: knitr
Language: en-GB
43 changes: 13 additions & 30 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,24 @@
git2rdata 0.1.0.9003 (2019-11-07)
git2rdata 0.2.0 (2019-11-08)
=================================

### BREAKING FEATURES

* reordering factor levels requires `strict = TRUE`

git2rdata 0.1.0.9002 (2019-09-27)
=================================

### BREAKING FEATURES

* sorting is based on the "C" locale
* the data hash is based on the plain text file

git2rdata 0.1.0.9001 (2019-09-09)
=================================

### BREAKING FEATURES

* Calculation of data hash has changed, due to which `read_vc()` will once warn that data are altered outside git2rdata when reading a previously written git2rdata object (#53).
* `read_vc()` only works with data stored with version >= 0.1.0.9001. Use `upgrade_data()` on data written with an earlier version.
* `is_git2rdata()` and `upgrade_data()` do not test equality in data hashes anymore (but `read_vc()` still does).
* Calculation of data hash has changed (#53).
You must use `upgrade_data()` to read data stored by an older version.
* `is_git2rdata()` and `upgrade_data()` do not test equality in data hashes any more (but `read_vc()` still does).
* `write_vc()` and `read_vc()` fail when `file` is a location outside of `root` (#50).
* Reordering factor levels requires `strict = TRUE`.

### Bugfixes

* The same data hash is generated on Linux and Windows machines (#49).

git2rdata 0.1.0.9000 (2019-08-13)
=================================

### BREAKING FEATURES

* `write_vc()` and `read_vc()` fail when `file` is a location outside of `root` (#50).
* Linux and Windows machines now generated the same data hash (#49).

### NEW FEATURES

* Only require `upgrade_data()` for data written with versions prior to 0.0.5 (#44).
* Improve warnings() and error().
* Internal sorting uses the "C" locale, regardless of the current locale.
* `read_vc()` reads older stored in an older version (#44).
When the version is too old, it prompts to `upgrade_data()`.
* Improve `warnings()` and `error()` messages.
* Use vector version of logo.

git2rdata 0.1 (2019-06-04)
Expand Down Expand Up @@ -64,7 +46,8 @@ git2rdata 0.0.4 (2019-05-16)
* The meta data gains a data hash. A mismatch throws a warning when reading the object. This tolerates updating the data by other software, while informing the user that such change occurred.
* `is_git2rmeta()` validates metadata.
* `list_data()` lists files with valid metadata.
* `rm_data()` and `prune_meta()` remove files with valid metadata. Other files are untouched.
* `rm_data()` and `prune_meta()` remove files with valid metadata.
They don't touch `tsv` file without metadata or `yml` files not associated with `git2rdata`.
* Files with invalid metadata yield a warning with `list_data()`, `rm_data()` and `prune_meta()`.

### Bugfixes
Expand Down
6 changes: 6 additions & 0 deletions R/utils.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# gramr is avaible from https://github.com/ropenscilabs/gramr
release_questions <- function() {
c(
'Did you ran `gramr::check_project(exclude_chunks = TRUE)`'
)
}
63 changes: 45 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,28 +22,42 @@

## Rationale

The `git2rdata` package is an R package for writing and reading dataframes as plain text files. Important information is stored in a metadata file.

1. Storing metadata allows to maintain the classes of variables. By default, the data is optimized for file storage prior to writing. The optimization is most effective on data containing factors. The optimization makes the data less human readable and can be turned off. Details on the implementation are available in `vignette("plain_text", package = "git2rdata")`.
1. Storing metadata also allows to minimize row based [diffs](https://en.wikipedia.org/wiki/Diff) between two consecutive [commits](https://en.wikipedia.org/wiki/Commit_(version_control)). This is a useful feature when storing data as plain text files under version control. Details on this part of the implementation are available in `vignette("version_control", package = "git2rdata")`. Although `git2rdata` was envisioned with a [git](https://git-scm.com/) workflow in mind, it can also be used in combination with other version control systems like [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/).
1. `git2rdata` is intended to facilitate a reproducible and traceable workflow. A toy example is given in `vignette("workflow", package = "git2rdata")`.
1. `vignette("efficiency", package = "git2rdata")` provides some insight into the efficiency in terms of file storage, git repository size and speed for writing and reading.
The `git2rdata` package is an R package for writing and reading dataframes as plain text files.
A metadata file stores important information.

1. Storing metadata allows to maintain the classes of variables.
By default, `git2rdata` optimizes the data for file storage.
The optimization is most effective on data containing factors.
The optimization makes the data less human readable.
The user can turn this off when they prefer a human readable format over smaller files.
Details on the implementation are available in `vignette("plain_text", package = "git2rdata")`.
1. Storing metadata also allows smaller row based [diffs](https://en.wikipedia.org/wiki/Diff) between two consecutive [commits](https://en.wikipedia.org/wiki/Commit_(version_control)).
This is a useful feature when storing data as plain text files under version control.
Details on this part of the implementation are available in `vignette("version_control", package = "git2rdata")`.
Although we envisioned `git2rdata` with a [git](https://git-scm.com/) workflow in mind, you can use it in combination with other version control systems like [subversion](https://subversion.apache.org/) or [mercurial](https://www.mercurial-scm.org/).
1. `git2rdata` is a useful tool in a reproducible and traceable workflow.
`vignette("workflow", package = "git2rdata")` gives a toy example.
1. `vignette("efficiency", package = "git2rdata")` provides some insight into the efficiency of file storage, git repository size and speed for writing and reading.

## Why Use Git2rdata?

- You can store dataframes as plain text files.
- The dataframe you read has exactly the same information content as the one you wrote.
- The dataframe you read identical information content as the one you wrote.
- No changes in data type.
- Factors keep their original levels, including their order.
- Date and date-time are stored in an unambiguous format, documented in the metadata.
- The data and the metadata are stored in a standard and open format, making it readable by other software.
- Data and metadata are checked during the reading. The user is informed if there is tampering with the data or metadata.
- Date and date-time format are unambiguous, documented in the metadata.
- The data and the metadata are in a standard and open format, making it readable by other software.
- `git2rdata` checks the data and metadata during the reading.
`read_vc()` informs the user if there is tampering with the data or metadata.
- Git2rdata integrates with the [`git2r`](https://cran.r-project.org/package=git2r) package for working with git repository from R.
- Another option is using git2rdata solely for writing to disk and handle the plain text files with your favourite version control system outside of R.
- The optimization reduces the required disk space by about 30% for both the working directory and the git history.
- Reading data from a HDD is 30% faster than `read.table()`, writing to a HDD takes about 70% more time than `write.table()`.
- Git2rdata is useful as a tool in a reproducible and traceable workflow. See `vignette("workflow", package = "git2rdata")`.
- You can detect when a file was last modified in the git history. Use this to check whether an existing analysis is obsolete due to new data. This allows to not rerun up to date analyses, saving resources.
- Git2rdata is useful as a tool in a reproducible and traceable workflow.
See `vignette("workflow", package = "git2rdata")`.
- You can detect when a file was last modified in the git history.
Use this to check whether an existing analysis is obsolete due to new data.
This allows to not rerun up to date analyses, saving resources.

## Talk About `git2rdata` at useR!2019 in Toulouse, France

Expand Down Expand Up @@ -74,9 +88,14 @@ remotes::install_github(
remotes::install_github("ropensci/git2rdata"))
```

## Usage in a Nutshell
## Usage in Brief

Dataframes are stored using `write_vc()` and retrieved with `read_vc()`. Both functions share the arguments `root` and `file`. `root` refers to a base location where the dataframe should be stored. It can either point to a local directory or a local git repository. `file` is the file name to use and can include a path relative to `root`. Make sure the relative path stays within `root`.
The user stores dataframes with `write_vc()` and retrieves them with `read_vc()`.
Both functions share the arguments `root` and `file`.
`root` refers to a base location where to store the dataframe.
It can either point to a local directory or a local git repository.
`file` is the file name to use and can include a path relative to `root`.
Make sure the relative path stays within `root`.

```r
# using a local directory
Expand Down Expand Up @@ -104,9 +123,14 @@ Please read `vignette("version_control", package = "git2rdata")` for more detail

## What Data Sizes Can Git2rdata Handle?

The recommendation for git repositories is to use files smaller than 100 MiB, an overall repository size less than 1 GiB and less than 25k files. The individual file size is the limiting factor. Storing the airbag dataset ([`DAAG::nassCDS`](https://cran.r-project.org/package=DAAG)) with `write_vc()` requires on average 68 (optimized) or 97 (verbose) byte per record. The 100 MiB file limit for this data is reached after about 1.5 million (optimize) or 1 million (verbose) observations.
The recommendation for git repositories is to use files smaller than 100 MiB, a repository size less than 1 GiB and less than 25k files.
The individual file size is the limiting factor.
Storing the airbag dataset ([`DAAG::nassCDS`](https://cran.r-project.org/package=DAAG)) with `write_vc()` requires on average 68 (optimized) or 97 (verbose) byte per record.
The file reaches the 100 MiB limit for this data after about 1.5 million (optimized) or 1 million (verbose) observations.

Storing a 90% random subset of the airbag dataset requires 370 kiB (optimized) or 400 kiB (verbose) storage in the git history. Updating the dataset with other 90% random subsets requires on average 60 kiB (optimized) to 100 kiB (verbose) per commit. The git history limit of 1 GiB will be reached after 17k (optimized) to 10k (verbose) commits.
Storing a 90% random subset of the airbag dataset requires 370 kiB (optimized) or 400 kiB (verbose) storage in the git history.
Updating the dataset with other 90% random subsets requires on average 60 kiB (optimized) to 100 kiB (verbose) per commit.
The git history reaches the limit of 1 GiB after 17k (optimized) to 10k (verbose) commits.

Your mileage might vary.

Expand All @@ -122,7 +146,7 @@ Please use the output of `citation("git2rdata")`
- `testthat`: R scripts with unit tests using the [testthat](http://testthat.r-lib.org/) framework
- `vignettes`: source code for the vignettes describing the package
- `man-roxygen`: templates for documentation in Roxygen format
- `pkgdown`: additional source files for the `git2rdata` [website](https://ropensci.github.io/git2rdata/)
- `pkgdown`: source files for the `git2rdata` [website](https://ropensci.github.io/git2rdata/)
- `.github`: guidelines and templates for contributors

```
Expand All @@ -141,6 +165,9 @@ git2rdata

## Contributions

Contributions to `git2rdata` are welcome. Please read our [Contributing guidelines](https://github.com/ropensci/git2rdata/blob/master/.github/CONTRIBUTING.md) first. The `git2rdata` project is released with a [Contributor Code of Conduct](https://github.com/ropensci/git2rdata/blob/master/.github/CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.
`git2rdata` welcomes contributions.
Please read our [Contributing guidelines](https://github.com/ropensci/git2rdata/blob/master/.github/CONTRIBUTING.md) first.
The `git2rdata` project has a [Contributor Code of Conduct](https://github.com/ropensci/git2rdata/blob/master/.github/CODE_OF_CONDUCT.md).
By contributing to this project, you agree to abide by its terms.

[![rOpenSci footer](http://ropensci.org/public_images/github_footer.png)](https://ropensci.org)
4 changes: 2 additions & 2 deletions codemeta.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
],
"issueTracker": "https://github.com/ropensci/git2rdata/issues",
"license": "https://spdx.org/licenses/GPL-3.0",
"version": "0.1.0.9002",
"version": "0.2.0",
"programmingLanguage": {
"@type": "ComputerLanguage",
"name": "R",
Expand Down Expand Up @@ -203,7 +203,7 @@
],
"releaseNotes": "https://github.com/ropensci/git2rdata/blob/master/NEWS.md",
"readme": "https://github.com/ropensci/git2rdata/blob/master/README.md",
"fileSize": "362.855KB",
"fileSize": "341.663KB",
"contIntegration": [
"https://travis-ci.org/inbo/git2rdata",
"https://ci.appveyor.com/project/ThierryO/git2rdata/branch/master",
Expand Down
8 changes: 3 additions & 5 deletions cran-comments.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,17 @@
## Test environments
* local
* ubuntu 18.04, R 3.6.0
* ubuntu 18.04.3 LTS, R 3.6.1
* travis-ci
* trusty, oldrel
* xenial, release and devel
* osx, release
* AppVeyor
* Windows Server 2012, R 3.6.0 Patched
* Windows Server 2012 R2 x64, R 3.6.1 Patched
* r-hub
* Windows Server 2008 R2 SP1, R-devel, 32/64 bit
* Ubuntu Linux 16.04 LTS, R-release, GCC
* Fedora Linux, R-devel, clang, gfortran

## R CMD check results

0 errors | 0 warnings | 1 note

* This is a new release.
0 errors | 0 warnings | 0 note
Loading