Merge pull request #64 from ropensci/0.3.1

0.3.1
ropensci · Jan 20, 2021 · c0fb058 · c0fb058
2 parents 20762c5 + 827848a
commit c0fb058
Show file tree

Hide file tree

Showing 8 changed files with 91 additions and 34 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: git2rdata
 Title: Store and Retrieve Data.frames in a Git Repository
-Version: 0.3.0
+Version: 0.3.1
 Authors@R: 
     c(person(given = "Thierry",
              family = "Onkelinx",
@@ -25,11 +25,29 @@ Authors@R:
       person(given = "Research Institute for Nature and Forest",
              role = c("cph", "fnd"),
              email = "[email protected]"))
-Description: Make versioning of data.frame easy and efficient using git
-    repositories.
+Description: The git2rdata package is an R package for writing and reading
+    dataframes as plain text files.  A metadata file stores important
+    information.  1) Storing metadata allows to maintain the classes of
+    variables.  By default, git2rdata optimizes the data for file storage.
+    The optimization is most effective on data containing factors.  The
+    optimization makes the data less human readable.  The user can turn
+    this off when they prefer a human readable format over smaller files.
+    Details on the implementation are available in vignette("plain_text",
+    package = "git2rdata").  2) Storing metadata also allows smaller row
+    based diffs between two consecutive commits.  This is a useful feature
+    when storing data as plain text files under version control.  Details
+    on this part of the implementation are available in
+    vignette("version_control", package = "git2rdata").  Although we
+    envisioned git2rdata with a git workflow in mind, you can use it in
+    combination with other version control systems like subversion or
+    mercurial.  3) git2rdata is a useful tool in a reproducible and
+    traceable workflow.  vignette("workflow", package = "git2rdata") gives
+    a toy example.  4) vignette("efficiency", package = "git2rdata")
+    provides some insight into the efficiency of file storage, git
+    repository size and speed for writing and reading.  Please cite using
+    <doi:10.5281/zenodo.1485309>.
 License: GPL-3
-URL: https://github.com/ropensci/git2rdata,
-    https://doi.org/10.5281/zenodo.1485309
+URL: https://ropensci.github.io/git2rdata/
 BugReports: https://github.com/ropensci/git2rdata/issues
 Depends: 
     R (>= 3.5.0)

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,7 @@
+# git2rdata 0.3.1
+
+* Use `icuSetCollate()` to define a standardised sorting.
+
 # git2rdata 0.3.0
 
 ## New features
@@ -14,7 +18,7 @@
 
 # git2rdata 0.2.2
 
-* Use the [checklist](https://inbo.github.io/checklist) package for CI.
+* Use the [checklist](https://packages.inbo.be/checklist/) package for CI.
 
 # git2rdata 0.2.1
 
@@ -32,8 +36,8 @@
 
 * Calculation of data hash has changed (#53). 
   You must use `upgrade_data()` to read data stored by an older version.
-* `is_git2rdata()` and `upgrade_data()` do not test equality in data hashes 
-  anymore (but `read_vc()` still does).
+* `is_git2rdata()` and `upgrade_data()` no longer not test equality in data
+  hashes (but `read_vc()` still does).
 * `write_vc()` and `read_vc()` fail when `file` is a location outside of `root`
   (#50).
 * Reordering factor levels requires `strict = TRUE`.

diff --git a/R/datahash.R b/R/datahash.R
@@ -50,22 +50,18 @@ datahash <- function(file) {
 #' @noRd
 #' @return a named vector with the old locale
 set_c_locale <- function() {
-  old_ctype <- Sys.getlocale(category = "LC_CTYPE")
-  old_collate <- Sys.getlocale(category = "LC_COLLATE")
-  old_time <- Sys.getlocale(category = "LC_TIME")
-  Sys.setlocale(category = "LC_CTYPE", locale = "C")
-  Sys.setlocale(category = "LC_COLLATE", locale = "C")
-  Sys.setlocale(category = "LC_TIME", locale = "C")
-  return(c(ctype = old_ctype, collate = old_collate, time = old_time))
+  icuSetCollate(
+    locale = "en_GB", case_first = "lower", normalization = "on",
+    case_level = "on"
+  )
+  return(c())
 }
 
 #' Reset the old locale
 #' @param locale the output of `set_c_locale()`
 #' @return invisible `NULL`
 #' @noRd
 set_local_locale <- function(locale) {
-  Sys.setlocale(category = "LC_CTYPE", locale = locale["ctype"])
-  Sys.setlocale(category = "LC_COLLATE", locale = locale["collate"])
-  Sys.setlocale(category = "LC_TIME", locale = locale["time"])
+  icuSetCollate(locale = "default")
   return(invisible(NULL))
 }
diff --git a/README.md b/README.md
@@ -138,10 +138,10 @@ Please use the output of `citation("git2rdata")`
 
 ## Folder Structure
 
-- `R`: The source scripts of the [R](https://cran.r-project.org/) functions with documentation in [Roxygen](https://github.com/klutometis/roxygen) format
+- `R`: The source scripts of the [R](https://cran.r-project.org/) functions with documentation in [Roxygen](https://CRAN.R-project.org/package=roxygen2) format
 - `man`: The help files in [Rd](https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Rd-format) format
 - `inst/efficiency`: pre-calculated data to speed up `vignette("efficiency", package = "git2rdata")`
-- `testthat`: R scripts with unit tests using the [testthat](http://testthat.r-lib.org/) framework
+- `testthat`: R scripts with unit tests using the [testthat](https://CRAN.R-project.org/package=testthat) framework
 - `vignettes`: source code for the vignettes describing the package
 - `man-roxygen`: templates for documentation in Roxygen format
 - `pkgdown`: source files for the `git2rdata` [website](https://ropensci.github.io/git2rdata/)

diff --git a/cran-comments.md b/cran-comments.md
@@ -1,12 +1,12 @@
 ## Test environments
 * local
-    * ubuntu 18.04.3 LTS, R 3.6.1
-* travis-ci
-    * trusty, oldrel
-    * xenial, release and devel
-    * osx, release
-* AppVeyor 
-    * Windows Server 2012 R2 x64, R 3.6.1 Patched
+    * ubuntu 18.04.5 LTS, R 4.0.3
+* github actions
+    * macOS-latest, release
+    * windows-latest, release
+    * ubuntu 20.04, devel
+    * ubuntu 16.04, oldrel
+    * checklist package: ubuntu 20.04.1, R 4.0.3
 * r-hub
     * Windows Server 2008 R2 SP1, R-devel, 32/64 bit
     * Ubuntu Linux 16.04 LTS, R-release, GCC
@@ -15,3 +15,24 @@
 ## R CMD check results
 
 0 errors | 0 warnings | 0 note
+
+r-hub gave a few false positive notes
+
+* Windows Server 2008 R2 SP1, R-devel, 32/64 bit
+
+```
+Possibly mis-spelled words in DESCRIPTION:
+  rdata (28:22, 31:33, 36:20, 40:48, 41:20, 43:24, 44:62, 45:62)
+  workflow (41:37, 44:15, 44:36)
+```
+
+* Fedora Linux, R-devel, clang, gfortran
+
+```
+Possibly mis-spelled words in DESCRIPTION:
+  rdata (28:22, 31:33, 36:20, 40:48, 41:20, 43:24, 44:62, 45:62)
+```
+
+Ubuntu Linux 16.04 LTS, R-release, GCC failed on r-hub because ICU is not
+available on that build.
+
diff --git a/man/git2rdata-package.Rd b/man/git2rdata-package.Rd
diff --git a/tests/testthat/test_b_special.R b/tests/testthat/test_b_special.R
@@ -19,7 +19,7 @@ expect_is(
 )
 expect_equal(
   names(output)[1],
-  "9e5edf55ceadd2c148d6d715ea5d12cc8e1538d8"
+  "1d135a85dc9beff3223d6c79f0d8975b559afca7"
 )
 old_locale <- git2rdata:::set_c_locale()
 dso <- ds[order(ds$a), , drop = FALSE] # nolint
@@ -64,7 +64,7 @@ expect_equal(
 )
 expect_equal(
   names(output)[1],
-  "9e5edf55ceadd2c148d6d715ea5d12cc8e1538d8"
+  "1d135a85dc9beff3223d6c79f0d8975b559afca7"
 )
 expect_identical(
   names(output),

diff --git a/vignettes/split_by.Rmd b/vignettes/split_by.Rmd
@@ -136,7 +136,7 @@ We add an `index.tsv` containing the combinations of the `split_by` variables an
 This hash becomes the base name of the partial data files.
 
 Splitting the dataframe into smaller files makes them easier to handle in version control system.
-The overall size depends on the amount of replication in the dataframe.
+The total size depends on the amount of replication in the dataframe.
 More on that in the next section.
 
 ## When to Split the Dataframe