Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problem on data hashes (#49) #53

Merged
merged 47 commits into from
Nov 8, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
bdded82
remove superfluous imported function
ElsLommelen Sep 4, 2019
a1eb8bd
write function datahash to calculate hash of dataframe
ElsLommelen Sep 4, 2019
356b336
calculate data_hash with datahash instead of hashfile
ElsLommelen Sep 4, 2019
bb0e2cd
rewrite documentation of is_git2rdata
ElsLommelen Sep 4, 2019
0d2bee2
remove unittest of is_git2rdata on data hash
ElsLommelen Sep 4, 2019
85ea2fd
correction
ElsLommelen Sep 4, 2019
6a33d9e
add special characters to test_data
ElsLommelen Sep 4, 2019
a52e4e3
test equality of hash in different OS
ElsLommelen Sep 4, 2019
b0b1851
correct error in attribute hashes of output read_vc
ElsLommelen Sep 4, 2019
66d4ef9
test to replace deleted unittest
ElsLommelen Sep 4, 2019
20119c8
simplify function datahash
ElsLommelen Sep 5, 2019
aa2e1a2
change data hash in read_vc if data have altered outside of git2rdata
ElsLommelen Sep 5, 2019
6e42072
change problem chars to unicode notation
ElsLommelen Sep 5, 2019
733f36f
change data hash value limits the problem to first read in windows
ElsLommelen Sep 9, 2019
3ad1b29
only convert characters in read_vc (not in write_vc
ElsLommelen Sep 9, 2019
5b33bdf
typo
ElsLommelen Sep 9, 2019
33276ad
add € to test
ElsLommelen Sep 9, 2019
49c28b2
Update R/datahash.R
ElsLommelen Sep 9, 2019
8e4ee17
Update R/datahash.R
ElsLommelen Sep 9, 2019
34ea931
Update R/datahash.R
ElsLommelen Sep 9, 2019
f2ace52
Update R/datahash.R
ElsLommelen Sep 9, 2019
9394950
update description and news files
ElsLommelen Sep 9, 2019
d1f4fd5
update the minimal package version in is_git2rmeta()
ThierryO Sep 10, 2019
85937c8
upgrade_data() updates the data hash in data written with version 0.0…
ThierryO Sep 10, 2019
9a2a447
update documentation
ThierryO Sep 10, 2019
256c204
test for equal hashes with tabs or newlines in dataset
ElsLommelen Sep 18, 2019
d0b25e5
use unchanged input data to calculate data hash in write_vc
ElsLommelen Sep 18, 2019
89ac8ef
try to fix datahash on linux
ThierryO Sep 25, 2019
51d8368
always use "C" locale when sorting
ThierryO Sep 27, 2019
38bcdad
ignore .httr-oauth file
ThierryO Oct 22, 2019
fa37c27
use iconv() to make sure that we use UTF-8 for all characters and fac…
ThierryO Oct 22, 2019
7ddd4f0
add from = "UTF-8" argument to iconv() calls
ThierryO Oct 22, 2019
c4a1913
fix linters
ThierryO Oct 24, 2019
bcc2f61
always use encoding with readLines()
ThierryO Oct 24, 2019
cfd1c4b
use enc2utf8() instead of iconv()
ThierryO Oct 24, 2019
9fcdf47
add more unit tests to investigate the problem
ThierryO Oct 24, 2019
dd6c256
extra unit tests
ThierryO Oct 24, 2019
65012fc
calculate datahash as a hash of hashes
ThierryO Oct 25, 2019
e607d14
trace problem
ThierryO Oct 25, 2019
0afc131
hash problem might be due to differences in factor levels
ThierryO Oct 25, 2019
cd96e85
set locale to C in unit tests when creating factors
ThierryO Oct 25, 2019
be8f2e3
reduce cyclomatic complexity of read_vc()
ThierryO Nov 7, 2019
d4bb2b0
ignore code coverage for datahash of very large files (> 1e8 lines)
ThierryO Nov 7, 2019
84ad022
reduce cyclomatic complexity in compare_meta()
ThierryO Nov 7, 2019
cf2e53e
reordering factor levels requires `strict = FALSE`
ThierryO Nov 7, 2019
41cab26
update data in efficiency vignette
ThierryO Nov 8, 2019
e02e802
add video of our talk at useR!2019
ThierryO Nov 8, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@
^.*\.Rproj$
^cran-comments\.md$
^CRAN-RELEASE$
^\.httr-oauth$
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
.Ruserdata
inst/doc
docs
.httr-oauth
7 changes: 6 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: git2rdata
Title: Store and Retrieve Data.frames in a Git Repository
Version: 0.1.0.9000
Version: 0.1.0.9003
Authors@R: c(
person(
"Thierry", "Onkelinx", role = c("aut", "cre"),
Expand All @@ -14,6 +14,10 @@ Authors@R: c(
"Peter", "Desmet", role = "ctb",
email = "[email protected]",
comment = c(ORCID = "0000-0002-8442-8025")),
person(
"Els", "Lommelen", role = "ctb",
email = "[email protected]",
comment = c(ORCID = "0000-0002-3481-5684")),
person(
"Research Institute for Nature and Forest",
role = c("cph", "fnd"), email = "[email protected]"))
Expand All @@ -40,6 +44,7 @@ URL: https://github.com/ropensci/git2rdata, https://doi.org/10.5281/zenodo.14853
BugReports: https://github.com/ropensci/git2rdata/issues
Collate:
'clean_data_path.R'
'datahash.R'
'git2rdata-package.R'
'write_vc.R'
'is_git2rdata.R'
Expand Down
28 changes: 28 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,31 @@
git2rdata 0.1.0.9003 (2019-11-07)
=================================

### BREAKING FEATURES

* reordering factor levels requires `strict = TRUE`

git2rdata 0.1.0.9002 (2019-09-27)
=================================

### BREAKING FEATURES

* sorting is based on the "C" locale
* the data hash is based on the plain text file

git2rdata 0.1.0.9001 (2019-09-09)
=================================

### BREAKING FEATURES

* Calculation of data hash has changed, due to which `read_vc()` will once warn that data are altered outside git2rdata when reading a previously written git2rdata object (#53).
* `read_vc()` only works with data stored with version >= 0.1.0.9001. Use `upgrade_data()` on data written with an earlier version.
* `is_git2rdata()` and `upgrade_data()` do not test equality in data hashes anymore (but `read_vc()` still does).

### Bugfixes

* The same data hash is generated on Linux and Windows machines (#49).

git2rdata 0.1.0.9000 (2019-08-13)
=================================

Expand Down
61 changes: 61 additions & 0 deletions R/datahash.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#' Calculate Hash of Dataframe
#' Calculates a hash based on the given data that is compatible with different
#' operating systems.
#' @param file The file to calculate the hash.
#' @return A 40 hexadecimal character quasi-unique code representing the
#' dataframe.
#' @noRd
#' @family internal
#' @importFrom assertthat assert_that
#' @importFrom git2r hash
datahash <- function(file) {
chunk_size <- 1e4
hashes <- character(chunk_size + 1)
i <- 0
rawdata <- scan(
file = file, what = character(), nmax = -1, sep = "\n", quote = "",
skip = i * chunk_size, nlines = chunk_size, na.strings = "",
flush = FALSE, fill = FALSE, strip.white = FALSE, quiet = TRUE,
blank.lines.skip = FALSE, comment.char = "", allowEscapes = FALSE,
encoding = "UTF-8", skipNul = FALSE
)
while (length(rawdata)) {
hashes[1 + i %% chunk_size] <- hash(paste(hash(rawdata), collapse = "\n"))
i <- i + 1
if (i %% chunk_size == 0) {
hashes[chunk_size + 1] <- hash(paste(hashes, collapse = "")) # nocov
}
rawdata <- scan(
file = file, what = character(), nmax = -1, sep = "\n", quote = "",
skip = i * chunk_size, nlines = chunk_size, na.strings = "",
flush = FALSE, fill = FALSE, strip.white = FALSE, quiet = TRUE,
blank.lines.skip = FALSE, comment.char = "", allowEscapes = FALSE,
encoding = "UTF-8", skipNul = FALSE
)
}
hash(paste(hashes, collapse = ""))
}

#' Set the C locale for standardized sorting
#' @noRd
#' @return a named vector with the old locale
set_c_locale <- function() {
old_ctype <- Sys.getlocale(category = "LC_CTYPE")
old_collate <- Sys.getlocale(category = "LC_COLLATE")
old_time <- Sys.getlocale(category = "LC_TIME")
Sys.setlocale(category = "LC_CTYPE", locale = "C")
Sys.setlocale(category = "LC_COLLATE", locale = "C")
Sys.setlocale(category = "LC_TIME", locale = "C")
return(c(ctype = old_ctype, collate = old_collate, time = old_time))
}

#' Reset the old locale
#' @param locale the output of `set_c_locale()`
#' @return invisible `NULL`
#' @noRd
set_local_locale <- function(locale) {
Sys.setlocale(category = "LC_CTYPE", locale = locale["ctype"])
Sys.setlocale(category = "LC_COLLATE", locale = locale["collate"])
Sys.setlocale(category = "LC_TIME", locale = locale["time"])
return(invisible(NULL))
}
11 changes: 1 addition & 10 deletions R/is_git2rdata.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
#' Check Whether a Git2rdata Object is Valid.
#'
#' A valid git2rdata object has valid metadata. The data hash must match the
#' data hash stored in the metadata.
#' A valid git2rdata object has valid metadata.
#' @inheritParams write_vc
#' @inheritParams is_git2rmeta
#' @return A logical value. `TRUE` in case of a valid git2rdata object.
Expand All @@ -24,7 +23,6 @@ is_git2rdata.default <- function(file, root, message) {
#' @importFrom assertthat assert_that is.string
#' @importFrom yaml read_yaml as.yaml
#' @importFrom utils packageVersion
#' @importFrom git2r hash
is_git2rdata.character <- function(file, root = ".",
message = c("none", "warning", "error")) {
assert_that(is.string(file), is.string(root))
Expand Down Expand Up @@ -56,13 +54,6 @@ is_git2rdata.character <- function(file, root = ".",
return(FALSE)
}

if (meta_data[["..generic"]][["data_hash"]] != hashfile(file[["raw_file"]])) {
msg <- "Corrupt data, mismatching data hash."
switch(message, error = stop(msg, call. = FALSE),
warning = warning(msg, call. = FALSE))
return(FALSE)
}

return(TRUE)
}

Expand Down
2 changes: 1 addition & 1 deletion R/is_git2rmeta.R
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ See `?upgrade_data()`."
return(FALSE)
}
if (package_version(meta_data[["..generic"]][["git2rdata"]]) <
package_version("0.0.5")) {
package_version("0.1.0.9001")) {
msg <- "Data stored using an older version of `git2rdata`.
See `?upgrade_data()`."
switch(message, error = stop(msg, call. = FALSE),
Expand Down
66 changes: 47 additions & 19 deletions R/meta.R
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
#' Optimize an Object for Storage as Plain Text and Add Metadata
#'
#' @description
#' Prepares a vector for storage. When relevant, `meta()`optimizes the object
#' Prepares a vector for storage. When relevant, `meta()` optimizes the object
#' for storage by changing the format to one which needs less characters. The
#' metadata stored in the `meta` attribute, contains all required information to
#' back-transform the optimized format into the original format.
Expand Down Expand Up @@ -38,6 +38,7 @@ meta <- function(x, ...) {
#' @importFrom assertthat assert_that is.string noNA
meta.character <- function(x, na = "NA", ...) {
assert_that(is.string(na), noNA(na), no_whitespace(na))
x <- enc2utf8(x)
if (na %in% x) {
stop("one of the strings matches the NA string ('", na, "')
Please use a different NA string or consider using a factor.", call. = FALSE)
Expand Down Expand Up @@ -70,21 +71,41 @@ meta.numeric <- function(x, ...) {

#' @export
#' @rdname meta
#' @param optimize If `TRUE`, recode the data to get smaller text files. If
#' `FALSE`, `meta()` converts the data to character. Defaults to `TRUE`.
#' @param index an optional named vector with existing factor indices. The names
#' must match the existing factor levels. Unmatched levels from `x` will get new
#' indices.
#' @param optimize If `TRUE`, recode the data to get smaller text files.
#' If `FALSE`, `meta()` converts the data to character.
#' Defaults to `TRUE`.
#' @param index An optional named vector with existing factor indices.
#' The names must match the existing factor levels.
#' Unmatched levels from `x` will get new indices.
#' @inheritParams utils::write.table
#' @importFrom assertthat assert_that is.flag noNA
meta.factor <- function(x, optimize = TRUE, na = "NA", index, ...) {
assert_that(is.flag(optimize), noNA(optimize))
#' @note The default order of factor levels depends on the current locale.
#' See \code{\link{Comparison}} for more details on that.
#' The same code on a different locale might result in a different sorting.
#' `meta()` ignores, with a warning, any change in the order of factor levels.
#' Add `strict = FALSE` to enforce the new order of factor levels.
meta.factor <- function(
x, optimize = TRUE, na = "NA", index, strict = TRUE, ...
) {
assert_that(is.flag(optimize), noNA(optimize), is.flag(strict), noNA(strict))
levels(x) <- enc2utf8(levels(x))
if (missing(index) || is.null(index)) {
index <- seq_along(levels(x))
names(index) <- levels(x)
} else {
assert_that(is.integer(index))
assert_that(anyDuplicated(index) == 0, msg = "duplicate indices")

if (
strict &&
all(names(index) %in% levels(x)) &&
all(levels(x) %in% names(index)) &&
any(levels(x) != names(index))
) {
warning("Same levels with a different order detected.
This change is ignored. Use `strict = FALSE` to reorder the factor.")
x <- factor(x, levels = names(index))
}
new_levels <- which(!levels(x) %in% names(index))
candidate_index <- seq_len(length(new_levels) + length(index))
candidate_index <- candidate_index[!candidate_index %in% index]
Expand All @@ -101,15 +122,16 @@ meta.factor <- function(x, optimize = TRUE, na = "NA", index, ...) {
z <- index[x]
} else {
assert_that(is.string(na), noNA(na), no_whitespace(na))
if (na %in% levels(x)) {
stop("one of the levels matches the NA string ('", na, "').
Please use a different NA string or use optimize = TRUE", call. = FALSE)
}
assert_that(
!na %in% levels(x),
msg = paste0("one of the levels matches the NA string ('", na, "').
Please use a different NA string or use optimize = TRUE")
)
z <- meta(as.character(x), optimize = optimize, na = na, ...)
}

m <- list(class = "factor", na_string = na, optimize = optimize,
labels = enc2utf8(names(index)), index = unname(index),
labels = names(index), index = unname(index),
ordered = is.ordered(x))
class(m) <- "meta_detail"
attr(z, "meta") <- m
Expand All @@ -119,7 +141,7 @@ Please use a different NA string or use optimize = TRUE", call. = FALSE)
#' @export
#' @rdname meta
#' @importFrom assertthat assert_that is.flag noNA
meta.logical <- function(x, optimize = TRUE, ...){
meta.logical <- function(x, optimize = TRUE, ...) {
assert_that(is.flag(optimize), noNA(optimize))
if (optimize) {
x <- as.integer(x)
Expand Down Expand Up @@ -160,7 +182,7 @@ meta.POSIXct <- function(x, optimize = TRUE, ...) {
#' @export
#' @rdname meta
#' @importFrom assertthat assert_that is.flag noNA
meta.Date <- function(x, optimize = TRUE, ...){
meta.Date <- function(x, optimize = TRUE, ...) {
assert_that(is.flag(optimize), noNA(optimize))
if (optimize) {
z <- as.integer(x)
Expand Down Expand Up @@ -188,7 +210,9 @@ meta.Date <- function(x, optimize = TRUE, ...){
#' argument intended for internal use.
#' @rdname meta
#' @inheritParams write_vc
meta.data.frame <- function(x, optimize = TRUE, na = "NA", sorting, ...) {
meta.data.frame <- function(# nolint
x, optimize = TRUE, na = "NA", sorting, strict = TRUE, ...
) {
assert_that(
!has_name(x, "..generic"),
msg = "'..generic' is a reserved name and not allowed as column name")
Expand All @@ -213,7 +237,9 @@ Sorting is strongly recommended in combination with version control.")
all(sorting %in% colnames(x)),
msg = "All sorting variables must be available in the data.frame")
if (nrow(x) > 1) {
old_locale <- set_c_locale()
x <- x[do.call(order, unname(x[sorting])), , drop = FALSE] # nolint
set_local_locale(old_locale)
if (any_duplicated(x[sorting])) {
sorted <- paste(sprintf("'%s'", sorting), collapse = ", ")
sorted <- sprintf("Sorting on %s results in ties.
Expand All @@ -229,14 +255,16 @@ Add extra sorting variables to ensure small diffs.", sorted)
if (length(common)) {
z_common <- lapply(
common,
function(id, optimize, na) {
function(id, optimize, na, strict) {
meta(
x[[id]], optimize = optimize, na = na,
index = setNames(old[[id]][["index"]], old[[id]][["labels"]])
index = setNames(old[[id]][["index"]], old[[id]][["labels"]]),
strict = strict
)
},
optimize = old[["..generic"]][["optimize"]],
na = old[["..generic"]][["NA string"]]
na = old[["..generic"]][["NA string"]],
strict = strict
)
names(z_common) <- common
} else {
Expand Down
16 changes: 8 additions & 8 deletions R/prune.R
Original file line number Diff line number Diff line change
Expand Up @@ -22,22 +22,22 @@
#' @template example-prune
rm_data <- function(
root = ".", path = NULL, recursive = TRUE, ...
){
) {
UseMethod("rm_data", root)
}

#' @export
rm_data.default <- function(
root, path = NULL, recursive = TRUE, ...
){
) {
stop("a 'root' of class ", class(root), " is not supported", call. = FALSE)
}

#' @export
#' @importFrom assertthat assert_that is.flag
rm_data.character <- function(
root = ".", path = NULL, recursive = TRUE, ...
){
) {
to_do <- list_data(root = root, path = path, recursive = recursive)
if (length(to_do) == 0) {
return(to_do)
Expand All @@ -63,7 +63,7 @@ rm_data.character <- function(
rm_data.git_repository <- function(
root, path = NULL, recursive = TRUE, ..., stage = FALSE,
type = c("unmodified", "modified", "ignored", "all")
){
) {
type <- match.arg(type)
to_do <- list_data(root = root, path = path, recursive = recursive)
if (length(to_do) == 0) {
Expand Down Expand Up @@ -116,22 +116,22 @@ rm_data.git_repository <- function(
#' @template example-prune
prune_meta <- function(
root = ".", path = NULL, recursive = TRUE, ...
){
) {
UseMethod("prune_meta", root)
}

#' @export
prune_meta.default <- function(
root, path = NULL, recursive = TRUE, ...
){
) {
stop("a 'root' of class ", class(root), " is not supported", call. = FALSE)
}

#' @export
#' @importFrom assertthat assert_that is.flag noNA
prune_meta.character <- function(
root = ".", path = NULL, recursive = TRUE, ...
){
) {
assert_that(is.string(root))
root <- normalizePath(root, winslash = "/", mustWork = TRUE)
assert_that(is.string(path))
Expand Down Expand Up @@ -171,7 +171,7 @@ prune_meta.character <- function(
#' @rdname prune_meta
prune_meta.git_repository <- function(
root, path = NULL, recursive = TRUE, ..., stage = FALSE
){
) {
root_wd <- normalizePath(workdir(root), winslash = "/")
assert_that(is.string(path))
path <- file.path(root_wd, path)
Expand Down
Loading