Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix problem on data hashes (#49) #53

Merged
merged 47 commits into from
Nov 8, 2019
Merged

Fix problem on data hashes (#49) #53

merged 47 commits into from
Nov 8, 2019

Conversation

ElsLommelen
Copy link
Contributor

Description

  • new function datahash replaces function hashfile
    • datahash generates the same hash in Windows and Linux
    • datahash is calculated based on the dataframe instead of the tsv-file
  • data hash is calculated in read_vc instead of is_git2rdata to avoid having to read the tsv-file twice -> a wrong data hash is not reported anymore as an error by is_git2rdata, and this whole error paragraph is replaced by a simple warning in read_vc (as this gave the same result now everything is replaced to read_vc)

Related Issue

fix #49

Example

included in unit tests

@ElsLommelen
Copy link
Contributor Author

  • I noticed the data hash did not change in the metadata and function output info when reading a file in which the hash had changed outside git2rdata, which seemed confusing, so I changed it
  • with regard to the difficulties in the hashes:
    • the problem seems to be caused by the symbols é, à, ç and µ
    • by using unicode notation in the test dataframe the tests pass for datahash and write_vc, but the problem remains with read_vc (and also pops up in this function in Windows)
    • it seems to be a matter of finding the right way to input the symbols, as there seems to be no problem if input is done directly on the machine
    • another possibility is to bypass this problem and develop tests for read_vc based on saved git2rdata-objects

@codecov-io
Copy link

codecov-io commented Sep 9, 2019

Codecov Report

Merging #53 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master    #53   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          11     12    +1     
  Lines         696    754   +58     
=====================================
+ Hits          696    754   +58
Impacted Files Coverage Δ
R/relabel.R 100% <ø> (ø) ⬆️
R/recent_commit.R 100% <ø> (ø) ⬆️
R/is_git2rdata.R 100% <ø> (ø) ⬆️
R/prune.R 100% <ø> (ø) ⬆️
R/upgrade_data.R 100% <100%> (ø) ⬆️
R/datahash.R 100% <100%> (ø)
R/is_git2rmeta.R 100% <100%> (ø) ⬆️
R/read_vc.R 100% <100%> (ø) ⬆️
R/write_vc.R 100% <100%> (ø) ⬆️
R/meta.R 100% <100%> (ø) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4c49807...e02e802. Read the comment docs.

@ElsLommelen
Copy link
Contributor Author

Problem solved!
Trial and error learned that in Windows:

  • keyboard imported characters are interpreted correctly without conversion to UTF-8 (but not anymore after conversion)
  • this behaviour could be imitated in unit tests by using unicode notation
  • text file imported chararcters do need conversion to UTF-8 to be interpreted correctly

In Linux, conversion does not influence the result, as characters are typically in UTF-8

@ElsLommelen ElsLommelen requested a review from ThierryO September 9, 2019 09:07
Copy link
Member

@ThierryO ThierryO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. It looks good. Can apply these changes too?

  • bump package version to 0.1.0.9001 in DESCRIPTION
  • add yourself as contributor in DESCRIPTION
  • mention the changes in NEWS. This is a breaking change. You can merge the changes with the entry for version 0.1.0.9000

R/write_vc.R Outdated
@@ -106,12 +106,16 @@ write_vc.character <- function(
meta_data[["..generic"]][["git2rdata"]] <- as.character(
packageVersion("git2rdata")
)
meta_data[["..generic"]][["data_hash"]] <- hashfile(file["raw_file"])
meta_data[["..generic"]][["data_hash"]] <- datahash(raw_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need convert = TRUE here too?

Suggested change
meta_data[["..generic"]][["data_hash"]] <- datahash(raw_data)
meta_data[["..generic"]][["data_hash"]] <- datahash(raw_data, convert = TRUE)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that's the point I mentioned above: keyboard imported characters (and unicode characters written in code) need no conversion in Windows. If they would be converted, they would get a different hash code (and 3 tests would fail in AppVeyor). (Characters from text files, on the other hand, need to be converted.)

R/datahash.R Outdated Show resolved Hide resolved
R/datahash.R Outdated Show resolved Hide resolved
R/datahash.R Outdated Show resolved Hide resolved
R/datahash.R Outdated Show resolved Hide resolved
Copy link
Member

@ThierryO ThierryO left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Thanks. I'm waiting on the feedback from the users in #49 before merging this.

@florisvdh
Copy link
Collaborator

The datahashes do match now (see #49); great!

I explored the behaviour of read_vc(), write_vc() and upgrade_data() with existing vc-data written by git2rdata 0.1.0.

  • write_vc() does not complain and just updates the hashes and git2rdata version in the yml file.
  • read_vc(), executed on the 0.1.0 data format, does also work but throws a warning which is actually incorrect:
df <- read_vc("df_vc")
# Warning message:
# Mismatching data hash. Data altered outside of git2rdata. 

Maybe the current case can be distinguished more clearly.

  • upgrade_data() does not change a thing, while one would expect the hashes to be updated. That, I guess, is still something to be fixed.

@ThierryO
Copy link
Member

Thanks @florisvdh for noticing the upgrade_data() issue. This should be solved in
646d60c

@florisvdh
Copy link
Collaborator

Great solutions! read_vc() and write_vc() now refuse to work with version 0.1.0 data. They clearly show the need to upgrade the data first.

> library(git2rdata)
> x <- seq(1:26)
> y <- letters
> df <- data.frame(x,y)
> df2 <- read_vc("df_vc") # stored earlier with 0.1.0
 Error: Data stored using an older version of `git2rdata`.
See `?upgrade_data()`. 
> write_vc(df, "df_vc", sorting = c("x"), strict =  FALSE)
 Error: Existing metadata file is invalid.
Data stored using an older version of `git2rdata`.
See `?upgrade_data()`. 
> upgrade_data("df_vc")
[...]/df_vc.yml updated
meta_file 
  "df_vc" 
> df2 <- read_vc("df_vc")
> all.equal(df, df2, check.attributes = FALSE)
[1] TRUE
# Now write_vc() works again (in this example, nothing changed):
> write_vc(df, "df_vc", sorting = c("x"), strict =  FALSE)
b2658819ed189ec4496b4b25c55404f7d0918b6a 3514e919bcca45b232268c650a04db36a18aa6b5 
                             "df_vc.tsv"                              "df_vc.yml"

@ElsLommelen
Copy link
Contributor Author

@ThierryO : Seems to work now, but could you please try your original code again on your Linux pc to exclude OS-based differences in reading googlesheets-files? I have difficulties using googlesheets4 on Docker...

@ThierryO ThierryO merged commit 40c7a02 into master Nov 8, 2019
@ThierryO ThierryO deleted the datahash branch November 8, 2019 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data hashes seem to differ between Windows and Linux
4 participants