Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data hashes seem to differ between Windows and Linux #49

Closed
florisvdh opened this issue Aug 7, 2019 · 6 comments · Fixed by #53
Closed

Data hashes seem to differ between Windows and Linux #49

florisvdh opened this issue Aug 7, 2019 · 6 comments · Fixed by #53
Assignees
Labels
bug Something isn't working
Milestone

Comments

@florisvdh
Copy link
Collaborator

This issue uses the reprex from issue #47 .

While not getting those errors, my output - in Linux - is always as:

3e6fbe383532f4312bd0f5c9f30976f64d00e9cc e5e6ed33018f669308297f2f3d66512b3fa8c1b6 
                     "../data/df_vc.tsv"                      "../data/df_vc.yml" 

Which is a different data_hash (stored in the yml file) than the Windows-generated one.

Session Info R version 3.6.1 (2019-07-05) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Linux Mint 18.1

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=nl_BE.UTF-8 LC_NUMERIC=C LC_TIME=nl_BE.UTF-8
[4] LC_COLLATE=nl_BE.UTF-8 LC_MONETARY=nl_BE.UTF-8 LC_MESSAGES=nl_BE.UTF-8
[7] LC_PAPER=nl_BE.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=nl_BE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] git2rdata_0.1

loaded via a namespace (and not attached):
[1] drat_0.1.5 compiler_3.6.1 assertthat_0.2.1 tools_3.6.1 yaml_2.2.0
[6] git2r_0.26.1 packrat_0.5.0 fortunes_1.5-4

@ThierryO ThierryO added the bug Something isn't working label Aug 12, 2019
@ThierryO ThierryO self-assigned this Aug 12, 2019
@ThierryO ThierryO added this to the Version 0.1.1 milestone Aug 12, 2019
@ThierryO
Copy link
Member

ThierryO commented Aug 14, 2019

I can reproduce this. It seems like git2r::hashfile() yields a different output under Linux and Windows

filename <- tempfile("os-bug")
writeLines(
  c("x\ty", "1\t1", "2\t2", "3\t3", "4\t4", "5\t5", "6\t6", "7\t7", 
    "8\t8", "9\t9", "10\t10", "11\t11", "12\t12", "13\t13", "14\t14", 
    "15\t15", "16\t16", "17\t17", "18\t18", "19\t19", "20\t20", "21\t21", 
    "22\t22", "23\t23", "24\t24", "25\t25", "26\t26"),
  filename
)
git2r::hashfile(filename)

Output:

  • Windows: 1de50dce6d5139f98a8e69d4d45d26ae7d32c64f
  • Linux: 3e6fbe383532f4312bd0f5c9f30976f64d00e9cc

Session info on Windows

Session info ──────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.5.2 (2018-12-20)
 os       Windows >= 8 x64            
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  Dutch_Belgium.1252          
 ctype    Dutch_Belgium.1252          
 tz       Europe/Paris                
 date     2019-08-14                  

- Packages ---------------------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.5.3)
 cli           1.1.0   2019-03-19 [1] CRAN (R 3.5.3)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 3.5.3)
 drat          0.1.4   2017-12-16 [1] CRAN (R 3.5.3)
 fortunes      1.5-4   2016-12-29 [1] CRAN (R 3.5.2)
 git2r       * 0.25.2  2019-03-19 [1] CRAN (R 3.5.3)
 rstudioapi    0.10    2019-03-19 [1] CRAN (R 3.5.3)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 3.5.3)
 withr         2.1.2   2018-03-15 [1] CRAN (R 3.5.3)
 yaml          2.2.0   2018-07-25 [1] CRAN (R 3.5.2)

[1] C:/R/library
[2] C:/Program Files/R/R-3.5.2/library

Session info on Linux

Session info ──────────────────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.1 (2019-07-05)
 os       Ubuntu 18.04.3 LTS          
 system   x86_64, linux-gnu           
 ui       RStudio                     
 language nl:en                       
 collate  nl_NL.UTF-8                 
 ctype    nl_NL.UTF-8                 
 tz       Europe/Brussels             
 date     2019-08-14Packages ──────────────────────────────────────────────────────────────────────────────────────
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 3.6.0)
 cli           1.1.0   2019-03-19 [2] CRAN (R 3.5.3)
 crayon        1.3.4   2017-09-16 [2] CRAN (R 3.5.3)
 drat          0.1.5   2019-03-28 [1] CRAN (R 3.6.0)
 fortunes      1.5-4   2016-12-29 [1] CRAN (R 3.6.0)
 git2r         0.26.1  2019-06-29 [1] CRAN (R 3.6.0)
 packrat       0.5.0   2018-11-14 [1] CRAN (R 3.6.0)
 rstudioapi    0.10    2019-03-19 [2] CRAN (R 3.5.3)
 sessioninfo   1.1.1   2018-11-05 [2] CRAN (R 3.5.3)
 withr         2.1.2   2018-03-15 [2] CRAN (R 3.5.3)

[1] /home/thierry_onkelinx/R/x86_64-pc-linux-gnu-library/3.5
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

@ThierryO
Copy link
Member

According to @stewid, the difference in hash is due to the difference in line endings on Linux and Windows (ropensci/git2r#397).

Below is a reprex using write.table() on Linux.

library(git2r)
x <- seq(1:26)
y <- letters
df <- data.frame(x, y, stringsAsFactors = FALSE)
filename <- tempfile("os-bug")

# unix style line endings
write.table(
  x = df, file = filename, append = FALSE, quote = FALSE,
  sep = "\t", eol = "\n", na = "NA", dec = ".", row.names = FALSE,
  col.names = TRUE, fileEncoding = "UTF-8"
)
hashfile(filename) # "50aabdcd96bd742fdcc41edcc6b3efdf8e63f498"

# windows style line endings
write.table(
  x = df, file = filename, append = FALSE, quote = FALSE,
  sep = "\t", eol = "\r\n", na = "NA", dec = ".", row.names = FALSE,
  col.names = TRUE, fileEncoding = "UTF-8"
)
hashfile(filename) # "1783ed10fa5035a3963abf4202f42fe6ca88f046"

@ThierryO
Copy link
Member

ThierryO commented Sep 9, 2019

@florisvdh and @w-jan can you check if PR #53 solves this issue? use remotes::install_github("ropensci/git2rdata@datahash")

@florisvdh
Copy link
Collaborator Author

Didn't check Windows yet, but in Linux I now get a different hash than before, is this expected?

library(git2rdata)
x <- seq(1:26)
y <- letters
df <- data.frame(x,y)
write_vc(df, "df_vc", sorting = c("x"), strict =  FALSE)
# b2658819ed189ec4496b4b25c55404f7d0918b6a 3514e919bcca45b232268c650a04db36a18aa6b5
#                              "df_vc.tsv"                              #"df_vc.yml"

@ThierryO
Copy link
Member

ThierryO commented Sep 9, 2019

Yes. This is possible. The hashes are now calculated based on the content instead of the file.

@florisvdh
Copy link
Collaborator Author

I checked in Windows and the same datahash is produced. Good work! I think it's OK to close the issue.

See some further comments in PR #53 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants