Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split table #62

Merged
merged 24 commits into from
Jan 13, 2021
Merged

Split table #62

merged 24 commits into from
Jan 13, 2021

Conversation

ThierryO
Copy link
Member

  • write_vc() gains an optional split_by argument.
  • add vignette on the new split_by argument.
  • read_vc(), is_git2rdata() and is_git2rmeta() now yield a better message when both the data and metadata are missing.

The old implementation yielded a "missing metadata" error when reading a non existing object.
The new implementation yields a "missing object" error.
Handle the case where a file stored without split_by is replaced with a version with split_by and vice versa.
Also check changes in split_by variables.
Copy link
Collaborator

@florisvdh florisvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this @ThierryO .

I've read the vignette (see upcoming PR for smaller fixes) and did some small tests.

  • In the vignette, best add something on the benefit of splitting. Why/when would a user consider this? What is the drawback of not splitting, in a workflow? I.e. if these are all related to filesize, then best explain why larger tsv files hinder the workflow. You could add it in the intro, or under a specific header. Otherwise I think the need for it may not be convincing enough.
  • I got no errors when running the code, and from looking at resulting files, this seems fine 👍 . However the below test gave some unexpected results for the split case; you may want to have a look.
library(git2rdata)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:git2rdata':
#> 
#>     pull
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

set.seed(123456)

testdata <- 
  data.frame(group1 = as.Date("2020-01-01"):as.Date("2020-01-10"),
             group2 = 1:5,
             id_sub = paste0("test",1:1e3)
             ) %>% 
  mutate(group1 = as.Date(group1, origin = "1970-01-01")) %>% 
  as_tibble %>% 
  expand(group1, group2, id_sub) %>% 
  mutate(var1 = rnorm(5e4),
         var2 = runif(5e4))

testdata %>% write_vc("split/testdata", 
                      sorting = c("group1", "group2", "id_sub"),
                      split_by = c("group2"))
#> 4516a4b1c2eee604b874ca03cb0c30a32c084ae7 
#>                     "split/testdata.tsv" 
#> 76c88bc1396a725758d4e0b513fe714130a91036 
#>                     "split/testdata.yml"

read_vc("split/testdata") %>% 
  all.equal(testdata, check.attributes = FALSE)
#> [1] "Component \"group1\": Mean relative difference: 0.0001891192"
#> [2] "Component \"group2\": Mean relative difference: 0.6666667"   
#> [3] "Component \"var1\": Mean relative difference: 1.357854"      
#> [4] "Component \"var2\": Mean relative difference: 0.6434074"

testdata %>% write_vc("notsplit/testdata", 
                      sorting = c("group1", "group2", "id_sub"))
#> 63958eb7da7077b1700d1605efa3d81c55c4770b 
#>                  "notsplit/testdata.tsv" 
#> 1da5da00e927938e1b4cc45e1108c67c4d2e7f40 
#>                  "notsplit/testdata.yml"

read_vc("notsplit/testdata") %>% 
  all.equal(testdata, check.attributes = FALSE)
#> [1] TRUE

Created on 2020-09-23 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Linux Mint 20               
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language nl_BE:nl                    
#>  collate  nl_BE.UTF-8                 
#>  ctype    nl_BE.UTF-8                 
#>  tz       Europe/Brussels             
#>  date     2020-09-23                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
#>  backports     1.1.10  2020-09-15 [1] CRAN (R 4.0.2)
#>  callr         3.4.4   2020-09-07 [1] CRAN (R 4.0.2)
#>  cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.2)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
#>  devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.2)
#>  dplyr       * 1.0.2   2020-08-18 [1] CRAN (R 4.0.2)
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.2)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.2)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  generics      0.0.2   2018-11-29 [1] CRAN (R 4.0.2)
#>  git2r         0.27.1  2020-05-03 [1] CRAN (R 4.0.2)
#>  git2rdata   * 0.3.0   2020-09-23 [1] local         
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.2)
#>  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.2)
#>  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.2)
#>  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.2)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.2)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.2)
#>  pillar        1.4.6   2020-07-10 [1] CRAN (R 4.0.2)
#>  pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.2)
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
#>  processx      3.4.4   2020-09-03 [1] CRAN (R 4.0.2)
#>  ps            1.3.4   2020-08-11 [1] CRAN (R 4.0.2)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.2)
#>  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
#>  reprex        0.3.0   2019-05-16 [1] CRAN (R 4.0.2)
#>  rlang         0.4.7   2020-07-09 [1] CRAN (R 4.0.2)
#>  rmarkdown     2.3     2020-06-18 [1] CRAN (R 4.0.2)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.2)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#>  testthat      2.3.2   2020-03-02 [1] CRAN (R 4.0.2)
#>  tibble        3.0.3   2020-07-10 [1] CRAN (R 4.0.2)
#>  tidyr       * 1.1.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.2)
#>  usethis       1.6.3   2020-09-17 [1] CRAN (R 4.0.2)
#>  vctrs         0.3.4   2020-08-29 [1] CRAN (R 4.0.2)
#>  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)
#>  xfun          0.17    2020-09-09 [1] CRAN (R 4.0.2)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)
#> 
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

@ThierryO
Copy link
Member Author

Thanks for the comments @florisvdh. The differences you get is because internally the split_by variables are prepended to the sorting variables. So i your example the actual sorting is c("group2", "group1", "id_sub"). I'll update the documentation to mention this.

Smaller changes in the split_by vignette
@ThierryO
Copy link
Member Author

closes #45

@florisvdh
Copy link
Collaborator

Good idea @ThierryO 👍 , thanks for the explanation.

Confirmed:

library(git2rdata)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:git2rdata':
#> 
#>     pull
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

set.seed(123456)

testdata <- 
  data.frame(group1 = as.Date("2020-01-01"):as.Date("2020-01-10"),
             group2 = 1:5,
             id_sub = paste0("test",1:1e3)
             ) %>% 
  mutate(group1 = as.Date(group1, origin = "1970-01-01")) %>% 
  as_tibble %>% 
  expand(group1, group2, id_sub) %>% 
  mutate(var1 = rnorm(5e4),
         var2 = runif(5e4)) %>% 
  arrange(group1, group2, id_sub)

testdata %>% write_vc("split/testdata", 
                      sorting = c("group1", "group2", "id_sub"),
                      split_by = c("group2"))
#> 4516a4b1c2eee604b874ca03cb0c30a32c084ae7 
#>                     "split/testdata.tsv" 
#> 76c88bc1396a725758d4e0b513fe714130a91036 
#>                     "split/testdata.yml"

read_vc("split/testdata") %>% 
  arrange(group1, group2, id_sub) %>% 
  all.equal(testdata, check.attributes = FALSE)
#> [1] TRUE

Created on 2020-09-23 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.2 (2020-06-22)
#>  os       Linux Mint 20               
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language nl_BE:nl                    
#>  collate  nl_BE.UTF-8                 
#>  ctype    nl_BE.UTF-8                 
#>  tz       Europe/Brussels             
#>  date     2020-09-23                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source        
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.2)
#>  backports     1.1.10  2020-09-15 [1] CRAN (R 4.0.2)
#>  callr         3.4.4   2020-09-07 [1] CRAN (R 4.0.2)
#>  cli           2.0.2   2020-02-28 [1] CRAN (R 4.0.2)
#>  crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.2)
#>  desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.2)
#>  devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.2)
#>  digest        0.6.25  2020-02-23 [1] CRAN (R 4.0.2)
#>  dplyr       * 1.0.2   2020-08-18 [1] CRAN (R 4.0.2)
#>  ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.2)
#>  evaluate      0.14    2019-05-28 [1] CRAN (R 4.0.2)
#>  fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.2)
#>  fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.2)
#>  generics      0.0.2   2018-11-29 [1] CRAN (R 4.0.2)
#>  git2r         0.27.1  2020-05-03 [1] CRAN (R 4.0.2)
#>  git2rdata   * 0.3.0   2020-09-23 [1] local         
#>  glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  highr         0.8     2019-03-20 [1] CRAN (R 4.0.2)
#>  htmltools     0.5.0   2020-06-16 [1] CRAN (R 4.0.2)
#>  knitr         1.30    2020-09-22 [1] CRAN (R 4.0.2)
#>  lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.2)
#>  magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.2)
#>  memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.2)
#>  pillar        1.4.6   2020-07-10 [1] CRAN (R 4.0.2)
#>  pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.2)
#>  pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.2)
#>  prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.2)
#>  processx      3.4.4   2020-09-03 [1] CRAN (R 4.0.2)
#>  ps            1.3.4   2020-08-11 [1] CRAN (R 4.0.2)
#>  purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.2)
#>  R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.2)
#>  remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
#>  reprex        0.3.0   2019-05-16 [1] CRAN (R 4.0.2)
#>  rlang         0.4.7   2020-07-09 [1] CRAN (R 4.0.2)
#>  rmarkdown     2.3     2020-06-18 [1] CRAN (R 4.0.2)
#>  rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.2)
#>  sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.2)
#>  stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
#>  stringr       1.4.0   2019-02-10 [1] CRAN (R 4.0.2)
#>  testthat      2.3.2   2020-03-02 [1] CRAN (R 4.0.2)
#>  tibble        3.0.3   2020-07-10 [1] CRAN (R 4.0.2)
#>  tidyr       * 1.1.2   2020-08-27 [1] CRAN (R 4.0.2)
#>  tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.2)
#>  usethis       1.6.3   2020-09-17 [1] CRAN (R 4.0.2)
#>  vctrs         0.3.4   2020-08-29 [1] CRAN (R 4.0.2)
#>  withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.2)
#>  xfun          0.17    2020-09-09 [1] CRAN (R 4.0.2)
#>  yaml          2.2.1   2020-02-01 [1] CRAN (R 4.0.2)
#> 
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library

@ThierryO ThierryO merged commit 20762c5 into master Jan 13, 2021
@ThierryO ThierryO deleted the split_table branch January 13, 2021 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants