-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split table #62
Split table #62
Conversation
ThierryO
commented
Sep 18, 2020
- write_vc() gains an optional split_by argument.
- add vignette on the new split_by argument.
- read_vc(), is_git2rdata() and is_git2rmeta() now yield a better message when both the data and metadata are missing.
The old implementation yielded a "missing metadata" error when reading a non existing object. The new implementation yields a "missing object" error.
Handle the case where a file stored without split_by is replaced with a version with split_by and vice versa. Also check changes in split_by variables.
This simplifies the logical in write_vc() and read_vc()
This removes the digest dependency
Required to get tests running under R-devel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this @ThierryO .
I've read the vignette (see upcoming PR for smaller fixes) and did some small tests.
- In the vignette, best add something on the benefit of splitting. Why/when would a user consider this? What is the drawback of not splitting, in a workflow? I.e. if these are all related to filesize, then best explain why larger tsv files hinder the workflow. You could add it in the intro, or under a specific header. Otherwise I think the need for it may not be convincing enough.
- I got no errors when running the code, and from looking at resulting files, this seems fine 👍 . However the below test gave some unexpected results for the split case; you may want to have a look.
library(git2rdata)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:git2rdata':
#>
#> pull
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
set.seed(123456)
testdata <-
data.frame(group1 = as.Date("2020-01-01"):as.Date("2020-01-10"),
group2 = 1:5,
id_sub = paste0("test",1:1e3)
) %>%
mutate(group1 = as.Date(group1, origin = "1970-01-01")) %>%
as_tibble %>%
expand(group1, group2, id_sub) %>%
mutate(var1 = rnorm(5e4),
var2 = runif(5e4))
testdata %>% write_vc("split/testdata",
sorting = c("group1", "group2", "id_sub"),
split_by = c("group2"))
#> 4516a4b1c2eee604b874ca03cb0c30a32c084ae7
#> "split/testdata.tsv"
#> 76c88bc1396a725758d4e0b513fe714130a91036
#> "split/testdata.yml"
read_vc("split/testdata") %>%
all.equal(testdata, check.attributes = FALSE)
#> [1] "Component \"group1\": Mean relative difference: 0.0001891192"
#> [2] "Component \"group2\": Mean relative difference: 0.6666667"
#> [3] "Component \"var1\": Mean relative difference: 1.357854"
#> [4] "Component \"var2\": Mean relative difference: 0.6434074"
testdata %>% write_vc("notsplit/testdata",
sorting = c("group1", "group2", "id_sub"))
#> 63958eb7da7077b1700d1605efa3d81c55c4770b
#> "notsplit/testdata.tsv"
#> 1da5da00e927938e1b4cc45e1108c67c4d2e7f40
#> "notsplit/testdata.yml"
read_vc("notsplit/testdata") %>%
all.equal(testdata, check.attributes = FALSE)
#> [1] TRUE
Created on 2020-09-23 by the reprex package (v0.3.0)
Session info
devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os Linux Mint 20
#> system x86_64, linux-gnu
#> ui X11
#> language nl_BE:nl
#> collate nl_BE.UTF-8
#> ctype nl_BE.UTF-8
#> tz Europe/Brussels
#> date 2020-09-23
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2)
#> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.2)
#> git2r 0.27.1 2020-05-03 [1] CRAN (R 4.0.2)
#> git2rdata * 0.3.0 2020-09-23 [1] local
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
#> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.2)
#> tibble 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> usethis 1.6.3 2020-09-17 [1] CRAN (R 4.0.2)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
#> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
#> xfun 0.17 2020-09-09 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library
Thanks for the comments @florisvdh. The differences you get is because internally the |
Smaller changes in the split_by vignette
closes #45 |
Good idea @ThierryO 👍 , thanks for the explanation. Confirmed: library(git2rdata)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:git2rdata':
#>
#> pull
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
set.seed(123456)
testdata <-
data.frame(group1 = as.Date("2020-01-01"):as.Date("2020-01-10"),
group2 = 1:5,
id_sub = paste0("test",1:1e3)
) %>%
mutate(group1 = as.Date(group1, origin = "1970-01-01")) %>%
as_tibble %>%
expand(group1, group2, id_sub) %>%
mutate(var1 = rnorm(5e4),
var2 = runif(5e4)) %>%
arrange(group1, group2, id_sub)
testdata %>% write_vc("split/testdata",
sorting = c("group1", "group2", "id_sub"),
split_by = c("group2"))
#> 4516a4b1c2eee604b874ca03cb0c30a32c084ae7
#> "split/testdata.tsv"
#> 76c88bc1396a725758d4e0b513fe714130a91036
#> "split/testdata.yml"
read_vc("split/testdata") %>%
arrange(group1, group2, id_sub) %>%
all.equal(testdata, check.attributes = FALSE)
#> [1] TRUE Created on 2020-09-23 by the reprex package (v0.3.0) Session infodevtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os Linux Mint 20
#> system x86_64, linux-gnu
#> ui X11
#> language nl_BE:nl
#> collate nl_BE.UTF-8
#> ctype nl_BE.UTF-8
#> tz Europe/Brussels
#> date 2020-09-23
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2)
#> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.2 2020-09-18 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.2)
#> git2r 0.27.1 2020-05-03 [1] CRAN (R 4.0.2)
#> git2rdata * 0.3.0 2020-09-23 [1] local
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
#> knitr 1.30 2020-09-22 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.2)
#> tibble 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyr * 1.1.2 2020-08-27 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> usethis 1.6.3 2020-09-17 [1] CRAN (R 4.0.2)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
#> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
#> xfun 0.17 2020-09-09 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library |