Skip to content

Commit

Permalink
Merge pull request #63 from ropensci/split_by_fv
Browse files Browse the repository at this point in the history
Smaller changes in the split_by vignette
  • Loading branch information
ThierryO authored Sep 23, 2020
2 parents ede240d + f89d5d8 commit 6b28a55
Show file tree
Hide file tree
Showing 2 changed files with 9 additions and 10 deletions.
2 changes: 1 addition & 1 deletion codemeta.json
Original file line number Diff line number Diff line change
Expand Up @@ -206,7 +206,7 @@
"sameAs": "https://CRAN.R-project.org/package=yaml"
}
],
"fileSize": "632.073KB",
"fileSize": "762.31KB",
"releaseNotes": "https://github.com/ropensci/git2rdata/blob/master/NEWS.md",
"readme": "https://github.com/ropensci/git2rdata/blob/master/README.md",
"contIntegration": "https://codecov.io/gh/ropensci/git2rdata",
Expand Down
17 changes: 8 additions & 9 deletions vignettes/split_by.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -129,10 +129,9 @@ In such a case we can use the `split_by` argument of `write_vc()`.
This will store the large dataframe over a set of tab separated files.
One file for every combination of the variables defined by `split_by`.
Every partial data file holds one combination of `split_by`.
We add an `index.tsv` containing the combinations of the `split_by` variables and a unique hash.
This hash becomes the base name of the partial data files.
The combination of the hash in the `index.tsv` and the base name of the partial data files makes the information of `split_by` in the partial data file redundant.
We remove the `split_by` variables from the partial data files, reducing their size.
We add an `index.tsv` containing the combinations of the `split_by` variables and a unique hash for each combination.
This hash becomes the base name of the partial data files.

## When to Split the Dataframe

Expand All @@ -151,14 +150,14 @@ Let's set the following variables:
- $N_s$: the number of unique combinations of the `split_by` variables.

Storing the dataframe with `write_vc()` without `split_by` requires $h_s + h_r + 1$ bytes for the header and $s + r + 1$ bytes for every observation.
The total number of bytes is `T_0 = h_s + h_r + 1 + N (s + r + 1)`.
The `+ 1` originates from the tab character to separate the `split_by` variables from the remaining variables.
The total number of bytes is $T_0 = h_s + h_r + 1 + N (s + r + 1)$.
The $+ 1$ originates from the tab character to separate the `split_by` variables from the remaining variables.

Storing the dataframe with `write_vc()` with `split_by` requires an index file to store the combinations of the `split_by` variables.
`h_s` bytes for the header and `N_s s` for the data.
It will use $h_s$ bytes for the header and $N_s s$ for the data.
The headers of the partial data files require $N_s h_r$ bytes ($N_s$ files and $h_r$ byte per file).
The data in the partial data files require $N r$ bytes.
The total number of bytes is `T_s = h_s + N_s s + N_s h_r + N r`.
The total number of bytes is $T_s = h_s + N_s s + N_s h_r + N r$.

We can look at the ratio of $T_s$ over $T_0$.

Expand Down Expand Up @@ -197,13 +196,13 @@ ggplot(combinations, aes(x = b, y = ratio, colour = factor(a))) +
geom_line() +
facet_wrap(~ paste("r =", r)) +
scale_x_continuous(
"b = N_s / N",
expression(b~{"="}~N[s]~{"/"}~N),
labels = function(x) {
paste0(100 * x, "%")
}
) +
scale_y_continuous(
"Relative amount of disc space",
"Relative amount of disk space",
labels = function(x) {
paste0(100 * x, "%")
}
Expand Down

0 comments on commit 6b28a55

Please sign in to comment.