-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xxHash64 implementation (clean) #5
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #5 +/- ##
==========================================
- Coverage 99.00% 97.91% -1.09%
==========================================
Files 4 5 +1
Lines 301 432 +131
==========================================
+ Hits 298 423 +125
- Misses 3 9 +6 ☔ View full report in Codecov by Sentry. |
Thank you so much for putting this together! A quick test below shows that hash1 <- digest::digest(
object = "ed978d615e301b8d",
serialize = FALSE,
algo = "xxhash64"
)
hash2 <- secretbase::xxh64(x = "ed978d615e301b8d")
hash1 == hash2
#> [1] TRUE
x <- crew::crew_controller_local()
hash1 <- digest::digest(
object = x,
algo = "xxhash64"
)
hash2 <- secretbase::xxh64(
x = x,
convert = TRUE
)
hash1 == hash2
#> [1] TRUE
x$start()
x <- crew::crew_controller_local()
hash1 <- digest::digest(
object = x,
algo = "xxhash64"
)
hash2 <- secretbase::xxh64(
x = x,
convert = TRUE
)
hash1 == hash2
#> [1] TRUE
temp <- tempfile()
saveRDS(x, temp)
hash1 <- digest::digest(
object = temp,
algo = "xxhash64",
file = TRUE
)
hash2 <- secretbase::xxh64(
file = temp,
convert = TRUE
)
hash1 == hash2
#> [1] TRUE
x$terminate() |
Speed is outstanding! library(digest)
library(secretbase)
library(microbenchmark)
x <- "ed978d615e301b8d"
microbenchmark(
digest = digest(x, serialize = FALSE),
secretbase = xxh64(x)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> digest 9.014 9.312 11.69342 9.4245 9.613 227.089 100
#> secretbase 1.401 1.459 2.34300 1.6080 1.676 76.999 100
x <- crew::crew_controller_local()
microbenchmark(
digest = digest(x),
secretbase = xxh64(x)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> digest 1.268635 1.329473 1.398098 1.388626 1.442978 1.735737 100
#> secretbase 1.132179 1.142675 1.191095 1.159126 1.182132 2.182591 100
x <- runif(5e7)
lobstr::obj_size(x)
#> 400.00 MB
system.time(digest(x))
#> user system elapsed
#> 1.292 0.191 1.484
system.time(xxh64(x))
#> user system elapsed
#> 0.357 0.000 0.357
temp <- tempfile()
saveRDS(x, temp, compress = FALSE)
file.size(temp)
#> [1] 4e+08
system(paste("du -h", temp))
system.time(digest(temp, file = TRUE))
#> user system elapsed
#> 0.813 0.064 0.930
system.time(xxh64(file = temp))
#> user system elapsed
#> 0.048 0.080 0.129 Created on 2024-02-21 with reprex v2.1.0 |
When this version of One thing I could use advice on is a fast way to convert the 64-bit hash into an 8-character string as a replacement for |
The intention was really not to have to invalidate your targets again - so let me think about the differences you've raised. I had originally thought In any case, I submitted a new version earlier with just SHA-256 added to make incremental changes, make sure the framework checks correctly on all platforms before adding code from a new codebase. |
Thanks!
I suppose that's the best course of action, though it is awkward to depend on two different hashing packages. |
I just took a quick look at {targets} code. It seems where you use |
Right, I forgot I did that. I wanted to use the vectorized version of > targets:::digest_obj64("x")
[1] "dcfd70a2ca84f10c"
> secretbase::xxh64(list("x"))
[1] "6188cdac5c40ad1b" I think the reason is probably that I chose serialization version 3 to support ALTREP. If I use version 2 and wrap it in another vdigest <- digest::getVDigest(algo = "xxhash64")
vdigest(list(list("x")), serialize = TRUE)
#> [1] "6188cdac5c40ad1b" |
So if I change |
secretbase::sha256(list("x"))
#> [1] "efe4523872b45152456abede6d7b643fbca1710636294d1c8dc0e41dd942121a"
digest::digest(list("x"), algo = "sha256")
#> [1] "efe4523872b45152456abede6d7b643fbca1710636294d1c8dc0e41dd942121a"
secretbase::xxh64(list("x"))
#> [1] "6188cdac5c40ad1b"
digest::digest(list("x"), algo = "xxhash64")
#> [1] "6188cdac5c40ad1b"
targets:::digest_obj64("x")
#> [1] "dcfd70a2ca84f10c" Created on 2024-02-22 with reprex v2.1.0 Yes, clearly there is a difference in the function Serialization is always v3 XDR in |
Ah it's documented. I was not familiar with this
|
Actually you define
|
Yeah,
My original motivation for vectorized digests was the low overhead, not the actual vectorization. A bit of a strange fit, but it helped performance. In fact, there are only a couple places where I used the vectorization capabilities of the digest utilities in
Right, and that's a good thing. I think we'll be fine on character vectors, I just think I'm having trouble figuring out how to swap in |
Actually, after some more testing (see below), I think we're close to agreement. The targets themselves use file hashes, character strings agree, and because of the S3 dispatch in But it's not the worst thing in the world if the hashes change. Worst case is users' targets are invalidated when they upgrade the package upgrade, which just means they have to rerun their pipelines. I hope it doesn’t come to that, but It might. vdigest64 <- digest::getVDigest(algo = "xxhash64")
vdigest64_file <- digest::getVDigest(algo = "xxhash64", errormode = "warn")
digest_obj64_serialize_version2 <- function(object, ...) {
vdigest64(
object = list(object),
serialize = TRUE,
serializeVersion = 2L,
file = FALSE,
seed = 0L,
...
)
}
# targets uses this one (serialization version 3):
digest_obj64_serialize_version3 <- function(object, ...) {
vdigest64(
object = list(object),
serialize = TRUE,
serializeVersion = 3L, ###
file = FALSE,
seed = 0L,
...
)
}
# and this one:
digest_chr64 <- function(object, ...) {
vdigest64(object, serialize = FALSE, file = FALSE, seed = 0L, ...)
}
# and this one:
digest_file64 <- function(object, ...) {
vapply(
X = object,
FUN = vdigest64_file,
serialize = FALSE,
file = TRUE,
seed = 0L,
...,
FUN.VALUE = character(1L),
USE.NAMES = FALSE
)
}
secretbase::xxh64(mtcars)
#> [1] "f4b89e63bc92af79"
digest_obj64_serialize_version2(mtcars)
#> [1] "f4b89e63bc92af79"
digest_obj64_serialize_version3(mtcars)
#> [1] "04304fa81fcd8794"
secretbase::xxh64(1)
#> [1] "853b1797f54b229c"
digest_obj64_serialize_version2(1)
#> [1] "853b1797f54b229c"
digest_obj64_serialize_version3(1)
#> [1] "a2a52dfe4e065e93"
secretbase::xxh64(1:2)
#> [1] "9ce0daa4458b3921"
digest_obj64_serialize_version2(1:2)
#> [1] "75ba6f9d214a6ef2"
digest_obj64_serialize_version3(1:2)
#> [1] "3bc638a2840836a8"
secretbase::xxh64("x")
#> [1] "5c80c09683041123"
digest_chr64("x")
#> [1] "5c80c09683041123"
temp <- tempfile()
saveRDS(mtcars, temp)
digest_file64(temp)
#> [1] "aeaec0e086ac5758"
secretbase::xxh64(file = temp)
#> [1] "aeaec0e086ac5758" Created on 2024-02-22 with reprex v2.1.0 |
I think the hashing utility functions in {targets} are small and manageable enough that I might be able to make existing projects use the legacy hashing system while pipelines that run from scratch use the new system. Or something similar which does a better job at reproducibility. |
But there's no rush. So on reflection, my current plan is to only replace After that, it would be nice to remove
But in the long-term future, (1) and (2) could actually become moot. At various points in the life cycle of vdigest64_file <- digest::getVDigest(algo = "xxhash64", errormode = "warn")
digest_file64 <- function(object, ...) {
vapply(
X = object,
FUN = vdigest64_file,
serialize = FALSE,
file = TRUE,
seed = 0L,
...,
FUN.VALUE = character(1L),
USE.NAMES = FALSE
)
}
temp <- tempfile()
file_hashes_agree <- function(temp) {
digest_file64(temp) == secretbase::xxh64(file = temp)
}
writeLines("x", temp)
file_hashes_agree(temp)
#> [1] TRUE
writeLines(letters, temp)
file_hashes_agree(temp)
#> [1] TRUE
saveRDS(mtcars, temp, compress = TRUE)
file_hashes_agree(temp)
#> [1] TRUE
saveRDS(mtcars, temp, compress = FALSE)
file_hashes_agree(temp)
#> [1] TRUE
qs::qsave(mtcars, temp)
file_hashes_agree(temp)
#> [1] TRUE
x <- crew::crew_controller_local()
x$start()
saveRDS(x, temp, compress = TRUE)
file_hashes_agree(temp)
#> [1] TRUE
saveRDS(x, temp, compress = FALSE)
file_hashes_agree(temp)
#> [1] TRUE
qs::qsave(x, temp, algorithm = "zstd")
file_hashes_agree(temp)
#> [1] TRUE
qs::qsave(x, temp, algorithm = "lz4", preset = "balanced")
file_hashes_agree(temp)
#> [1] TRUE
x$terminate() Created on 2024-02-23 with reprex v2.1.0 |
The issue is your use of I've checked the R source and version 3 adds 2 default encoding headers after the R version headers. It seems amazing if this has never given you any problems in {targets} in the past. |
As far as I'm aware, |
Wow, I never would have thought of that! You're right: secretbase::xxh64(1:2)
#> [1] "9ce0daa4458b3921"
targets:::digest_obj64(1:2, skip = 23)
#> [1] "9ce0daa4458b3921"
targets:::digest_obj64(1:2, skip = "auto")
#> [1] "3bc638a2840836a8" I thought I could trust the default
What method does
Do you think this warrants an issue in
It may very well have. Ocassionally I get questions about users who run a project on one machine, download it to another machine, and find that their targets are no longer up to date. Usually we have been able to narrow it down to something specific to their compute environment, but there may have been cases where this was the cause. I think portability is reason enough to change how The last remaining choice is up to you: whether to implement |
Odd, I thought file streams explained how much faster I thought the results from x <- runif(5e7)
lobstr::obj_size(x)
#> 400.00 MB
temp <- tempfile()
saveRDS(x, temp, compress = FALSE)
file.size(temp)
#> [1] 4e+08
system(paste("du -h", temp))
system.time(digest(temp, file = TRUE))
#> user system elapsed
#> 0.813 0.064 0.930
system.time(xxh64(file = temp))
#> user system elapsed
#> 0.048 0.080 0.129 But benchmarking is trickier than that. A slightly better benchmark (though still not perfect) is below. (FYI: it consumes a lot of storage.) temp <- tempfile()
size <- numeric(0L)
digest <- numeric(0L)
secretbase <- numeric(0L)
for (exponent in seq_len(9L)) {
x <- runif(n = 10 ^ exponent)
saveRDS(x, temp, compress = FALSE)
size <- c(size, file.size(temp))
digest <- c(digest, system.time(targets:::digest_file64(object = temp))["elapsed"])
secretbase <- c(secretbase, system.time(secretbase::xxh64(file = temp))["elapsed"])
}
unlink(temp)
results <- tibble::tibble(
size = size,
log_size = log(size),
digest = digest,
secretbase = secretbase
) |>
tidyr::pivot_longer(
cols = all_of(c("digest", "secretbase")),
names_to = "package",
values_to = "seconds"
)
results_human <- tibble::tibble(
size = targets:::units_bytes(size),
digest = targets:::units_seconds(digest),
secretbase = targets:::units_seconds(secretbase)
)
print(results_human)
#> # A tibble: 9 × 3
#> size digest secretbase
#> <chr> <chr> <chr>
#> 1 111 bytes 0.006 seconds 0 seconds
#> 2 831 bytes 0.007 seconds 0 seconds
#> 3 8.031 kilobytes 0.008 seconds 0 seconds
#> 4 80.031 kilobytes 0.009 seconds 0.001 seconds
#> 5 800.031 kilobytes 0.009 seconds 0.001 seconds
#> 6 8 megabytes 0.011 seconds 0.002 seconds
#> 7 80 megabytes 0.03 seconds 0.019 seconds
#> 8 800 megabytes 0.243 seconds 0.185 seconds
#> 9 8 gigabytes 3.326 seconds 2.921 seconds
library(ggplot2)
ggplot(results) +
geom_line(aes(x = log_size, y = seconds, group = package, color = package)) +
theme_gray(16)
|
Here are the same results but with the file saved with # A tibble: 9 × 3
size digest secretbase
<chr> <chr> <chr>
1 97 bytes 0.004 seconds 0 seconds
2 543 bytes 0.005 seconds 0 seconds
3 4.695 kilobytes 0.008 seconds 0 seconds
4 45.4 kilobytes 0.008 seconds 0 seconds
5 426.2 kilobytes 0.01 seconds 0.001 seconds
6 4.187 megabytes 0.009 seconds 0.001 seconds
7 41.8 megabytes 0.017 seconds 0.01 seconds
8 417.915 megabytes 0.108 seconds 0.096 seconds
9 4.179 gigabytes 1.191 seconds 0.979 seconds |
And for completeness, using # A tibble: 9 × 3
size digest secretbase
<chr> <chr> <chr>
1 99 bytes 0.005 seconds 0 seconds
2 540 bytes 0.005 seconds 0.001 seconds
3 4.691 kilobytes 0.009 seconds 0 seconds
4 45.409 kilobytes 0.01 seconds 0 seconds
5 426.078 kilobytes 0.009 seconds 0 seconds
6 4.187 megabytes 0.004 seconds 0.002 seconds
7 41.799 megabytes 0.011 seconds 0.01 seconds
8 417.915 megabytes 0.108 seconds 0.098 seconds
9 4.179 gigabytes 1.238 seconds 0.972 seconds |
I was planning to file a bug report in |
I'm not following. The 'skip' branch |
Oh, I didn't see the
Not sure I follow. I was testing |
It relies on a reading of the R source to know that there are exactly 6 header writes before the data is serialized, so the first 6 are literally skipped. It's not at the level of an API guaranteed by R Core, but that doesn't exist and the R source doesn't change often, and if it did here, it would probably necessitate a new serialization version. I think it's slightly more robust than knowing the exact number of bytes written.
For software as established as
Version 3.0 actually makes the package a real 'framework' at the C level. You can add a new hash a bit like a recipe. Well the implementation part at least. It's cleaning up the original code that's painful. I'll leave this to consider. |
Installing the > secretbase::xxh64(1:2)
[1] "3bc638a2840836a8"
> targets:::digest_obj64(1:2, skip = "auto")
[1] "3bc638a2840836a8" |
If you use the 'skip' branch version. You should generate the same hashes as digests. Then on platforms with a different locale the hash would differ. For me, start R with > secretbase::xxh64(NULL)
[1] "e6edff991e9f8c97" Normal R start: > secretbase::xxh64(NULL)
[1] "da7e5646cbdfced7" with 'regular' secretbase, it is always: > secretbase::xxh64(NULL)
[1] "c85d88fc56f4e042" Sorry of course I'm doing this with my modified version of secretbase, you just want to do the same experiment using digests itself! |
Ah, I see. I wasn't changing the locale in my tests, and I can now reproduce what you see. Indeed I can reproduce this with $ LANG="C" R -q -e 'digest::digest(NULL, serializeVersion = 3, skip = "auto")'
> digest::digest(NULL, serializeVersion = 3, skip = "auto")
[1] "bdef078af943dd2546be047d2044d8b5"
$ R -q -e 'digest::digest(NULL, serializeVersion = 3, skip = "auto")'
> digest::digest(NULL, serializeVersion = 3, skip = "auto")
[1] "a611bfa70eb5dcc0a248ed0369794237" Setting either I think we are aligned on that part now. Here is my train of thought about what this all implies:
Does that make sense? |
So as I understand it, one possible path forward is to:
|
I just posted ropensci/targets#1244. I hope it's clear why the portability issue in
Taking a step back here: since I can't achieve back-compatibility anyway, I have more freedom to rethink the whole 32-bit hashing strategy in |
Yes, there might be a reason we're not aware of. So you have more of a concrete idea of the issue, it is these 2 lines: The encoding which is written into a v3 header. You can see that apart from the headers written in the switch statement - if you use version 2 and skip those headers, and if I use v3 and I also skip all those headers, what is left is the same - that's why
For this, I'd see no problem with truncating the 64bit hash. If you can live with that (and I see no reason why not as this is completely different to the RNG issues), then I'd prefer this over implementing another hash. |
At this point, I will definitely not ask you for another hash. (Thank you for |
I'll merge this PR when it's clear all the checks are returning clean on the current 0.3.0 release. p.s. in the above, sometimes I referred to |
Actually here the number of bytes written differs according to the locale, as the locale string is essentially written into the header - I had previously missed this fact. This means that the
|
Clean PR, can be merged into main.