Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xxHash64 implementation (clean) #5

Merged
merged 2 commits into from
Mar 5, 2024
Merged

xxHash64 implementation (clean) #5

merged 2 commits into from
Mar 5, 2024

Conversation

shikokuchuo
Copy link
Owner

Clean PR, can be merged into main.

@shikokuchuo shikokuchuo added the enhancement New feature or request label Feb 19, 2024
@codecov-commenter
Copy link

codecov-commenter commented Feb 19, 2024

Codecov Report

Attention: Patch coverage is 95.41985% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 97.91%. Comparing base (aa0de0b) to head (0020df9).

Files Patch % Lines
src/secret3.c 95.31% 6 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main       #5      +/-   ##
==========================================
- Coverage   99.00%   97.91%   -1.09%     
==========================================
  Files           4        5       +1     
  Lines         301      432     +131     
==========================================
+ Hits          298      423     +125     
- Misses          3        9       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wlandau
Copy link

wlandau commented Feb 21, 2024

Thank you so much for putting this together!

A quick test below shows that digest and secretbase agree on the objects I tried. I think there will be some disagreement in practice in targets because it does not look like there is an equivalent of digest::digest(serialize = TRUE) for character vectors, but without an xxhash32 implementation, pipelines will be temporarily invalidated in other ways.

hash1 <- digest::digest(
  object = "ed978d615e301b8d",
  serialize = FALSE,
  algo = "xxhash64"
)
hash2 <- secretbase::xxh64(x = "ed978d615e301b8d")
hash1 == hash2
#> [1] TRUE
x <- crew::crew_controller_local()
hash1 <- digest::digest(
  object = x,
  algo = "xxhash64"
)
hash2 <- secretbase::xxh64(
  x = x,
  convert = TRUE
)
hash1 == hash2
#> [1] TRUE
x$start()
x <- crew::crew_controller_local()
hash1 <- digest::digest(
  object = x,
  algo = "xxhash64"
)
hash2 <- secretbase::xxh64(
  x = x,
  convert = TRUE
)
hash1 == hash2
#> [1] TRUE
temp <- tempfile()
saveRDS(x, temp)
hash1 <- digest::digest(
  object = temp,
  algo = "xxhash64",
  file = TRUE
)
hash2 <- secretbase::xxh64(
  file = temp,
  convert = TRUE
)
hash1 == hash2
#> [1] TRUE
x$terminate()

@wlandau
Copy link

wlandau commented Feb 21, 2024

Speed is outstanding!

library(digest)
library(secretbase)
library(microbenchmark)
x <- "ed978d615e301b8d"
microbenchmark(
  digest = digest(x, serialize = FALSE),
  secretbase = xxh64(x)
)
#> Unit: microseconds
#>        expr   min    lq     mean median    uq     max neval
#>      digest 9.014 9.312 11.69342 9.4245 9.613 227.089   100
#>  secretbase 1.401 1.459  2.34300 1.6080 1.676  76.999   100
x <- crew::crew_controller_local()
microbenchmark(
  digest = digest(x),
  secretbase = xxh64(x)
)
#> Unit: milliseconds
#>        expr      min       lq     mean   median       uq      max neval
#>      digest 1.268635 1.329473 1.398098 1.388626 1.442978 1.735737   100
#>  secretbase 1.132179 1.142675 1.191095 1.159126 1.182132 2.182591   100
x <- runif(5e7)
lobstr::obj_size(x)
#> 400.00 MB
system.time(digest(x))
#>    user  system elapsed 
#>   1.292   0.191   1.484
system.time(xxh64(x))
#>    user  system elapsed 
#>   0.357   0.000   0.357
temp <- tempfile()
saveRDS(x, temp, compress = FALSE)
file.size(temp)
#> [1] 4e+08
system(paste("du -h", temp))
system.time(digest(temp, file = TRUE))
#>    user  system elapsed 
#>   0.813   0.064   0.930
system.time(xxh64(file = temp))
#>    user  system elapsed 
#>   0.048   0.080   0.129

Created on 2024-02-21 with reprex v2.1.0

@wlandau
Copy link

wlandau commented Feb 21, 2024

When this version of secretbase reaches CRAN, I will be ready to migrate the xxhash64 algorithm in targets to use secretbase rather than digest.

One thing I could use advice on is a fast way to convert the 64-bit hash into an 8-character string as a replacement for xxhash32 (the latter of which is just used to create stable target names in some cases). Should I start with xxh64(convert = FALSE), do modular arithmetic on the raw vector, and then convert the result to a character string? Do you happen to know of a fast way to do this in R?

@shikokuchuo
Copy link
Owner Author

The intention was really not to have to invalidate your targets again - so let me think about the differences you've raised. I had originally thought xxHash64 would be where the most benefit would lie (in terms of performance and memory), and you could retain digest for the 32bit hash.

In any case, I submitted a new version earlier with just SHA-256 added to make incremental changes, make sure the framework checks correctly on all platforms before adding code from a new codebase.

@wlandau
Copy link

wlandau commented Feb 21, 2024

The intention was really not to have to invalidate your targets again - so let me think about the differences you've raised.

Thanks!

I had originally thought xxHash64 would be where the most benefit would lie (in terms of performance and memory), and you could retain digest for the 32bit hash.

I suppose that's the best course of action, though it is awkward to depend on two different hashing packages.

@shikokuchuo
Copy link
Owner Author

A quick test below shows that digest and secretbase agree on the objects I tried. I think there will be some disagreement in practice in targets because it does not look like there is an equivalent of digest::digest(serialize = TRUE) for character vectors, but without an xxhash32 implementation, pipelines will be temporarily invalidated in other ways.

I just took a quick look at {targets} code. It seems where you use digest_obj64(), you always wrap this in a list, in which case you'd get the same result from secretbase. For digest_chr64() you turn off serialization, and this would also not be serialized in secretbase. If you confirm this is the case, then it's just the use of xxhash32 remaining?

@wlandau
Copy link

wlandau commented Feb 22, 2024

Right, I forgot I did that. I wanted to use the vectorized version of digest for performance, and the list() workaround gave me the same reduction in overhead without having to vectorize. In theory digest_obj64() should agree with secretbase, but it appears there is actually a disagreement.

> targets:::digest_obj64("x")
[1] "dcfd70a2ca84f10c"
> secretbase::xxh64(list("x"))
[1] "6188cdac5c40ad1b"

I think the reason is probably that I chose serialization version 3 to support ALTREP. If I use version 2 and wrap it in another list(), then we agree.

vdigest <- digest::getVDigest(algo = "xxhash64")
vdigest(list(list("x")), serialize = TRUE)
#> [1] "6188cdac5c40ad1b"

@wlandau
Copy link

wlandau commented Feb 22, 2024

So if I change digest_obj64() to use serialization version 2, then digest_obj64(list("x")) agrees with secretbase::xxh64(list("x")). The trouble is that targets has already been doing the equivalent of digest_obj64("x").

@shikokuchuo
Copy link
Owner Author

secretbase::sha256(list("x"))
#> [1] "efe4523872b45152456abede6d7b643fbca1710636294d1c8dc0e41dd942121a"
digest::digest(list("x"), algo = "sha256")
#> [1] "efe4523872b45152456abede6d7b643fbca1710636294d1c8dc0e41dd942121a"

secretbase::xxh64(list("x"))
#> [1] "6188cdac5c40ad1b"
digest::digest(list("x"), algo = "xxhash64")
#> [1] "6188cdac5c40ad1b"
targets:::digest_obj64("x")
#> [1] "dcfd70a2ca84f10c"

Created on 2024-02-22 with reprex v2.1.0

Yes, clearly there is a difference in the function targets uses from digest.

Serialization is always v3 XDR in secretbase. If different serialization versions are being used then that is in digest.

@shikokuchuo
Copy link
Owner Author

Ah it's documented. I was not familiar with this getVdigest() business.

Note that since one hash summary will be returned for each element passed as input, care must be taken when determining whether or not to include the data structure as part of the object. For instance, to return the equivalent output of digest(list("a")) it would be necessary to wrap the list object itself getVDigest()(list(list("a")))

@shikokuchuo
Copy link
Owner Author

Actually you define hash_object() as a generic and your character method in terms of digest_chr64(), so you're not actually serializing character vectors at all, are you? Is what you need the inverse, which is a way for secretbase to hash an unserialized character vector of length > 1?

secretbase and digest agree on a character string as that's somewhat of a special case where the characters are hashed 'as is' and not as a nul-terminated c string.

@wlandau
Copy link

wlandau commented Feb 22, 2024

Actually you define hash_object() as a generic and your character method in terms of digest_chr64(), so you're not actually serializing character vectors at all, are you?

Yeah, targets tries to avoid serialization for character strings.

Is what you need the inverse, which is a way for secretbase to hash an unserialized character vector of length > 1?

My original motivation for vectorized digests was the low overhead, not the actual vectorization. A bit of a strange fit, but it helped performance. In fact, there are only a couple places where I used the vectorization capabilities of the digest utilities in utils_digest.R, and I removed them in ropensci/targets@41e7a03 and ropensci/targets@0625aa2. Now, it looks like everywhere in the code I collapse strings down to length 1 before hashing them, or explicitly map over the elements. I will need to test before I completely get away from built-in support for vectors of length > 1, but I'm positive it will be possible (and easy).

secretbase and digest agree on a character string as that's somewhat of a special case where the characters are hashed 'as is' and not as a nul-terminated c string.

Right, and that's a good thing. I think we'll be fine on character vectors, I just think I'm having trouble figuring out how to swap in secretbase::xxh64() as the backend for targets:::digest_obj64() without a breaking change.

@wlandau
Copy link

wlandau commented Feb 22, 2024

I just think I'm having trouble figuring out how to swap in secretbase::xxh64() as the backend for targets:::digest_obj64() without a breaking change.

Actually, after some more testing (see below), I think we're close to agreement. The targets themselves use file hashes, character strings agree, and because of the S3 dispatch in hash_object(), the imports that get sent to digest_obj64() are not character strings. The only differences I am seeing now have something to do with non-character vectors and serialization methods.

But it's not the worst thing in the world if the hashes change. Worst case is users' targets are invalidated when they upgrade the package upgrade, which just means they have to rerun their pipelines. I hope it doesn’t come to that, but It might.

vdigest64 <- digest::getVDigest(algo = "xxhash64")
vdigest64_file <- digest::getVDigest(algo = "xxhash64", errormode = "warn")

digest_obj64_serialize_version2 <- function(object, ...) {
  vdigest64(
    object = list(object),
    serialize = TRUE,
    serializeVersion = 2L,
    file = FALSE,
    seed = 0L,
    ...
  )
}

# targets uses this one (serialization version 3):
digest_obj64_serialize_version3 <- function(object, ...) {
  vdigest64(
    object = list(object),
    serialize = TRUE,
    serializeVersion = 3L, ###
    file = FALSE,
    seed = 0L,
    ...
  )
}

# and this one:
digest_chr64 <- function(object, ...) {
  vdigest64(object, serialize = FALSE, file = FALSE, seed = 0L, ...)
}

# and this one:
digest_file64 <- function(object, ...) {
  vapply(
    X = object,
    FUN = vdigest64_file,
    serialize = FALSE,
    file = TRUE,
    seed = 0L,
    ...,
    FUN.VALUE = character(1L),
    USE.NAMES = FALSE
  )
}

secretbase::xxh64(mtcars)
#> [1] "f4b89e63bc92af79"
digest_obj64_serialize_version2(mtcars)
#> [1] "f4b89e63bc92af79"
digest_obj64_serialize_version3(mtcars)
#> [1] "04304fa81fcd8794"

secretbase::xxh64(1)
#> [1] "853b1797f54b229c"
digest_obj64_serialize_version2(1)
#> [1] "853b1797f54b229c"
digest_obj64_serialize_version3(1)
#> [1] "a2a52dfe4e065e93"

secretbase::xxh64(1:2)
#> [1] "9ce0daa4458b3921"
digest_obj64_serialize_version2(1:2)
#> [1] "75ba6f9d214a6ef2"
digest_obj64_serialize_version3(1:2)
#> [1] "3bc638a2840836a8"

secretbase::xxh64("x")
#> [1] "5c80c09683041123"
digest_chr64("x")
#> [1] "5c80c09683041123"

temp <- tempfile()
saveRDS(mtcars, temp)
digest_file64(temp)
#> [1] "aeaec0e086ac5758"
secretbase::xxh64(file = temp)
#> [1] "aeaec0e086ac5758"

Created on 2024-02-22 with reprex v2.1.0

@wlandau
Copy link

wlandau commented Feb 23, 2024

I think the hashing utility functions in {targets} are small and manageable enough that I might be able to make existing projects use the legacy hashing system while pipelines that run from scratch use the new system. Or something similar which does a better job at reproducibility.

@wlandau
Copy link

wlandau commented Feb 23, 2024

But there's no rush. targets already depends on both secretbase and digest, and both packages are very light. I think almost all of the noticeable performance and memory improvements will come from hashing files, and as best I can tell, agreement there is perfect (see below).

So on reflection, my current plan is to only replace digest with secretbase in targets:::digest_file64(), which should take care of most of the performance gains and be fully back-compatible.

After that, it would be nice to remove digest as a dependency, but it's not urgent, and I can wait as many releases as it takes. To be clear, the current reasons to keep digest in targets are:

  1. There are disagreements between xxh64() and targets:::digest_obj64() (see above). Partly this is due to serialization V2 vs V3, but I don't understand what's happening with V2 and numeric vectors.
  2. xxhash32 would be needed to replace digest in targets:::digest_chr32().

But in the long-term future, (1) and (2) could actually become moot. At various points in the life cycle of targets, I needed to make an update that either changed the hashes or invalidated everyone's pipelines. The most recent episode was the parallel RNG seed issue, and that just get done. If something like that ever comes up again (which I hope it doesn't), then it won't matter if hashes are back-compatible for that release, and I can take the opportunity to make a complete switch to secretbase without putting an extra burden on users.

vdigest64_file <- digest::getVDigest(algo = "xxhash64", errormode = "warn")

digest_file64 <- function(object, ...) {
  vapply(
    X = object,
    FUN = vdigest64_file,
    serialize = FALSE,
    file = TRUE,
    seed = 0L,
    ...,
    FUN.VALUE = character(1L),
    USE.NAMES = FALSE
  )
}

temp <- tempfile()

file_hashes_agree <- function(temp) {
  digest_file64(temp) == secretbase::xxh64(file = temp)
}

writeLines("x", temp)
file_hashes_agree(temp)
#> [1] TRUE

writeLines(letters, temp)
file_hashes_agree(temp)
#> [1] TRUE

saveRDS(mtcars, temp, compress = TRUE)
file_hashes_agree(temp)
#> [1] TRUE

saveRDS(mtcars, temp, compress = FALSE)
file_hashes_agree(temp)
#> [1] TRUE

qs::qsave(mtcars, temp)
file_hashes_agree(temp)
#> [1] TRUE

x <- crew::crew_controller_local()
x$start()
saveRDS(x, temp, compress = TRUE)
file_hashes_agree(temp)
#> [1] TRUE

saveRDS(x, temp, compress = FALSE)
file_hashes_agree(temp)
#> [1] TRUE
  
qs::qsave(x, temp, algorithm = "zstd")
file_hashes_agree(temp)
#> [1] TRUE

qs::qsave(x, temp, algorithm = "lz4", preset = "balanced")
file_hashes_agree(temp)
#> [1] TRUE

x$terminate()

Created on 2024-02-23 with reprex v2.1.0

@shikokuchuo
Copy link
Owner Author

  1. There are disagreements between xxh64() and targets:::digest_obj64() (see above). Partly this is due to serialization V2 vs V3, but I don't understand what's happening with V2 and numeric vectors.

The issue is your use of digest with serialization version 3 (which is not the default for that package). The 'skip' parameter is still set to that for version 2. As it does not operate at a low level like secretbase, it actually just truncates a set number of bytes from the start of the serialized vector, not really skipping. The default (aligning with version 2) is 14 bytes, to remove the entire header for version 3 would require 23 bytes. If you add the parameter skip = 23, you will get the same result in digest as secretbase.

I've checked the R source and version 3 adds 2 default encoding headers after the R version headers. secretbase skips all of these to ensure portability across all systems. If you use digest, specify serialization version 3 and do not change the skip parameter, then this is not portable. To demonstrate, I have uploaded to the 'skip' branch a version that aligns with the behaviour of digest(serializeVersion=3). This throws a test error on the rhub debian-clang-devel machine as it has the configuration: Debian Linux, R-devel, clang, ISO-8859-15 locale.

It seems amazing if this has never given you any problems in {targets} in the past.

@shikokuchuo
Copy link
Owner Author

So on reflection, my current plan is to only replace digest with secretbase in targets:::digest_file64(), which should take care of most of the performance gains and be fully back-compatible.

As far as I'm aware, digest also uses a streaming hash algorithm for files, so it doesn't make sense to me that there'd be much of a performance gain here. I saw your benchmark above, but I'm not sure if I'm reading that correctly...

@wlandau
Copy link

wlandau commented Feb 23, 2024

The issue is your use of digest with serialization version 3 (which is not the default for that package). The 'skip' parameter is still set to that for version 2. As it does not operate at a low level like secretbase, it actually just truncates a set number of bytes from the start of the serialized vector, not really skipping. The default (aligning with version 2) is 14 bytes, to remove the entire header for version 3 would require 23 bytes. If you add the parameter skip = 23, you will get the same result in digest as secretbase.

Wow, I never would have thought of that! You're right:

secretbase::xxh64(1:2)
#> [1] "9ce0daa4458b3921"
targets:::digest_obj64(1:2, skip = 23)
#> [1] "9ce0daa4458b3921"
targets:::digest_obj64(1:2, skip = "auto")
#> [1] "3bc638a2840836a8"

I thought I could trust the default skip = "auto", but it looks like description in the help file makes no mention of serializeVersion.

I've checked the R source and version 3 adds 2 default encoding headers after the R version headers. secretbase skips all of these to ensure portability across all systems.

What method does secretbase use to robustly detect and skip these headers?

If you use digest, specify serialization version 3 and do not change the skip parameter, then this is not portable. To demonstrate, I have uploaded to the 'skip' branch a version that aligns with the behaviour of digest(serializeVersion=3). This throws a test error on the rhub debian-clang-devel machine as it has the configuration: Debian Linux, R-devel, clang, ISO-8859-15 locale.

Do you think this warrants an issue in digest?

It seems amazing if this has never given you any problems in {targets} in the past.

It may very well have. Ocassionally I get questions about users who run a project on one machine, download it to another machine, and find that their targets are no longer up to date. Usually we have been able to narrow it down to something specific to their compute environment, but there may have been cases where this was the cause.

I think portability is reason enough to change how targets does hashing, even if upgrading the package causes user pipelines to rerun. When the next version of secretbase with xxhash64 is on CRAN, I can start to make the switch.

The last remaining choice is up to you: whether to implement xxhash32 so all the hashing in targets can straightforwardly move to secretbase, or whether I just use xxhash64 and take the least significant 32 bits of the output.

@wlandau
Copy link

wlandau commented Feb 23, 2024

As far as I'm aware, digest also uses a streaming hash algorithm for files, so it doesn't make sense to me that there'd be much of a performance gain here. I saw your benchmark above, but I'm not sure if I'm reading that correctly...

Odd, I thought file streams explained how much faster secretbase is running on files.

I thought the results from digest took a lot longer than secretbase based on this result:

x <- runif(5e7)
lobstr::obj_size(x)
#> 400.00 MB
temp <- tempfile()
saveRDS(x, temp, compress = FALSE)
file.size(temp)
#> [1] 4e+08
system(paste("du -h", temp))
system.time(digest(temp, file = TRUE))
#>    user  system elapsed 
#>   0.813   0.064   0.930
system.time(xxh64(file = temp))
#>    user  system elapsed 
#>   0.048   0.080   0.129

But benchmarking is trickier than that. A slightly better benchmark (though still not perfect) is below. (FYI: it consumes a lot of storage.)

temp <- tempfile()
size <- numeric(0L)
digest <- numeric(0L)
secretbase <- numeric(0L)
for (exponent in seq_len(9L)) {
  x <- runif(n = 10 ^ exponent)
  saveRDS(x, temp, compress = FALSE)
  size <- c(size, file.size(temp))
  digest <- c(digest, system.time(targets:::digest_file64(object = temp))["elapsed"])
  secretbase <- c(secretbase, system.time(secretbase::xxh64(file = temp))["elapsed"])
}
unlink(temp)
results <- tibble::tibble(
  size = size,
  log_size = log(size),
  digest = digest,
  secretbase = secretbase
) |>
  tidyr::pivot_longer(
    cols = all_of(c("digest", "secretbase")),
    names_to = "package",
    values_to = "seconds"
  )
results_human <- tibble::tibble(
  size = targets:::units_bytes(size),
  digest = targets:::units_seconds(digest),
  secretbase = targets:::units_seconds(secretbase)
)

print(results_human)
#> # A tibble: 9 × 3
#>   size              digest        secretbase   
#>   <chr>             <chr>         <chr>        
#> 1 111 bytes         0.006 seconds 0 seconds    
#> 2 831 bytes         0.007 seconds 0 seconds    
#> 3 8.031 kilobytes   0.008 seconds 0 seconds    
#> 4 80.031 kilobytes  0.009 seconds 0.001 seconds
#> 5 800.031 kilobytes 0.009 seconds 0.001 seconds
#> 6 8 megabytes       0.011 seconds 0.002 seconds
#> 7 80 megabytes      0.03 seconds  0.019 seconds
#> 8 800 megabytes     0.243 seconds 0.185 seconds
#> 9 8 gigabytes       3.326 seconds 2.921 seconds

library(ggplot2)
ggplot(results) +
  geom_line(aes(x = log_size, y = seconds, group = package, color = package)) +
  theme_gray(16)

Screenshot 2024-02-23 at 10 06 57 AM

secretbase still wins for files in this one test, but not by as much as I first thought.

@wlandau
Copy link

wlandau commented Feb 23, 2024

Here are the same results but with the file saved with qs::qsave(x, temp) instead of saveRDS(x, temp, compress = FALSE). Same story:

# A tibble: 9 × 3
  size              digest        secretbase   
  <chr>             <chr>         <chr>        
1 97 bytes          0.004 seconds 0 seconds    
2 543 bytes         0.005 seconds 0 seconds    
3 4.695 kilobytes   0.008 seconds 0 seconds    
4 45.4 kilobytes    0.008 seconds 0 seconds    
5 426.2 kilobytes   0.01 seconds  0.001 seconds
6 4.187 megabytes   0.009 seconds 0.001 seconds
7 41.8 megabytes    0.017 seconds 0.01 seconds 
8 417.915 megabytes 0.108 seconds 0.096 seconds
9 4.179 gigabytes   1.191 seconds 0.979 seconds

Screenshot 2024-02-23 at 10 12 24 AM

@wlandau
Copy link

wlandau commented Feb 23, 2024

And for completeness, using digest::digest(object = temp, algo = "xxhash64", file = TRUE) directly instead of targets:::digest_file64():

# A tibble: 9 × 3
  size              digest        secretbase   
  <chr>             <chr>         <chr>        
1 99 bytes          0.005 seconds 0 seconds    
2 540 bytes         0.005 seconds 0.001 seconds
3 4.691 kilobytes   0.009 seconds 0 seconds    
4 45.409 kilobytes  0.01 seconds  0 seconds    
5 426.078 kilobytes 0.009 seconds 0 seconds    
6 4.187 megabytes   0.004 seconds 0.002 seconds
7 41.799 megabytes  0.011 seconds 0.01 seconds 
8 417.915 megabytes 0.108 seconds 0.098 seconds
9 4.179 gigabytes   1.238 seconds 0.972 seconds

Screenshot 2024-02-23 at 10 15 04 AM

@wlandau
Copy link

wlandau commented Feb 23, 2024

It seems amazing if this has never given you any problems in {targets} in the past.

I was planning to file a bug report in targets citing potential non-portability of hashes, but I am having trouble reproducing hash differences on my Mac and Linux machines. I can only reproduce differences with secretbase. Did you happen to come across any evidence of portability issues in digest::digest(serializeVersion = 3, skip = "auto")?

@shikokuchuo
Copy link
Owner Author

It seems amazing if this has never given you any problems in {targets} in the past.

I was planning to file a bug report in targets citing potential non-portability of hashes, but I am having trouble reproducing hash differences on my Mac and Linux machines. I can only reproduce differences with secretbase. Did you happen to come across any evidence of portability issues in digest::digest(serializeVersion = 3, skip = "auto")?

I'm not following. The 'skip' branch secretbase just mimics digest::digest(serializeVersion = 3, skip = "auto") - i.e. it only skips the first 4 headers rather than the full 6. If you're saying there are differences with one but not the other, then also the two would then not produce the same hashes on some platforms?

@wlandau
Copy link

wlandau commented Feb 23, 2024

The 'skip' branch secretbase just mimics digest::digest(serializeVersion = 3, skip = "auto") - i.e. it only skips the first 4 headers rather than the full 6.

Oh, I didn't see the skip branch of secretbase. That would certainly help agreement between secretbase and how targets is using digest. But would it be correct and portable? You mentioned that all 6 headers should be skipped for serialization V3, or else the hash might not be portable across machines.

If you're saying there are differences with one but not the other, then also the two would then not produce the same hashes on some platforms?

Not sure I follow. I was testing digest::digest(serializationVersion = 3, skip = "auto") as well as targets::digest_obj64() on the different platforms I can access, and the hashes across those platforms appeared to agree even though not enough headers are being skipped for serialization V3.

@shikokuchuo
Copy link
Owner Author

shikokuchuo commented Feb 23, 2024

What method does secretbase use to robustly detect and skip these headers?

It relies on a reading of the R source to know that there are exactly 6 header writes before the data is serialized, so the first 6 are literally skipped. It's not at the level of an API guaranteed by R Core, but that doesn't exist and the R source doesn't change often, and if it did here, it would probably necessitate a new serialization version. I think it's slightly more robust than knowing the exact number of bytes written.

If you use digest, specify serialization version 3 and do not change the skip parameter, then this is not portable. To demonstrate, I have uploaded to the 'skip' branch a version that aligns with the behaviour of digest(serializeVersion=3). This throws a test error on the rhub debian-clang-devel machine as it has the configuration: Debian Linux, R-devel, clang, ISO-8859-15 locale.

Do you think this warrants an issue in digest?

For software as established as digests, breaking changes would need to be well-motivated, so you would have a case if you continued using digests.

The last remaining choice is up to you: whether to implement xxhash32 so all the hashing in targets can straightforwardly move to secretbase, or whether I just use xxhash64 and take the least significant 32 bits of the output.

Version 3.0 actually makes the package a real 'framework' at the C level. You can add a new hash a bit like a recipe. Well the implementation part at least. It's cleaning up the original code that's painful. I'll leave this to consider.

@wlandau
Copy link

wlandau commented Feb 23, 2024

Installing the skip branch resolves the apparent disagreement between targets and secretbase in the test I cited:

> secretbase::xxh64(1:2)
[1] "3bc638a2840836a8"
> targets:::digest_obj64(1:2, skip = "auto")
[1] "3bc638a2840836a8"

@shikokuchuo
Copy link
Owner Author

shikokuchuo commented Feb 23, 2024

The 'skip' branch secretbase just mimics digest::digest(serializeVersion = 3, skip = "auto") - i.e. it only skips the first 4 headers rather than the full 6.

Oh, I didn't see the skip branch of secretbase. That would certainly help agreement between secretbase and how targets is using digest. But would it be correct and portable? You mentioned that all 6 headers should be skipped for serialization V3, or else the hash might not be portable across machines.

If you use the 'skip' branch version. You should generate the same hashes as digests. Then on platforms with a different locale the hash would differ.

For me, start R with LANG="C" R

> secretbase::xxh64(NULL)
[1] "e6edff991e9f8c97"

Normal R start:

> secretbase::xxh64(NULL)
[1] "da7e5646cbdfced7"

with 'regular' secretbase, it is always:

> secretbase::xxh64(NULL)
[1] "c85d88fc56f4e042"

Sorry of course I'm doing this with my modified version of secretbase, you just want to do the same experiment using digests itself!

@wlandau
Copy link

wlandau commented Feb 23, 2024

If you use the 'skip' branch version. You should generate the same hashes as digests. Then on platforms with a different locale the hash would differ.

For me, start R with LANG="C" R

secretbase::xxh64(NULL)
[1] "e6edff991e9f8c97"
Normal R start:

secretbase::xxh64(NULL)
[1] "da7e5646cbdfced7"
with 'regular' secretbase, it is always:

secretbase::xxh64(NULL)
[1] "c85d88fc56f4e042"

Ah, I see. I wasn't changing the locale in my tests, and I can now reproduce what you see. Indeed I can reproduce this with digest:

$ LANG="C" R -q -e 'digest::digest(NULL, serializeVersion = 3, skip = "auto")'
> digest::digest(NULL, serializeVersion = 3, skip = "auto")
[1] "bdef078af943dd2546be047d2044d8b5"

$ R -q -e 'digest::digest(NULL, serializeVersion = 3, skip = "auto")'
> digest::digest(NULL, serializeVersion = 3, skip = "auto")
[1] "a611bfa70eb5dcc0a248ed0369794237"

Setting either serializationVersion = 2 or skip = 23 makes the hashes agree.

I think we are aligned on that part now. Here is my train of thought about what this all implies:

  1. Using the hashing utility functions in targets, I see LANG="C" R -e 'targets:::digest_obj64(NULL)' returns "02f794ac27515874", whereas R -e 'targets:::digest_obj64(NULL)' returns "da7e5646cbdfced7".
  2. If a targets user runs a pipeline on one machine and transfers it to a different machine, the difference in locale could invalidate the original hashes and cause everything in the pipeline rerun. There may have already been cases of this.
  3. This portability issue could be a bug in digest, or it may be intentional behavior. Regardless, this is a serious problem for targets.
  4. The only way to fix this portability bug in targets is to either change the serialization version or change which headers are skipped. Either change will unavoidably invalidate everyone's pipelines and cause them to rerun.
  5. Because of (4), strict agreement between secretbase::xxh64() and targets::digest_obj64() is no longer necessary. Fixing the bug in (4) is already a (sort-of) breaking change, so further changes in the hashing strategy cannot cause any additional disruption. (That disruption will already happen due to fixing (4)).

Does that make sense?

@wlandau
Copy link

wlandau commented Feb 23, 2024

So as I understand it, one possible path forward is to:

  1. Raise a discussion in the digest package, at least to understand why the locale headers are not skipped in serialization version 3.
  2. File a bug report in targets to let users know.
  3. Incorporate the xxh64 branch of secretbase into targets, which will change the hashes but make them portable.
  4. Because (3) changes the hashes anyway, move to completely replace digest with secretbase in targets.
  5. For 32-bit hashes, I could either use the lowest 4 bytes of the xxhash64, or use xxhash32 if you plan to implement it. (see below)

@wlandau
Copy link

wlandau commented Feb 23, 2024

I just posted ropensci/targets#1244. I hope it's clear why the portability issue in targets cannot be fixed back-compatibly.

For 32-bit hashes, I could either use the lowest 4 bytes of the xxhash64, or use xxhash32 if you plan to implement it.

Taking a step back here: since I can't achieve back-compatibility anyway, I have more freedom to rethink the whole 32-bit hashing strategy in targets. I originally chose a 32-bit hash because it makes the names of dynamic branches more compact and nice-looking. But that's not a hard and fast requirement. At this point, I think it makes the most sense for targets to just use the 64-bit hash for everything it does. So unless I am missing something, I no longer have a need for xxhash32 in secretbase.

@shikokuchuo
Copy link
Owner Author

  1. Raise a discussion in the digest package, at least to understand why the locale headers are not skipped in serialization version 3.

Yes, there might be a reason we're not aware of. So you have more of a concrete idea of the issue, it is these 2 lines:

https://github.com/wch/r-source/blob/9683a5bb0164b8a8b3821b924df0b68bd85dda97/src/main/serialize.c#L1430-L1431

The encoding which is written into a v3 header.

You can see that apart from the headers written in the switch statement - if you use version 2 and skip those headers, and if I use v3 and I also skip all those headers, what is left is the same - that's why secretbase agrees with the default digest().

  1. For 32-bit hashes, I could either use the lowest 4 bytes of the xxhash64, or use xxhash32 if you plan to implement it.

For this, I'd see no problem with truncating the 64bit hash. If you can live with that (and I see no reason why not as this is completely different to the RNG issues), then I'd prefer this over implementing another hash.

@wlandau
Copy link

wlandau commented Feb 23, 2024

For this, I'd see no problem with truncating the 64bit hash. If you can live with that (and I see no reason why not as this is completely different to the RNG issues), then I'd prefer this over implementing another hash.

At this point, I will definitely not ask you for another hash. (Thank you for xxhash64!) This current PR in its current state meets all the needs for targets, and I will happily adopt it when it moves to the next release.

@shikokuchuo
Copy link
Owner Author

shikokuchuo commented Feb 23, 2024

I'll merge this PR when it's clear all the checks are returning clean on the current 0.3.0 release.

p.s. in the above, sometimes I referred to digests (a mash of digest and targets) when I meant digest :p

Repository owner deleted a comment from wlandau Feb 23, 2024
@shikokuchuo
Copy link
Owner Author

shikokuchuo commented Feb 24, 2024

What method does secretbase use to robustly detect and skip these headers?

It relies on a reading of the R source to know that there are exactly 6 header writes before the data is serialized, so the first 6 are literally skipped. It's not at the level of an API guaranteed by R Core, but that doesn't exist and the R source doesn't change often, and if it did here, it would probably necessitate a new serialization version. I think it's slightly more robust than knowing the exact number of bytes written.

Actually here the number of bytes written differs according to the locale, as the locale string is essentially written into the header - I had previously missed this fact. This means that the secretbase method is much more robust, and it is not possible for digest to skip all the headers using a fixed value. What I said previously about the header being 23 bytes is not correct for all locales:

The default (aligning with version 2) is 14 bytes, to remove the entire header for version 3 would require 23 bytes. If you add the parameter skip = 23, you will get the same result in digest as secretbase.

@shikokuchuo shikokuchuo merged commit f8c2c8f into main Mar 5, 2024
@shikokuchuo shikokuchuo deleted the xxh64 branch March 5, 2024 16:51
Repository owner locked and limited conversation to collaborators Mar 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants