Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor registry system: no direct dependencies; expose standard hash.Hash; be a data carrier. #136

Merged
merged 17 commits into from
Mar 11, 2021

Conversation

warpfork
Copy link
Contributor

@warpfork warpfork commented Mar 4, 2021

This diff revisits the registry system to use more standard interfaces for hashers, and puts it to work to reduce the transitive dependencies of go-multihash.


Dependencies:

Previously, depending on the go-mulithash package would create direct
dependencies to several other modules for their various hash function
implementations. This meant that instead of go-multihash being a
lightweight, easy-to-accept dependency itself, it became something
which would noticably increase the size of your go.mod file,
your package graph, your download sizes during development, and most
concerningly, your compile output size in final products.

Now, there is a registry system (see the Register function), and
the main go-multihash package only populates the registry with hashes
that are available from the golang standard library by default.
This means you gain no transitive dependencies on other libraries
by importing the go-multihash package, and your binaries will not
be bloated by hashers you don't use. (Your go.mod file may still
show more repos; but they don't end up in your builds unless you
actually refer to them).

There are now several new packages in go-multihash/register/*.
These can be imported to register the hashes in those packages.
If you want all the hashes that were previously available, just make
sure to import "go-mulithash/register/all" somewhere in your program.
(You can register hashes, too, without making PRs to this library;
these packages are just here for convenience and easy use.)

This is a breaking change if you used hashes not found in the
golang stdlib, such as blake2 and sha3. However, to update,
all you need to do is ensure the relevant go-multihash/register/*
package is imported anywhere in your program -- an easy change.
A go-multihash/register/all package can be imported to get a hasher
registered for all of the same multihash codes as before (but will
correspondingly add the dependency weight back too, of course).


Standard hash.Hash:

Previously, go-multihash had its own definition of a HashFunc
interface, and only exposed hashing through the multihash.Sum method.
The problem with this was these definitions did not support streaming:
one had to have an entire chunk of memory loaded at once,
in a single contiguous byte slice, in order to hash it.

(A second, admittedly much more minor, problem with this was that one
often had to write glue code to turn a hash.Hash into a
multihash.HashFunc, and since most of golang uses the standard lib
hash.Hash definition already, this was generally avoidable friction.)

Now, the Register function operates in terms of standard hash.Hash,
and there is a GetHasher function which can get you a hash.Hash.
(Okay, to be more precise, these functions take and return a factory
function for a hash.Hash. You get the idea.)

Since the standard hash.Hash interface can operate streamingly,
now it's easy to use go-multihash in a streaming way.

The multihash.Sum method works the same as always.


Be a data carrier:

Previously, go-multihash contained checks that any multihash indicator
codes being handled were required to have a hash function registered
for them. This made it very difficult to use go-multihash in a
"forward compatible" way (and it also made a lot of practical bumps
for this dependency-extraction refactor).

Now, go-multihash is willing to carry data, even if it doesn't know
what kind of hash function would be associated with an indicator code.
(Methods that you'd expect to parse things do still parse the varints,
making sure they're sanely formatted. They just don't inspect and
whitelist the actual integers anymore.)

I removed the ValidCode predicate entirely. It doesn't seem to serve
any good purpose anymore.


Other:

I have not touched the Codes and Names maps in this diff.
I think we should probably review (and probably remove) these, and
instead direct people to use the go-multicodec package instead,
which has the two advantages of decoupling registration of an
implementation versus simply having a description, and also being
automatically generated from the multiformats table.
However, I wanted to check on feelings about this before doing the work
(especially because they're somewhat entangled with a bunch of the
tests in this package, making their removal somewhat nontrivial).

Most of the test files are now package multihash_test.
This makes for some colorful diffs, but is not otherwise interesting.
The reason for this is because the dependency separation process now
requires the tests to import those register/* packages, and to
avoid a cycle, that means, well, package mulithash_test.

I think there's probably more work to be done in making this library
really shine. For example, in reviewing the Encode function,
I see some allocations that look very likely to be avoidable... if the
function was redesigned to be more aware of how it's likely to be used.
However, I took no action on this, in part because this diff is big
enough already, and in part because I think it might be reasonable to
re-examine the relationship of this code to go-cid at the same time.

I dropped TestSmallerLengthHashID. It appeared to be testing an API
that wasn't actually exported... and the nearest API that is exported
(Sum) has a general contract of truncating a hash upon short length,
so it was overall unclear what this test should be checking.
Review might be needed on this.

The situation for murmur3 is still in need of resolution.
It's commented out entirely for now. Questions are noted in the diff.

There's a 'register/miniosha256' package which sets the sha2-256
implementation to a non-stdlib one. If you don't import this package,
you still get a sha2-256; it's just the stdlib one.
I did not include this in the 'register/all' group.
(Maybe it's faster; maybe it's not; but it's definitely not required,
and I'm getting some reports it also shows weird on profiles, so I tend
to think maybe one should really have to explicitly ask for this one.)

There are a couple other review questions marked with "// REVIEW" in the diff.

@warpfork
Copy link
Contributor Author

warpfork commented Mar 4, 2021

(The test that's currently failing is for murmur3. Input needed on what we're going to do with that thing.)

@warpfork warpfork force-pushed the dependency-separation branch 2 times, most recently from 96aa53c to fdbce65 Compare March 4, 2021 17:52
Copy link
Member

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM but we need to do something about those constants everywhere.

errata.go Outdated Show resolved Hide resolved
errata.go Outdated Show resolved Hide resolved
errata.go Outdated Show resolved Hide resolved
multihash.go Outdated Show resolved Hide resolved
multihash.go Outdated Show resolved Hide resolved
register/miniosha256/multihash_miniosha256.go Outdated Show resolved Hide resolved
register/murmur3/multihash_murmur3.go Outdated Show resolved Hide resolved
register/sha3/multihash_sha3.go Outdated Show resolved Hide resolved
register/sha3/multihash_sha3.go Outdated Show resolved Hide resolved
sum.go Outdated Show resolved Hide resolved
Copy link
Contributor

@aschmahmann aschmahmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easy enough to fix the binary in the multihash folder? Seems like we'd just need it to import all the registry entries here (including miniosha256).

register/sha3/multihash_sha3.go Outdated Show resolved Hide resolved
registry.go Show resolved Hide resolved
registry.go Outdated Show resolved Hide resolved
registry.go Outdated Show resolved Hide resolved
registry.go Outdated
//
// Hashers which are available in the golang stdlib will be registered automatically.
// Others can be added using the Register function.
var registry = make(map[uint64]func() hash.Hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps a stupid question, but why are we preferring to use uint64, instead of depending on the Code type from go-multicodec which aliases uint64, throughout the registry code? Does it help us out with backwards compatibility/consistency things (e.g. the existing Names = map[string]uint64)?

I was thinking that in the future if go-multicodec added support for typed metadata (e.g. names) then you could do things like code.Name() and it would look it up for you behind the scenes which might be nice. Feel free to say that this is a terrible idea though 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing up this whole range of questions. (They are good. So far, I made these diffs while trying to evade them! But that does not represent a stance so much as it was just short-term diff-scope-minimization.)

This came up in enough places throughout this PR that I opened a whole new issue to talk through it: #137

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... okay, my time & energy budget for recursing up through that concern tree has expired.

  • I'm going to keep uint64 as a map key (rather than any typedefs) because that's what this repo did before;
  • I'm not going to add the https://github.com/multiformats/go-multicodec repo for symbols right now because it's not in this repo right now;
  • I'm going to replace all the current use of bare constants with the named constants already in this package and leave them as they are.
  • I'm going to make a concerted effort to do nothing to the Names and Codes maps in the rest of this PR because the things we'd probably want to do with them would almost certainly involve bringing in the above, which... E_SCOPE_LIMIT.

Additional boat rocking about constants can come some other day and some other PR.

sum_test.go Show resolved Hide resolved
errata.go Outdated Show resolved Hide resolved
register/blake2/multihash_blake2.go Outdated Show resolved Hide resolved
register/murmur3/multihash_murmur3.go Outdated Show resolved Hide resolved
register/murmur3/multihash_murmur3.go Outdated Show resolved Hide resolved
@warpfork warpfork mentioned this pull request Mar 5, 2021
@warpfork warpfork force-pushed the dependency-separation branch from fdbce65 to abbba7f Compare March 5, 2021 15:13
warpfork added 7 commits March 6, 2021 13:28
…h.Hash; be a data carrier.

---

Dependencies:

Previously, depending on the go-mulithash package would create direct
dependencies to several other modules for their various hash function
implementations.  This meant that instead of go-multihash being a
lightweight, easy-to-accept dependency itself, it became something
which would noticably increase the size of your go.mod file,
your package graph, your download sizes during development, and most
concerningly, your compile output size in final products.

Now, there is a registry system (see the Register function), and
the main go-multihash package *only* populates the registry with hashes
that are available from the golang standard library by default.
This means you gain no transitive dependencies on other libraries
by importing the go-multihash package, and your binaries will not
be bloated by hashers you don't use.  (Your go.mod file may still
show more repos; but they don't end up in your builds unless you
actually refer to them).

There are now several new packages in `go-multihash/register/*`.
These can be imported to register the hashes in those packages.
If you want all the hashes that were previously available, just make
sure to import "go-mulithash/register/all" somewhere in your program.
(You can register hashes, too, without making PRs to this library;
these packages are just here for convenience and easy use.)

**This is a breaking change** if you used hashes not found in the
golang stdlib, such as blake2 and sha3.  However, to update,
all you need to do is ensure the relevant `go-multihash/register/*`
package is imported anywhere in your program -- an easy change.
A `go-multihash/register/all` package can be imported to get a hasher
registered for all of the same multihash codes as before (but will
correspondingly add the dependency weight back too, of course).

---

Standard hash.Hash:

Previously, go-multihash had its own definition of a `HashFunc`
interface, and only exposed hashing through the `multihash.Sum` method.
The problem with this was these definitions did not support streaming:
one had to have an entire chunk of memory loaded at once,
in a single contiguous byte slice, in order to hash it.

(A second, admittedly much more minor, problem with this was that one
often had to write glue code to turn a `hash.Hash` into a
`multihash.HashFunc`, and since most of golang uses the standard lib
`hash.Hash` definition already, this was generally avoidable friction.)

Now, the Register function operates in terms of standard `hash.Hash`,
and there is a `GetHasher` function which can get you a `hash.Hash`.
(Okay, to be more precise, these functions take and return a factory
function for a `hash.Hash`.  You get the idea.)

Since the standard `hash.Hash` interface can operate streamingly,
now it's easy to use go-multihash in a streaming way.

The `multihash.Sum` method works the same as always.

---

Be a data carrier:

Previously, go-multihash contained checks that any multihash indicator
codes being handled were required to have a hash function registered
for them.  This made it very difficult to use go-multihash in a
"forward compatible" way (and it also made a lot of practical bumps
for this dependency-extraction refactor).

Now, go-multihash is willing to carry data, even if it doesn't know
what kind of hash function would be associated with an indicator code.
(Methods that you'd expect to parse things do still parse the varints,
making sure they're sanely formatted.  They just don't inspect and
whitelist the actual integers anymore.)

I removed the `ValidCode` predicate entirely.  It doesn't seem to serve
any good purpose anymore.

---

Other:

I have not touched the `Codes` and `Names` maps in this diff.
I think we should probably review (and probably remove) these, and
instead direct people to use the go-multicodec package instead,
which has the two advantages of decoupling registration of an
implementation versus simply having a description, and also being
automatically generated from the multiformats table.
However, I wanted to check on feelings about this before doing the work
(especially because they're somewhat entangled with a bunch of the
tests in this package, making their removal somewhat nontrivial).

Most of the test files are now `package multihash_test`.
This makes for some colorful diffs, but is not otherwise interesting.
The reason for this is because the dependency separation process now
requires the tests to import those `register/*` packages, and to
avoid a cycle, that means, well, `package mulithash_test`.

I think there's probably more work to be done in making this library
really shine.  For example, in reviewing the `Encode` function,
I see some allocations that look very likely to be avoidable... if the
function was redesigned to be more aware of how it's likely to be used.
However, I took no action on this, in part because this diff is big
enough already, and in part because I think it might be reasonable to
re-examine the relationship of this code to go-cid at the same time.

I dropped `TestSmallerLengthHashID`.  It appeared to be testing an API
that wasn't actually exported... and the nearest API that *is* exported
(Sum) has a general contract of truncating a hash upon short length,
so it was overall unclear what this test should be checking.
Review might be needed on this.

The situation for murmur3 is still in need of resolution.
It's commented out entirely for now.  Questions are noted in the diff.

There's a 'register/miniosha256' package which sets the sha2-256
implementation to a non-stdlib one.  If you don't import this package,
you still get a sha2-256; it's just the stdlib one.
I did not include this in the 'register/all' group.
(Maybe it's faster; maybe it's not; but it's definitely not required,
and I'm getting some reports it also shows weird on profiles, so I tend
to think maybe one should really have to explicitly ask for this one.)
It seems likely that we should replace many of these with values
from https://github.com/multiformats/go-multicodec in the future,
however, we have a new issue to track that:
#137

And as noted in
#136 (comment)
it's time to reign in the amount of work going on in this PR.
Avoid allocations with more reuse the same slice for results.
Agreed upon in discussion at:
#136 (comment)

The "REVIEW" comments in the outgoing diff identify the concerns that
led to this choice.
@warpfork warpfork force-pushed the dependency-separation branch from abbba7f to 10afd22 Compare March 6, 2021 12:46
@willscott
Copy link
Contributor

is this ready for a round of review? are there any loose things that still need to happen?

warpfork added 4 commits March 9, 2021 08:02
Leaving the error return in, per
#136 (comment)

May also be interesting to note: I did check if ValidCode (or, roughly
the same thing but perhaps renamed to KnownCode) could be reintroduced,
and the answer is... actually no, not without broaching other issues.
A body of `_, ok := registry[code]` is enough to open a can of worms.
See
#136 (comment)
This creates a hasher long enough to ask it its properties.
This is arguably creating garbage; on the other hand, I don't think
this is a codepath ever likely to be used in a hot loop anywhere.
We can extract this to something that caches the properties later
if it proves necessary.

It quietly defaults to zero for unknown codes.  I don't know if this
makes sense, but it's what the old code would have done, so it's what
the new code will do, and I'm not looking deeper into it.
At this point I'm just trying to make surgically minimal alterations
and get this changeset as a whole wrapped up so things can move on.
Surely no one would try to do this, nor then be surprised if it
creates problems.  On the other hand: if someone *does* do this, and
the error doesn't appear until arbitrarily far away when the map is
read, it's a pain to diagnose... and checking it up front is cheap.
So, here we go: check it up front.
The number that's missing is now reported in the error.
This is done using the golang error wrap feature.

We're keeping a constant value for ErrSumNotSupported for
compatibility and change avoidance reasons.

Code that cares about this exact error will still have to change;
it is now necessary to use `errors.Is` to detect it.

The text in the ErrSumNotSupported value is also updated,
because the text no longer made sense given an open registry system.
As a target-of-opportunity fix, it also now follows golang normative
conventions for error messages (no capitalization, no punctuation).
warpfork added a commit that referenced this pull request Mar 10, 2021
…ultihash that rejects truncations.

See #136 (comment)
for discussion.

This change means Sum behaves slightly differently for identity
multihashes than it does for any other multihash.  I'm not keeping
score on the number of ways identity multihash is weird anymore,
just documenting it and keeping tests passing.

The error message is lifted from the old `sumID` function verbatim.
@warpfork
Copy link
Contributor Author

warpfork commented Mar 10, 2021

Okay! This is finally green.

I believe all comments are addressed.

Also note that I updated a bunch of the transitive dependencies. It was forced (see 877240b / https://travis-ci.com/github/multiformats/go-multihash/builds/219591152 ). But none of our fixtures reported effects, so I presume this to be safe and joyously uninteresting.

Last call for any more reviews?

@warpfork
Copy link
Contributor Author

warpfork commented Mar 10, 2021

I've attempted to push this downstream through go-cid, and here's what that might look like: ipfs/go-cid@26b387a#diff-88c2c4b0892fb0172322544e3937e937bb18beccecb1b24a1b5028758a43685bR346-R375

There's some questions worth a quick review there. Namely: it's entirely possible to make all the code work without a need for DefaultLengths, but there are some tests which conflict with it. It's not clear to me if those tests are testing something important and relied upon or if they happen to assert what they do incidentally.

Everything else looks fine -- although we do again end up seeing the go-multihash/register/all import now needing to appear in test files, which should probably not be a surprise.

@Stebalien
Copy link
Member

There's some questions worth a quick review there. Namely: it's entirely possible to make all the code work without a need for DefaultLengths, but there are some tests which conflict with it. It's not clear to me if those tests are testing something important and relied upon or if they happen to assert what they do incidentally.

We do need DefaultLengths, we use it in quite a few places. You can find these places by:

  1. Updating go-multihash in go-ipfs, lotus, etc.
  2. Seeing what breaks.

Let's make this PR minimally breaking so we can merge it ASAP and move on.

…ultihash that rejects truncations.

See #136 (comment)
for discussion.

This change means Sum behaves slightly differently for identity
multihashes than it does for any other multihash.  I'm not keeping
score on the number of ways identity multihash is weird anymore,
just documenting it and keeping tests passing.

The error message is lifted from the old `sumID` function verbatim.
We'll need something a little more recent than 1.11.x to be able to
use the fmt "%w" and errors.Is features.
I swear I did not mean to be doing this on a branch, but:
https://travis-ci.com/github/multiformats/go-multihash/builds/219591152

- it became necessary to update the go version in CI for %w features

- that apparently adds pointer alignment checking to the compiler...

- which flunks some stuff in x/crypto...

So, okay, we're updating libraries now!
While it was possible to remove all use of this from this repo, when
attempting propagate changes to downstreams consuming it, it became
apparently that other repos also rely on this symbol.

Whether or not those usages are important and intentional, whether
they're actually worth maintaining, and whether they'd be replacable
with other approaches... is not considered at this time.
(Probably we should be asking this!  The first occasions where this
cropped up are in other functions that have been marked "deprecated"
since... 2018!  But... chasing those things down and straightening
them out is becoming problematic.  Perhaps we'll be more ready to
revisit these things at a later date.)
@warpfork warpfork force-pushed the dependency-separation branch from 877240b to 1a96911 Compare March 10, 2021 22:04
@warpfork
Copy link
Contributor Author

DefaultLengths is back in.

Here's what propagating through go-cid looks like now, with all tests passing: ipfs/go-cid#118

@warpfork
Copy link
Contributor Author

I guess the final check here is: do we want to keep github.com/multiformats/go-multihash, the root package of this repo, still dragging in all transitive dependencies, to really really minimize change? E.g. effectively have it import register/all and register/miniosha256?

#131 did originally describe doing the registration rework in a new package, and keeping the go-multihash root unchanged.

As it stands now, the go-multihash symbols exported are pretty much unchanged, but the code does go all-in on making the dependencies optional, so the behavior does shift. If we're scared of this and the consequent need to add the registration imports in downstream applications, and we'd rather make the dependency minimization an opt-in action that has to be done separately from just updating the library by putting it in a deeper package, we can do that.

... heck, I'll just draft that now. If we introduce a new package, who wants to name it? go-multihash/core? or go-multihash/lite? or just put it in go-mulithash/register?

@Stebalien
Copy link
Member

Yeah, let's minimize breakage This library is a bit popular: https://pkg.go.dev/github.com/multiformats/go-multihash?tab=importedby. A "core" package would work (I'd call it that).

I guess the question is: what do you need. This appears to be blocking other work, so I want to make sure we merge that need ASAP.

If you don't really need to reduce dependencies (I'm pretty sure you don't, at least right now) then I'd:

  1. Import these deps from the root for now.
  2. Release this as a 1.0 release.
  3. Immediately start on a 2.0 release that doesn't "register" everything.
  4. Make the 1.0 release depend on the 2.0 release.

Basically, we need unblock this ASAP and we need to avoid causing a meltdown. Let's just go with the easiest approach and introduce a 2.0 (we'll get there eventually regardless).

…s again.

The transitive dependencies in the root package are still managed by
this whole registry system (which now resides in the 'core' package),
so we have parity and *mostly* just one suite of code to maintain.

You can now use this 'core' package when interested in dependency
minimization.  In exchange, the "register/*" imports may be required.

You can just also just yank updates to go-multihash, and if you were
already using it (and happen to be using one of these
now-optional(-but-only-if-you-use-core))... nothing should actually
change; the "register/*" imports won't be required because the
root package does them for you.

Some constants are replicated.  This was the minimum step necessary
to avoid import cycles.  I'm not spending time prettifying it because
we really probably ought to be refactoring this to use the package
of constants in go-multicodec that's automatically generated, and
yet, that is a scope limit we're trying not to cross during this
changeset.
@warpfork warpfork force-pushed the dependency-separation branch from 0db0169 to cebc9f8 Compare March 11, 2021 00:23
@warpfork
Copy link
Contributor Author

Alright. go-multihash/core is now a package, it does the thing. go-mulithash still brings on all the deps unconditionally.

I think I've run out of changes to avoid. go-cid can now be updated to this last commit with literally no code changes.

I'll also bow out on version tagging. Whatever numbers you like, fine by me. (I'm wary of major numbers and the Different Behavior they prompt from go mod, but... what could go wrong, really)

@Stebalien
Copy link
Member

I've tested with go-ipfs and the only broken test is the one that tries to use murmur (to make sure we don't allow it).

@Stebalien Stebalien merged commit c3ba253 into master Mar 11, 2021
@Stebalien Stebalien deleted the dependency-separation branch March 11, 2021 03:57
@Stebalien
Copy link
Member

@warpfork could you cut a minor release when ready?

@warpfork
Copy link
Contributor Author

This is now available within the v0.0.15 tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants