Skip to content

Commit

Permalink
Refactor registry system: no direct dependencies; expose standard has…
Browse files Browse the repository at this point in the history
…h.Hash; be a data carrier.

---

Dependencies:

Previously, depending on the go-mulithash package would create direct
dependencies to several other modules for their various hash function
implementations.  This meant that instead of go-multihash being a
lightweight, easy-to-accept dependency itself, it became something
which would noticably increase the size of your go.mod file,
your package graph, your download sizes during development, and most
concerningly, your compile output size in final products.

Now, there is a registry system (see the Register function), and
the main go-multihash package *only* populates the registry with hashes
that are available from the golang standard library by default.
This means you gain no transitive dependencies on other libraries
by importing the go-multihash package, and your binaries will not
be bloated by hashers you don't use.  (Your go.mod file may still
show more repos; but they don't end up in your builds unless you
actually refer to them).

There are now several new packages in `go-multihash/register/*`.
These can be imported to register the hashes in those packages.
If you want all the hashes that were previously available, just make
sure to import "go-mulithash/register/all" somewhere in your program.
(You can register hashes, too, without making PRs to this library;
these packages are just here for convenience and easy use.)

**This is a breaking change** if you used hashes not found in the
golang stdlib, such as blake2 and sha3.  However, to update,
all you need to do is ensure the relevant `go-multihash/register/*`
package is imported anywhere in your program -- an easy change.

---

Standard hash.Hash:

Previously, go-multihash had its own definition of a `HashFunc`
interface, and only exposed hashing through the `multihash.Sum` method.
The problem with this was these definitions did not support streaming:
one had to have an entire chunk of memory loaded at once,
in a single contiguous byte slice, in order to hash it.

(A second, admittedly much more minor, problem with this was that one
often had to write glue code to turn a `hash.Hash` into a
`multihash.HashFunc`, and since most of golang uses the standard lib
`hash.Hash` definition already, this was generally avoidable friction.)

Now, the Register function operates in terms of standard `hash.Hash`,
and there is a `GetHasher` function which can get you a `hash.Hash`.
(Okay, to be more precise, these functions take and return a factory
function for a `hash.Hash`.  You get the idea.)

Since the standard `hash.Hash` interface can operate streamingly,
now it's easy to use go-multihash in a streaming way.

The `multihash.Sum` method works the same as always.

---

Be a data carrier:

Previously, go-multihash contained checks that any multihash indicator
codes being handled were required to have a hash function registered
for them.  This made it very difficult to use go-multihash in a
"forward compatible" way (and it also made a lot of practical bumps
for this dependency-extraction refactor).

Now, go-multihash is willing to carry data, even if it doesn't know
what kind of hash function would be associated with an indicator code.
(Methods that you'd expect to parse things do still parse the varints,
making sure their sanely formatted.  They just don't inspect and
whitelist the actual integers anymore.)

I removed the `ValidCode` predicate entirely.  It doesn't seem to serve
any good purpose anymore.

---

Other:

I have not touched the `Codes` and `Names` maps in this diff.
I think we should probably review (and probably remove) these, and
instead direct people to use the go-multicodec package instead,
which has the two advantages of decoupling registration of an
implementation versus simply having a description, and also being
automatically generated from the multiformats table.
However, I wanted to check on feelings about this before doing the work
(especially because they're somewhat entangled with a bunch of the
tests in this package, making their removal somewhat nontrivial).

Most of the test files are now `package multihash_test`.
This makes for some colorful diffs, but is not otherwise interesting.
The reason for this is because the dependency separation process now
requires the tests to import those `register/*` packages, and to
avoid a cycle, that means, well, `package mulithash_test`.

I think there's probably more work to be done in making this library
really shine.  For example, in reviewing the `Encode` function,
I see some allocations that look very likely to be avoidable... if the
function was redesigned to be more aware of how it's likely to be used.
However, I took no action on this, in part because this diff is big
enough already, and in part because I think it might be reasonable to
re-examine the relationship of this code to go-cid at the same time.

I dropped `TestSmallerLengthHashID`.  It appeared to be testing an API
that wasn't actually exported... and the nearest API that *is* exported
(Sum) has a general contract of truncating a hash upon short length,
so it was overall unclear what this test should be checking.
Review might be needed on this.

The situation for murmur3 is still in need of resolution.
It's commented out entirely for now.  Questions are noted in the diff.

There's a 'register/miniosha256' package which sets the sha2-256
implementation to a non-stdlib one.  If you don't import this package,
you still get a sha2-256; it's just the stdlib one.
I did not include this in the 'register/all' group.
(Maybe it's faster; maybe it's not; but it's definitely not required,
and I'm getting some reports it also shows weird on profiles, so I tend
to think maybe one should really have to explicitly ask for this one.)
  • Loading branch information
warpfork committed Mar 4, 2021
1 parent 6f1ea18 commit 96aa53c
Show file tree
Hide file tree
Showing 12 changed files with 408 additions and 345 deletions.
51 changes: 51 additions & 0 deletions errata.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
package multihash

import (
"bytes"
"crypto/sha256"
"hash"
)

type identityMultihash struct {
bytes.Buffer
}

func (identityMultihash) BlockSize() int {
return 32 // A prefered block size is nonsense for the "identity" "hash". An arbitrary but unsurprising and positive nonzero number has been chosen to minimize the odds of fascinating bugs.
}

func (x identityMultihash) Size() int {
return x.Len()
}

func (x identityMultihash) Sum(digest []byte) []byte {
return x.Bytes()
}

type doubleSha256 struct {
main hash.Hash
}

func (x doubleSha256) Write(body []byte) (int, error) {
return x.main.Write(body)
}

func (doubleSha256) BlockSize() int {
return sha256.BlockSize
}

func (doubleSha256) Size() int {
return sha256.Size
}

func (x doubleSha256) Reset() {
x.main.Reset()
}

func (x doubleSha256) Sum(digest []byte) []byte {
intermediate := [sha256.Size]byte{}
x.main.Sum(intermediate[0:0])
h2 := sha256.New()
h2.Write(intermediate[:])
return h2.Sum(digest)
}
17 changes: 3 additions & 14 deletions multihash.go
Original file line number Diff line number Diff line change
Expand Up @@ -231,15 +231,11 @@ func FromB58String(s string) (m Multihash, err error) {
// Cast casts a buffer onto a multihash, and returns an error
// if it does not work.
func Cast(buf []byte) (Multihash, error) {
dm, err := Decode(buf)
_, err := Decode(buf)
if err != nil {
return Multihash{}, err
}

if !ValidCode(dm.Code) {
return Multihash{}, ErrUnknownCode
}

return Multihash(buf), nil
}

Expand Down Expand Up @@ -267,9 +263,8 @@ func Decode(buf []byte) (*DecodedMultihash, error) {
// Encode a hash digest along with the specified function code.
// Note: the length is derived from the length of the digest itself.
func Encode(buf []byte, code uint64) ([]byte, error) {
if !ValidCode(code) {
return nil, ErrUnknownCode
}
// REVIEW: if we remove the strict ValidCode check, this can no longer error. Change signiture?
// REVIEW: this function always causes heap allocs... but when used, this value is almost always going to be appended to another buffer (either as part of CID creation, or etc) -- should this whole function be rethought and alternatives offered?

newBuf := make([]byte, varint.UvarintSize(code)+varint.UvarintSize(uint64(len(buf)))+len(buf))
n := varint.PutUvarint(newBuf, code)
Expand All @@ -285,12 +280,6 @@ func EncodeName(buf []byte, name string) ([]byte, error) {
return Encode(buf, Names[name])
}

// ValidCode checks whether a multihash code is valid.
func ValidCode(code uint64) bool {
_, ok := Codes[code]
return ok
}

// readMultihashFromBuf reads a multihash from the given buffer, returning the
// individual pieces of the multihash.
// Note: the returned digest is a slice over the passed in data and should be
Expand Down
10 changes: 0 additions & 10 deletions multihash_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -224,16 +224,6 @@ func ExampleDecode() {
// obj: sha1 0x11 20 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
}

func TestValidCode(t *testing.T) {
for i := uint64(0); i < 0xff; i++ {
_, ok := tCodes[i]

if ValidCode(i) != ok {
t.Error("ValidCode incorrect for: ", i)
}
}
}

func TestCast(t *testing.T) {
for _, tc := range testCases {
ob, err := hex.DecodeString(tc.hex)
Expand Down
23 changes: 23 additions & 0 deletions register/all/multihash_all.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
/*
This package has no purpose except to perform registration of mulithashes.
It is meant to be used as a side-effecting import, e.g.
import (
_ "github.com/multiformats/go-multihash/register/all"
)
This package registers many multihashes at once.
Importing it will increase the size of your dependency tree significantly.
It's recommended that you import this package if you're building some
kind of data broker application, which may need to handle many different kinds of hashes;
if you're building an application which you know only handles a specific hash,
importing this package may bloat your builds unnecessarily.
*/
package all

import (
_ "github.com/multiformats/go-multihash/register/blake2"
_ "github.com/multiformats/go-multihash/register/murmur3"
_ "github.com/multiformats/go-multihash/register/sha3"
)
61 changes: 61 additions & 0 deletions register/blake2/multihash_blake2.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
/*
This package has no purpose except to perform registration of multihashes.
It is meant to be used as a side-effecting import, e.g.
import (
_ "github.com/multiformats/go-multihash/register/blake2"
)
This package registers several multihashes for the blake2 family
(both the 's' and the 'b' variants, and in a variety of sizes).
*/
package blake2

import (
"hash"

"github.com/minio/blake2b-simd"
"golang.org/x/crypto/blake2s"

"github.com/multiformats/go-multihash"
)

const (
BLAKE2B_MIN = 0xb201
BLAKE2B_MAX = 0xb240
BLAKE2S_MIN = 0xb241
BLAKE2S_MAX = 0xb260
)

func init() {
// BLAKE2S
// This package only enables support for 32byte (256 bit) blake2s.
multihash.Register(BLAKE2S_MIN+31, func() hash.Hash { h, _ := blake2s.New256(nil); return h })

// BLAKE2B
// There's a whole range of these.
for c := uint64(BLAKE2B_MIN); c <= BLAKE2B_MAX; c++ {
size := int(c - BLAKE2B_MIN + 1)

// special case these lengths to avoid allocations.
switch size {
case 32:
multihash.Register(c, blake2b.New256)
continue
case 64:
multihash.Register(c, blake2b.New512)
continue
}

// Ok, allocate away.
// (The config object here being a pointer is unfortunate.)
multihash.Register(c, func() hash.Hash {
hasher, err := blake2b.New(&blake2b.Config{Size: uint8(size)})
if err != nil {
panic(err)
}
return hasher
})
}
}
23 changes: 23 additions & 0 deletions register/miniosha256/multihash_miniosha256.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
/*
This package has no purpose except to perform registration of multihashes.
It is meant to be used as a side-effecting import, e.g.
import (
_ "github.com/multiformats/go-multihash/register/miniosha256"
)
This package registers alternative implementations for sha2-256, using
the github.com/minio/sha256-simd library.
*/
package miniosha256

import (
"github.com/minio/sha256-simd"

"github.com/multiformats/go-multihash"
)

func init() {
multihash.Register(0x12, sha256.New)
}
38 changes: 38 additions & 0 deletions register/murmur3/multihash_murmur3.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
/*
This package has no purpose except to perform registration of multihashes.
It is meant to be used as a side-effecting import, e.g.
import (
_ "github.com/multiformats/go-multihash/register/murmur3"
)
This package registers multihashes for the murmur3 family.
*/
package murmur3

// import (
// "hash"
//
// "github.com/gxed/hashland/murmur3"
//
// "github.com/multiformats/go-multihash"
// )

func init() {
// REVIEW: what go-multihash has done historically is New32, but this doesn't match what the multihash table says, which is 128! Resolution needed.
// REVIEW: I have also heard that something in ipfs unixfsv1 uses a murmur hash, but that is yet different than this. Resolution needed.
// REVIEW: these bit-twiddling things *are* in fact load-bearing somehow. If you just return murmur3.New32 without this wrapper type, it produces different results. Resolution needed.

// multihash.Register(0x22, func() hash.Hash { return murmur3.New32() })

// -or-, what previously existed:

// number := murmur3.Sum32(data)
// bytes := make([]byte, 4)
// for i := range bytes {
// bytes[i] = byte(number & 0xff)
// number >>= 8
// }
// return bytes, nil
}
61 changes: 61 additions & 0 deletions register/sha3/multihash_sha3.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
/*
This package has no purpose except to perform registration of multihashes.
It is meant to be used as a side-effecting import, e.g.
import (
_ "github.com/multiformats/go-multihash/register/sha3"
)
This package registers several multihashes for the sha3 family.
This also includes some functions known as "shake" and "keccak",
since they share much of their implementation and come in the same repos.
*/
package sha3

import (
"hash"

"golang.org/x/crypto/sha3"

"github.com/multiformats/go-multihash"
)

func init() {
multihash.Register(0x14, sha3.New512)
multihash.Register(0x15, sha3.New384)
multihash.Register(0x16, sha3.New256)
multihash.Register(0x17, sha3.New224)
multihash.Register(0x18, func() hash.Hash { return shakeNormalizer{sha3.NewShake128(), 128 / 8 * 2} })
multihash.Register(0x19, func() hash.Hash { return shakeNormalizer{sha3.NewShake256(), 256 / 8 * 2} })
multihash.Register(0x1B, sha3.NewLegacyKeccak256)
multihash.Register(0x1D, sha3.NewLegacyKeccak512)
}

// sha3.ShakeHash presents a somewhat odd interface, and requires a wrapper to normalize it to the usual hash.Hash interface.
//
// Some of the fiddly bits required by this normalization probably makes it undesirable for use in the highest performance applications;
// There's at least one extra allocation in constructing it (sha3.ShakeHash is an interface, so that's one heap escape; and there's a second heap escape when this normalizer struct gets boxed into a hash.Hash interface),
// and there's at least one extra allocation in getting a sum out of it (because reading a shake hash is a mutation (!) and the API only provides cloning as a way to escape this).
// Fun.
type shakeNormalizer struct {
sha3.ShakeHash
size int
}

func (shakeNormalizer) BlockSize() int {
return 32 // Shake doesn't have a prefered block size, apparently. An arbitrary but unsurprising and positive nonzero number has been chosen to minimize the odds of fascinating bugs.
}

func (x shakeNormalizer) Size() int {
return x.size
}

func (x shakeNormalizer) Sum(digest []byte) []byte {
if len(digest) != x.size {
digest = make([]byte, x.size)
}
h2 := x.Clone() // clone it, because reading mutates this kind of hash (!) which is not the standard contract for a Hash.Sum method.
h2.Read(digest) // not capable of underreading. See sha3.ShakeSum256 for similar usage.
return digest
}
68 changes: 68 additions & 0 deletions registry.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
package multihash

import (
"crypto/md5"
"crypto/sha1"
"crypto/sha256"
"crypto/sha512"
"hash"
)

// registry is a simple map which maps a multihash indicator number
// to a standard golang Hash interface.
//
// Multihash indicator numbers are reserved and described in
// https://github.com/multiformats/multicodec/blob/master/table.csv .
// The keys used in this map must match those reservations.
//
// Hashers which are available in the golang stdlib will be registered automatically.
// Others can be added using the Register function.
var registry = make(map[uint64]func() hash.Hash)

// Register adds a new hash to the set available from GetHasher and Sum.
//
// Register has a global effect and should only be used at package init time to avoid data races.
//
// The indicator code should be per the numbers reserved and described in
// https://github.com/multiformats/multicodec/blob/master/table.csv .
//
// If Register is called with the same indicator code more than once, the last call wins.
// In practice, this means that if an application has a strong opinion about what implementation to use for a certain hash
// (e.g., perhaps they want to override the sha256 implementation to use a special hand-rolled assembly variant
// rather than the stdlib one which is registered by default),
// then this can be done by making a Register call with that effect at init time in the application's main package.
// This should have the desired effect because the root of the import tree has its init time effect last.
func Register(indicator uint64, hasherFactory func() hash.Hash) {
registry[indicator] = hasherFactory
}

// GetHasher returns a new hash.Hash according to the indicator code number provided.
//
// The indicator code should be per the numbers reserved and described in
// https://github.com/multiformats/multicodec/blob/master/table.csv .
//
// The actual hashers available are determined by what has been registered.
// The registry automatically contains those hashers which are available in the golang standard libraries
// (which includes md5, sha1, sha256, sha384, sha512, and the "identity" mulithash, among others).
// Other hash implementations can be made available by using the Register function.
// The 'go-mulithash/register/*' packages can also be imported to gain more common hash functions.
//
// If an error is returned, it will be ErrSumNotSupported.
func GetHasher(indicator uint64) (hash.Hash, error) {
factory, exists := registry[indicator]
if !exists {
return nil, ErrSumNotSupported // REVIEW: it's unfortunate that this error doesn't say what code was missing. Also "NotSupported" is a bit of a misnomer now.
}
return factory(), nil
}

func init() {
Register(0x00, func() hash.Hash { return &identityMultihash{} })
Register(0xd5, md5.New)
Register(0x11, sha1.New)
Register(0x12, sha256.New)
Register(0x13, sha512.New)
Register(0x1f, sha256.New224)
Register(0x20, sha512.New384)
Register(0x56, func() hash.Hash { return &doubleSha256{sha256.New()} })
}
Loading

0 comments on commit 96aa53c

Please sign in to comment.