Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: incremental-hasher #261

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions src/block/interface.ts
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,22 @@ export interface Phantom<T> {
[Marker]?: T
}

/**
* [Multicodec code] usually used to tag [multiformat]. It is simply an Integer that
* utilizes `Phantom` type to capture code name which typescript aware tools will
* surface providing info about the code withouth having to look it up in the table.
*
* Type also can be used to convey the fact that value must be a multicodec
*
* [multiformat]:https://multiformats.io/
* [multicodec code]:https://github.com/multiformats/multicodec/blob/master/table.csv
*/
export type MulticodecCode<
Code extends number = number,
Name extends string = string
> = Code & Phantom<Name>


/**
* Represents an IPLD block (including its CID) that can be decoded to data of
* type `T`.
Expand Down
79 changes: 75 additions & 4 deletions src/hashes/interface.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
// # Multihash
import type { MulticodecCode } from '../block/interface.js'

/**
* Represents a multihash digest which carries information about the
Expand All @@ -9,7 +10,7 @@
// a bunch of places that parse it to extract (code, digest, size). By creating
// this first class representation we avoid reparsing and things generally fit
// really nicely.
export interface MultihashDigest<Code extends number = number> {
export interface MultihashDigest<Code extends MulticodecCode = MulticodecCode, Size extends number = number> {
/**
* Code of the multihash
*/
Expand All @@ -23,7 +24,7 @@ export interface MultihashDigest<Code extends number = number> {
/**
* byte length of the `this.digest`
*/
size: number
size: Size

/**
* Binary representation of this multihash digest.
Expand All @@ -35,7 +36,7 @@ export interface MultihashDigest<Code extends number = number> {
* Hasher represents a hashing algorithm implementation that produces as
* `MultihashDigest`.
*/
export interface MultihashHasher<Code extends number = number> {
export interface MultihashHasher<Code extends MulticodecCode = MulticodecCode> {
/**
* Takes binary `input` and returns it (multi) hash digest. Return value is
* either promise of a digest or a digest. This way general use can `await`
Expand Down Expand Up @@ -67,6 +68,76 @@ export interface MultihashHasher<Code extends number = number> {
* `SyncMultihashHasher` is useful in certain APIs where async hashing would be
* impractical e.g. implementation of Hash Array Mapped Trie (HAMT).
*/
export interface SyncMultihashHasher<Code extends number = number> extends MultihashHasher<Code> {
export interface SyncMultihashHasher<Code extends MulticodecCode = MulticodecCode> extends MultihashHasher<Code> {
digest: (input: Uint8Array) => MultihashDigest<Code>
}

/**
* Incremental variant of the `MultihashHasher` that can be used to compute
* digest of the payloads that would be impractical or impossible to load all
* into a memory.
*/
export interface IncrementalMultihashHasher<
Code extends MulticodecCode,
Size extends number,
Digest = MultihashDigest<Code, Size>
> {
/**
* Size of the digest this hasher produces.
*/
size: Size

/**
* Code of the multihash
*/
code: Code

/**
* Name of the multihash
*/
name: string

/**
* Number of bytes that were consumed.
*/
count(): bigint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if someone is hashing >9PiB of data in JS then 👏👏👏.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah ... is this overkill?

Comment on lines +100 to +103
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Number of bytes that were consumed.
*/
count(): bigint

Let's just drop this method, we can revisit if we find it really necessary.


/**
* Returns multihash digest of the bytes written so far. Should not have
* side-effects, meaning you should be able to write some more bytes and
* call `digest` again to get the digest for all the bytes written from
* creation (or from reset)
*/
digest(): Digest

/**
* Encodes multihash of the bytes written so far (since creation or
* reset) into provided `target` at given `offset`. If `offset` not
* provided it is implicitly `0`.
*
* @param [offset=0] - Byte offset in the `target`.
*/
readDigest(target: Uint8Array, offset?: number): this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe the use-case for this? it seems like this makes it an onerous API to have to implement

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my mistake, this is the output function!

I think maybe the naming could be better here. We have ample precedent of digest() in JS-land, so we could have digest() and multihash() (or multihashDigest() if you want to be more explicit). In Go-land Sum() is the standard for this action, which has grown on me to make sense (though it's taken time!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I also see I'm discussing history here - read* being the new versions? I'm not a fan. I also wonder whether we could have nicer APIs that don't require you to pass in a target? I understand that's an important part of this, for efficiency, but casual use typically just wants that done for you. So could the APIs take a target? instead and always return Uint8Array? So you can either choose to supply the bytes to write in to (with optional offset) or not supply one, but either way you get back some bytes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read* being the new versions? I'm not a fan.

I mean if you think of it as a transform stream, it makes sense to have write and read ops. I don't mind renaming it to something else, but please don't make me come up with a name that everyone will like.

I also wonder whether we could have nicer APIs that don't require you to pass in a target? I understand that's an important part of this, for efficiency, but casual use typically just wants that done for you. So could the APIs take a target? instead and always return Uint8Array? So you can either choose to supply the bytes to write in to (with optional offset) or not supply one, but either way you get back some bytes.

I'm not completely opposed to returning back the target, however I would caution against it as it mixes two very different modes into one and can also lead to mistakes (e.g. you may have passed undefined reference which will no through but happily give you back Uint8Array)

Idea was that if you want to compute digest you just call digest method and use this only in those rare cases when you need to work with slabs of memory.


/**
* Encodes raw digest (without multihash header) of the bytes written
* so far (since creation or reset) into provided `target` at given
* `offset`. If `offset` not provided it is implicitly `0`.
*
* @param [offset=0] - Byte offset in the `target`.
*/
read(target: Uint8Array, offset?: number): this

/**
* Writes bytes to be digested.
*/
write(bytes: Uint8Array): this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically in streaming hashers this is called update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with calling it update although I do find that name confusing personally as I think of update as overwrite as opposed to append.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but streaming hashers aren't appending to a buffer they are updating their internal state with the new data you pass.


/**
* Resets this hasher to its initial state. Can be used to recycle this
* instance. It resets `count` and and discards all the bytes that were
* written prior.
*/
reset(): this
}