Skip to content

Commit

Permalink
Merge pull request #70 from ipfs/rsrch
Browse files Browse the repository at this point in the history
cid implementation research
  • Loading branch information
warpfork authored Aug 27, 2018
2 parents afcde25 + 2cf56e3 commit 5ddbe21
Show file tree
Hide file tree
Showing 8 changed files with 700 additions and 0 deletions.
144 changes: 144 additions & 0 deletions _rsrch/cidiface/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
What golang Kinds work best to implement CIDs?
==============================================

There are many possible ways to implement CIDs. This package explores them.

### criteria

There's a couple different criteria to consider:

- We want the best performance when operating on the type (getters, mostly);
- We want to minimize the number of memory allocations we need;
- We want types which can be used as map keys, because this is common.

The priority of these criteria is open to argument, but it's probably
mapkeys > minalloc > anythingelse.
(Mapkeys and minalloc are also quite entangled, since if we don't pick a
representation that can work natively as a map key, we'll end up needing
a `KeyRepr()` method which gives us something that does work as a map key,
an that will almost certainly involve a malloc itself.)

### options

There are quite a few different ways to go:

- Option A: CIDs as a struct; multihash as bytes.
- Option B: CIDs as a string.
- Option C: CIDs as an interface with multiple implementors.
- Option D: CIDs as a struct; multihash also as a struct or string.
- Option E: CIDs as a struct; content as strings plus offsets.

The current approach on the master branch is Option A.

Option D is distinctive from Option A because multihash as bytes transitively
causes the CID struct to be non-comparible and thus not suitable for map keys
as per https://golang.org/ref/spec#KeyType . (It's also a bit more work to
pursue Option D because it's just a bigger splash radius of change; but also,
something we might also want to do soon, because we *do* also have these same
map-key-usability concerns with multihash alone.)

Option E is distinctive from Option D because Option E would always maintain
the binary format of the cid internally, and so could yield it again without
malloc, while still potentially having faster access to components than
Option B since it wouldn't need to re-parse varints to access later fields.

Option C is the avoid-choices choice, but note that interfaces are not free;
since "minimize mallocs" is one of our major goals, we cannot use interfaces
whimsically.

Note there is no proposal for migrating to `type Cid []bytes`, because that
is generally considered to be strictly inferior to `type Cid string`.


Discoveries
-----------

### using interfaces as map keys forgoes a lot of safety checks

Using interfaces as map keys pushes a bunch of type checking to runtime.
E.g., it's totally valid at compile time to push a type which is non-comparable
into a map key; it will panic at *runtime* instead of failing at compile-time.

There's also no way to define equality checks between implementors of the
interface: golang will always use its innate concept of comparison for the
concrete types. This means its effectively *never safe* to use two different
concrete implementations of an interface in the same map; you may add elements
which are semantically "equal" in your mind, and end up very confused later
when both impls of the same "equal" object have been stored.

### sentinel values are possible in any impl, but some are clearer than others

When using `*Cid`, the nil value is a clear sentinel for 'invalid';
when using `type Cid string`, the zero value is a clear sentinel;
when using `type Cid struct` per Option A or D... the only valid check is
for a nil multihash field, since version=0 and codec=0 are both valid values.

### usability as a map key is important

We already covered this in the criteria section, but for clarity:

- Option A: ❌
- Option B: ✔
- Option C: ~ (caveats, and depends on concrete impl)
- Option D: ✔
- Option E: ✔

### living without offsets requires parsing

Since CID (and multihash!) are defined using varints, they require parsing;
we can't just jump into the string at a known offset in order to yield e.g.
the multicodec number.

In order to get to the 'meat' of the CID (the multihash content), we first
must parse:

- the CID version varint;
- the multicodec varint;
- the multihash type enum varint;
- and the multihash length varint.

Since there are many applications where we want to jump straight to the
multihash content (for example, when doing CAS sharding -- see the
[disclaimer](https://github.com/multiformats/multihash#disclaimers) about
bias in leading bytes), this overhead may be interesting.

How much this overhead is significant is hard to say from microbenchmarking;
it depends largely on usage patterns. If these traversals are a significant
timesink, it would be an argument for Option D/E.
If these traversals are *not* a significant timesink, we might be wiser
to keep to Option B, because keeping a struct full of offsets will add several
words of memory usage per CID, and we keep a *lot* of CIDs.

### interfaces cause boxing which is a significant performance cost

See `BenchmarkCidMap_CidStr` and friends.

Long story short: using interfaces *anywhere* will cause the compiler to
implicitly generate boxing and unboxing code (e.g. `runtime.convT2E`);
this is both another function call, and more concerningly, results in
large numbers of unbatchable memory allocations.

Numbers without context are dangerous, but if you need one: 33%.
It's a big deal.

This means attempts to "use interfaces, but switch to concrete impls when
performance is important" are a red herring: it doesn't work that way.

This is not a general inditement against using interfaces -- but
if a situation is at the scale where it's become important to mind whether
or not pointers are a performance impact, then that situation also
is one where you have to think twice before using interfaces.

### one way or another: let's get rid of that star

We should switch completely to handling `Cid` and remove `*Cid` completely.
Regardless of whether we do this by migrating to interface, or string
implementations, or simply structs with no pointers... once we get there,
refactoring to any of the *others* can become a no-op from the perspective
of any downstream code that uses CIDs.

(This means all access via functions, never references to fields -- even if
we were to use a struct implementation. *Pretend* there's a interface,
in other words.)

There are probably `gofix` incantations which can help us with this migration.
48 changes: 48 additions & 0 deletions _rsrch/cidiface/cid.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
package cid

import (
mh "github.com/multiformats/go-multihash"
)

// Cid represents a self-describing content adressed identifier.
//
// A CID is composed of:
//
// - a Version of the CID itself,
// - a Multicodec (indicates the encoding of the referenced content),
// - and a Multihash (which identifies the referenced content).
//
// (Note that the Multihash further contains its own version and hash type
// indicators.)
type Cid interface {
// n.b. 'yields' means "without copy", 'produces' means a malloc.

Version() uint64 // Yields the version prefix as a uint.
Multicodec() uint64 // Yields the multicodec as a uint.
Multihash() mh.Multihash // Yields the multihash segment.

String() string // Produces the CID formatted as b58 string.
Bytes() []byte // Produces the CID formatted as raw binary.

Prefix() Prefix // Produces a tuple of non-content metadata.

// some change notes:
// - `KeyString() CidString` is gone because we're natively a map key now, you're welcome.
// - `StringOfBase(mbase.Encoding) (string, error)` is skipped, maybe it can come back but maybe it should be a formatter's job.
// - `Equals(o Cid) bool` is gone because it's now `==`, you're welcome.

// TODO: make a multi-return method for {v,mc,mh} decomposition. CidStr will be able to implement this more efficiently than if one makes a series of the individual getter calls.
}

// Prefix represents all the metadata of a Cid,
// that is, the Version, the Codec, the Multihash type
// and the Multihash length. It does not contains
// any actual content information.
// NOTE: The use -1 in MhLength to mean default length is deprecated,
// use the V0Builder or V1Builder structures instead
type Prefix struct {
Version uint64
Codec uint64
MhType uint64
MhLength int
}
71 changes: 71 additions & 0 deletions _rsrch/cidiface/cidBoxingBench_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
package cid

import (
"testing"
)

// BenchmarkCidMap_CidStr estimates how fast it is to insert primitives into a map
// keyed by CidStr (concretely).
//
// We do 100 insertions per benchmark run to make sure the map initialization
// doesn't dominate the results.
//
// Sample results on linux amd64 go1.11beta:
//
// BenchmarkCidMap_CidStr-8 100000 16317 ns/op
// BenchmarkCidMap_CidIface-8 100000 20516 ns/op
//
// With benchmem on:
//
// BenchmarkCidMap_CidStr-8 100000 15579 ns/op 11223 B/op 207 allocs/op
// BenchmarkCidMap_CidIface-8 100000 19500 ns/op 12824 B/op 307 allocs/op
// BenchmarkCidMap_StrPlusHax-8 200000 10451 ns/op 7589 B/op 202 allocs/op
//
// We can see here that the impact of interface boxing is significant:
// it increases the time taken to do the inserts to 133%, largely because
// the implied `runtime.convT2E` calls cause another malloc each.
//
// There are also significant allocations in both cases because
// A) we cannot create a multihash without allocations since they are []byte;
// B) the map has to be grown several times;
// C) something I haven't quite put my finger on yet.
// Ideally we'd drive those down further as well.
//
// Pre-allocating the map reduces allocs by a very small percentage by *count*,
// but reduces the time taken by 66% overall (presumably because when a map
// re-arranges itself, it involves more or less an O(n) copy of the content
// in addition to the alloc itself). This isn't topical to the question of
// whether or not interfaces are a good idea; just for contextualizing.
//
func BenchmarkCidMap_CidStr(b *testing.B) {
for i := 0; i < b.N; i++ {
mp := map[CidStr]int{}
for x := 0; x < 100; x++ {
mp[NewCidStr(0, uint64(x), []byte{})] = x
}
}
}

// BenchmarkCidMap_CidIface is in the family of BenchmarkCidMap_CidStr:
// it is identical except the map key type is declared as an interface
// (which forces all insertions to be boxed, changing performance).
func BenchmarkCidMap_CidIface(b *testing.B) {
for i := 0; i < b.N; i++ {
mp := map[Cid]int{}
for x := 0; x < 100; x++ {
mp[NewCidStr(0, uint64(x), []byte{})] = x
}
}
}

// BenchmarkCidMap_CidStrAvoidMapGrowth is in the family of BenchmarkCidMap_CidStr:
// it is identical except the map is created with a size hint that removes
// some allocations (5, in practice, apparently).
func BenchmarkCidMap_CidStrAvoidMapGrowth(b *testing.B) {
for i := 0; i < b.N; i++ {
mp := make(map[CidStr]int, 100)
for x := 0; x < 100; x++ {
mp[NewCidStr(0, uint64(x), []byte{})] = x
}
}
}
Loading

0 comments on commit 5ddbe21

Please sign in to comment.