-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cid implementation variations++ #72
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -27,6 +27,7 @@ There are quite a few different ways to go: | |
- Option C: CIDs as an interface with multiple implementors. | ||
- Option D: CIDs as a struct; multihash also as a struct or string. | ||
- Option E: CIDs as a struct; content as strings plus offsets. | ||
- Option F: CIDs as a struct wrapping only a string. | ||
|
||
The current approach on the master branch is Option A. | ||
|
||
|
@@ -42,6 +43,10 @@ the binary format of the cid internally, and so could yield it again without | |
malloc, while still potentially having faster access to components than | ||
Option B since it wouldn't need to re-parse varints to access later fields. | ||
|
||
Option F is actually a varation of Option B; it's distinctive from the other | ||
struct options because it is proposing *literally* `struct{ x string }` as | ||
the type, with no additional fields for components nor offsets. | ||
|
||
Option C is the avoid-choices choice, but note that interfaces are not free; | ||
since "minimize mallocs" is one of our major goals, we cannot use interfaces | ||
whimsically. | ||
|
@@ -72,6 +77,7 @@ When using `*Cid`, the nil value is a clear sentinel for 'invalid'; | |
when using `type Cid string`, the zero value is a clear sentinel; | ||
when using `type Cid struct` per Option A or D... the only valid check is | ||
for a nil multihash field, since version=0 and codec=0 are both valid values. | ||
When using `type Cid struct{string}` per Option F, zero is a clear sentinel. | ||
|
||
### usability as a map key is important | ||
|
||
|
@@ -82,6 +88,7 @@ We already covered this in the criteria section, but for clarity: | |
- Option C: ~ (caveats, and depends on concrete impl) | ||
- Option D: ✔ | ||
- Option E: ✔ | ||
- Option F: ✔ | ||
|
||
### living without offsets requires parsing | ||
|
||
|
@@ -106,7 +113,7 @@ How much this overhead is significant is hard to say from microbenchmarking; | |
it depends largely on usage patterns. If these traversals are a significant | ||
timesink, it would be an argument for Option D/E. | ||
If these traversals are *not* a significant timesink, we might be wiser | ||
to keep to Option B, because keeping a struct full of offsets will add several | ||
to keep to Option B/F, because keeping a struct full of offsets will add several | ||
words of memory usage per CID, and we keep a *lot* of CIDs. | ||
|
||
### interfaces cause boxing which is a significant performance cost | ||
|
@@ -129,6 +136,23 @@ if a situation is at the scale where it's become important to mind whether | |
or not pointers are a performance impact, then that situation also | ||
is one where you have to think twice before using interfaces. | ||
|
||
### struct wrappers can be used in place of typedefs with zero overhead | ||
|
||
See `TestSizeOf`. | ||
|
||
Using the `unsafe.Sizeof` feature to inspect what the Go runtime thinks, | ||
we can see that `type Foo string` and `type Foo struct{x string}` consume | ||
precisely the same amount of memory. | ||
|
||
This is interesting because it means we can choose between either | ||
type definition with no significant overhead anywhere we use it: | ||
thus, we can choose freely between Option B and Option F based on which | ||
we feel is more pleasant to work with. | ||
|
||
Option F (a struct wrapper) means we can prevent casting into our Cid type. | ||
Option B (typedef string) can be declared a `const`. | ||
Are there any other concerns that would separate the two choices? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was going to speculate about if one would be easier/faster for refmt to munge than the other, because struct-containing-string seems like it going to require at least a few more reflection calls... But on second thought that probably doesn't matter at all; CIDs are going to be using transform funcs either way because the serial type is either bytes (which is a different Are there any other things that would make B/F distinct? 🤔 |
||
|
||
### one way or another: let's get rid of that star | ||
|
||
We should switch completely to handling `Cid` and remove `*Cid` completely. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The struct wrapper allows us to add additional metadata/offsets later if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's "Option D" or "Option E" though, respectively.
Tbh, I think basically all of these are equal in terms of how easy they are to switch to any of the others, later. Switching from typedef-string to struct later would be a "breaking change" in terms of how a linker things about the raw assembly... but it will not be a breaking change syntactically‡, and that's what matters to us as humans estimating refactor costs.
[‡] Presuming we make all accesses outside the cid package itself via public functions, and define a
cid.Zero
. Which would be the plan, I think. The only thing that makes this first big refactor a pain is the syntactic change of removing the*
and fixing all the== nil
occurances.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People would still compare against "" and cast to a string/bytes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The former should be rejected in code review based on the existence of
cid.Zero
and introduced universally when we do the refactor; the latter would be a bug.If we're seriously worried about these, then we should 100% hands-down prefer Option F over B, because more than precisely zero instances of casting to Cid from outside the cid package would strike me as acceptable code quality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not an assumption we can make, generally. Users (not in our projects) will cast. It's really the primary reason to use a type alias over a struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the choice of the constant I am going with
Nil
. My second choice would beEmpty
, andZero
the third.Nil
to me is neutral and also implies an invalid value, of which an empty Cid{} struct is (it can't be used for anything).Empty
to me implies a valid but empty container, but I am okay with it ifNil
is causing that much of a problem.Zero
to me implies a numeric value, of which a Cid is not. I understand the notion of a zero-value in compiler terminology, butZero
still strikes me as an odd choice,ZeroValue
may be a bit better, but ifNil
is no good I still rather go withEmpty
.I didn't realize that using struct initializers in the middle of equality was so verbose. However, it was never my intent to use them in equality. Instead I provided an
IsNil
method as c.IsNil() is nicer than c == cid.Nil or c == (cid.Cid{}). However elected to still providevar Nil = Cid{}
so it can be used if desired.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to make speculations about whether comparison to a var is going to be a performance issue, then shouldn't proposing an equality method have at least as much burden to prove itself a non-issue before we consider that a confirmed design choice?
I disagree completely about naming a variable of non-pointer type "Nil". That is extremely unidiomatic Go because it conflicts with the language's own definitions of the term, and in discussion elsewhere I've already linked the part of the language spec which defines zero versus nil. But this PR is probably not the topical place to relitigate this naming, either, seeing as how the term is not used anywhere in the diff body we're discussing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm strongly in favor of
Nil
as a constant. WhileZero
is technically correct, it's confusing.However, as a method, I'd be fine with
IsEmpty
(orIsZero
but I'd rather not).As for the method, golang is pretty good about inlining one-line methods so it should be fine. After doing a bit of decompiling, it looks like we should go with:
c == (Cid{})
seems to cost us an extra xor zeroing something out.c == Nil
costs us a mov.See: https://go.godbolt.org/z/CHyhZl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stebalien the
IsNilString
one looks wrong. The compiler is getting really smart there and looks like its comparing the empty string to itself, and calling that the output.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(discussed on slack)
The assembly was correct. The generated instruction was
testq AX, AX
which testsAX & AX
to see if it's non-nil.