-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do we need a separate spec for chunking? #2
Comments
Oh, I took a few notes on the matter over the last few months: IPFS Profiling IPIP NotesUnderstanding the problemPart 2 can be separated out-- I have started down this path with the new prior art section of the IETF multihash spec, which is kind of a "downpayment" on adding a "just the hash please" helper function that should be built into multihash libraries to convert an IPFS CID to a "normal hash" like Mark's Part 1 is what this whole document is trying to address in the form of defined profiles that can be "sensible defaults" strayed from at great risk. How profile identifiers can be baked INTO CIDv2s or passed out of band or catch-tried or ducktyped, etc etc., is out of scope for now. Out of Scope / Orthogonal Problems
TerminologyTechnically CID generation is only "ingress" in the IPFS mental model where all the data ever CID'd exists in a self-indexing DHT; for now, let's just table bikeshedding and use these terms within this document. We can update later when all the work is done and this is all being rewritten as docs for external readers. Ingress =
Parameterizing Ingress/GenerationAlan's Heirarchy of Parameters
becomes
becomes mindmap
root((CID params))
ingress mode
wrapping node metadata
CAR file: true/false
??: type: data/files
onlyHash: true/false
concurrency
blockWriteConcurrency: uvarint _*10_
fileImportConcurrency: uvarint _*50_
CID properties
cidVersion: 0/1
DAG
DAG building strategy
strategy: flat/balanced/trickle
universal
maxChildrenPerNode: uvarint _*174_
reduceSingleLeafToSelf: true/false _*true_
rawleaves: true/false _*true_
leafType: string _*file_ -- ignored if rawleavestrue
balanced-only
trickle-specific
layerRepeat: uvarint _*4_
directory structure
shardSplitThreshold: uvarint
helper functions
hamtHashFn: async buffer
hamtBucketBits: uvarint _*8_
chunker: none/fixed/rabin
avgChunksize: int _*262144_
minChunksize: int _*0_
maxChunksize: int _*262144_
Mindmap of Alan's 2020 w3up Parametersbecomes (src) Kubo State of MindAlan's 2020 attempts were based on SYNOPSIS
ipfs add [--recursive | -r] [--dereference-args] [--stdin-name=<stdin-name>]
[--hidden | -H] [--ignore=<ignore>]...
[--ignore-rules-path=<ignore-rules-path>] [--quiet | -q]
[--quieter | -Q] [--silent] [--progress | -p] [--trickle | -t]
[--only-hash | -n] [--wrap-with-directory | -w]
[--chunker=<chunker> | -s] [--raw-leaves] [--nocopy] [--fscache]
[--cid-version=<cid-version>] [--hash=<hash>] [--inline]
[--inline-limit=<inline-limit>] [--pin=false] [--to-files=<to-files>]
[--preserve-mode] [--preserve-mtime] [--mode=<mode>]
[--mtime=<mtime>] [--mtime-nsecs=<mtime-nsecs>] [--] <path>...
ARGUMENTS
<path>... - The path to a file to be added to IPFS.
OPTIONS
-r, --recursive bool - Add directory paths recursively.
--dereference-args bool - Symlinks supplied in arguments are
dereferenced.
--stdin-name string - Assign a name if the file source is stdin.
-H, --hidden bool - Include files that are hidden. Only takes
effect on recursive add.
--ignore array - A rule (.gitignore-stype) defining which
file(s) should be ignored (variadic,
experimental).
--ignore-rules-path string - A path to a file with .gitignore-style
ignore rules (experimental).
-q, --quiet bool - Write minimal output.
-Q, --quieter bool - Write only final hash.
--silent bool - Write no output.
-p, --progress bool - Stream progress data.
-t, --trickle bool - Use trickle-dag format for dag generation.
-n, --only-hash bool - Only chunk and hash - do not write to
disk.
-w, --wrap-with-directory bool - Wrap files with a directory object.
-s, --chunker string - Chunking algorithm, size-[bytes],
rabin-[min]-[avg]-[max] or buzhash.
--raw-leaves bool - Use raw blocks for leaf nodes.
--nocopy bool - Add the file using filestore. Implies
raw-leaves. (experimental).
--fscache bool - Check the filestore for pre-existing
blocks. (experimental).
--cid-version int - CID version. Defaults to 0 unless an
option that depends on CIDv1 is passed.
Passing version 1 will cause the
raw-leaves option to default to true.
--hash string - Hash function to use. Implies CIDv1 if
not sha2-256. (experimental).
--inline bool - Inline small blocks into CIDs.
(experimental).
--inline-limit int - Maximum block size to inline.
(experimental). Default: 32.
--pin bool - Pin locally to protect added files from
garbage collection. Default: true.
--to-files string - Add reference to Files API (MFS) at the
provided path.
--preserve-mode bool - Apply existing POSIX permissions to
created UnixFS entries. Disables
raw-leaves. (experimental).
--mode uint - Custom POSIX file mode to store in
created UnixFS entries. Disables
raw-leaves. (experimental).
--mtime int64 - Custom POSIX modification time to store
in created UnixFS entries (seconds before
or after the Unix Epoch). Disables
raw-leaves. (experimental).
--mtime-nsecs uint - Custom POSIX modification time (optional
time fraction in nanoseconds). becomes - ingress mode
- wrapping node
- wrap with directory: true/false
- data or files
- data
- files
- dereference-symlinks: true/false
- stdin-name: string
- preserve POSIX perms for each file
- necessarily disables `raw-leaves` setting (automatic if any of the following are set as flags)
- preserve-mode: bool
- preserve-mtime: bool (sets modtime to now if false)
- mode: uint (overrides existing posix perms with provided)
- mtime: int64 (overrides modtime with provided)
- mtime-nsecs: uint (overrides modtime with provided in unix epoch nanoseconds)
- verbosity
- progress: true/false
- silent: true/false
- postprocessing
- onlyHash: true/false
- pin: true/false
- to-files: string (path to Files API (MFS))
- CID properties
- cidVersion: 0/1
- DAG
- DAG building strategy
- strategy
- rawleaves: true/false (true)
- trickle: true/false
- ??: is "balanced" the unmarked default if false?
- directory structure
- recursive: true/false
- wrap with directory: true/false
- inline: true/false
- inline-limit: uvarint (32)
- chunker: string
- rabin: `rabin-min-avg-max`
- avgChunksize: int (262144)
- minChunksize: int (0)
- maxChunksize: int (262144)
- static: `size-[number of bytes]`
- buzhash: `buzhash` becomes mindmap
root ((CID params))
ingress mode
wrap with directory: true/false
data or files
data
files
dereference-symlinks: true/false
stdin-name: string
preserve POSIX perms for each file
necessarily disables `raw-leaves` setting -- automatic if any of the following are set as flags
preserve-mode: bool
preserve-mtime: bool -- sets modtime to now if false
mode: uint -- overrides existing posix perms with provided
mtime: int64 -- overrides modtime with provided
mtime-nsecs: uint -- overrides modtime with provided in unix epoch nanoseconds
verbosity
progress: true/false
silent: true/false
postprocessing
onlyHash: true/false
pin: true/false
to-files: string -- path to Files API, i.e. MFS
CID properties
cidVersion: 0/1
DAG
DAG building strategy
strategy
rawleaves: true/false _*true_
trickle: true/false
??: is `balanced` the unmarked default if false?
directory structure
recursive: true/false
wrap with directory: true/false
inline: true/false
inline-limit: uvarint _*32_
chunker: string
rabin: rabin-+min+-+avg+-+max+
avg: int _*262144_
min: int _*0_
max: int _*262144_
static: size-+number of bytes+
buzhash: buzhash
Mindmap of Today's kubo Parametersbecomes (Src): Riba's DAGgerSrc: https://github.com/ribasushi/DAGger/ Usage: stream-dagger [-h] [--async-hashers integer] [--chunkers ch1_o1.1_o1.2_..._o1.N__ch2_o2.1_o2.2_..._o2.N__ch3_...] [--cid-multibase string] [--collectors co1_o1.1_o1.2_..._o1.N__co2_...] [--emit-stderr comma,sep,emitters] [--emit-stdout comma,sep,emitters] [--hash algname] [--hash-bits integer] [--help-all] [--inline-max-size bytes] [--ipfs-add-compatible-command cmdstring] [--multipart] [--node-encoder encname_opt1_opt2_..._optN] [--ring-buffer-min-sysread bytes] [--ring-buffer-size bytes] [--ring-buffer-sync-size bytes] [--skip-nul-inputs] [--stats-active uint]
--async-hashers=integer
Number of concurrent short-lived goroutines performing
hashing. Set to 0 (disable) for predictable benchmarking.
Default: [24]
--chunkers=ch1_o1.1_o1.2_..._o1.N__ch2_o2.1_o2.2_..._o2.N__ch3_...
Stream chunking algorithm chain. Each chunker is one of:
'buzhash', 'fixed-size', 'pad-finder', 'pigz', 'rabin'
--cid-multibase=string
Use this multibase when encoding CIDs for output. One of
'base32', 'base36'. Default: [base32]
--collectors=co1_o1.1_o1.2_..._o1.N__co2_...
Node-forming algorithm chain. Each collector is one of:
'fixed-cid-refs-size', 'fixed-outdegree', 'none',
'shrubber', 'trickle'
--emit-stderr=comma,sep,emitters
One or more emitters to activate on stdERR. Available
emitters are 'car-v0-fifos-xargs', 'car-v0-pinless-stream',
'chunks-jsonl', 'none', 'roots-jsonl', 'stats-jsonl',
'stats-text'. Default: [stats-text]
--emit-stdout=comma,sep,emitters
One or more emitters to activate on stdOUT. Available
emitters same as above. Default: [roots-jsonl]
--hash=algname
Hash function to use, one of: 'blake2b-256', 'murmur3-128',
'none', 'sha2-256', 'sha2-256-gocore', 'sha3-512'
--hash-bits=integer
Amount of bits taken from *start* of the hash output.
Default: [256]
-h, --help Display basic help
--help-all Display full help including options for every currently
supported chunker/collector/encoder
--inline-max-size=bytes
Use identity-CID to refer to blocks having on-wire size at
or below the specified value (36 is recommended), 0 disables
--ipfs-add-compatible-command=cmdstring
A complete go-ipfs/js-ipfs add command serving as a basis
config (any conflicting option will take precedence)
--multipart Expect multiple SInt64BE-size-prefixed streams on stdIN
--node-encoder=encname_opt1_opt2_..._optN
The IPLD-ish node encoder to use, one of: 'unixfsv1'
--ring-buffer-min-sysread=bytes
(EXPERT SETTING) Perform next read(2) only when the
specified amount of free space is available in the buffer.
Default: [262144]
--ring-buffer-size=bytes
The size of the quantized ring buffer used for ingestion.
Default: [25165824]
--ring-buffer-sync-size=bytes
(EXPERT SETTING) The size of each buffer synchronization
sector. Default: [65536]
--skip-nul-inputs
Instead of emitting an IPFS-compatible zero-length CID, skip
zero-length streams outright
--stats-active=uint
A bitfield representing activated stat aggregations:
bit0:BlockSizing, bit1:RingbufferTiming. Default: [1]
[C]ollector 'fixed-cid-refs-size'
Forms a DAG where the amount of bytes taken by CID references is limited
for every individual node. The IPLD-link referencing overhead, aside from
the CID length itself, is *NOT* considered.
------------
SubOptions
max-cid-refs-size=[160:]
Maximum cumulative byte-size of CID references within a node
[C]ollector 'fixed-outdegree'
Forms a DAG where every node has a fixed outdegree (amount of children).
The last (right-most) node in each DAG layer may have a lower outdegree.
------------
SubOptions
max-outdegree=[2:]
Maximum outdegree (children) for a node (IPFS default: 174)
[C]ollector 'none'
Does not form a DAG, nor emits a root CID. Simply redirects chunked data
to /dev/null. Takes no arguments.
[C]ollector 'shrubber'
This collector allows one to arrange, group and emit nodes as smaller
subtrees (shrubberies), before passing them to the next collector in the
chain. It combines several modes of operation, each benefitting from being
as close to the 'leaf node' layer as possible. Specifically:
-
------------
SubOptions
cid-subgroup-mask-bits=[4:16]
FIXME Amount of bits from the end of a cryptographic Cid to
compare of state to compare to target on every iteration.
For random input average chunk size is about 2**m
cid-subgroup-min-nodes=[0:]
FIXME The minimum amount of nodes clustered together before
employing CID-based subgrouping. 0 disables
cid-subgroup-target=[0:]
FIXME State value denoting a chunk boundary, check against
mask
max-payload=[0:MaxPayload]
FIXME Maximum payload size in each node. To skip
payload-based balancing, set this to 0.
static-pad-repeater-nodes=[1:]
FIXME LS
[C]ollector 'trickle'
Produces a "side-balanced" DAG optimized for streaming. Data blocks further
away from the stream start are arranged in nodes at increasing depth away
from the root. The rough "placement group" for a particular node LeafIndex
away from the stream start can be derived numerically via:
int( log( LeafIndex / MaxDirectLeaves ) / log( 1 + MaxSiblingSubgroups ) )
See the example program in trickle/init.go for more info.
------------
SubOptions
max-direct-leaves=[1:]
Maximum leaves per node (IPFS default: 174)
max-sibling-subgroups=[1:]
Maximum same-depth-groups per node (IPFS default: 4)
unixfs-nul-leaf-compat
Flag to force convergence with go-ipfs *specifically* when
encoding a 0-length stream (override encoder leaf-type)
[N]odeEncoder 'unixfsv1'
Implements UnixFSv1, the only encoding currently rendered by IPFS gateways.
By default generates go-ipfs-standard, inefficient, 'Tsize'-full linknodes.
------------
SubOptions
cidv0 Generate compat-mode CIDv0 links
merkledag-compat-protobuf
Output merkledag links/data in non-canonical protobuf order
for convergence with go-ipfs
non-standard-lean-links
Omit dag-size and offset information from all links. While
IPFS will likely render the result, ONE VOIDS ALL WARRANTIES
unixfs-leaf-decorator-type=value
Generate leaves as full UnixFS nodes with the given UnixFSv1
type (0 or 2). When unspecified (default) uses raw leaves
instead.
[C]hunker 'buzhash'
Chunker based on hashing by cyclic polynomial, similar to the one used
in 'attic-backup'. As source of "hashing" uses a predefined table of
values selectable via the hash-table option.
------------
SubOptions
hash-table=name
The hash table to use, one of: 'GoIPFSv0'
max-size=[1:MaxPayload]
Maximum data chunk size (IPFS default: 524288)
min-size=[0:MaxPayload]
Minimum data chunk size (IPFS default: 131072)
state-mask-bits=[5:22]
Amount of bits of state to compare to target on every
iteration. For random input average chunk size is about 2**m
(IPFS default: 17)
state-target=uint32
State value denoting a chunk boundary (IPFS default: 0)
[C]hunker 'fixed-size'
Splits buffer into equally sized chunks. Requires a single parameter: the
size of each chunk in bytes (IPFS default: 262144)
[C]hunker 'pad-finder'
PAD FIXME
------------
SubOptions
max-pad-run=value
unspecified
pad-freeform-re2=regex
FIXME
pad-static-hex=value
unspecified
static-pad-literal-max=value
unspecified
static-pad-min-repeats=value
unspecified
[C]hunker 'pigz'
FIXME
------------
SubOptions
max-size=[1:MaxPayload]
Maximum data chunk size
min-size=[0:MaxPayload]
Minimum data chunk size
state-mask-bits=[5:22]
Amount of bits of state to compare to target on every
iteration. For random input average chunk size is about 2**m
state-target=uint32
State value denoting a chunk boundary
[C]hunker 'rabin'
Chunker based on the venerable 'Rabin Fingerprint', similar to the one
used by `restic`, the LBFS, and others. The provided implementation is a
significantly slimmed-down adaptation of multiple "classic" versions.
------------
SubOptions
max-size=[1:MaxPayload]
Maximum data chunk size (IPFS default: 393216)
min-size=[0:MaxPayload]
Minimum data chunk size (IPFS default: 87381)
polynomial=uint64
(IPFS default: 17437180132763653)
state-mask-bits=[5:22]
Amount of bits of state to compare to target on every
iteration. For random input average chunk size is about 2**m
(IPFS default: 18)
state-target=uint64
State value denoting a chunk boundary (IPFS default: 0)
window-size=bytes
State value denoting a chunk boundary (IPFS default: 16)
Riba: "anelace was created specifically to rip out all the knobs/tunables Riba: The flags that are missing is extra tunables on what to do when you have a dangling "chunk" you want to add to a tree of 10 leaves of, say, height 3 Lists of Profiles that existKubo CIDv0 c.Import.CidVersion = *NewOptionalInteger(0)
c.Import.UnixFSRawLeaves = False
c.Import.UnixFSChunker = *NewOptionalString("size-262144")
c.Import.HashFunction = *NewOptionalString("sha2-256") (Src: Lidel in kubo#4143 on kubo, May 2024) Kubo CIDv1 c.Import.CidVersion = *NewOptionalInteger(1)
c.Import.UnixFSRawLeaves = True
c.Import.UnixFSChunker = *NewOptionalString("size-1048576")
c.Import.HashFunction = *NewOptionalString("sha2-256") Lidel: "Feedback about what should go into [the] It feels like a good opportunity to ALSO enable --inline with --inline-limit=32, to have one release with "breaking change" instead of two. (Src: Lidel in kubo#4143 on kubo, Sep 2023) Kubo POSIX-aware mode(worth including? still marked experimental in Filecoin pieceCIDs (are there other CID profiles used in other corners of FIL land?)
Web3.storageLUCIDIrohopen questions:
RIBSbasic premise:
WARCSee WARC record file-chunking section of Web Recorder spec for a distinct ingress profile. |
The root spec currently has no mention of "larger" data, i.e. files/blobs/directories. While I don't have the answers for all the open questions and debates, I think it's worthy of mention, irrespective of whether you choose to incorporate it. |
There's BDASL right there in the list of spec on https://dasl.ing/! And https://dasl.ing/bdasl.html! |
Right! That's useful, but there's no mention of files/directories/UnixFS which means representing anything more than just a blob of data isn't supported. I think that's worth mentioning since so many of the CIDs in the ecosystem are UnixFS. |
I think one of the questions I keep wanting to raise that @2color is pointing at is there's a conceptual difference between
They've been confused cause of the IPFS block limit, but they're not the same. A Blake3 like algorithm solves the first problem only, not the second. A directory for example is a simple structural piece of data with selective querying (I just want to follow a path) -- a directory stored as a single blob of bytes I have to download entirely is less useful than a directory I can path into for just what I want. Arguably there are ways you can collapse these problems together but they are complicated IMHO (see https://github.com/Gozala/merkle-reference). |
My argument is once you add "this data will be broken up in interesting ways to enable querying" the "data always hashes to the same CID" proposition becomes somewhat impossible. |
Cross-linking to #16 (comment) which is relevant for this discussion |
Chunking can be useful but should be treated separately from CIDs, with its own indirection layer that can encode the required parameters. Is that useful? Should it be part of CAR?
The text was updated successfully, but these errors were encountered: