Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need a separate spec for chunking? #2

Open
darobin opened this issue Nov 25, 2024 · 7 comments
Open

Do we need a separate spec for chunking? #2

darobin opened this issue Nov 25, 2024 · 7 comments

Comments

@darobin
Copy link
Owner

darobin commented Nov 25, 2024

Chunking can be useful but should be treated separately from CIDs, with its own indirection layer that can encode the required parameters. Is that useful? Should it be part of CAR?

@bumblefudge
Copy link
Collaborator

Oh, I took a few notes on the matter over the last few months:

IPFS Profiling IPIP Notes

Understanding the problem

Part 2 can be separated out-- I have started down this path with the new prior art section of the IETF multihash spec, which is kind of a "downpayment" on adding a "just the hash please" helper function that should be built into multihash libraries to convert an IPFS CID to a "normal hash" like Mark's sha256(big data) for any CID that points to locally-available complete big data, whether it's been DAG'd or LUCID/iroh-style direct-hashed. Sidenote, I love mark's term cidsum for CID-->checksum method, let's keep that one in our back pockets if this actually gets written up as an IPIP and incorporated into kubo's feature roadmap or packaged up as a distinct library/stand-alone tool. It's much easier to tackle part 2 after part 1 is shipped, at least at the conceptual level.

Part 1 is what this whole document is trying to address in the form of defined profiles that can be "sensible defaults" strayed from at great risk. How profile identifiers can be baked INTO CIDv2s or passed out of band or catch-tried or ducktyped, etc etc., is out of scope for now.

Out of Scope / Orthogonal Problems

  • Debatably the "content-type"/HTTP-interop question comes up any time you start talking about multicodecs, and it's worth noting that Lidel deliberately mentions it as out-of-scope of kubo's current profiling efforts. He links to the WARC file-chunking section of the web recorder spec, which does custom chunking to make a CID for each WARC record to allow the WARC to function like a CAR file, segmenting a web archive into individual files labeled in the manifest by encoding type. Future work here for sure-- perhaps WARC-aware (or otherwise HTTP/enc-type-aware) profiles are worth adding to the list?
  • How profiles figure in the Doc Sprint
    • Lidel: Would it be useful to have “CID Conventions” section at https://specs.ipfs.tech/ as a way of disseminating information about involved settings to implementers that care about 1:1 reproducible CIDs? (Src)
  • Big datasets that are added to or incrementally changed over time require "branch-aware" DAG-building strategies and updating tools; this probably only works with some profiles and these usecases should be steered away from the kubo default profiles and the web3.storage profiles alike, which are for static files rather than evolving DAGs
  • Content credentials / CID equivalence manifests could just list all possible parameters defined in the Profiling IPIP...
    • note that hannah's example includes both a top-level SHA-256 and a blake3 "direct hash" of input, in addition to a UnixFS object containing DAG params... interesting aesthetic choice

Terminology

Technically CID generation is only "ingress" in the IPFS mental model where all the data ever CID'd exists in a self-indexing DHT; for now, let's just table bikeshedding and use these terms within this document. We can update later when all the work is done and this is all being rewritten as docs for external readers.

Ingress =

  1. Understand input
  2. If DAG, generate DAG
  3. If UnixFS, structure folders/paths/directories
  4. If packaging, generate CAR file and/or alternate CIDs and/or content credentials
  5. [Optional] Post-processing of CID(s) and/or bytes: locally pin or not, index, advertise, hash-only, etc

Parameterizing Ingress/Generation

Alan's Heirarchy of Parameters

Src

  • wrap (boolean, defaults to false): if true, a wrapping node will be created
  • shardSplitThreshold (positive integer, defaults to 1000): the number of directory entries above which we decide to use a sharding directory builder (instead of the default flat one)
  • chunker (string, defaults to "fixed"): the chunking strategy. Supports:
    • fixed
    • rabin
  • avgChunkSize (positive integer, defaults to 262144): the average chunk size (rabin chunker only)
  • minChunkSize (positive integer): the minimum chunk size (rabin chunker only)
  • maxChunkSize (positive integer, defaults to 262144): the maximum chunk size
  • strategy (string, defaults to "balanced"): the DAG builder strategy name. Supports:
    • flat: flat list of chunks
    • balanced: builds a balanced tree
    • trickle: builds a trickle tree
  • maxChildrenPerNode (positive integer, defaults to 174): the maximum children per node for the balanced and trickle DAG builder strategies
  • layerRepeat (positive integer, defaults to 4): (only applicable to the trickle DAG builder strategy). The maximum repetition of parent nodes for each layer of the tree.
  • reduceSingleLeafToSelf (boolean, defaults to true): optimization for, when reducing a set of nodes with one node, reduce it to that node.
  • hamtHashFn (async function(string) Buffer): a function that hashes file names to create HAMT shards
  • hamtBucketBits (positive integer, defaults to 8): the number of bits at each bucket of the HAMT
  • progress (function): a function that will be called with the byte length of chunks as a file is added to ipfs.
  • onlyHash (boolean, defaults to false): Only chunk and hash - do not write to disk
  • hashAlg (string): multihash hashing algorithm to use
  • cidVersion (integer, default 0): the CID version to use when storing the data (storage keys are based on the CID, including it's version)
  • rawLeaves (boolean, defaults to false): When a file would span multiple DAGNodes, if this is true the leaf nodes will not be wrapped in UnixFS protobufs and will instead contain the raw file bytes
  • leafType (string, defaults to 'file') what type of UnixFS node leaves should be - can be 'file' or 'raw' (ignored when rawLeaves is true)
  • blockWriteConcurrency (positive integer, defaults to 10) How many blocks to hash and write to the block store concurrently. For small numbers of large files this should be high (e.g. 50).
  • fileImportConcurrency (number, defaults to 50) How many files to import concurrently. For large numbers of small files this should be high (e.g. 50).

becomes

  • ingress mode
    • wrapping node
      • CAR file: true/false
      • data or files
        • data
        • files
    • onlyHash: true/false
    • concurrency
      • blockWriteConcurrency: uvarint (10)
      • fileImportConcurrency: uvarint (50)
  • CID properties
    • cidVersion: 0/1
  • DAG
    • DAG building strategy
      • strategy: flat/balanced/trickle
        • universal
          • maxChildrenPerNode: uvarint (174)
          • reduceSingleLeafToSelf: true/false (true)
          • rawleaves: true/false (true)
          • leafType: string ("file") - ignored if rawleaves=true
        • balanced-only
        • trickle-specific
          • layerRepeat: uvarint (4)
      • directory structure
        • shardSplitThreshold: uvarint
      • helper functions
        • hamtHashFn: async buffer
          • hamtBucketBits: uvarint (8)
    • chunker: none/fixed/rabin
      • avgChunksize: int (262144)
      • minChunksize: int (0)
      • maxChunksize: int (262144)

becomes
(note: ignore the errors, hedgedoc v1 doesn't do mermaid.js mindmaps yet)

Loading
mindmap
  root((CID params))
    ingress mode
        wrapping node metadata
            CAR file: true/false
            ??: type: data/files
        onlyHash: true/false
        concurrency
            blockWriteConcurrency: uvarint _*10_
            fileImportConcurrency: uvarint _*50_
    CID properties
        cidVersion: 0/1 
    DAG
        DAG building strategy
            strategy: flat/balanced/trickle
                universal
                    maxChildrenPerNode: uvarint _*174_
                    reduceSingleLeafToSelf: true/false _*true_
                    rawleaves: true/false _*true_
                    leafType: string _*file_ -- ignored if rawleavestrue
                balanced-only
                trickle-specific
                    layerRepeat: uvarint _*4_
            directory structure
                shardSplitThreshold: uvarint
            helper functions
                hamtHashFn: async buffer
                    hamtBucketBits: uvarint _*8_
        chunker: none/fixed/rabin
            avgChunksize: int _*262144_
            minChunksize: int _*0_
            maxChunksize: int _*262144_  

Mindmap of Alan's 2020 w3up Parameters

becomes (src)

Kubo State of Mind

Alan's 2020 attempts were based on ipfs add options then; today, they are different. These parameters in ipfs add help message:

SYNOPSIS
  ipfs add [--recursive | -r] [--dereference-args] [--stdin-name=<stdin-name>]
           [--hidden | -H] [--ignore=<ignore>]...
           [--ignore-rules-path=<ignore-rules-path>] [--quiet | -q]
           [--quieter | -Q] [--silent] [--progress | -p] [--trickle | -t]
           [--only-hash | -n] [--wrap-with-directory | -w]
           [--chunker=<chunker> | -s] [--raw-leaves] [--nocopy] [--fscache]
           [--cid-version=<cid-version>] [--hash=<hash>] [--inline]
           [--inline-limit=<inline-limit>] [--pin=false] [--to-files=<to-files>]
           [--preserve-mode] [--preserve-mtime] [--mode=<mode>]
           [--mtime=<mtime>] [--mtime-nsecs=<mtime-nsecs>] [--] <path>...

ARGUMENTS

  <path>... - The path to a file to be added to IPFS.

OPTIONS

  -r, --recursive            bool   - Add directory paths recursively.
  --dereference-args         bool   - Symlinks supplied in arguments are
                                      dereferenced.
  --stdin-name               string - Assign a name if the file source is stdin.
  -H, --hidden               bool   - Include files that are hidden. Only takes
                                      effect on recursive add.
  --ignore                   array  - A rule (.gitignore-stype) defining which
                                      file(s) should be ignored (variadic,
                                      experimental).
  --ignore-rules-path        string - A path to a file with .gitignore-style
                                      ignore rules (experimental).
  -q, --quiet                bool   - Write minimal output.
  -Q, --quieter              bool   - Write only final hash.
  --silent                   bool   - Write no output.
  -p, --progress             bool   - Stream progress data.
  -t, --trickle              bool   - Use trickle-dag format for dag generation.
  -n, --only-hash            bool   - Only chunk and hash - do not write to
                                      disk.
  -w, --wrap-with-directory  bool   - Wrap files with a directory object.
  -s, --chunker              string - Chunking algorithm, size-[bytes],
                                      rabin-[min]-[avg]-[max] or buzhash.
  --raw-leaves               bool   - Use raw blocks for leaf nodes.
  --nocopy                   bool   - Add the file using filestore. Implies
                                      raw-leaves. (experimental).
  --fscache                  bool   - Check the filestore for pre-existing
                                      blocks. (experimental).
  --cid-version              int    - CID version. Defaults to 0 unless an
                                      option that depends on CIDv1 is passed.
                                      Passing version 1 will cause the
                                      raw-leaves option to default to true.
  --hash                     string - Hash function to use. Implies CIDv1 if
                                      not sha2-256. (experimental).
  --inline                   bool   - Inline small blocks into CIDs.
                                      (experimental).
  --inline-limit             int    - Maximum block size to inline.
                                      (experimental). Default: 32.
  --pin                      bool   - Pin locally to protect added files from
                                      garbage collection. Default: true.
  --to-files                 string - Add reference to Files API (MFS) at the
                                      provided path.
  --preserve-mode            bool   - Apply existing POSIX permissions to
                                      created UnixFS entries. Disables
                                      raw-leaves. (experimental).
  --mode                     uint   - Custom POSIX file mode to store in
                                      created UnixFS entries. Disables
                                      raw-leaves. (experimental).
  --mtime                    int64  - Custom POSIX modification time to store
                                      in created UnixFS entries (seconds before
                                      or after the Unix Epoch). Disables
                                      raw-leaves. (experimental).
  --mtime-nsecs              uint   - Custom POSIX modification time (optional
                                      time fraction in nanoseconds).

becomes

- ingress mode
    - wrapping node
        - wrap with directory: true/false
        - data or files
            - data
            - files
                - dereference-symlinks: true/false
                - stdin-name: string
                - preserve POSIX perms for each file
                    - necessarily disables `raw-leaves` setting (automatic if any of the following are set as flags)
                    - preserve-mode: bool
                    - preserve-mtime: bool (sets modtime to now if false)
                    - mode: uint (overrides existing posix perms with provided)
                    - mtime: int64 (overrides modtime with provided)
                    - mtime-nsecs: uint (overrides modtime with provided in unix epoch nanoseconds)
    - verbosity
        - progress: true/false
        - silent: true/false
    - postprocessing
        - onlyHash: true/false
        - pin: true/false
        - to-files: string (path to Files API (MFS))
- CID properties
    - cidVersion: 0/1
- DAG
    - DAG building strategy
        - strategy
            - rawleaves: true/false (true)
            - trickle: true/false
                - ??: is "balanced" the unmarked default if false?
        - directory structure
            - recursive: true/false
            - wrap with directory: true/false
        - inline: true/false
            - inline-limit: uvarint (32)
    - chunker: string
        - rabin: `rabin-min-avg-max`
            - avgChunksize: int (262144)
            - minChunksize: int (0)
            - maxChunksize: int (262144)
        - static: `size-[number of bytes]`
        - buzhash: `buzhash`

becomes

Loading
mindmap
    root ((CID params))
        ingress mode
            wrap with directory: true/false
            data or files
                data
                files
                    dereference-symlinks: true/false
                    stdin-name: string
                    preserve POSIX perms for each file
                        necessarily disables `raw-leaves` setting -- automatic if any of the following are set as flags
                        preserve-mode: bool
                        preserve-mtime: bool -- sets modtime to now if false
                        mode: uint -- overrides existing posix perms with provided
                        mtime: int64 -- overrides modtime with provided
                        mtime-nsecs: uint -- overrides modtime with provided in unix epoch nanoseconds
            verbosity
                progress: true/false
                silent: true/false
            postprocessing
                onlyHash: true/false
                pin: true/false
                to-files: string -- path to Files API, i.e. MFS
        CID properties
            cidVersion: 0/1
        DAG
            DAG building strategy
                strategy
                    rawleaves: true/false _*true_
                    trickle: true/false
                        ??: is `balanced` the unmarked default if false?
                directory structure
                    recursive: true/false
                    wrap with directory: true/false
                inline: true/false
                    inline-limit: uvarint _*32_
            chunker: string
                rabin: rabin-+min+-+avg+-+max+
                    avg: int _*262144_
                    min: int _*0_
                    max: int _*262144_
                static: size-+number of bytes+
                buzhash: buzhash

Mindmap of Today's kubo Parameters

becomes (Src):

Riba's DAGger

Src: https://github.com/ribasushi/DAGger/

Usage: stream-dagger [-h] [--async-hashers integer] [--chunkers ch1_o1.1_o1.2_..._o1.N__ch2_o2.1_o2.2_..._o2.N__ch3_...] [--cid-multibase string] [--collectors co1_o1.1_o1.2_..._o1.N__co2_...] [--emit-stderr comma,sep,emitters] [--emit-stdout comma,sep,emitters] [--hash algname] [--hash-bits integer] [--help-all] [--inline-max-size bytes] [--ipfs-add-compatible-command cmdstring] [--multipart] [--node-encoder encname_opt1_opt2_..._optN] [--ring-buffer-min-sysread bytes] [--ring-buffer-size bytes] [--ring-buffer-sync-size bytes] [--skip-nul-inputs] [--stats-active uint]
     --async-hashers=integer
                  Number of concurrent short-lived goroutines performing
                  hashing. Set to 0 (disable) for predictable benchmarking.
                  Default: [24]
     --chunkers=ch1_o1.1_o1.2_..._o1.N__ch2_o2.1_o2.2_..._o2.N__ch3_...
                  Stream chunking algorithm chain. Each chunker is one of:
                  'buzhash', 'fixed-size', 'pad-finder', 'pigz', 'rabin'
     --cid-multibase=string
                  Use this multibase when encoding CIDs for output. One of
                  'base32', 'base36'. Default: [base32]
     --collectors=co1_o1.1_o1.2_..._o1.N__co2_...
                  Node-forming algorithm chain. Each collector is one of:
                  'fixed-cid-refs-size', 'fixed-outdegree', 'none',
                  'shrubber', 'trickle'
     --emit-stderr=comma,sep,emitters
                  One or more emitters to activate on stdERR. Available
                  emitters are 'car-v0-fifos-xargs', 'car-v0-pinless-stream',
                  'chunks-jsonl', 'none', 'roots-jsonl', 'stats-jsonl',
                  'stats-text'. Default: [stats-text]
     --emit-stdout=comma,sep,emitters
                  One or more emitters to activate on stdOUT. Available
                  emitters same as above. Default: [roots-jsonl]
     --hash=algname
                  Hash function to use, one of: 'blake2b-256', 'murmur3-128',
                  'none', 'sha2-256', 'sha2-256-gocore', 'sha3-512'
     --hash-bits=integer
                  Amount of bits taken from *start* of the hash output.
                  Default: [256]
 -h, --help       Display basic help
     --help-all   Display full help including options for every currently
                  supported chunker/collector/encoder
     --inline-max-size=bytes
                  Use identity-CID to refer to blocks having on-wire size at
                  or below the specified value (36 is recommended), 0 disables
     --ipfs-add-compatible-command=cmdstring
                  A complete go-ipfs/js-ipfs add command serving as a basis
                  config (any conflicting option will take precedence)
     --multipart  Expect multiple SInt64BE-size-prefixed streams on stdIN
     --node-encoder=encname_opt1_opt2_..._optN
                  The IPLD-ish node encoder to use, one of: 'unixfsv1'
     --ring-buffer-min-sysread=bytes
                  (EXPERT SETTING) Perform next read(2) only when the
                  specified amount of free space is available in the buffer.
                  Default: [262144]
     --ring-buffer-size=bytes
                  The size of the quantized ring buffer used for ingestion.
                  Default: [25165824]
     --ring-buffer-sync-size=bytes
                  (EXPERT SETTING) The size of each buffer synchronization
                  sector. Default: [65536]
     --skip-nul-inputs
                  Instead of emitting an IPFS-compatible zero-length CID, skip
                  zero-length streams outright
     --stats-active=uint
                  A bitfield representing activated stat aggregations:
                  bit0:BlockSizing, bit1:RingbufferTiming. Default: [1]

[C]ollector 'fixed-cid-refs-size'
  Forms a DAG where the amount of bytes taken by CID references is limited
  for every individual node. The IPLD-link referencing overhead, aside from
  the CID length itself, is *NOT* considered.
  ------------
   SubOptions
       max-cid-refs-size=[160:]
              Maximum cumulative byte-size of CID references within a node

[C]ollector 'fixed-outdegree'
  Forms a DAG where every node has a fixed outdegree (amount of children).
  The last (right-most) node in each DAG layer may have a lower outdegree.
  ------------
   SubOptions
       max-outdegree=[2:]
              Maximum outdegree (children) for a node (IPFS default: 174)

[C]ollector 'none'
  Does not form a DAG, nor emits a root CID. Simply redirects chunked data
  to /dev/null. Takes no arguments.

[C]ollector 'shrubber'
  This collector allows one to arrange, group and emit nodes as smaller
  subtrees (shrubberies), before passing them to the next collector in the
  chain. It combines several modes of operation, each benefitting from being
  as close to the 'leaf node' layer as possible. Specifically:
   -
  ------------
   SubOptions
       cid-subgroup-mask-bits=[4:16]
              FIXME Amount of bits from the end of a cryptographic Cid to
              compare of state to compare to target on every iteration.
              For random input average chunk size is about 2**m
       cid-subgroup-min-nodes=[0:]
              FIXME The minimum amount of nodes clustered together before
              employing CID-based subgrouping. 0 disables
       cid-subgroup-target=[0:]
              FIXME State value denoting a chunk boundary, check against
              mask
       max-payload=[0:MaxPayload]
              FIXME Maximum payload size in each node. To skip
              payload-based balancing, set this to 0.
       static-pad-repeater-nodes=[1:]
              FIXME LS

[C]ollector 'trickle'
  Produces a "side-balanced" DAG optimized for streaming. Data blocks further
  away from the stream start are arranged in nodes at increasing depth away
  from the root. The rough "placement group" for a particular node LeafIndex
  away from the stream start can be derived numerically via:
  int( log( LeafIndex / MaxDirectLeaves ) / log( 1 + MaxSiblingSubgroups ) )
  See the example program in trickle/init.go for more info.
  ------------
   SubOptions
       max-direct-leaves=[1:]
              Maximum leaves per node (IPFS default: 174)
       max-sibling-subgroups=[1:]
              Maximum same-depth-groups per node (IPFS default: 4)
       unixfs-nul-leaf-compat
              Flag to force convergence with go-ipfs *specifically* when
              encoding a 0-length stream (override encoder leaf-type)


[N]odeEncoder 'unixfsv1'
  Implements UnixFSv1, the only encoding currently rendered by IPFS gateways.
  By default generates go-ipfs-standard, inefficient, 'Tsize'-full linknodes.
  ------------
   SubOptions
       cidv0  Generate compat-mode CIDv0 links
       merkledag-compat-protobuf
              Output merkledag links/data in non-canonical protobuf order
              for convergence with go-ipfs
       non-standard-lean-links
              Omit dag-size and offset information from all links. While
              IPFS will likely render the result, ONE VOIDS ALL WARRANTIES
       unixfs-leaf-decorator-type=value
              Generate leaves as full UnixFS nodes with the given UnixFSv1
              type (0 or 2). When unspecified (default) uses raw leaves
              instead.


[C]hunker 'buzhash'
  Chunker based on hashing by cyclic polynomial, similar to the one used
  in 'attic-backup'. As source of "hashing" uses a predefined table of
  values selectable via the hash-table option.
  ------------
   SubOptions
       hash-table=name
              The hash table to use, one of: 'GoIPFSv0'
       max-size=[1:MaxPayload]
              Maximum data chunk size (IPFS default: 524288)
       min-size=[0:MaxPayload]
              Minimum data chunk size (IPFS default: 131072)
       state-mask-bits=[5:22]
              Amount of bits of state to compare to target on every
              iteration. For random input average chunk size is about 2**m
              (IPFS default: 17)
       state-target=uint32
              State value denoting a chunk boundary (IPFS default: 0)

[C]hunker 'fixed-size'
  Splits buffer into equally sized chunks. Requires a single parameter: the
  size of each chunk in bytes (IPFS default: 262144)

[C]hunker 'pad-finder'
  PAD FIXME
  ------------
   SubOptions
       max-pad-run=value
              unspecified
       pad-freeform-re2=regex
              FIXME
       pad-static-hex=value
              unspecified
       static-pad-literal-max=value
              unspecified
       static-pad-min-repeats=value
              unspecified

[C]hunker 'pigz'
  FIXME
  ------------
   SubOptions
       max-size=[1:MaxPayload]
              Maximum data chunk size
       min-size=[0:MaxPayload]
              Minimum data chunk size
       state-mask-bits=[5:22]
              Amount of bits of state to compare to target on every
              iteration. For random input average chunk size is about 2**m
       state-target=uint32
              State value denoting a chunk boundary

[C]hunker 'rabin'
  Chunker based on the venerable 'Rabin Fingerprint', similar to the one
  used by `restic`, the LBFS, and others. The provided implementation is a
  significantly slimmed-down adaptation of multiple "classic" versions.
  ------------
   SubOptions
       max-size=[1:MaxPayload]
              Maximum data chunk size (IPFS default: 393216)
       min-size=[0:MaxPayload]
              Minimum data chunk size (IPFS default: 87381)
       polynomial=uint64
              (IPFS default: 17437180132763653)
       state-mask-bits=[5:22]
              Amount of bits of state to compare to target on every
              iteration. For random input average chunk size is about 2**m
              (IPFS default: 18)
       state-target=uint64
              State value denoting a chunk boundary (IPFS default: 0)
       window-size=bytes
              State value denoting a chunk boundary (IPFS default: 16)

Riba: "anelace was created specifically to rip out all the knobs/tunables
it is a small suibset, rather than a superset"

Riba: The flags that are missing is extra tunables on what to do when you have a dangling "chunk" you want to add to a tree of 10 leaves of, say, height 3

Lists of Profiles that exist

Kubo CIDv0

 c.Import.CidVersion = *NewOptionalInteger(0) 
 c.Import.UnixFSRawLeaves = False 
 c.Import.UnixFSChunker = *NewOptionalString("size-262144") 
 c.Import.HashFunction = *NewOptionalString("sha2-256") 

(Src: Lidel in kubo#4143 on kubo, May 2024)

Kubo CIDv1

 c.Import.CidVersion = *NewOptionalInteger(1) 
 c.Import.UnixFSRawLeaves = True 
 c.Import.UnixFSChunker = *NewOptionalString("size-1048576") 
 c.Import.HashFunction = *NewOptionalString("sha2-256") 

Lidel: "Feedback about what should go into [the] test-cid-v1 [profile] is welcome."

It feels like a good opportunity to ALSO enable --inline with --inline-limit=32, to have one release with "breaking change" instead of two.

(Src: Lidel in kubo#4143 on kubo, Sep 2023)

Kubo POSIX-aware mode

(worth including? still marked experimental in ipfs add help)

Filecoin pieceCIDs (are there other CID profiles used in other corners of FIL land?)

struct {
	PayloadLength int64
	RootDigest    [32]byte
}

Web3.storage

LUCID

Iroh

open questions:

  • does Iroh also support LUCID's DAG-CBOR fallback for where blake3 is unavailable or inappropriate? or is it DAG-free at CID layer?

RIBS

basic premise:

  • kubo fork for "less expensive" packaging of data for FIL deals
  • ingresses data to minimalists DAGs (opinionated profile blockstore) and only produces more-efficient CAR files comprised of ONLY RAW BLOCKS, "multihashes not CIDs"
  • architecture diagram
  • repo here and long, thorough youtube explainer here
  • magik6k is using it to store bsky data in FIL deals?
    • screengrab from filecoin public slack

WARC

See WARC record file-chunking section of Web Recorder spec for a distinct ingress profile.

@2color
Copy link

2color commented Dec 11, 2024

The root spec currently has no mention of "larger" data, i.e. files/blobs/directories. While I don't have the answers for all the open questions and debates, I think it's worthy of mention, irrespective of whether you choose to incorporate it.

@darobin
Copy link
Owner Author

darobin commented Dec 11, 2024

There's BDASL right there in the list of spec on https://dasl.ing/! And https://dasl.ing/bdasl.html!

@2color
Copy link

2color commented Dec 11, 2024

Right! That's useful, but there's no mention of files/directories/UnixFS which means representing anything more than just a blob of data isn't supported.

I think that's worth mentioning since so many of the CIDs in the ecosystem are UnixFS.

@hannahhoward
Copy link

I think one of the questions I keep wanting to raise that @2color is pointing at is there's a conceptual difference between

  • "I had a big chunk of bytes and I'm deciding whether to break it up into multiple blocks for transport"
    and
  • "I have data with a semantic structure that has meaning to users that I want to break into multiple blocks so users can query only parts of it"

They've been confused cause of the IPFS block limit, but they're not the same.

A Blake3 like algorithm solves the first problem only, not the second.

A directory for example is a simple structural piece of data with selective querying (I just want to follow a path) -- a directory stored as a single blob of bytes I have to download entirely is less useful than a directory I can path into for just what I want.

Arguably there are ways you can collapse these problems together but they are complicated IMHO (see https://github.com/Gozala/merkle-reference).

@hannahhoward
Copy link

My argument is once you add "this data will be broken up in interesting ways to enable querying" the "data always hashes to the same CID" proposition becomes somewhat impossible.

@2color
Copy link

2color commented Dec 23, 2024

"I have data with a semantic structure that has meaning to users that I want to break into multiple blocks so users can query only parts of it"

Cross-linking to #16 (comment) which is relevant for this discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants