diff --git a/considerations.md b/considerations.md new file mode 100644 index 0000000..d248a5b --- /dev/null +++ b/considerations.md @@ -0,0 +1,136 @@ +# Extensibility + +Implementations that are reading/processing manifests MUST NOT generate an error if they encounter an unknown property. +Instead they MUST ignore unknown properties. + +# Canonicalization + +* ORAS Artifacts are [content-addressable](https://en.wikipedia.org/wiki/Content-addressable_storage). See [descriptors](descriptor.md) for more. +* One benefit of content-addressable storage is easy deduplication. +* Many artifacts might depend on a particular blob, but there may be only one blob in the store. +* With a different serialization, that same semantic blob would have a different hash, and if both versions of the blob are referenced there will be two blobs with the same semantic content. +* To allow efficient storage, implementations serializing content for blobs SHOULD use a canonical serialization. +* This increases the chance that different implementations can push the same semantic content to the store without creating redundant blobs. + +## JSON + +[JSON][] content SHOULD be serialized as [canonical JSON][canonical-json]. + +Implementations: + +* [Go][]: [github.com/docker/go][], which claims to implement [canonical JSON][canonical-json] except for Unicode normalization. + +## EBNF + +For field formats described in this specification, we use a limited subset of [Extended Backus-Naur Form][ebnf], similar to that used by the [XML specification][xmlebnf]. +Grammars present in the OCI specification are regular and can be converted to a single regular expressions. +However, regular expressions are avoided to limit ambiguity between regular expression syntax. +By defining a subset of EBNF used here, the possibility of variation, misunderstanding or ambiguities from linking to a larger specification can be avoided. + +Grammars are made up of rules in the following form: + +``` +symbol ::= expression +``` + +We can say we have the production identified by symbol if the input is matched by the expression. +Whitespace is completely ignored in rule definitions. + +## Expressions + +The simplest expression is the literal, surrounded by quotes: + +``` +literal ::= "matchthis" +``` + +The above expression defines a symbol, "literal", that matches the exact input of "matchthis". +Character classes are delineated by brackets (`[]`), describing either a set, range or multiple range of characters: + +``` +set := [abc] +range := [A-Z] +``` + +The above symbol "set" would match one character of either "a", "b" or "c". +The symbol "range" would match any character, "A" to "Z", inclusive. +Currently, only matching for 7-bit ascii literals and character classes is defined, as that is all that is required by this specification. +Multiple character ranges and explicit characters can be specified in a single character classes, as follows: + +``` +multipleranges := [a-zA-Z=-] +``` + +The above matches the characters in the range `A` to `Z`, `a` to `z` and the individual characters `-` and `=`. + +Expressions can be made up of one or more expressions, such that one must be followed by the other. +This is known as an implicit concatenation operator. +For example, to satisfy the following rule, both `A` and `B` must be matched to satisfy the rule: + +``` +symbol ::= A B +``` + +Each expression must be matched once and only once, `A` followed by `B`. +To support the description of repetition and optional match criteria, the postfix operators `*` and `+` are defined. +`*` indicates that the preceding expression can be matched zero or more times. +`+` indicates that the preceding expression must be matched one or more times. +These appear in the following form: + +``` +zeroormore ::= expression* +oneormore ::= expression+ +``` + +Parentheses are used to group expressions into a larger expression: + +``` +group ::= (A B) +``` + +Like simpler expressions above, operators can be applied to groups, as well. +To allow for alternates, we also define the infix operator `|`. + +``` +oneof ::= A | B +``` + +The above indicates that the expression should match one of the expressions, `A` or `B`. + +## Precedence + +The operator precedence is in the following order: + +- Terminals (literals and character classes) +- Grouping `()` +- Unary operators `+*` +- Concatenation +- Alternates `|` + +The precedence can be better described using grouping to show equivalents. +Concatenation has higher precedence than alernates, such `A B | C D` is equivalent to `(A B) | (C D)`. +Unary operators have higher precedence than alternates and concatenation, such that `A+ | B+` is equivalent to `(A+) | (B+)`. + +## Examples + +The following combines the previous definitions to match a simple, relative path name, describing the individual components: + +``` +path ::= component ("/" component)* +component ::= [a-z]+ +``` + +The production "component" is one or more lowercase letters. +A "path" is then at least one component, possibly followed by zero or more slash-component pairs. +The above can be converted into the following regular expression: + +``` +[a-z]+(?:/[a-z]+)* +``` + +[ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form +[xmlebnf]: https://www.w3.org/TR/REC-xml/#sec-notation +[canonical-json]: http://wiki.laptop.org/go/Canonical_JSON +[github.com/docker/go]: https://github.com/docker/go/ +[Go]: https://golang.org/ +[JSON]: http://json.org/ diff --git a/descriptor.md b/descriptor.md new file mode 100644 index 0000000..8f174f8 --- /dev/null +++ b/descriptor.md @@ -0,0 +1,159 @@ +# Descriptors + +* An artifact consists of several different components, arranged in a [Merkle Directed Acyclic Graph (DAG)](https://en.wikipedia.org/wiki/Merkle_tree). +* References between components in the graph are expressed through _Content Descriptors_. +* A Content Descriptor (or simply _Descriptor_) describes the disposition of the targeted content. +* A Content Descriptor includes the type of the content, a content identifier (_digest_), and the byte-size of the raw content. +* A Content Descriptor MAY differentiate the type through the `artifactType`. +* Descriptors SHOULD be embedded in other formats to securely reference external content. +* Other formats SHOULD use descriptors to securely reference external content. + +This section defines the `application/vnd.oras.artifact.descriptor.v1+json` media type. + +## Properties + +A descriptor consists of a set of properties encapsulated in key-value fields. + +The following fields contain the primary properties that constitute an Artifact Descriptor: + +- **`mediaType`** *string* + + This REQUIRED property contains the media type of the referenced content. + Values MUST comply with [RFC 6838][rfc6838], including the [naming requirements in its section 4.2][rfc6838-s4.2]. + + Each artifact author MAY define their own unique `mediaTypes`, or utilize existing `mediaTypes` defined by other artifacts. To assure unique ownership, all `mediaTypes` MUST be registered with iana.org. + +- **`digest`** *string* + + This REQUIRED property is the _digest_ of the targeted content, conforming to the requirements outlined in [Digests](#digests). + Retrieved content SHOULD be verified against this digest when consumed via untrusted sources. + +- **`size`** *int64* + + This REQUIRED property specifies the size, in bytes, of the raw content. + This property exists so that a client will have an expected size for the content before processing. + If the length of the retrieved content does not match the specified length, the content SHOULD NOT be trusted. + +- **`artifactType`** *string* + + This OPTIONAL property defines the type or Artifact, differentiating artifacts that use the `application/vnd.oras.manifest`. When the descriptor is used for blobs, this property MUST be empty. + +- **`annotations`** *string-string map* + + This OPTIONAL property contains arbitrary metadata for this descriptor. + This OPTIONAL property MUST use the [annotation rules](annotations.md#rules). + +## Digests + +The _digest_ property of a Descriptor acts as a content identifier, enabling [content addressability](http://en.wikipedia.org/wiki/Content-addressable_storage). +It uniquely identifies content by taking a [collision-resistant hash](https://en.wikipedia.org/wiki/Cryptographic_hash_function) of the bytes. +If the _digest_ can be communicated in a secure manner, one can verify content from an insecure source by recalculating the digest independently, ensuring the content has not been modified. + +The value of the `digest` property is a string consisting of an _algorithm_ portion and an _encoded_ portion. +The _algorithm_ specifies the cryptographic hash function and encoding used for the digest; the _encoded_ portion contains the encoded result of the hash function. + +A digest string MUST match the following [grammar](considerations.md#ebnf): + +``` +digest ::= algorithm ":" encoded +algorithm ::= algorithm-component (algorithm-separator algorithm-component)* +algorithm-component ::= [a-z0-9]+ +algorithm-separator ::= [+._-] +encoded ::= [a-zA-Z0-9=_-]+ +``` + +Note that _algorithm_ MAY impose algorithm-specific restriction on the grammar of the _encoded_ portion. +See also [Registered Algorithms](#registered-algorithms). + +Some example digest strings include the following: + +digest | algorithm | Registered | +--------------------------------------------------------------------------|---------------------|------------| +`sha256:6c3c624b58dbbcd3c0dd82b4c53f04194d1247c6eebdaab7c610cf7d66709b3b` | [SHA-256](#sha-256) | Yes | +`sha512:401b09eab3c013d4ca54922bb802bec8fd5318192b0a75f201d8b372742...` | [SHA-512](#sha-512) | Yes | +`multihash+base58:QmRZxt2b1FVZPNqd8hsiykDL3TdBDeTSPX9Kv46HmX4Gx8` | Multihash | No | +`sha256+b64u:LCa0a2j_xo_5m0U8HTBBNBNCLXBkg7-g-YpeiGJm564` | SHA-256 with urlsafe base64 | No | + +Please see [Registered Algorithms](#registered-algorithms) for a list of registered algorithms. + +Implementations SHOULD allow digests with unrecognized algorithms to pass validation if they comply with the above grammar. +While `sha256` will only use hex encoded digests, separators in _algorithm_ and alphanumerics in _encoded_ are included to allow for extensions. +As an example, we can parameterize the encoding and algorithm as `multihash+base58:QmRZxt2b1FVZPNqd8hsiykDL3TdBDeTSPX9Kv46HmX4Gx8`, which would be considered valid but unregistered by this specification. + +### Verification + +Before consuming content targeted by a descriptor from untrusted sources, the byte content SHOULD be verified against the digest string. +Before calculating the digest, the size of the content SHOULD be verified to reduce hash collision space. +Heavy processing before calculating a hash SHOULD be avoided. +Implementations MAY employ [canonicalization](considerations.md#canonicalization) of the underlying content to ensure stable content identifiers. + +### Digest calculations + +A _digest_ is calculated by the following pseudo-code, where `H` is the selected hash algorithm, identified by string ``: +``` +let ID(C) = Descriptor.digest +let C = +let D = ':' + Encode(H(C)) +let verified = ID(C) == D +``` +Above, we define the content identifier as `ID(C)`, extracted from the `Descriptor.digest` field. +Content `C` is a string of bytes. +Function `H` returns the hash of `C` in bytes and is passed to function `Encode` and prefixed with the algorithm to obtain the digest. +The result `verified` is true if `ID(C)` is equal to `D`, confirming that `C` is the content identified by `D`. +After verification, the following is true: + +``` +D == ID(C) == ':' + Encode(H(C)) +``` + +The _digest_ is confirmed as the content identifier by independently calculating the _digest_. + +### Registered algorithms + +While the _algorithm_ component of the digest string allows the use of a variety of cryptographic algorithms, compliant implementations SHOULD use [SHA-256](#sha-256). + +The following algorithm identifiers are currently defined by this specification: + +| algorithm identifier | algorithm | +|----------------------|---------------------| +| `sha256` | [SHA-256](#sha-256) | +| `sha512` | [SHA-512](#sha-512) | + +If a useful algorithm is not included in the above table, it SHOULD be submitted to this specification for registration. + +#### SHA-256 + +[SHA-256][rfc4634-s4.1] is a collision-resistant hash function, chosen for ubiquity, reasonable size and secure characteristics. +Implementations MUST implement SHA-256 digest verification for use in descriptors. + +When the _algorithm identifier_ is `sha256`, the _encoded_ portion MUST match `/[a-f0-9]{64}/`. +Note that `[A-F]` MUST NOT be used here. + +#### SHA-512 + +[SHA-512][rfc4634-s4.2] is a collision-resistant hash function which [may be more perfomant][sha256-vs-sha512] than [SHA-256](#sha-256) on some CPUs. +Implementations MAY implement SHA-512 digest verification for use in descriptors. + +When the _algorithm identifier_ is `sha512`, the _encoded_ portion MUST match `/[a-f0-9]{128}/`. +Note that `[A-F]` MUST NOT be used here. + +## Examples + +The following example describes a manifest, representing a `cncf.notary.v2` signature, with a content identifier of "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270" and a size of 7682 bytes: + +```json,title=Content%20Descriptor&mediatype=application/vnd.oci.descriptor.v1%2Bjson +{ + "mediaType": "application/vnd.oci.artifact.manifest.v1+json", + "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270", + "size": 7682, + "artifactType": "cncf.notary.v2" +} +``` + +[rfc3986]: https://tools.ietf.org/html/rfc3986 +[rfc4634-s4.1]: https://tools.ietf.org/html/rfc4634#section-4.1 +[rfc4634-s4.2]: https://tools.ietf.org/html/rfc4634#section-4.2 +[rfc6838]: https://tools.ietf.org/html/rfc6838 +[rfc6838-s4.2]: https://tools.ietf.org/html/rfc6838#section-4.2 +[rfc7230-s2.7]: https://tools.ietf.org/html/rfc7230#section-2.7 +[sha256-vs-sha512]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/hsMw7cAwrZE