Add artifact descriptors and considerations to round out the surface …

…area, decoupling from the image-spec Signed-off-by: Steve Lasker <[email protected]>
oras-project · Jul 21, 2021 · 69febd1 · 69febd1
1 parent 434fe99
commit 69febd1
Show file tree

Hide file tree

Showing 2 changed files with 295 additions and 0 deletions.
diff --git a/considerations.md b/considerations.md
@@ -0,0 +1,136 @@
+# Extensibility
+
+Implementations that are reading/processing manifests MUST NOT generate an error if they encounter an unknown property.
+Instead they MUST ignore unknown properties.
+
+# Canonicalization
+
+* ORAS Artifacts are [content-addressable](https://en.wikipedia.org/wiki/Content-addressable_storage). See [descriptors](descriptor.md) for more.
+* One benefit of content-addressable storage is easy deduplication.
+* Many artifacts might depend on a particular blob, but there may be only one blob in the store.
+* With a different serialization, that same semantic blob would have a different hash, and if both versions of the blob are referenced there will be two blobs with the same semantic content.
+* To allow efficient storage, implementations serializing content for blobs SHOULD use a canonical serialization.
+* This increases the chance that different implementations can push the same semantic content to the store without creating redundant blobs.
+
+## JSON
+
+[JSON][] content SHOULD be serialized as [canonical JSON][canonical-json].
+
+Implementations:
+
+* [Go][]: [github.com/docker/go][], which claims to implement [canonical JSON][canonical-json] except for Unicode normalization.
+
+## EBNF
+
+For field formats described in this specification, we use a limited subset of [Extended Backus-Naur Form][ebnf], similar to that used by the [XML specification][xmlebnf].
+Grammars present in the OCI specification are regular and can be converted to a single regular expressions.
+However, regular expressions are avoided to limit ambiguity between regular expression syntax.
+By defining a subset of EBNF used here, the possibility of variation, misunderstanding or ambiguities from linking to a larger specification can be avoided.
+
+Grammars are made up of rules in the following form:
+
+```
+symbol ::= expression
+```
+
+We can say we have the production identified by symbol if the input is matched by the expression.
+Whitespace is completely ignored in rule definitions.
+
+## Expressions
+
+The simplest expression is the literal, surrounded by quotes:
+
+```
+literal ::= "matchthis"
+```
+
+The above expression defines a symbol, "literal", that matches the exact input of "matchthis".
+Character classes are delineated by brackets (`[]`), describing either a set, range or multiple range of characters:
+
+```
+set := [abc]
+range := [A-Z]
+```
+
+The above symbol "set" would match one character of either "a", "b" or "c".
+The symbol "range" would match any character, "A" to "Z", inclusive.
+Currently, only matching for 7-bit ascii literals and character classes is defined, as that is all that is required by this specification.
+Multiple character ranges and explicit characters can be specified in a single character classes, as follows:
+
+```
+multipleranges := [a-zA-Z=-]
+```
+
+The above matches the characters in the range `A` to `Z`, `a` to `z` and the individual characters `-` and `=`.
+
+Expressions can be made up of one or more expressions, such that one must be followed by the other.
+This is known as an implicit concatenation operator.
+For example, to satisfy the following rule, both `A` and `B` must be matched to satisfy the rule:
+
+```
+symbol ::= A B
+```
+
+Each expression must be matched once and only once, `A` followed by `B`.
+To support the description of repetition and optional match criteria, the postfix operators `*` and `+` are defined.
+`*` indicates that the preceding expression can be matched zero or more times.
+`+` indicates that the preceding expression must be matched one or more times.
+These appear in the following form:
+
+```
+zeroormore ::= expression*
+oneormore ::= expression+
+```
+
+Parentheses are used to group expressions into a larger expression:
+
+```
+group ::= (A B)
+```
+
+Like simpler expressions above, operators can be applied to groups, as well.
+To allow for alternates, we also define the infix operator `|`.
+
+```
+oneof ::= A | B
+```
+
+The above indicates that the expression should match one of the expressions, `A` or `B`.
+
+## Precedence
+
+The operator precedence is in the following order:
+
+- Terminals (literals and character classes)
+- Grouping `()`
+- Unary operators `+*`
+- Concatenation
+- Alternates `|`
+
+The precedence can be better described using grouping to show equivalents.
+Concatenation has higher precedence than alernates, such `A B | C D` is equivalent to `(A B) | (C D)`.
+Unary operators have higher precedence than alternates and concatenation, such that `A+ | B+` is equivalent to `(A+) | (B+)`.
+
+## Examples
+
+The following combines the previous definitions to match a simple, relative path name, describing the individual components:
+
+```
+path      ::= component ("/" component)*
+component ::= [a-z]+
+```
+
+The production "component" is one or more lowercase letters.
+A "path" is then at least one component, possibly followed by zero or more slash-component pairs.
+The above can be converted into the following regular expression:
+
+```
+[a-z]+(?:/[a-z]+)*
+```
+
+[ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
+[xmlebnf]: https://www.w3.org/TR/REC-xml/#sec-notation
+[canonical-json]: http://wiki.laptop.org/go/Canonical_JSON
+[github.com/docker/go]: https://github.com/docker/go/
+[Go]: https://golang.org/
+[JSON]: http://json.org/
diff --git a/descriptor.md b/descriptor.md
@@ -0,0 +1,159 @@
+# Descriptors
+
+* An artifact consists of several different components, arranged in a [Merkle Directed Acyclic Graph (DAG)](https://en.wikipedia.org/wiki/Merkle_tree).
+* References between components in the graph are expressed through _Content Descriptors_.
+* A Content Descriptor (or simply _Descriptor_) describes the disposition of the targeted content.
+* A Content Descriptor includes the type of the content, a content identifier (_digest_), and the byte-size of the raw content.
+* A Content Descriptor MAY differentiate the type through the `artifactType`.
+* Descriptors SHOULD be embedded in other formats to securely reference external content.
+* Other formats SHOULD use descriptors to securely reference external content.
+
+This section defines the `application/vnd.oras.artifact.descriptor.v1+json` media type.
+
+## Properties
+
+A descriptor consists of a set of properties encapsulated in key-value fields.
+
+The following fields contain the primary properties that constitute an Artifact Descriptor:
+
+- **`mediaType`** *string*
+
+  This REQUIRED property contains the media type of the referenced content.
+  Values MUST comply with [RFC 6838][rfc6838], including the [naming requirements in its section 4.2][rfc6838-s4.2].
+
+  Each artifact author MAY define their own unique `mediaTypes`, or utilize existing `mediaTypes` defined by other artifacts. To assure unique ownership, all `mediaTypes` MUST be registered with iana.org.
+
+- **`digest`** *string*
+
+  This REQUIRED property is the _digest_ of the targeted content, conforming to the requirements outlined in [Digests](#digests).
+  Retrieved content SHOULD be verified against this digest when consumed via untrusted sources.
+
+- **`size`** *int64*
+
+  This REQUIRED property specifies the size, in bytes, of the raw content.
+  This property exists so that a client will have an expected size for the content before processing.
+  If the length of the retrieved content does not match the specified length, the content SHOULD NOT be trusted.
+
+- **`artifactType`** *string*
+
+  This OPTIONAL property defines the type or Artifact, differentiating artifacts that use the `application/vnd.oras.manifest`. When the descriptor is used for blobs, this property MUST be empty.
+
+- **`annotations`** *string-string map*
+
+    This OPTIONAL property contains arbitrary metadata for this descriptor.
+    This OPTIONAL property MUST use the [annotation rules](annotations.md#rules).
+
+## Digests
+
+The _digest_ property of a Descriptor acts as a content identifier, enabling [content addressability](http://en.wikipedia.org/wiki/Content-addressable_storage).
+It uniquely identifies content by taking a [collision-resistant hash](https://en.wikipedia.org/wiki/Cryptographic_hash_function) of the bytes.
+If the _digest_ can be communicated in a secure manner, one can verify content from an insecure source by recalculating the digest independently, ensuring the content has not been modified.
+
+The value of the `digest` property is a string consisting of an _algorithm_ portion and an _encoded_ portion.
+The _algorithm_ specifies the cryptographic hash function and encoding used for the digest; the _encoded_ portion contains the encoded result of the hash function.
+
+A digest string MUST match the following [grammar](considerations.md#ebnf):
+
+```
+digest                ::= algorithm ":" encoded
+algorithm             ::= algorithm-component (algorithm-separator algorithm-component)*
+algorithm-component   ::= [a-z0-9]+
+algorithm-separator   ::= [+._-]
+encoded               ::= [a-zA-Z0-9=_-]+
+```
+
+Note that _algorithm_ MAY impose algorithm-specific restriction on the grammar of the _encoded_ portion.
+See also [Registered Algorithms](#registered-algorithms).
+
+Some example digest strings include the following:
+
+digest                                                                    | algorithm           | Registered |
+--------------------------------------------------------------------------|---------------------|------------|
+`sha256:6c3c624b58dbbcd3c0dd82b4c53f04194d1247c6eebdaab7c610cf7d66709b3b` | [SHA-256](#sha-256) | Yes        |
+`sha512:401b09eab3c013d4ca54922bb802bec8fd5318192b0a75f201d8b372742...`   | [SHA-512](#sha-512) | Yes        |
+`multihash+base58:QmRZxt2b1FVZPNqd8hsiykDL3TdBDeTSPX9Kv46HmX4Gx8`         | Multihash           | No         |
+`sha256+b64u:LCa0a2j_xo_5m0U8HTBBNBNCLXBkg7-g-YpeiGJm564`                 | SHA-256 with urlsafe base64 | No |
+
+Please see [Registered Algorithms](#registered-algorithms) for a list of registered algorithms.
+
+Implementations SHOULD allow digests with unrecognized algorithms to pass validation if they comply with the above grammar.
+While `sha256` will only use hex encoded digests, separators in _algorithm_ and alphanumerics in _encoded_ are included to allow for extensions.
+As an example, we can parameterize the encoding and algorithm as `multihash+base58:QmRZxt2b1FVZPNqd8hsiykDL3TdBDeTSPX9Kv46HmX4Gx8`, which would be considered valid but unregistered by this specification.
+
+### Verification
+
+Before consuming content targeted by a descriptor from untrusted sources, the byte content SHOULD be verified against the digest string.
+Before calculating the digest, the size of the content SHOULD be verified to reduce hash collision space.
+Heavy processing before calculating a hash SHOULD be avoided.
+Implementations MAY employ [canonicalization](considerations.md#canonicalization) of the underlying content to ensure stable content identifiers.
+
+### Digest calculations
+
+A _digest_ is calculated by the following pseudo-code, where `H` is the selected hash algorithm, identified by string `<alg>`:
+```
+let ID(C) = Descriptor.digest
+let C = <bytes>
+let D = '<alg>:' + Encode(H(C))
+let verified = ID(C) == D
+```
+Above, we define the content identifier as `ID(C)`, extracted from the `Descriptor.digest` field.
+Content `C` is a string of bytes.
+Function `H` returns the hash of `C` in bytes and is passed to function `Encode` and prefixed with the algorithm to obtain the digest.
+The result `verified` is true if `ID(C)` is equal to `D`, confirming that `C` is the content identified by `D`.
+After verification, the following is true:
+
+```
+D == ID(C) == '<alg>:' + Encode(H(C))
+```
+
+The _digest_ is confirmed as the content identifier by independently calculating the _digest_.
+
+### Registered algorithms
+
+While the _algorithm_ component of the digest string allows the use of a variety of cryptographic algorithms, compliant implementations SHOULD use [SHA-256](#sha-256).
+
+The following algorithm identifiers are currently defined by this specification:
+
+| algorithm identifier | algorithm           |
+|----------------------|---------------------|
+| `sha256`             | [SHA-256](#sha-256) |
+| `sha512`             | [SHA-512](#sha-512) |
+
+If a useful algorithm is not included in the above table, it SHOULD be submitted to this specification for registration.
+
+#### SHA-256
+
+[SHA-256][rfc4634-s4.1] is a collision-resistant hash function, chosen for ubiquity, reasonable size and secure characteristics.
+Implementations MUST implement SHA-256 digest verification for use in descriptors.
+
+When the _algorithm identifier_ is `sha256`, the _encoded_ portion MUST match `/[a-f0-9]{64}/`.
+Note that `[A-F]` MUST NOT be used here.
+
+#### SHA-512
+
+[SHA-512][rfc4634-s4.2] is a collision-resistant hash function which [may be more perfomant][sha256-vs-sha512] than [SHA-256](#sha-256) on some CPUs.
+Implementations MAY implement SHA-512 digest verification for use in descriptors.
+
+When the _algorithm identifier_ is `sha512`, the _encoded_ portion MUST match `/[a-f0-9]{128}/`.
+Note that `[A-F]` MUST NOT be used here.
+
+## Examples
+
+The following example describes a manifest, representing a `cncf.notary.v2` signature, with a content identifier of "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270" and a size of 7682 bytes:
+
+```json,title=Content%20Descriptor&mediatype=application/vnd.oci.descriptor.v1%2Bjson
+{
+  "mediaType": "application/vnd.oci.artifact.manifest.v1+json",
+  "digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270",
+  "size": 7682,
+  "artifactType": "cncf.notary.v2"
+}
+```
+
+[rfc3986]: https://tools.ietf.org/html/rfc3986
+[rfc4634-s4.1]: https://tools.ietf.org/html/rfc4634#section-4.1
+[rfc4634-s4.2]: https://tools.ietf.org/html/rfc4634#section-4.2
+[rfc6838]: https://tools.ietf.org/html/rfc6838
+[rfc6838-s4.2]: https://tools.ietf.org/html/rfc6838#section-4.2
+[rfc7230-s2.7]: https://tools.ietf.org/html/rfc7230#section-2.7
+[sha256-vs-sha512]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/hsMw7cAwrZE