Skip to content
This repository has been archived by the owner on May 15, 2024. It is now read-only.

Commit

Permalink
Add artifact descriptors and considerations to round out the surface …
Browse files Browse the repository at this point in the history
…area, decoupling from the image-spec

Signed-off-by: Steve Lasker <[email protected]>
  • Loading branch information
SteveLasker committed Jul 21, 2021
1 parent 434fe99 commit 69febd1
Show file tree
Hide file tree
Showing 2 changed files with 295 additions and 0 deletions.
136 changes: 136 additions & 0 deletions considerations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# Extensibility

Implementations that are reading/processing manifests MUST NOT generate an error if they encounter an unknown property.
Instead they MUST ignore unknown properties.

# Canonicalization

* ORAS Artifacts are [content-addressable](https://en.wikipedia.org/wiki/Content-addressable_storage). See [descriptors](descriptor.md) for more.
* One benefit of content-addressable storage is easy deduplication.
* Many artifacts might depend on a particular blob, but there may be only one blob in the store.
* With a different serialization, that same semantic blob would have a different hash, and if both versions of the blob are referenced there will be two blobs with the same semantic content.
* To allow efficient storage, implementations serializing content for blobs SHOULD use a canonical serialization.
* This increases the chance that different implementations can push the same semantic content to the store without creating redundant blobs.

## JSON

[JSON][] content SHOULD be serialized as [canonical JSON][canonical-json].

Implementations:

* [Go][]: [github.com/docker/go][], which claims to implement [canonical JSON][canonical-json] except for Unicode normalization.

## EBNF

For field formats described in this specification, we use a limited subset of [Extended Backus-Naur Form][ebnf], similar to that used by the [XML specification][xmlebnf].
Grammars present in the OCI specification are regular and can be converted to a single regular expressions.
However, regular expressions are avoided to limit ambiguity between regular expression syntax.
By defining a subset of EBNF used here, the possibility of variation, misunderstanding or ambiguities from linking to a larger specification can be avoided.

Grammars are made up of rules in the following form:

```
symbol ::= expression
```

We can say we have the production identified by symbol if the input is matched by the expression.
Whitespace is completely ignored in rule definitions.

## Expressions

The simplest expression is the literal, surrounded by quotes:

```
literal ::= "matchthis"
```

The above expression defines a symbol, "literal", that matches the exact input of "matchthis".
Character classes are delineated by brackets (`[]`), describing either a set, range or multiple range of characters:

```
set := [abc]
range := [A-Z]
```

The above symbol "set" would match one character of either "a", "b" or "c".
The symbol "range" would match any character, "A" to "Z", inclusive.
Currently, only matching for 7-bit ascii literals and character classes is defined, as that is all that is required by this specification.
Multiple character ranges and explicit characters can be specified in a single character classes, as follows:

```
multipleranges := [a-zA-Z=-]
```

The above matches the characters in the range `A` to `Z`, `a` to `z` and the individual characters `-` and `=`.

Expressions can be made up of one or more expressions, such that one must be followed by the other.
This is known as an implicit concatenation operator.
For example, to satisfy the following rule, both `A` and `B` must be matched to satisfy the rule:

```
symbol ::= A B
```

Each expression must be matched once and only once, `A` followed by `B`.
To support the description of repetition and optional match criteria, the postfix operators `*` and `+` are defined.
`*` indicates that the preceding expression can be matched zero or more times.
`+` indicates that the preceding expression must be matched one or more times.
These appear in the following form:

```
zeroormore ::= expression*
oneormore ::= expression+
```

Parentheses are used to group expressions into a larger expression:

```
group ::= (A B)
```

Like simpler expressions above, operators can be applied to groups, as well.
To allow for alternates, we also define the infix operator `|`.

```
oneof ::= A | B
```

The above indicates that the expression should match one of the expressions, `A` or `B`.

## Precedence

The operator precedence is in the following order:

- Terminals (literals and character classes)
- Grouping `()`
- Unary operators `+*`
- Concatenation
- Alternates `|`

The precedence can be better described using grouping to show equivalents.
Concatenation has higher precedence than alernates, such `A B | C D` is equivalent to `(A B) | (C D)`.
Unary operators have higher precedence than alternates and concatenation, such that `A+ | B+` is equivalent to `(A+) | (B+)`.

## Examples

The following combines the previous definitions to match a simple, relative path name, describing the individual components:

```
path ::= component ("/" component)*
component ::= [a-z]+
```

The production "component" is one or more lowercase letters.
A "path" is then at least one component, possibly followed by zero or more slash-component pairs.
The above can be converted into the following regular expression:

```
[a-z]+(?:/[a-z]+)*
```

[ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
[xmlebnf]: https://www.w3.org/TR/REC-xml/#sec-notation
[canonical-json]: http://wiki.laptop.org/go/Canonical_JSON
[github.com/docker/go]: https://github.com/docker/go/
[Go]: https://golang.org/
[JSON]: http://json.org/
159 changes: 159 additions & 0 deletions descriptor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# Descriptors

* An artifact consists of several different components, arranged in a [Merkle Directed Acyclic Graph (DAG)](https://en.wikipedia.org/wiki/Merkle_tree).
* References between components in the graph are expressed through _Content Descriptors_.
* A Content Descriptor (or simply _Descriptor_) describes the disposition of the targeted content.
* A Content Descriptor includes the type of the content, a content identifier (_digest_), and the byte-size of the raw content.
* A Content Descriptor MAY differentiate the type through the `artifactType`.
* Descriptors SHOULD be embedded in other formats to securely reference external content.
* Other formats SHOULD use descriptors to securely reference external content.

This section defines the `application/vnd.oras.artifact.descriptor.v1+json` media type.

## Properties

A descriptor consists of a set of properties encapsulated in key-value fields.

The following fields contain the primary properties that constitute an Artifact Descriptor:

- **`mediaType`** *string*

This REQUIRED property contains the media type of the referenced content.
Values MUST comply with [RFC 6838][rfc6838], including the [naming requirements in its section 4.2][rfc6838-s4.2].

Each artifact author MAY define their own unique `mediaTypes`, or utilize existing `mediaTypes` defined by other artifacts. To assure unique ownership, all `mediaTypes` MUST be registered with iana.org.

- **`digest`** *string*

This REQUIRED property is the _digest_ of the targeted content, conforming to the requirements outlined in [Digests](#digests).
Retrieved content SHOULD be verified against this digest when consumed via untrusted sources.

- **`size`** *int64*

This REQUIRED property specifies the size, in bytes, of the raw content.
This property exists so that a client will have an expected size for the content before processing.
If the length of the retrieved content does not match the specified length, the content SHOULD NOT be trusted.

- **`artifactType`** *string*

This OPTIONAL property defines the type or Artifact, differentiating artifacts that use the `application/vnd.oras.manifest`. When the descriptor is used for blobs, this property MUST be empty.

- **`annotations`** *string-string map*

This OPTIONAL property contains arbitrary metadata for this descriptor.
This OPTIONAL property MUST use the [annotation rules](annotations.md#rules).

## Digests

The _digest_ property of a Descriptor acts as a content identifier, enabling [content addressability](http://en.wikipedia.org/wiki/Content-addressable_storage).
It uniquely identifies content by taking a [collision-resistant hash](https://en.wikipedia.org/wiki/Cryptographic_hash_function) of the bytes.
If the _digest_ can be communicated in a secure manner, one can verify content from an insecure source by recalculating the digest independently, ensuring the content has not been modified.

The value of the `digest` property is a string consisting of an _algorithm_ portion and an _encoded_ portion.
The _algorithm_ specifies the cryptographic hash function and encoding used for the digest; the _encoded_ portion contains the encoded result of the hash function.

A digest string MUST match the following [grammar](considerations.md#ebnf):

```
digest ::= algorithm ":" encoded
algorithm ::= algorithm-component (algorithm-separator algorithm-component)*
algorithm-component ::= [a-z0-9]+
algorithm-separator ::= [+._-]
encoded ::= [a-zA-Z0-9=_-]+
```

Note that _algorithm_ MAY impose algorithm-specific restriction on the grammar of the _encoded_ portion.
See also [Registered Algorithms](#registered-algorithms).

Some example digest strings include the following:

digest | algorithm | Registered |
--------------------------------------------------------------------------|---------------------|------------|
`sha256:6c3c624b58dbbcd3c0dd82b4c53f04194d1247c6eebdaab7c610cf7d66709b3b` | [SHA-256](#sha-256) | Yes |
`sha512:401b09eab3c013d4ca54922bb802bec8fd5318192b0a75f201d8b372742...` | [SHA-512](#sha-512) | Yes |
`multihash+base58:QmRZxt2b1FVZPNqd8hsiykDL3TdBDeTSPX9Kv46HmX4Gx8` | Multihash | No |
`sha256+b64u:LCa0a2j_xo_5m0U8HTBBNBNCLXBkg7-g-YpeiGJm564` | SHA-256 with urlsafe base64 | No |

Please see [Registered Algorithms](#registered-algorithms) for a list of registered algorithms.

Implementations SHOULD allow digests with unrecognized algorithms to pass validation if they comply with the above grammar.
While `sha256` will only use hex encoded digests, separators in _algorithm_ and alphanumerics in _encoded_ are included to allow for extensions.
As an example, we can parameterize the encoding and algorithm as `multihash+base58:QmRZxt2b1FVZPNqd8hsiykDL3TdBDeTSPX9Kv46HmX4Gx8`, which would be considered valid but unregistered by this specification.

### Verification

Before consuming content targeted by a descriptor from untrusted sources, the byte content SHOULD be verified against the digest string.
Before calculating the digest, the size of the content SHOULD be verified to reduce hash collision space.
Heavy processing before calculating a hash SHOULD be avoided.
Implementations MAY employ [canonicalization](considerations.md#canonicalization) of the underlying content to ensure stable content identifiers.

### Digest calculations

A _digest_ is calculated by the following pseudo-code, where `H` is the selected hash algorithm, identified by string `<alg>`:
```
let ID(C) = Descriptor.digest
let C = <bytes>
let D = '<alg>:' + Encode(H(C))
let verified = ID(C) == D
```
Above, we define the content identifier as `ID(C)`, extracted from the `Descriptor.digest` field.
Content `C` is a string of bytes.
Function `H` returns the hash of `C` in bytes and is passed to function `Encode` and prefixed with the algorithm to obtain the digest.
The result `verified` is true if `ID(C)` is equal to `D`, confirming that `C` is the content identified by `D`.
After verification, the following is true:

```
D == ID(C) == '<alg>:' + Encode(H(C))
```

The _digest_ is confirmed as the content identifier by independently calculating the _digest_.

### Registered algorithms

While the _algorithm_ component of the digest string allows the use of a variety of cryptographic algorithms, compliant implementations SHOULD use [SHA-256](#sha-256).

The following algorithm identifiers are currently defined by this specification:

| algorithm identifier | algorithm |
|----------------------|---------------------|
| `sha256` | [SHA-256](#sha-256) |
| `sha512` | [SHA-512](#sha-512) |

If a useful algorithm is not included in the above table, it SHOULD be submitted to this specification for registration.

#### SHA-256

[SHA-256][rfc4634-s4.1] is a collision-resistant hash function, chosen for ubiquity, reasonable size and secure characteristics.
Implementations MUST implement SHA-256 digest verification for use in descriptors.

When the _algorithm identifier_ is `sha256`, the _encoded_ portion MUST match `/[a-f0-9]{64}/`.
Note that `[A-F]` MUST NOT be used here.

#### SHA-512

[SHA-512][rfc4634-s4.2] is a collision-resistant hash function which [may be more perfomant][sha256-vs-sha512] than [SHA-256](#sha-256) on some CPUs.
Implementations MAY implement SHA-512 digest verification for use in descriptors.

When the _algorithm identifier_ is `sha512`, the _encoded_ portion MUST match `/[a-f0-9]{128}/`.
Note that `[A-F]` MUST NOT be used here.

## Examples

The following example describes a manifest, representing a `cncf.notary.v2` signature, with a content identifier of "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270" and a size of 7682 bytes:

```json,title=Content%20Descriptor&mediatype=application/vnd.oci.descriptor.v1%2Bjson
{
"mediaType": "application/vnd.oci.artifact.manifest.v1+json",
"digest": "sha256:5b0bcabd1ed22e9fb1310cf6c2dec7cdef19f0ad69efa1f392e94a4333501270",
"size": 7682,
"artifactType": "cncf.notary.v2"
}
```

[rfc3986]: https://tools.ietf.org/html/rfc3986
[rfc4634-s4.1]: https://tools.ietf.org/html/rfc4634#section-4.1
[rfc4634-s4.2]: https://tools.ietf.org/html/rfc4634#section-4.2
[rfc6838]: https://tools.ietf.org/html/rfc6838
[rfc6838-s4.2]: https://tools.ietf.org/html/rfc6838#section-4.2
[rfc7230-s2.7]: https://tools.ietf.org/html/rfc7230#section-2.7
[sha256-vs-sha512]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/hsMw7cAwrZE

0 comments on commit 69febd1

Please sign in to comment.