Skip to content
Draft
Binary file added xml-schema/DMRPP-doc-ky.docx
Binary file not shown.
291 changes: 291 additions & 0 deletions xml-schema/README-dmrpp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,291 @@

~~# DMR++ Documentation for the `dmrpp` Namespace

The DMR++ `dmrpp` XML namespace elements were added to provide a way to describe the organization of 'chunks' used by a
binary data format such as HDF5 to store the data values in an array. The DMR++ supports both HDF5 and HDF4 as of
January 2026. The elements in this `dmrpp` namespace can be added to a DAP4 DMR (Dataset Metadata Response) document
without affecting the XML parse of the elements in the DAP4 namespace.

There are three primary elements in the `dmrpp` namespace: `chunks`, `chunkDimensionSizes`, and `chunk`. While
not always true, in general, a `chunks` element encloses a set of `chunk` elements and a single `chunkDimensionSizes`
element. The `chunks` element provides information that can be applied to all the chunks that make up a variable.
The information in the `chunkDimensionSizes` element could have been encoded as an attribute of the `chunks` element.
The `chunk` elements hold information unique to each chunk that makes up the variable.

It is possible that a DMR++ document contains variables that have neither `chunks` nor `chunkDimensionSizes`
elements since some variables' data is stored in a single 'chunk' in the HDF5 file. If only the attributes defined
for `chunk` are needed, then that is the only element present. For example, HDF5 defines a storage class named
_CONTIGUOUS_ that can be represented as a single chunk.

## The dmrpp Namespace Elements

### dmrpp:chunks

The `dmrpp:chunks` element is always a child of a DAP/DMR variable element (e.g., `Float32`, `Int32`, etc.).
It describes how the variable’s data are stored on in an HDF5-like chunked layout. The element can
contain the following:

The DMR++ parser uses the information in the `dmrpp:chunks` element to build internal _Chunk_ objects. This
can include _Chunk_ objects that are not present in the data file/object because they consist solely of fill
values. In this case, the parser must synthesize these chunks itself using the value of the `fillValue` attribute.

#### Attributes of `dmrpp:chunks`

All attributes of the `dmrpp:chunks` element are optional.

* `compressionType`: a space separated list of filters, not limited to compression. Currently, DMR++ supports
_shuffle_, _deflate_, and _fletcher32_. The deflate filter uses the standard Internet deflate algorithm and
includes an associated compression level. The shuffle filter groups the high-order through low-order bytes of
multibyte numerical types together to improve the effectiveness of the deflate algorithm. The fletcher32 filter
provides a 32-bit hash of the data. *The order of the filters in the list is important.* The filters are listed in
the order in which they were applied during data encoding and therefore must be applied in reverse order during
decoding.
* `deflateLevel`: the numerical level of the deflate compression, used when the data in the chunk were
compressed. The deflateLevel must be between 1 and 9. This is not needed to deflate the chunk, but it is
necessary when other operations are applied.
* `byteOrder`: optional byte order information; one of `LE` or `BE` (little- or big-endian). Defaults to `BE`.
Although `dmrpp:chunk` also includes a _byteOrder_ attribute, all the chunks inside a _dmrpp:chunks_ element
must have the same byte order.
* `structOffset`: total size and offset information for a structure. In DMR++, only simple structures are supported;
nested structures are not supported. This attribute is a space-separated list of numbers that encode the offsets,
in bytes, from the start of the structure for all fields except the first, which must have an offset of zero
bytes. In addition to the field offsets, the final element of the list specifies the total size of the structure
in bytes.
* `fillValue`: the fill value used for chunks that have no data. In some cases, an array may contain regions with no
data. For example, this can occur with satellite swath data stored using a map projection. In such cases, a format
such as HDF5 may omit writing chunks that contain only fill values. Software that uses the DMR++ to read data must
fill in the gaps left by these “phantom” chunks. Each member of a structure may have its own fill value; in that
case, _fillValue_ is represented as a space-separated list of strings.
* `LBChunk`: boolean value indicting if this variable has linked blocks. Linked blocks are used by HDF4 when a '
chunk' is not atomic but instead split into multiple regions within a single file. In this case, the 'linked blocks'
are concatenated and then treated as 'chunk.' See the `dmrpp:block` element below.
* `DIO`: a boolean that indicates the chunks can be used for a particular I/O optimization. Direct IO (DIO) is a
feature in the Hyrax software that improves performance by passing chunked data directly to the end user without
applying any filtering operations (for example, without decompression). By default, the Hyrax data server uses DIO
when writing NetCDF-4 files from HDF5 data described using DMR++, provided that certain conditions are met. This
feature can be disabled. _**FIXME**_ _What are those conditions_?

#### Child elements of `dmrpp:chunks`
* Exactly one `dmrpp:chunkDimensionSizes` element, as defined below. This defines the logical organization
Of the chunks/blocks that make up the variable.
* and one of:
* a list of individual `dmrpp:chunk` elements (this is the typical case for an HDF5/NetCDF4 file),
* a list of `dmrpp:block` elements (linked-block storage), or
* a “multi linked-block chunk” arrangement where `dmrpp:chunk` elements refer to multiple underlying _blocks_
(this case deals with formats where _chunks_ are not always atomic such as HDF4).
* A `dmrpp:chunks` element can contain, as child elements, either one or more `dmrpp:chunk` or `dmrpp:block`
element(s), but not both.
*
---

### dmrpp:chunkDimensionSizes

The `dmrpp:chunkDimensionSizes` is a child of `dmrpp:chunks`.
It Contains a **whitespace separated list of chunk sizes**, one per array dimension (e.g., `"100 200"`). It is used
together with the array’s declared dimensions to compute the **logical number of chunks** and their shapes. It is also
used in conjunction with the 0...N `dmrpp:chunk` elements (see below) to detect which logical chunks are not included
in the data file/object (i.e., they contain only fill values). For an array stored as a number of discreet chunks,
this element has to be present to tell the DMR++ interpreter how the information in the chunks is reassembled to make
the original array.

#### Attributes of `dmrpp:chunkDimensionSizes`

The `dmrpp:chunkDimensionSizes` element has no attributes.

#### Child elements of `dmrpp:chunkDimensionSizes`

The `dmrpp:chunkDimensionSizes` element has no child elements.
---

### dmrpp:chunk

Each `dmrpp:chunk` describes a single data chunk. The `dmrpp:chunk` element is usually a child of `dmrpp:chunks`, but is
sometimes a direct child of the variable element when all the data are held in a singe chunk (e.g., HDF5 contiguous
storage).

The software uses the `dmrpp:chunk` element to determine **where within the file or object to read data** and how to
reconstruct the chunk’s data. Each `dmrpp:chunk` element must include the `offset` and `size` attribute. For a variable
that contains more than one chunk, the `chunkPositionInArray` attribute must also be included.

The remain attributes are optional. If they are not used by a given `dmrpp:chunk` element, then the value is either
the default value (e.g., `fm`, see below) or an inherited value from some enclosing XML element. In version XXX of
the DMR++, the only elements that provide inherited attributes are the `dap:Dataset` and the `dmrpp:chunks` elements.
Using inherited XML attributes complicates parsing but can reduce XML document size when the number of `dmrpp:chunk`
elements is large.

#### Attributes of `dmrpp:chunk`

* `offset` and `nBytes`: byte offset and length in the underlying data resource (HDF5 file, etc.).
* `chunkPositionInArray`: space-separated integer indices of the chunk in chunk-space (e.g., `"[0,1,3]"`).
* `fm`: optional “filter mask” for per-chunk filter flags. This attribute applies only to HDF5. It is a 32-bit integer
bit mask that should normally be zero. A non-zero value indicates that a filter failed and HDF5 retained the original,
unfiltered data. When reading the data, this mask is used to determine that decompression should not be attempted for
the affected chunk. This condition occurs rarely. With Direct I/O, this attribute becomes important, but only when the
mask value is non-zero. The bit assignments are as follows: shuffle is bit 0, deflate is bit 1, and fletcher32 is bit 2.
The default value of fm is 0.
* `href` and `trust` / `dmrpp:trust`: The `trust` attribute applies to the value of the `href` attribute. In systems
such as NASA Earthdata Cloud (EDC), this allows authentication steps to be skipped by indicating to the DMR++ parser
that the referenced `href` does not require authentication. It can be trusted because access to the DMR++ itself was
already authenticated and authorized. When present, the values of `href` and `trust` override those specified in the
`dap4:Dataset` element.
* `LinkedBlockIndex`: When multi-block chunks are used, this attribute groups multiple linked blocks into a single
logical chunk.

#### Child elements of `dmrpp:chunk`

The `dmrpp:chunk` element has no child elements.

**_FIXME_** Maybe it can contain dmrpp:block elements?

---

### `dmrpp:block`

Child of `dmrpp:chunks` used for **linked-block storage**, non-contiguous pieces of a variable stored as blocks that
are assembled into a single chunk.

#### Attributes of `dmrpp:block`

* `offset`, `nBytes`: byte location and size of a block.
* `href` and `trust` / `dmrpp:trust`: The `trust` attribute applies to the value of the `href` attribute. In systems
such as NASA Earthdata Cloud (EDC), this allows authentication steps to be skipped by indicating to the DMR++ parser
that the referenced `href` does not require authentication. It can be trusted because access to the DMR++ itself was
already authenticated and authorized. When present, the values of `href` and `trust` override those specified in the
`dap4:Dataset` element.

**_FIXME_** Kent notes that the `href` and `trust` attributes might not be supported by the `drmpp:block` element.

The DMR++ interpreter groups multiple blocks into a single buffer in memory that is them treated as a 'chunk.'

#### Child elements of `dmrpp:block`

The `dmrpp:block` element has no child elements.

---

### `dmrpp:FixedLengthStringArray`

Child element of a DMR array variable element when that array is actually an **array of fixed-length strings** stored
as raw bytes.

The parser treats this as a marker that:

* indicates the base type is string-like but should be interpreted as **fixed-length strings**,

#### Attributes of `dmrpp:FixedLengthStringArray`

* attribute `string_length` (e.g., `"8"`) gives the per-string length in bytes,
* attribute `pad` describes how padding bytes are encoded (e.g., `"null"`, `"space"`, `"zero"`).

The software then slices the byte buffer into equal-sized string segments and de-pads each one appropriately,
extracting an array of strings.

#### Child elements of `dmrpp:FixedLengthStringArray`

The `dmrpp:FixedLengthStringArray` element has no child elements.

---~~

### `dmrpp:compact`

Child element of a DMR variable element indicating **HDF5 COMPACT storage** — the data are stored inline in the DMR++ document,
as **base64-encoded** values. This encoding provides a way to include binary data in an XML
document.

The interpreter:

* base64-decodes the contents,
* interprets them according to the variable’s DAP type (numeric, string, fixed-length string array, etc.),
* and populates the corresponding variable in memory without any external I/O.

This inline base64 encoding is only used for relatively small variables.

#### Attributes of `dmrpp:compact`

The `dmrpp:compact` element has no attributes.

#### Child elements of `dmrpp:compact`

The `dmrpp:compact` element has no child elements.

---

### `dmrpp:missingdata`

Child element of a DMR variable element containing **missing-data values** for an array (or a single unsigned byte scalar) as
base64-encoded bytes, optionally compressed.

The parser:

* base64-decodes the contents,
* inflates them with zlib if needed,
* and either:

* copies directly into the variable (no projection), or
* uses the variable’s projection (start/stop/stride) to create a subset buffer.

This is used as a special “all missing” data source (e.g., when some chunks are not present and are logically all
fill/missing).

---

### `dmrpp:specialstructuredata`

Child of a structure variable (or array of structures) that encodes the content of a **“special structure”** as base64.

The parser supports structures whose members are limited to:

* numeric scalars,
* numeric arrays,
* string scalars, and
* arrays of strings,

and decodes the flattened byte layout back into the structure (or array of structures), including embedded
base64-encoded strings separated by semicolons.

---

### `dmrpp` variable-length string array element (name from `DMRPP_VLSA_ELEMENT`)

There is also an element whose QName is whatever `DMRPP_VLSA_ELEMENT` expands to in `DmrppNames.h`. From the usage:

* It appears as a child of an **array of strings/URLs**.
* `DMZ::process_vlsa` passes that element to `vlsa::read(...)`, which fills a `std::vector<std::string>` and marks the
array as a **variable-length string array (VLSA)**.

In the XSD below I’ll call this element `dmrpp:vlenStringArray` as a placeholder; you should rename it to match the
actual name used in your `DmrppNames.h`.

---

## 2. dmrpp attributes on the Dataset element

The parser also recognizes these **dmrpp attributes on the root `Dataset` element**:

* `dmrpp:href` – **required** in this implementation. A URL to the underlying data resource; used as the default data
URL for all chunks and blocks that don’t specify their own `href`.
* `dmrpp:trust` – optional boolean (`"true"` / `"false"`). Signals whether the `dmrpp:href` is trusted as-is (controls
how the URL is handled).
* `dmrpp:version` – optional string. Its presence switches off some legacy “emulate original filter order” behavior; if
present and the `DMR` is actually a `DMRpp`, it’s stored in the DMRpp object.

I’ll expose these as an attributeGroup in the schema so the DMR schema can import/reuse them.

---

## 3. Example dmrpp.xsd (XSD 1.1)

Below is a standalone XSD 1.1 file for the `dmrpp` namespace, focused on what your parser actually uses.

### Notes before you drop this into your repo

* **Namespace URI**: I’ve used a placeholder `http://opendap.org/ns/dmrpp/1.0#`. Swap this for your actual dmrpp
namespace URI.
* **VLSA element name**: I’ve chosen `vlenStringArray` for the variable-length string array element. Rename that element
to whatever `DMRPP_VLSA_ELEMENT` is in your code.
* Types and constraints are conservative: they won’t enforce all your semantic rules, but they match what the code
expects structurally.

> [!NOTE]
> The schema is in the file dmrpp.xsd

47 changes: 47 additions & 0 deletions xml-schema/README-gpt-prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@

_*Me*_

I have a fairly large C++ class implementation that parses (using the pugixml header library) an XML document. This
document contains a mix of XML elements from our DAP4 namespace and a new namespace called 'dmrpp'. I'd like to document
the dmrpp namespace, first using a short paragraph for each of the XML elements the parser recognizes and then using XML
Schema (XSD 1.1). Can you do that? The source file is 2k LOC.

_**ChatGPT**_

Yes, I can absolutely help with that.

_**Me**_

Here is the C++ code that implements the XML DOM parser using the pugixml library:

...

And, my editorial comments are in notes.

_**ChatGPT**_

From this code, the parser knows about the following **dmrpp elements** (plus a few dmrpp attributes on `Dataset`). I’ll
first give you short, human-readable descriptions, then a concrete XSD 1.1 schema you can use as a starting point.

_**Me**_

[! NOTE]: It's actually a 1.0 schema.

_**ChatGPT**_

# DMR++ Documentation for the `dmrpp` Namespace

...

_**Me**_

What follows in the document is almost completely rewritten. jhrg 1/2/26

_**ChatGPT**_

---

If you’d like, in a follow-up I can:

* tweak element/attribute names to exactly match your `DmrppNames.h`, and/or
* show how to wire this into the DAP4 DMR schema using `xs:any` or explicit `dmrpp:*` element references.
Loading