diff --git a/xml-schema/DMRPP-doc-ky.docx b/xml-schema/DMRPP-doc-ky.docx new file mode 100644 index 0000000..6ecc1f8 Binary files /dev/null and b/xml-schema/DMRPP-doc-ky.docx differ diff --git a/xml-schema/README-dmrpp.md b/xml-schema/README-dmrpp.md new file mode 100644 index 0000000..9e59701 --- /dev/null +++ b/xml-schema/README-dmrpp.md @@ -0,0 +1,291 @@ + +~~# DMR++ Documentation for the `dmrpp` Namespace + +The DMR++ `dmrpp` XML namespace elements were added to provide a way to describe the organization of 'chunks' used by a +binary data format such as HDF5 to store the data values in an array. The DMR++ supports both HDF5 and HDF4 as of +January 2026. The elements in this `dmrpp` namespace can be added to a DAP4 DMR (Dataset Metadata Response) document +without affecting the XML parse of the elements in the DAP4 namespace. + +There are three primary elements in the `dmrpp` namespace: `chunks`, `chunkDimensionSizes`, and `chunk`. While +not always true, in general, a `chunks` element encloses a set of `chunk` elements and a single `chunkDimensionSizes` +element. The `chunks` element provides information that can be applied to all the chunks that make up a variable. +The information in the `chunkDimensionSizes` element could have been encoded as an attribute of the `chunks` element. +The `chunk` elements hold information unique to each chunk that makes up the variable. + +It is possible that a DMR++ document contains variables that have neither `chunks` nor `chunkDimensionSizes` +elements since some variables' data is stored in a single 'chunk' in the HDF5 file. If only the attributes defined +for `chunk` are needed, then that is the only element present. For example, HDF5 defines a storage class named +_CONTIGUOUS_ that can be represented as a single chunk. + +## The dmrpp Namespace Elements + +### dmrpp:chunks + +The `dmrpp:chunks` element is always a child of a DAP/DMR variable element (e.g., `Float32`, `Int32`, etc.). +It describes how the variable’s data are stored on in an HDF5-like chunked layout. The element can +contain the following: + +The DMR++ parser uses the information in the `dmrpp:chunks` element to build internal _Chunk_ objects. This +can include _Chunk_ objects that are not present in the data file/object because they consist solely of fill +values. In this case, the parser must synthesize these chunks itself using the value of the `fillValue` attribute. + +#### Attributes of `dmrpp:chunks` + +All attributes of the `dmrpp:chunks` element are optional. + +* `compressionType`: a space separated list of filters, not limited to compression. Currently, DMR++ supports + _shuffle_, _deflate_, and _fletcher32_. The deflate filter uses the standard Internet deflate algorithm and + includes an associated compression level. The shuffle filter groups the high-order through low-order bytes of + multibyte numerical types together to improve the effectiveness of the deflate algorithm. The fletcher32 filter + provides a 32-bit hash of the data. *The order of the filters in the list is important.* The filters are listed in + the order in which they were applied during data encoding and therefore must be applied in reverse order during + decoding. +* `deflateLevel`: the numerical level of the deflate compression, used when the data in the chunk were + compressed. The deflateLevel must be between 1 and 9. This is not needed to deflate the chunk, but it is + necessary when other operations are applied. +* `byteOrder`: optional byte order information; one of `LE` or `BE` (little- or big-endian). Defaults to `BE`. + Although `dmrpp:chunk` also includes a _byteOrder_ attribute, all the chunks inside a _dmrpp:chunks_ element + must have the same byte order. +* `structOffset`: total size and offset information for a structure. In DMR++, only simple structures are supported; + nested structures are not supported. This attribute is a space-separated list of numbers that encode the offsets, + in bytes, from the start of the structure for all fields except the first, which must have an offset of zero + bytes. In addition to the field offsets, the final element of the list specifies the total size of the structure + in bytes. +* `fillValue`: the fill value used for chunks that have no data. In some cases, an array may contain regions with no + data. For example, this can occur with satellite swath data stored using a map projection. In such cases, a format + such as HDF5 may omit writing chunks that contain only fill values. Software that uses the DMR++ to read data must + fill in the gaps left by these “phantom” chunks. Each member of a structure may have its own fill value; in that + case, _fillValue_ is represented as a space-separated list of strings. +* `LBChunk`: boolean value indicting if this variable has linked blocks. Linked blocks are used by HDF4 when a ' + chunk' is not atomic but instead split into multiple regions within a single file. In this case, the 'linked blocks' + are concatenated and then treated as 'chunk.' See the `dmrpp:block` element below. +* `DIO`: a boolean that indicates the chunks can be used for a particular I/O optimization. Direct IO (DIO) is a + feature in the Hyrax software that improves performance by passing chunked data directly to the end user without + applying any filtering operations (for example, without decompression). By default, the Hyrax data server uses DIO + when writing NetCDF-4 files from HDF5 data described using DMR++, provided that certain conditions are met. This + feature can be disabled. _**FIXME**_ _What are those conditions_? + +#### Child elements of `dmrpp:chunks` + * Exactly one `dmrpp:chunkDimensionSizes` element, as defined below. This defines the logical organization + Of the chunks/blocks that make up the variable. + * and one of: + * a list of individual `dmrpp:chunk` elements (this is the typical case for an HDF5/NetCDF4 file), + * a list of `dmrpp:block` elements (linked-block storage), or + * a “multi linked-block chunk” arrangement where `dmrpp:chunk` elements refer to multiple underlying _blocks_ + (this case deals with formats where _chunks_ are not always atomic such as HDF4). + * A `dmrpp:chunks` element can contain, as child elements, either one or more `dmrpp:chunk` or `dmrpp:block` + element(s), but not both. + * +--- + +### dmrpp:chunkDimensionSizes + +The `dmrpp:chunkDimensionSizes` is a child of `dmrpp:chunks`. +It Contains a **whitespace separated list of chunk sizes**, one per array dimension (e.g., `"100 200"`). It is used +together with the array’s declared dimensions to compute the **logical number of chunks** and their shapes. It is also +used in conjunction with the 0...N `dmrpp:chunk` elements (see below) to detect which logical chunks are not included +in the data file/object (i.e., they contain only fill values). For an array stored as a number of discreet chunks, +this element has to be present to tell the DMR++ interpreter how the information in the chunks is reassembled to make +the original array. + +#### Attributes of `dmrpp:chunkDimensionSizes` + +The `dmrpp:chunkDimensionSizes` element has no attributes. + +#### Child elements of `dmrpp:chunkDimensionSizes` + +The `dmrpp:chunkDimensionSizes` element has no child elements. +--- + +### dmrpp:chunk + +Each `dmrpp:chunk` describes a single data chunk. The `dmrpp:chunk` element is usually a child of `dmrpp:chunks`, but is +sometimes a direct child of the variable element when all the data are held in a singe chunk (e.g., HDF5 contiguous +storage). + +The software uses the `dmrpp:chunk` element to determine **where within the file or object to read data** and how to +reconstruct the chunk’s data. Each `dmrpp:chunk` element must include the `offset` and `size` attribute. For a variable +that contains more than one chunk, the `chunkPositionInArray` attribute must also be included. + +The remain attributes are optional. If they are not used by a given `dmrpp:chunk` element, then the value is either +the default value (e.g., `fm`, see below) or an inherited value from some enclosing XML element. In version XXX of +the DMR++, the only elements that provide inherited attributes are the `dap:Dataset` and the `dmrpp:chunks` elements. +Using inherited XML attributes complicates parsing but can reduce XML document size when the number of `dmrpp:chunk` +elements is large. + +#### Attributes of `dmrpp:chunk` + +* `offset` and `nBytes`: byte offset and length in the underlying data resource (HDF5 file, etc.). +* `chunkPositionInArray`: space-separated integer indices of the chunk in chunk-space (e.g., `"[0,1,3]"`). +* `fm`: optional “filter mask” for per-chunk filter flags. This attribute applies only to HDF5. It is a 32-bit integer + bit mask that should normally be zero. A non-zero value indicates that a filter failed and HDF5 retained the original, + unfiltered data. When reading the data, this mask is used to determine that decompression should not be attempted for + the affected chunk. This condition occurs rarely. With Direct I/O, this attribute becomes important, but only when the + mask value is non-zero. The bit assignments are as follows: shuffle is bit 0, deflate is bit 1, and fletcher32 is bit 2. + The default value of fm is 0. +* `href` and `trust` / `dmrpp:trust`: The `trust` attribute applies to the value of the `href` attribute. In systems + such as NASA Earthdata Cloud (EDC), this allows authentication steps to be skipped by indicating to the DMR++ parser + that the referenced `href` does not require authentication. It can be trusted because access to the DMR++ itself was + already authenticated and authorized. When present, the values of `href` and `trust` override those specified in the + `dap4:Dataset` element. +* `LinkedBlockIndex`: When multi-block chunks are used, this attribute groups multiple linked blocks into a single + logical chunk. + +#### Child elements of `dmrpp:chunk` + +The `dmrpp:chunk` element has no child elements. + +**_FIXME_** Maybe it can contain dmrpp:block elements? + +--- + +### `dmrpp:block` + +Child of `dmrpp:chunks` used for **linked-block storage**, non-contiguous pieces of a variable stored as blocks that +are assembled into a single chunk. + +#### Attributes of `dmrpp:block` + +* `offset`, `nBytes`: byte location and size of a block. +* `href` and `trust` / `dmrpp:trust`: The `trust` attribute applies to the value of the `href` attribute. In systems + such as NASA Earthdata Cloud (EDC), this allows authentication steps to be skipped by indicating to the DMR++ parser + that the referenced `href` does not require authentication. It can be trusted because access to the DMR++ itself was + already authenticated and authorized. When present, the values of `href` and `trust` override those specified in the + `dap4:Dataset` element. + +**_FIXME_** Kent notes that the `href` and `trust` attributes might not be supported by the `drmpp:block` element. + +The DMR++ interpreter groups multiple blocks into a single buffer in memory that is them treated as a 'chunk.' + +#### Child elements of `dmrpp:block` + +The `dmrpp:block` element has no child elements. + +--- + +### `dmrpp:FixedLengthStringArray` + +Child element of a DMR array variable element when that array is actually an **array of fixed-length strings** stored +as raw bytes. + +The parser treats this as a marker that: + +* indicates the base type is string-like but should be interpreted as **fixed-length strings**, + +#### Attributes of `dmrpp:FixedLengthStringArray` + +* attribute `string_length` (e.g., `"8"`) gives the per-string length in bytes, +* attribute `pad` describes how padding bytes are encoded (e.g., `"null"`, `"space"`, `"zero"`). + +The software then slices the byte buffer into equal-sized string segments and de-pads each one appropriately, +extracting an array of strings. + +#### Child elements of `dmrpp:FixedLengthStringArray` + +The `dmrpp:FixedLengthStringArray` element has no child elements. + +---~~ + +### `dmrpp:compact` + +Child element of a DMR variable element indicating **HDF5 COMPACT storage** — the data are stored inline in the DMR++ document, +as **base64-encoded** values. This encoding provides a way to include binary data in an XML +document. + +The interpreter: + +* base64-decodes the contents, +* interprets them according to the variable’s DAP type (numeric, string, fixed-length string array, etc.), +* and populates the corresponding variable in memory without any external I/O. + +This inline base64 encoding is only used for relatively small variables. + +#### Attributes of `dmrpp:compact` + +The `dmrpp:compact` element has no attributes. + +#### Child elements of `dmrpp:compact` + +The `dmrpp:compact` element has no child elements. + +--- + +### `dmrpp:missingdata` + +Child element of a DMR variable element containing **missing-data values** for an array (or a single unsigned byte scalar) as +base64-encoded bytes, optionally compressed. + +The parser: + +* base64-decodes the contents, +* inflates them with zlib if needed, +* and either: + + * copies directly into the variable (no projection), or + * uses the variable’s projection (start/stop/stride) to create a subset buffer. + +This is used as a special “all missing” data source (e.g., when some chunks are not present and are logically all +fill/missing). + +--- + +### `dmrpp:specialstructuredata` + +Child of a structure variable (or array of structures) that encodes the content of a **“special structure”** as base64. + +The parser supports structures whose members are limited to: + +* numeric scalars, +* numeric arrays, +* string scalars, and +* arrays of strings, + +and decodes the flattened byte layout back into the structure (or array of structures), including embedded +base64-encoded strings separated by semicolons. + +--- + +### `dmrpp` variable-length string array element (name from `DMRPP_VLSA_ELEMENT`) + +There is also an element whose QName is whatever `DMRPP_VLSA_ELEMENT` expands to in `DmrppNames.h`. From the usage: + +* It appears as a child of an **array of strings/URLs**. +* `DMZ::process_vlsa` passes that element to `vlsa::read(...)`, which fills a `std::vector` and marks the + array as a **variable-length string array (VLSA)**. + +In the XSD below I’ll call this element `dmrpp:vlenStringArray` as a placeholder; you should rename it to match the +actual name used in your `DmrppNames.h`. + +--- + +## 2. dmrpp attributes on the Dataset element + +The parser also recognizes these **dmrpp attributes on the root `Dataset` element**: + +* `dmrpp:href` – **required** in this implementation. A URL to the underlying data resource; used as the default data + URL for all chunks and blocks that don’t specify their own `href`. +* `dmrpp:trust` – optional boolean (`"true"` / `"false"`). Signals whether the `dmrpp:href` is trusted as-is (controls + how the URL is handled). +* `dmrpp:version` – optional string. Its presence switches off some legacy “emulate original filter order” behavior; if + present and the `DMR` is actually a `DMRpp`, it’s stored in the DMRpp object. + +I’ll expose these as an attributeGroup in the schema so the DMR schema can import/reuse them. + +--- + +## 3. Example dmrpp.xsd (XSD 1.1) + +Below is a standalone XSD 1.1 file for the `dmrpp` namespace, focused on what your parser actually uses. + +### Notes before you drop this into your repo + +* **Namespace URI**: I’ve used a placeholder `http://opendap.org/ns/dmrpp/1.0#`. Swap this for your actual dmrpp + namespace URI. +* **VLSA element name**: I’ve chosen `vlenStringArray` for the variable-length string array element. Rename that element + to whatever `DMRPP_VLSA_ELEMENT` is in your code. +* Types and constraints are conservative: they won’t enforce all your semantic rules, but they match what the code + expects structurally. + +> [!NOTE] +> The schema is in the file dmrpp.xsd + diff --git a/xml-schema/README-gpt-prompt.md b/xml-schema/README-gpt-prompt.md new file mode 100644 index 0000000..ae8c632 --- /dev/null +++ b/xml-schema/README-gpt-prompt.md @@ -0,0 +1,47 @@ + +_*Me*_ + +I have a fairly large C++ class implementation that parses (using the pugixml header library) an XML document. This +document contains a mix of XML elements from our DAP4 namespace and a new namespace called 'dmrpp'. I'd like to document +the dmrpp namespace, first using a short paragraph for each of the XML elements the parser recognizes and then using XML +Schema (XSD 1.1). Can you do that? The source file is 2k LOC. + +_**ChatGPT**_ + +Yes, I can absolutely help with that. + +_**Me**_ + +Here is the C++ code that implements the XML DOM parser using the pugixml library: + +... + +And, my editorial comments are in notes. + +_**ChatGPT**_ + +From this code, the parser knows about the following **dmrpp elements** (plus a few dmrpp attributes on `Dataset`). I’ll +first give you short, human-readable descriptions, then a concrete XSD 1.1 schema you can use as a starting point. + +_**Me**_ + +[! NOTE]: It's actually a 1.0 schema. + +_**ChatGPT**_ + +# DMR++ Documentation for the `dmrpp` Namespace + +... + +_**Me**_ + +What follows in the document is almost completely rewritten. jhrg 1/2/26 + +_**ChatGPT**_ + +--- + +If you’d like, in a follow-up I can: + +* tweak element/attribute names to exactly match your `DmrppNames.h`, and/or +* show how to wire this into the DAP4 DMR schema using `xs:any` or explicit `dmrpp:*` element references. diff --git a/xml-schema/dmrpp.xsd b/xml-schema/dmrpp.xsd new file mode 100644 index 0000000..abfea1e --- /dev/null +++ b/xml-schema/dmrpp.xsd @@ -0,0 +1,205 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +