Skip to content

Configuration options

Bart Noordervliet edited this page Sep 10, 2024 · 23 revisions

This page documents all the configuration items available in the xml-to-postgres YAML format. It is expected that you'll have a separate config for each type of XML document that you'll be converting.

Path specifications

A path is simply a slash-separated series of nested tags you need to traverse in the XML document to get to the required node. It's like the most basic application of XPath, so in a document like this:

<A>
  <B>
    <C>id1</C>
  </B>
  <B>
    <C>id2</C>
  </B>
</A>

Use path /A/B for the row entries and then path /C for the id column.

No other XPath features are available at the moment, but they may be added if a good use-case turns up.

Options

  • name <string> [required]
    Sets the name of the main SQL table for this dataset
  • path <string> [required]
    Sets the path to the repeating element that contains the row entries
  • file <string>
    Sets the output filename for the main table; if not present the data will be sent to stdout
  • emit <string>
    A comma-separated string of additional SQL statements to be included in the output
    These are convenience functions to make it easier to pipe output straight into psql
    Currently available are:
    • copy_from Adds a "COPY <table> FROM stdin" statement
      This allows the data to be preceded/followed by other SQL statements; as such all other emit options imply this option
    • create_table Adds a "CREATE TABLE IF NOT EXISTS" statement that defines a table based on the columns specified below
      The datatype of each column will be 'text' unless it has a specific type set
    • drop_table Adds a "DROP TABLE IF EXISTS" statement before the CREATE TABLE
    • truncate Adds a "TRUNCATE" statement before the COPY
    • start_trans Wraps all statements in an explicit transaction
  • skip <string>
    Defines a sub-path to skip entirely (purely for performance reasons)
  • hush <string>
    A comma-separated string of logging levels to silence
    Valid values are info, notice and warning
  • cols <array of objects> [required]
    Defines the columns for the main table
    • name <string> [required]
      Sets the name of this column
    • path <string> [required]
      Sets the path to the data for this column
    • seri <boolean>
      Fills this column with an auto-incrementing integer instead of reading data out of the XML
      This option can be useful if you want to use subtables but have no unique key available
      Ignores the 'path' and all other column options
    • type <string>
      Defines the datatype for this column
      Only used for the "emit: create_table" function; no type checking or conversion is performed
    • attr <string>
      Causes the data for this column to be taken from the named attribute rather than the text node of the element
    • find <string>
      Sets a text string to find (and replace through the next field) in the data for this column
    • repl <string>
      Sets the replacement string to be used with find
    • trim <boolean>
      Collapses any linebreaks and surrounding whitespace in text nodes into a single space character
    • conv <string>
      Enables a conversion function to be run on the data for this column
      Currently available functions are:
      • xml-to-text Converts all child nodes to a single text column, including XML tags and attributes
      • gml-to-ewkb Interprets the data as GML and converts it into EWKB for fast importing into PostGIS (WARNING)
    • mult <boolean>
      Forces the use of multitype (e.g. MultiPolygon) with EWKB output (only relevant with conv: gml-to-ewkb)
    • bbox <string>
      Limits the output features within the requested boundary box (format: "minx,miny maxx,maxy", only relevant with conv: gml-to-ewkb)
    • aggr <string>
      Enables an aggregation function to deal with repeating elements
      Currently available functions are:
      • first Takes the first occurrence (in XML document order) and ignores any later ones
      • last Takes the last occurrence (in XML document order) and ignores any earlier ones
      • append Concatenates all occurrences into a single string separated by commas
    • incl <regex string>
      Filters the result set by a regex match on this column; a non-match will cause the entire row to be skipped
    • excl <regex string>
      Filters the result set (after include above) by a regex match on this column; a match will cause the entire row to be skipped
    • hide <boolean>
      Omits the column from the output
      Enable this to use a column with incl/excl but not include it in the output
    • norm <string>
      Sets a path specifying the output file to write a normalized (deduplicated) table for this column into
      Regular columns are deduplicated on their unique values; subtable columns are deduplicated on their first configured column
    • cols <array of objects>
      Defines this path as the start of a 'subtable', a nested set of columns that will be saved into a separate table with a one-to-many relationship to the parent table
      The value of the first column of the parent table will be saved as the first column of the subtable, to be used as a foreign key
      This set of columns can use any option documented above
      The subtable column itself is required to have a file field (containing a path for the output file), because the subtable output cannot be mixed with the main table
Clone this wiki locally