diff --git a/README.md b/README.md
index ace82cd..13e4a03 100644
--- a/README.md
+++ b/README.md
@@ -1,24 +1,24 @@
# Flechette
-**Flechette** is a JavaScript library for reading the [Apache Arrow](https://arrow.apache.org/) columnar in-memory data format. It provides a faster, lighter, zero-dependency alternative to the [Arrow JS reference implementation](https://github.com/apache/arrow/tree/main/js).
+**Flechette** is a JavaScript library for reading and writing the [Apache Arrow](https://arrow.apache.org/) columnar in-memory data format. It provides a faster, lighter, zero-dependency alternative to the [Arrow JS reference implementation](https://github.com/apache/arrow/tree/main/js).
-Flechette performs fast extraction of data columns in the Arrow binary IPC format, supporting ingestion of Arrow data (from sources such as [DuckDB](https://duckdb.org/)) for downstream use in JavaScript data analysis tools like [Arquero](https://github.com/uwdata/arquero), [Mosaic](https://github.com/uwdata/mosaic), [Observable Plot](https://observablehq.com/plot/), and [Vega-Lite](https://vega.github.io/vega-lite/).
+Flechette performs fast extraction and encoding of data columns in the Arrow binary IPC format, supporting ingestion of Arrow data from sources such as [DuckDB](https://duckdb.org/) and Arrow use in JavaScript data analysis tools like [Arquero](https://github.com/uwdata/arquero), [Mosaic](https://github.com/uwdata/mosaic), [Observable Plot](https://observablehq.com/plot/), and [Vega-Lite](https://vega.github.io/vega-lite/).
## Why Flechette?
In the process of developing multiple data analysis packages that consume Arrow data (including Arquero, Mosaic, and Vega), we've had to develop workarounds for the performance and correctness of the Arrow JavaScript reference implementation. Instead of workarounds, Flechette addresses these issues head-on.
-* _Speed_. Flechette provides faster decoding. Across varied datasets, initial performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, and 5-9x faster row object extraction.
+* _Speed_. Flechette provides better performance. Performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, 5-9x faster row object extraction, and 1.5-3.5x faster building of Arrow columns.
-* _Size_. Flechette is ~17k minified (~6k gzip'd), versus 163k minified (~43k gzip'd) for Arrow JS.
+* _Size_. Flechette is smaller: ~42k minified (~13k gzip'd) versus 163k minified (~43k gzip'd) for Arrow JS. Flechette's encoders and decoders also tree-shake cleanly, so you only pay for what you need in your own bundles.
-* _Coverage_. Flechette supports data types unsupported by the reference implementation at the time of writing, including decimal-to-number conversion, month/day/nanosecond time intervals (as used by DuckDB, for example), list views, and run-end encoded data.
+* _Coverage_. Flechette supports data types unsupported by the reference implementation, including decimal-to-number conversion, month/day/nanosecond time intervals (as used by DuckDB, for example), run-end encoded data, binary views, and list views.
* _Flexibility_. Flechette includes options to control data value conversion, such as numerical timestamps vs. Date objects for temporal data, and numbers vs. bigint values for 64-bit integer data.
-* _Simplicity_. Our goal is to provide a smaller, simpler code base in the hope that it will make it easier for ourselves and others to improve the library. If you'd like to see support for additional Arrow data types or features, please [file an issue](https://github.com/uwdata/flechette/issues) or [open a pull request](https://github.com/uwdata/flechette/pulls).
+* _Simplicity_. Our goal is to provide a smaller, simpler code base in the hope that it will make it easier for ourselves and others to improve the library. If you'd like to see support for additional Arrow features, please [file an issue](https://github.com/uwdata/flechette/issues) or [open a pull request](https://github.com/uwdata/flechette/pulls).
-That said, no tool is without limitations or trade-offs. Flechette is *consumption oriented*: it does yet support encoding (though feel free to [upvote encoding support](https://github.com/uwdata/flechette/issues/1)!). Flechette also requires simpler inputs (byte buffers, no promises or streams), has less strict TypeScript typings, and at times has a slightly slower initial parse (as it decodes dictionary data upfront for faster downstream access).
+That said, no tool is without limitations or trade-offs. Flechette assumes simpler inputs (byte buffers, no promises or streams), has less strict TypeScript typings, and may have a slightly slower initial parse (as it decodes dictionary data upfront for faster downstream access).
## What's with the name?
@@ -70,18 +70,57 @@ const objects = table.toArray();
const subtable = table.select(['delay', 'time']);
```
+### Build and Encode Arrow Data
+
+```js
+import {
+ bool, dictionary, float32, int32, tableFromArrays, tableToIPC, utf8
+} from '@uwdata/flechette';
+
+// data defined using standard JS types
+// both arrays and typed arrays work well
+const arrays = {
+ ints: [1, 2, null, 4, 5],
+ floats: [1.1, 2.2, 3.3, 4.4, 5.5],
+ bools: [true, true, null, false, true],
+ strings: ['a', 'b', 'c', 'b', 'a']
+};
+
+// create table with automatically inferred types
+const tableInfer = tableFromArrays(arrays);
+
+// encode table to bytes in Arrow IPC stream format
+const ipcInfer = tableToIPC(tableInfer);
+
+// create table using explicit types
+const tableTyped = tableFromArrays(arrays, {
+ types: {
+ ints: int32(),
+ floats: float32(),
+ bools: bool(),
+ strings: dictionary(utf8())
+ }
+});
+
+// encode table to bytes in Arrow IPC file format
+const ipcTyped = tableToIPC(tableTyped, { format: 'file' });
+```
+
### Customize Data Extraction
Data extraction can be customized using options provided to the table generation method. By default, temporal data is returned as numeric timestamps, 64-bit integers are coerced to numbers, and map-typed data is returned as an array of [key, value] pairs. These defaults can be changed via conversion options that push (or remove) transformations to the underlying data batches.
```js
const table = tableFromIPC(ipc, {
- useDate: true, // map temporal data to Date objects
- useBigInt: true, // use BigInt, do not coerce to number
- useMap: true // create Map objects for [key, value] pair lists
+ useDate: true, // map dates and timestamps to Date objects
+ useDecimalBigInt: true, // use BigInt for decimals, do not coerce to number
+ useBigInt: true, // use BigInt for 64-bit ints, do not coerce to number
+ useMap: true // create Map objects for [key, value] pair lists
});
```
+The same extraction options can be passed to `tableFromArrays`.
+
## Build Instructions
To build and develop Flechette locally:
diff --git a/docs/api/column.md b/docs/api/column.md
new file mode 100644
index 0000000..7963e2c
--- /dev/null
+++ b/docs/api/column.md
@@ -0,0 +1,69 @@
+---
+title: Flechette API Reference
+---
+# Flechette API Reference
+
+[Top-Level](/flechette/api) | [Data Types](data-types) | [Table](table) | [**Column**](column)
+
+## Column Class
+
+A data column. A column provides a view over one or more value batches, each corresponding to part of an Arrow record batch. The Column class supports random access to column values by integer index using the [`at`](#at) method; however, extracting arrays using [`toArray`](#toArray) may provide more efficient means of bulk access and scanning.
+
+* [constructor](#constructor)
+* [type](#type)
+* [length](#length)
+* [nullCount](#nullCount)
+* [data](#data)
+* [at](#at)
+* [get](#get)
+* [toArray](#toArray)
+* [Symbol.iterator](#iterator)
+
+
#
+Column.constructor(data)
+
+Create a new column with the given data batches.
+
+* *data* (`Batch[]`): The column data batches.
+
+
#
+Column.type
+
+The column [data type](data-types).
+
+
#
+Column.length
+
+The column length (number of rows).
+
+
#
+Column.nullCount
+
+The count of null values in the column.
+
+
#
+Column.data
+
+An array of column data batches.
+
+
#
+Column.at(index)
+
+Return the column value at the given *index*. If a column has multiple batches, this method performs binary search over the batch lengths to determine the batch from which to retrieve the value. The search makes lookup less efficient than a standard array access. If making multiple full scans of a column, consider extracting an array via `toArray()`.
+
+* *index* (`number`): The row index.
+
+
#
+Column.get(index)
+
+Return the column value at the given *index*. This method is the same as [`at`](#at) and is provided for better compatibility with Apache Arrow JS.
+
+
#
+Column.toArray()
+
+Extract column values into a single array instance. When possible, a zero-copy subarray of the input Arrow data is returned. A typed array is used if possible. If a column contains `null` values, a standard `Array` is created and populated.
+
+
#
+Column.[Symbol.iterator]()
+
+Return an iterator over the values in this column.
diff --git a/docs/api/data-types.md b/docs/api/data-types.md
new file mode 100644
index 0000000..726b6b0
--- /dev/null
+++ b/docs/api/data-types.md
@@ -0,0 +1,470 @@
+---
+title: Flechette API Reference
+---
+# Flechette API Reference
+
+[Top-Level](/flechette/api) | [**Data Types**](data-types) | [Table](table) | [Column](column)
+
+## Data Type Overview
+
+The table below provides an overview of all data types supported by the Apache Arrow format and how Flechette maps them to JavaScript types. The table indicates if Flechette can read the type (via [`tableFromIPC`](/flechette/api/#tableFromIPC)), write the type (via [`tableToIPC`](/flechette/api/#tableToIPC)), and build the type from JavaScript values (via [`tableFromArrays`](/flechette/api/#tableFromArrays) or [`columnFromArray`](/flechette/api/#tableFromArray)).
+
+| Id | Data Type | Read? | Write? | Build? | JavaScript Type |
+| --: | ----------------------------------- | :---: | :----: | :----: | --------------- |
+| -1 | [Dictionary](#dictionary) | ✅ | ✅ | ✅ | depends on dictionary value type |
+| 1 | [Null](#null) | ✅ | ✅ | ✅ | `null` |
+| 2 | [Int](#int) | ✅ | ✅ | ✅ | `number`, or `bigint` for 64-bit values via the `useBigInt` flag |
+| 3 | [Float](#float) | ✅ | ✅ | ✅ | `number` |
+| 4 | [Binary](#binary) | ✅ | ✅ | ✅ | `Uint8Array` |
+| 5 | [Utf8](#utf8) | ✅ | ✅ | ✅ | `string` |
+| 6 | [Bool](#bool) | ✅ | ✅ | ✅ | `boolean` |
+| 7 | [Decimal](#decimal) | ✅ | ✅ | ✅ | `number`, or `bigint` via the `useDecimalBigInt` flag |
+| 8 | [Date](#date) | ✅ | ✅ | ✅ | `number`, or `Date` via the `useDate` flag. |
+| 9 | [Time](#time) | ✅ | ✅ | ✅ | `number`, or `bigint` for 64-bit values via the `useBigInt` flag |
+| 10 | [Timestamp](#timestamp) | ✅ | ✅ | ✅ | `number`, or `Date` via the `useDate` flag. |
+| 11 | [Interval](#interval) | ✅ | ✅ | ✅ | `Float64Array` (month/day/nano) or `Int32Array` (other units) |
+| 12 | [List](#list) | ✅ | ✅ | ✅ | `Array` or `TypedArray` of child type |
+| 13 | [Struct](#struct) | ✅ | ✅ | ✅ | `object`, properties depend on child types |
+| 14 | [Union](#union) | ✅ | ✅ | ✅ | depends on child types |
+| 15 | [FixedSizeBinary](#fixedSizeBinary) | ✅ | ✅ | ✅ | `Uint8Array` |
+| 16 | [FixedSizeList](#fixedSizeList) | ✅ | ✅ | ✅ | `Array` or `TypedArray` of child type |
+| 17 | [Map](#map) | ✅ | ✅ | ✅ | `[key, value][]`, or `Map` via the `useMap` flag |
+| 18 | [Duration](#duration) | ✅ | ✅ | ✅ | `number`, or `bigint` via the `useBigInt` flag |
+| 19 | [LargeBinary](#largeBinary) | ✅ | ✅ | ✅ | `Uint8Array` |
+| 20 | [LargeUtf8](#largeUtf8) | ✅ | ✅ | ✅ | `string` |
+| 21 | [LargeList](#largeList) | ✅ | ✅ | ✅ | `Array` or `TypedArray` of child type |
+| 22 | [RunEndEncoded](#runEndEncoded) | ✅ | ✅ | ✅ | depends on child type |
+| 23 | [BinaryView](#binaryView) | ✅ | ✅ | ❌ | `Uint8Array` |
+| 24 | [Utf8View](#utf8View) | ✅ | ✅ | ❌ | `string` |
+| 25 | [ListView](#listView) | ✅ | ✅ | ❌ | `Array` or `TypedArray` of child type |
+| 26 | [LargeListView](#largeListView) | ✅ | ✅ | ❌ | `Array` or `TypedArray` of child type |
+
+## Data Type Methods
+
+* [field](#field)
+* [dictionary](#dictionary)
+* [nullType](#nullType)
+* [int](#int), [int8](#int8), [int16](#int16), [int32](#int32), [int64](#int64), [uint8](#uint8), [uint16](#uint16), [uint32](#uint32), [uint64](#uint64)
+* [float](#float), [float16](#float16), [float32](#float32), [float64](#float64)
+* [binary](#binary)
+* [utf8](#utf8)
+* [bool](#bool)
+* [decimal](#decimal)
+* [date](#date), [dateDay](#dateDay), [dateMillisecond](#dateMillisecond)
+* [time](#time), [timeSecond](#timeSecond), [timeMillisecond](#timeMillisecond), [timeMicrosecond](#timeMicrosecond), [timeNanosecond](#timeNanosecond)
+* [timestamp](#timestamp)
+* [interval](#interval)
+* [list](#list)
+* [struct](#struct)
+* [union](#struct)
+* [fixedSizeBinary](#fixedSizeBinary)
+* [fixedSizeList](#fixedSizeList)
+* [map](#map)
+* [duration](#duration)
+* [largeBinary](#largeBinary)
+* [largeUtf8](#largeUtf8)
+* [largeList](#largeList)
+* [runEndEncoded](#runEndEncoded)
+* [binaryView](#binaryView)
+* [utf8View](#utf8View)
+* [listView](#listView)
+* [largeListView](#largeListView)
+
+### Field
+
+
#
+field(name, type[, nullable, metadata])
+
+Create a new field instance for use in a schema or type definition. A field represents a field name, data type, and additional metadata. Fields are used to represent child types within nested types like [List](#list), [Struct](#struct), and [Union](#union).
+
+* *name* (`string`): The field name.
+* *type* (`DataType`): The field data type.
+* *nullable* (`boolean`): Flag indicating if the field is nullable (default `true`).
+* *metadata* (`Map`): Custom field metadata annotations (default `null`).
+
+### Dictionary
+
+
#
+dictionary(type[, indexType, id, ordered])
+
+Create a Dictionary data type instance. A dictionary type consists of a dictionary of values (which may be of any type) and corresponding integer indices that reference those values. If values are repeated, a dictionary encoding can provide substantial space savings. In the IPC format, dictionary indices reside alongside other columns in a record batch, while dictionary values are written to special dictionary batches, linked by a unique dictionary *id*. Internally Flechette extracts dictionary values upfront; while this incurs some initial overhead, it enables fast subsequent lookups.
+
+* *type* (`DataType`): The data type of dictionary values.
+* *indexType* (`DataType`): The data type of dictionary indices. Must be an integer type (default [`int32`](#int32)).
+* *id* (`number`): The dictionary id, should be unique in a table. Defaults to `-1`, but is set to a proper id if the type is passed through [`tableFromArrays`](/flechette/api/#tableFromArrays).
+* *ordered* (`boolean`): Indicates if dictionary values are ordered (default `false`).
+
+### Null
+
+
#
+nullType()
+
+Create a Null data type instance. Null data requires no storage and all extracted values are `null`.
+
+### Int
+
+
#
+int([bitWidth, signed])
+
+Create an Int data type instance. Integer values are stored within typed arrays and extracted to JavaScript `number` values by default.
+
+* *bitWidth* (`number`): The integer bit width, must be `8`, `16`, `32` (default), or `64`.
+* *signed* (`boolean`): Flag for signed or unsigned integers (default `true`).
+
+
#
+int8()
+
+Create an Int data type instance for 8-bit signed integers. 8-bit signed integers are stored within an `Int8Array` and accessed directly.
+
+
#
+int16()
+
+Create an Int data type instance for 16-bit signed integers. 16-bit signed integers are stored within an `Int16Array` and accessed directly.
+
+
#
+int32()
+
+Create an Int data type instance for 32-bit signed integers. 32-bit signed integers are stored within an `Int32Array` and accessed directly.
+
+
#
+int64()
+
+Create an Int data type instance for 64-bit signed integers. 64-bit signed integers are stored within a `BigInt64Array` and converted to JavaScript `number` values upon extraction. An error is raised if a value exceeds either `Number.MIN_SAFE_INTEGER` or `Number.MAX_SAFE_INTEGER`. Pass the `useBigInt` extraction option (e.g., to [`tableFromIPC`](/flechette/api/#tableFromIPC) or [`tableFromArrays`](/flechette/api/#tableFromArrays)) to instead extract 64-bit integers directly as `BigInt` values.
+
+
#
+uint8()
+
+Create an Int data type instance for 8-bit unsigned integers. 8-bit unsigned integers are stored within an `Uint8Array` and accessed directly.
+
+
#
+uint16()
+
+Create an Int data type instance for 16-bit unsigned integers. 16-bit unsigned integers are stored within an `Uint16Array` and accessed directly.
+
+
#
+uint32()
+
+Create an Int data type instance for 32-bit unsigned integers. 32-bit unsigned integers are stored within an `Uint32Array` and accessed directly.
+
+
#
+uint64()
+
+Create an Int data type instance for 64-bit unsigned integers. 64-bit unsigned integers are stored within a `BigUint64Array` and converted to JavaScript `number` values upon extraction. An error is raised if a value exceeds `Number.MAX_SAFE_INTEGER`. Pass the `useBigInt` extraction option (e.g., to [`tableFromIPC`](/flechette/api/#tableFromIPC) or [`tableFromArrays`](/flechette/api/#tableFromArrays)) to instead extract 64-bit integers directly as `BigInt` values.
+
+### Float
+
+
#
+float([precision])
+
+Create a Float data type instance for floating point numbers. Floating point values are stored within typed arrays and extracted to JavaScript `number` values.
+
+* *precision* (`number`): The floating point precision, one of `Precision.HALF` (16-bit), `Precision.SINGLE` (32-bit) or `Precision.DOUBLE` (64-bit, default).
+
+
#
+float16()
+
+Create a Float data type instance for 16-bit (half precision) floating point numbers. 16-bit floats are stored within a `Uint16Array` and converted to/from `number` values. We intend to use [`Float16Array`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Float16Array) once it is widespread among JavaScript engines.
+
+
#
+float32()
+
+Create a Float data type instance for 32-bit (single precision) floating point numbers. 32-bit floats are stored within a `Float32Array` and accessed directly.
+
+
#
+float64()
+
+Create a Float data type instance for 64-bit (double precision) floating point numbers. 64-bit floats are stored within a `Float64Array` and accessed directly.
+
+### Binary
+
+
#
+binary()
+
+Create a Binary data type instance for variably-sized opaque binary data with 32-bit offsets. Binary values are stored in a `Uint8Array` using a 32-bit offset array and extracted to JavaScript `Uint8Array` subarray values.
+
+### Utf8
+
+
#
+utf8()
+
+Create a Utf8 data type instance for Unicode string data of variable length with 32-bit offsets. [UTF-8](https://en.wikipedia.org/wiki/UTF-8) code points are stored as binary data and extracted to JavaScript `string` values using [`TextDecoder`](https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder). Due to decoding overhead, repeated access to string data can be costly. If making multiple passes over Utf8 data, we recommended converting the string upfront (e.g., via [`Column.toArray`](column#toArray)) and accessing the result.
+
+### Bool
+
+
#
+bool()
+
+Create a Bool data type instance for boolean data. Bool values are stored compactly in `Uint8Array` bitmaps with eight values per byte, and extracted to JavaScript `boolean` values.
+
+### Decimal
+
+
#
+decimal(precision, scale[, bitWidth])
+
+Create an Decimal data type instance for exact decimal values, represented as a 128 or 256-bit integer value in two's complement. Decimals are fixed point numbers with a set *precision* (total number of decimal digits) and *scale* (number of fractional digits). For example, the number `35.42` can be represented as `3542` with *precision* ≥ 4 and *scale* = 2.
+
+By default, Flechette converts decimals to 64-bit floating point numbers upon extraction (e.g., mapping `3542` back to `35.42`). While useful for many downstream applications, this conversion may be lossy and introduce inaccuracies. Pass the `useDecimalBigInt` extraction option (e.g., to [`tableFromIPC`](/flechette/api/#tableFromIPC) or [`tableFromArrays`](/flechette/api/#tableFromArrays)) to instead extract decimal data as `BigInt` values.
+
+* *precision* (`number`): The total number of decimal digits that can be represented.
+* *scale* (`number`): The number of fractional digits, beyond the decimal point.
+* *bitWidth* (`number`): The decimal bit width, one of `128` (default) or `256`.
+
+### Date
+
+
#
+date(unit)
+
+Create a Date data type instance. Date values are 32-bit or 64-bit signed integers representing an elapsed time since the UNIX epoch (Jan 1, 1970 UTC), either in units of days (32 bits) or milliseconds (64 bits, with values evenly divisible by 86400000). Dates are stored in either an `Int32Array` (days) or `BigInt64Array` (milliseconds).
+
+By default, extracted date values are converted to JavaScript `number` values representing milliseconds since the UNIX epoch. Pass the `useDate` extraction option (e.g., to [`tableFromIPC`](/flechette/api/#tableFromIPC) or [`tableFromArrays`](/flechette/api/#tableFromArrays)) to instead extract date values as JavaScript `Date` objects.
+
+* *unit* (`number`): The date unit, one of `DateUnit.DAY` or `DateUnit.MILLISECOND`.
+
+
#
+dateDay()
+
+Create a Date data type instance with units of `DateUnit.DAY`.
+
+
#
+dateMillisecond()
+
+Create a Date data type instance with units of `DateUnit.MILLISECOND`.
+
+### Time
+
+
#
+time([unit, bitWidth])
+
+Create a Time data type instance, stored in one of four *unit*s: seconds, milliseconds, microseconds or nanoseconds. The integer *bitWidth* depends on the *unit* and must be 32 bits for seconds and milliseconds or 64 bits for microseconds and nanoseconds. The allowed values are between 0 (inclusive) and 86400 (=24*60*60) seconds (exclusive), adjusted for the time unit (for example, up to 86400000 exclusive for the `DateUnit.MILLISECOND` unit.
+
+This definition doesn't allow for leap seconds. Time values from measurements with leap seconds will need to be corrected when ingesting into Arrow (for example by replacing the value 86400 with 86399).
+
+Time values are stored as integers in either an `Int32Array` (*bitWidth* = 32) or `BigInt64Array` (*bitWidth* = 64). By default, all time values are returned as untransformed `number` values. 64-bit values are stored within a `BigInt64Array` and converted to JavaScript `number` values upon extraction. An error is raised if a value exceeds either `Number.MIN_SAFE_INTEGER` or `Number.MAX_SAFE_INTEGER`. Pass the `useBigInt` extraction option (e.g., to [`tableFromIPC`](/flechette/api/#tableFromIPC) or [`tableFromArrays`](/flechette/api/#tableFromArrays)) to instead extract 64-bit time values directly as `BigInt` values.
+
+* *unit* (`number`): The time unit, one of `TimeUnit.SECOND`, `TimeUnit.MILLISECOND` (default), `TimeUnit.MICROSECOND`, or `TimeUnit.NANOSECOND`.
+* *bitWidth (`number`): The time bit width, one of `32` (for seconds and milliseconds) or `64` (for microseconds and nanoseconds).
+
+
#
+timeSecond()
+
+Create a Time data type instance with units of `TimeUnit.SECOND`.
+
+
#
+timeMillisecond()
+
+Create a Time data type instance with units of `TimeUnit.MILLISECOND`.
+
+
#
+timeMicrosecond()
+
+Create a Time data type instance with units of `TimeUnit.MICROSECOND`.
+
+
#
+timeNanosecond()
+
+Create a Time data type instance with units of `TimeUnit.NANOSECOND`.
+
+### Timestamp
+
+
#
+timestamp([unit, timezone])
+
+Create a Timestamp data type instance. Timestamp values are 64-bit signed integers representing an elapsed time since a fixed epoch, stored in either of four *unit*s: seconds, milliseconds, microseconds or nanoseconds, and are optionally annotated with a *timezone*. Timestamp values do not include any leap seconds (in other words, all days are considered 86400 seconds long).
+
+Timestamp values are stored in a `BigInt64Array` and converted to millisecond-based JavaScript `number` values (potentially with fractional digits) upon extraction. An error is raised if a value exceeds either `Number.MIN_SAFE_INTEGER` or `Number.MAX_SAFE_INTEGER`. Pass the `useDate` extraction option (e.g., to [`tableFromIPC`](/flechette/api/#tableFromIPC) or [`tableFromArrays`](/flechette/api/#tableFromArrays)) to instead extract timestamp values as JavaScript `Date` objects.
+
+* *unit* (`number`): The time unit, one of `TimeUnit.SECOND`, `TimeUnit.MILLISECOND` (default), `TimeUnit.MICROSECOND`, or `TimeUnit.NANOSECOND`.
+* *timezone* (`string`): An optional string for the name of a timezone. If provided, the value should either be a string as used in the Olson timezone database (the "tz database" or "tzdata"), such as "America/New_York", or an absolute timezone offset of the form "+XX:XX" or "-XX:XX", such as "+07:30". Whether a timezone string is present indicates different semantics about the data.
+
+### Interval
+
+
#
+interval([unit])
+
+Create an Interval data type instance. Values represent calendar intervals stored using integers for each date part. The supported intervals *unit*s are:
+
+* `IntervalUnit.YEAR_MONTH`: Indicates the number of elapsed whole months, stored as 4-byte signed integers.
+* `IntervalUnit.DAY_TIME`: Indicates the number of elapsed days and milliseconds (no leap seconds), stored as 2 contiguous 32-bit signed integers (8-bytes in total).
+* `IntervalUnit.MONTH_DAY_NANO`: A triple of the number of elapsed months, days, and nanoseconds. The values are stored contiguously in 16-byte blocks. Months and days are encoded as 32-bit signed integers and nanoseconds is encoded as a 64-bit signed integer. Nanoseconds does not allow for leap seconds. Each field is independent (e.g. there is no constraint that nanoseconds have the same sign as days or that the quantity of nanoseconds represents less than a day's worth of time).
+
+Flechette extracts interval values to two-element `Int32Array` instances (for `IntervalUnit.YEAR_MONTH` and `IntervalUnit.DAY_TIME`) or to three-element `Float64Array` instances (for `IntervalUnit.MONTH_DAY_NANO`).
+
+* *unit* (`number`): The interval unit. One of `IntervalUnit.YEAR_MONTH`, `IntervalUnit.DAY_TIME`, or `IntervalUnit.MONTH_DAY_NANO` (default).
+
+### List
+
+
#
+list(child)
+
+Create a List type instance, representing variably-sized lists (arrays) with 32-bit offsets. A list has a single child data type for list entries. Lists are represented using integer offsets that indicate list extents within a single child array containing all list values. Lists are extracted to either `Array` or `TypedArray` instances, depending on the child type.
+
+* *child* (`DataType | Field`): The child (list item) field or data type.
+
+### Struct
+
+
#
+struct(children)
+
+Create a Struct type instance. A struct consists of multiple named child data types. Struct values are stored as parallel child batches, one per child type, and extracted to standard JavaScript objects.
+
+* *children* (`Field[] | object`): An array of property fields, or an object mapping property names to data types. If an object, the instantiated fields are assumed to be nullable and have no metadata.
+
+*Examples*
+
+```js
+import { bool, float32, int16, struct } from '@uwdata/flechette';
+// using an object with property names and types
+struct({ foo: int16(), bar: bool(), baz: float32() })
+```
+
+```js
+import { bool, field, float32, int16, struct } from '@uwdata/flechette';
+// using an array of Field instances
+struct([
+ field('foo', int16()),
+ field('bar', bool()),
+ field('baz', float32())
+])
+```
+
+### Union
+
+
#
+union(mode, children[, typeIds, typeIdForValue])
+
+Create a Union type instance. A union is a complex type with parallel *children* data types. Union values are stored in either a sparse (`UnionMode.Sparse`) or dense (`UnionMode.Dense`) layout *mode*. In a sparse layout, child types are stored in parallel arrays with the same lengths, resulting in many unused, empty values. In a dense layout, child types have variable lengths and an offsets array is used to index the appropriate value.
+
+By default, ids in the type vector refer to the index in the children array. Optionally, *typeIds* provide an indirection between the child index and the type id. For each child, `typeIds[index]` is the id used in the type vector. The *typeIdForValue* argument provides a lookup function for mapping input data to the proper child type id, and is required if using builder methods.
+
+Extracted JavaScript values depend on the child types.
+
+* *mode* (`number`): The union mode. One of `UnionMode.Sparse` or `UnionMode.Dense`.
+* *children* (`(DataType[] | Field)[]`): The children fields or data types. Types are mapped to nullable fields with no metadata.
+* *typeIds* (`number[]`): Children type ids, in the same order as the children types. Type ids provide a level of indirection over children types. If not provided, the children indices are used as the type ids.
+* *typeIdForValue* (`(value: any, index: number) => number`): A function that takes an arbitrary value and a row index and returns a correponding union type id. This function is required to build union-typed data with [`tableFromArrays`](/flechette/api/#tableFromArrays) or [`columnFromArray`](/flechette/api/#tableFromArray).
+
+### FixedSizeBinary
+
+
#
+fixedSizeBinary(stride)
+
+Create a FixedSizeBinary data type instance for opaque binary data where each entry has the same fixed size. Fixed binary data are stored in a single `Uint8Array`, indexed using the known stride and extracted to JavaScript `Uint8Array` subarray values.
+
+* *stride* (`number`): The fixed size in bytes.
+
+### FixedSizeList
+
+
#
+fixedSizeList(child, stride)
+
+Create a FixedSizeList type instance for list (array) data where every list has the same fixed size. A list has a single child data type for list entries. Fixed size lists are represented as a single child array containing all list values, indexed using the known stride. Lists are extracted to either `Array` or `TypedArray` instances, depending on the child type.
+
+* *child* (`DataType | Field`): The child (list item) field or data type.
+* *stride* (`number`): The fixed list size.
+
+### Map
+
+
#
+map(keyField, valueField[, keysSorted])
+
+Create a Map type instance representing collections of key-value pairs. A Map is a logical nested type that is represented as a list of key-value structs. The key and value types are not constrained, so the application is responsible for ensuring that the keys are hashable and unique, and that keys are properly sorted if *keysSorted* is `true`.
+
+By default, map data is extracted to arrays of `[key, value]` pairs, in the style of `Object.entries`. Pass the `useMap` extraction option (e.g., to [`tableFromIPC`](/flechette/api/#tableFromIPC) or [`tableFromArrays`](/flechette/api/#tableFromArrays)) to instead extract JavaScript [`Map`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map) instances.
+
+* *keyField* (`DataType | Field`): The map key field or data type.
+* *valueField* (`DataType | Field`): The map value field or data type.
+* *keysSorted* (`boolean`): Flag indicating if the map keys are sorted (default `false`).
+
+### Duration
+
+
#
+duration([unit])
+
+Create a Duration data type instance. Durations represent an absolute length of time unrelated to any calendar artifacts. The resolution defaults to millisecond, but can be any of the other `TimeUnit` values. This type is always represented as a 64-bit integer.
+
+Duration values are stored as integers in a `BigInt64Array`. By default, duration values are extracted as JavaScript `number` values. An error is raised if a value exceeds either `Number.MIN_SAFE_INTEGER` or `Number.MAX_SAFE_INTEGER`. Pass the `useBigInt` extraction option (e.g., to [`tableFromIPC`](/flechette/api/#tableFromIPC) or [`tableFromArrays`](/flechette/api/#tableFromArrays)) to instead extract duration values directly as `BigInt` values.
+
+* *unit* (`number`): The duration time unit, one of `TimeUnit.SECOND`, `TimeUnit.MILLISECOND` (default), `TimeUnit.MICROSECOND`, or `TimeUnit.NANOSECOND`.
+
+### LargeBinary
+
+
#
+largeBinary()
+
+Create a LargeBinary data type instance for variably-sized opaque binary data with 64-bit offsets, allowing representation of extremely large data values. Large binary values are stored in a `Uint8Array`, indexed using a 64-bit offset array and extracted to JavaScript `Uint8Array` subarray values.
+
+### LargeUtf8
+
+
#
+largeUtf8()
+
+Create a LargeUtf8 data type instance for Unicode string data of variable length with 64-bit offsets, allowing representation of extremely large data values. [UTF-8](https://en.wikipedia.org/wiki/UTF-8) code points are stored as binary data and extracted to JavaScript `string` values using [`TextDecoder`](https://developer.mozilla.org/en-US/docs/Web/API/TextDecoder). Due to decoding overhead, repeated access to string data can be costly. If making multiple passes over Utf8 data, we recommended converting the string upfront (e.g., via [`Column.toArray`](column#toArray)) and accessing the result.
+
+### LargeList
+
+
#
+largeList(child)
+
+Create a LargeList type instance, representing variably-sized lists (arrays) with 64-bit offsets, allowing representation of extremely large data values. A list has a single child data type for list entries. Lists are represented using integer offsets that indicate list extents within a single child array containing all list values. Lists are extracted to either `Array` or `TypedArray` instances, depending on the child type.
+
+* *child* (`DataType | Field`): The child (list item) field or data type.
+
+### RunEndEncoded
+
+
#
+runEndEncoded(runsField, valuesField)
+
+Create a RunEndEncoded type instance, which compresses data by representing consecutive repeated values as a run. This data type uses two child arrays, `run_ends` and `values`. The `run_ends` child array must be a 16, 32, or 64 bit integer array which encodes the indices at which the run with the value in each corresponding index in the values child array ends. Like list and struct types, the `values` array can be of any type.
+
+To extract values by index, binary search is performed over the run_ends to locate the correct value. The extracted value depends on the `values` data type.
+
+* *runsField* (`DataType | Field`): The run-ends field or data type.
+* *valuesField* (`DataType | Field`): The values field or data type.
+
+*Examples*
+
+```js
+import { int32, runEndEncoded, utf8 } from '@uwdata/flechette';
+// 32-bit integer run ends and utf8 string values
+const type = runEndEncoded(int32(), utf8());
+```
+
+### BinaryView
+
+
#
+binaryView()
+
+Create a BinaryView type instance. BinaryView data is logically the same as the [Binary](#binary) type, but the internal representation uses a view struct that contains the string length and either the string's entire data inline (for small strings) or an inlined prefix, an index of another buffer, and an offset pointing to a slice in that buffer (for non-small strings). For more details, see the [Apache Arrow format documentation](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout).
+
+Flechette can encode and decode BinaryView data, extracting `Uint8Array` values. However, Flechette does not currently support building BinaryView columns from JavaScript values.
+
+### Utf8View
+
+
#
+utf8View()
+
+Create a Utf8View type instance. Utf8View data is logically the same as the [Utf8](#utf8) type, but the internal representation uses a view struct that contains the string length and either the string's entire data inline (for small strings) or an inlined prefix, an index of another buffer, and an offset pointing to a slice in that buffer (for non-small strings). For more details, see the [Apache Arrow format documentation](https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout).
+
+Flechette can encode and decode Utf8View data, extracting `string` values. However, Flechette does not currently support building Utf8View columns from JavaScript values.
+
+### ListView
+
+
#
+listView(child)
+
+Create a ListView type instance, representing variably-sized lists (arrays) with 32-bit offsets. ListView data represents the same logical types that [List](#list) can, but contains both offsets and sizes allowing for writes in any order and sharing of child values among list values. For more details, see the [Apache Arrow format documentation](https://arrow.apache.org/docs/format/Columnar.html#listview-layout).
+
+ListView data are extracted to either `Array` or `TypedArray` instances, depending on the child type. Flechette can encode and decode ListView data; however, Flechette does not currently support building ListView columns from JavaScript values.
+
+* *child* (`DataType | Field`): The child (list item) field or data type.
+
+### LargeListView
+
+
#
+largeListView(child)
+
+Create a LargeListView type instance, representing variably-sized lists (arrays) with 64-bit offsets, allowing representation of extremely large data values. LargeListView data represents the same logical types that [LargeList](#largeList) can, but contains both offsets and sizes allowing for writes in any order and sharing of child values among list values. For more details, see the [Apache Arrow format documentation](https://arrow.apache.org/docs/format/Columnar.html#listview-layout).
+
+LargeListView data are extracted to either `Array` or `TypedArray` instances, depending on the child type. Flechette can encode and decode LargeListView data; however, Flechette does not currently support building LargeListView columns from JavaScript values.
+
+* *child* (`DataType | Field`): The child (list item) field or data type.
diff --git a/docs/api/index.md b/docs/api/index.md
new file mode 100644
index 0000000..aab82f4
--- /dev/null
+++ b/docs/api/index.md
@@ -0,0 +1,153 @@
+---
+title: Flechette API Reference
+---
+# Flechette API Reference
+
+[**Top-Level**](/flechette/api) | [Data Types](data-types) | [Table](table) | [Column](column)
+
+## Top-Level Decoding and Encoding
+
+* [tableFromIPC](#tableFromIPC)
+* [tableToIPC](#tableToIPC)
+* [tableFromArrays](#tableFromArrays)
+* [columnFromArray](#columnFromArray)
+* [tableFromColumns](#tableFromColumns)
+
+
#
+tableFromIPC(data[, options])
+
+Decode [Apache Arrow IPC data](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) and return a new [`Table`](table). The input binary data may be either an `ArrayBuffer` or `Uint8Array`. For Arrow data in the [IPC 'stream' format](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format), an array of `Uint8Array` values is also supported.
+
+* *data* (`ArrayBuffer` | `Uint8Array` | `Uint8Array[]`): The source byte buffer, or an array of buffers. If an array, each byte array may contain one or more self-contained messages. Messages may NOT span multiple byte arrays.
+* *options* (`ExtractionOptions`): Options for controlling how values are transformed when extracted from an Arrow binary representation.
+ * *useDate* (`boolean`): If true, extract dates and timestamps as JavaScript `Date` objects Otherwise, return numerical timestamp values (default).
+ * *useDecimalBigInt* (`boolean`): If true, extract decimal-type data as BigInt values, where fractional digits are scaled to integers. Otherwise, return converted floating-point numbers (default).
+ * *useBigInt* (`boolean`): If true, extract 64-bit integers as JavaScript `BigInt` values Otherwise, coerce long integers to JavaScript number values (default).
+ * *useMap* (`boolean`): If true, extract Arrow 'Map' values as JavaScript `Map` instances Otherwise, return an array of [key, value] pairs compatible with both `Map` and `Object.fromEntries` (default).
+
+*Examples*
+
+```js
+import { tableFromIPC } from '@uwdata/flechette';
+const url = 'https://vega.github.io/vega-datasets/data/flights-200k.arrow';
+const ipc = await fetch(url).then(r => r.arrayBuffer());
+const table = tableFromIPC(ipc);
+```
+
+
#
+tableToIPC(table[, options])
+
+Encode an Arrow table into Arrow IPC binary format and return the result as a `Uint8Array`. Both the IPC ['stream'](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format) and ['file'](https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format) formats are supported.
+
+* *table* (`Table`): The Arrow table to encode.
+* *options* (`object`): Encoding options object.
+ * *format* (`string`): Arrow `'stream'` (the default) or `'file'` format.
+
+*Examples*
+
+```js
+import { tableToIPC } from '@uwdata/flechette';
+const bytes = tableFromIPC(table, { format: 'stream' });
+```
+
+
#
+tableFromArrays(data[, options])
+
+Create a new table from a set of named arrays. Data types for the resulting Arrow columns can be automatically inferred or specified using the *types* option. If the *types* option provides data types for only a subset of columns, the rest are inferred. Each input array must have the same length.
+
+* *data* (`object | array`): The input data as a collection of named arrays. If object-valued, the objects keys are column names and the values are arrays or typed arrays. If array-valued, the data should consist of [name, array] pairs in the style of `Object.entries`.
+* *options* (`object`): Options for building new tables and controlling how values are transformed when extracted from an Arrow binary representation.
+ * *types*: (`object`): An object mapping column names to [data types](data-types).
+ * *maxBatchRows* (`number`): The maximum number of rows to include in a single record batch. If the array lengths exceed this number, the resulting table will consist of multiple record batches.
+ * In addition, all [tableFromIPC](#tableFromIPC) extraction options are supported.
+
+*Examples*
+
+```js
+import { tableFromArrays } from '@uwdata/flechette';
+
+// create table with inferred types
+const table = tableFromArrays({
+ ints: [1, 2, null, 4, 5],
+ floats: [1.1, 2.2, 3.3, 4.4, 5.5],
+ bools: [true, true, null, false, true],
+ strings: ['a', 'b', 'c', 'b', 'a']
+});
+```
+
+```js
+import {
+ bool, dictionary, float32, int32, tableFromArrays, tableToIPC, utf8
+} from '@uwdata/flechette';
+
+// create table with specified types
+const table = tableFromArrays({
+ ints: [1, 2, null, 4, 5],
+ floats: [1.1, 2.2, 3.3, 4.4, 5.5],
+ bools: [true, true, null, false, true],
+ strings: ['a', 'b', 'c', 'b', 'a']
+}, {
+ types: {
+ ints: int32(),
+ floats: float32(),
+ bools: bool(),
+ strings: dictionary(utf8())
+ }
+});
+```
+
+
#
+columnFromArray(data[, type, options])
+
+Create a new column from a provided data array. The data types for the column can be automatically inferred or specified using the *type* argument.
+
+* *data* (`Array | TypedArray`): The input data as an Array or TypedArray.
+* *type*: (`DataType`): The [data type](data-types) for the column. If not specified, type inference is attempted.
+* *options* (`object`): Options for building new columns and controlling how values are transformed when extracted from an Arrow binary representation.
+ * *maxBatchRows* (`number`): The maximum number of rows to include in a single record batch. If the array lengths exceed this number, the resulting table will consist of multiple record batches.
+ * In addition, all [tableFromIPC](#tableFromIPC) extraction options are supported.
+
+*Examples*
+
+```js
+import { columnFromArray } from '@uwdata/flechette';
+
+// create column with inferred type (here, float64)
+const col = columnFromArray([1.1, 2.2, 3.3, 4.4, 5.5]);
+```
+
+```js
+import { columnFromArray, float32 } from '@uwdata/flechette';
+
+// create column with specified type
+const col = columnFromArray([1.1, 2.2, 3.3, 4.4, 5.5], float32());
+```
+
+```js
+import { columnFromArray, int64 } from '@uwdata/flechette';
+
+// create column with specified type and options
+const col = columnFromArray(
+ [1n, 32n, 2n << 34n], int64(),
+ { maxBatchRows: 1000, useBigInt: true }
+);
+```
+
+
#
+tableFromColumns(columns[, type, options])
+
+Create a new table from a collection of columns. This method is useful for creating new tables using one or more pre-existing column instances. Otherwise, [`tableFromArrays`](#tableFromArrays) should be preferred. Input columns are assumed to have the same record batch sizes and non-conflicting dictionary ids.
+
+* *data* (`object | array`): The input columns as an object with name keys, or an array of [name, column] pairs.
+
+*Examples*
+
+```js
+import { columnFromArray, tableFromColumns } from '@uwdata/flechette';
+
+// create column with inferred type (here, float64)
+const table = tableFromColumns({
+ bools: columnFromArray([true, true, null, false, true]),
+ floats: columnFromArray([1.1, 2.2, 3.3, 4.4, 5.5])
+});
+```
diff --git a/docs/api/table.md b/docs/api/table.md
new file mode 100644
index 0000000..6c5fb30
--- /dev/null
+++ b/docs/api/table.md
@@ -0,0 +1,98 @@
+---
+title: Flechette API Reference
+---
+# Flechette API Reference
+
+[Top-Level](/flechette/api) | [Data Types](data-types) | [**Table**](table) | [Column](column)
+
+## Table Class
+
+A table consisting of named columns (or 'children'). To extract table data directly to JavaScript values, use [`toColumns()`](#toColumns) to produce an object that maps column names to extracted value arrays, or [`toArray()`](#toArray) to extract an array of row objects. Tables are [iterable](#iterator), iterating over row objects. While `toArray()` and [table iterators](#iterator) enable convenient use by tools that expect row objects, column-oriented processing is more efficient and thus recommended. Use [`getChild`](#getChild) or [`getChildAt`](#getChildAt) to access a specific [`Column`](column).
+
+* [constructor](#constructor)
+* [numCols](#numCols)
+* [numRows](#numRows)
+* [getChildAt](#getChildAt)
+* [getChild](#getChild)
+* [selectAt](#selectAt)
+* [select](#select)
+* [at](#at)
+* [get](#get)
+* [toColumns](#toColumns)
+* [toArray](#toArray)
+* [Symbol.iterator](#iterator)
+
+
#
+Table.constructor(schema, children)
+
+Create a new table with the given *schema* and *children* columns. The column types and order *must* be consistent with the given *schema*.
+
+* *schema* (`Schema`): The table schema.
+* *children* (`Column[]`): The table columns.
+
+
#
+Table.numCols
+
+The number of columns in the table.
+
+
#
+Table.numRows
+
+The number of rows in the table.
+
+
#
+Table.getChildAt(index)
+
+Return the child [column](column) at the given *index* position.
+
+* *index* (`number`): The column index.
+
+
#
+Table.getChild(name)
+
+Return the first child [column](column) with the given *name*.
+
+* *name* (`string`): The column name.
+
+
#
+Table.selectAt(indices[, as])
+
+Construct a new table containing only columns at the specified *indices*. The order of columns in the new table matches the order of input *indices*.
+
+* *indices* (`number[]`): The indices of columns to keep.
+* *as* (`string[]`): Optional new names for the selected columns.
+
+
#
+Table.select(names[, as])
+
+Construct a new table containing only columns with the specified *names*. If columns have duplicate names, the first (with lowest index) is used. The order of columns in the new table matches the order of input *names*.
+
+* *names* (`string[]`): The names of columns to keep.
+* *as* (`string[]`): Optional new names for selected columns.
+
+
#
+Table.at(index)
+
+Return a row object for the given *index*.
+
+* *index* (`number`): The row index.
+
+
#
+Table.get(index)
+
+Return a row object for the given *index*. This method is the same as [`at`](#at) and is provided for better compatibility with Apache Arrow JS.
+
+
#
+Table.toColumns()
+
+Return an object that maps column names to extracted value arrays.
+
+
#
+Table.toArray()
+
+Return an array of objects representing the rows of this table.
+
+
#
+Table.[Symbol.iterator]()
+
+Return an iterator over row objects representing the rows of this table.
diff --git a/docs/index.md b/docs/index.md
new file mode 100644
index 0000000..2908f61
--- /dev/null
+++ b/docs/index.md
@@ -0,0 +1,132 @@
+# Flechette
+
+**Flechette** is a JavaScript library for reading and writing the [Apache Arrow](https://arrow.apache.org/) columnar in-memory data format. It provides a faster, lighter, zero-dependency alternative to the [Arrow JS reference implementation](https://github.com/apache/arrow/tree/main/js).
+
+Flechette performs fast extraction and encoding of data columns in the Arrow binary IPC format, supporting ingestion of Arrow data from sources such as [DuckDB](https://duckdb.org/) and Arrow use in JavaScript data analysis tools like [Arquero](https://github.com/uwdata/arquero), [Mosaic](https://github.com/uwdata/mosaic), [Observable Plot](https://observablehq.com/plot/), and [Vega-Lite](https://vega.github.io/vega-lite/).
+
+[**API Reference**](api)
+
+## Why Flechette?
+
+In the process of developing multiple data analysis packages that consume Arrow data (including Arquero, Mosaic, and Vega), we've had to develop workarounds for the performance and correctness of the Arrow JavaScript reference implementation. Instead of workarounds, Flechette addresses these issues head-on.
+
+* _Speed_. Flechette provides better performance. Performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, 5-9x faster row object extraction, and 1.5-3.5x faster building of Arrow columns.
+
+* _Size_. Flechette is smaller: ~42k minified (~13k gzip'd) versus 163k minified (~43k gzip'd) for Arrow JS. Flechette's encoders and decoders also tree-shake cleanly, so you only pay for what you need in your own bundles.
+
+* _Coverage_. Flechette supports data types unsupported by the reference implementation, including decimal-to-number conversion, month/day/nanosecond time intervals (as used by DuckDB, for example), run-end encoded data, binary views, and list views.
+
+* _Flexibility_. Flechette includes options to control data value conversion, such as numerical timestamps vs. Date objects for temporal data, and numbers vs. bigint values for 64-bit integer data.
+
+* _Simplicity_. Our goal is to provide a smaller, simpler code base in the hope that it will make it easier for ourselves and others to improve the library. If you'd like to see support for additional Arrow features, please [file an issue](https://github.com/uwdata/flechette/issues) or [open a pull request](https://github.com/uwdata/flechette/pulls).
+
+That said, no tool is without limitations or trade-offs. Flechette assumes simpler inputs (byte buffers, no promises or streams), has less strict TypeScript typings, and may have a slightly slower initial parse (as it decodes dictionary data upfront for faster downstream access).
+
+## What's with the name?
+
+The project name stems from the French word [fléchette](https://en.wikipedia.org/wiki/Flechette), which means "little arrow" or "dart". 🎯
+
+## Examples
+
+### Load and Access Arrow Data
+
+```js
+import { tableFromIPC } from '@uwdata/flechette';
+
+const url = 'https://vega.github.io/vega-datasets/data/flights-200k.arrow';
+const ipc = await fetch(url).then(r => r.arrayBuffer());
+const table = tableFromIPC(ipc);
+
+// print table size: (231083 x 3)
+console.log(`${table.numRows} x ${table.numCols}`);
+
+// inspect schema for column names, data types, etc.
+// [
+// { name: "delay", type: { typeId: 2, bitWidth: 16, signed: true }, ...},
+// { name: "distance", type: { typeId: 2, bitWidth: 16, signed: true }, ...},
+// { name: "time", type: { typeId: 3, precision: 1 }, ...}
+// ]
+// typeId: 2 === Type.Int, typeId: 3 === Type.Float
+console.log(JSON.stringify(table.schema.fields, 0, 2));
+
+// convert a single Arrow column to a value array
+// when possible, zero-copy access to binary data is used
+const delay = table.getChild('delay').toArray();
+
+// data columns are iterable
+const time = [...table.getChild('time')];
+
+// data columns provide random access
+const time0 = table.getChild('time').at(0);
+
+// extract all columns into a { name: array, ... } object
+// { delay: Int16Array, distance: Int16Array, time: Float32Array }
+const columns = table.toColumns();
+
+// convert Arrow data to an array of standard JS objects
+// [ { delay: 14, distance: 405, time: 0.01666666753590107 }, ... ]
+const objects = table.toArray();
+
+// create a new table with a selected subset of columns
+// use this first to limit toColumns or toArray to fewer columns
+const subtable = table.select(['delay', 'time']);
+```
+
+### Build and Encode Arrow Data
+
+```js
+import {
+ bool, dictionary, float32, int32, tableFromArrays, tableToIPC, utf8
+} from '@uwdata/flechette';
+
+// data defined using standard JS types
+// both arrays and typed arrays work well
+const arrays = {
+ ints: [1, 2, null, 4, 5],
+ floats: [1.1, 2.2, 3.3, 4.4, 5.5],
+ bools: [true, true, null, false, true],
+ strings: ['a', 'b', 'c', 'b', 'a']
+};
+
+// create table with automatically inferred types
+const tableInfer = tableFromArrays(arrays);
+
+// encode table to bytes in Arrow IPC stream format
+const ipcInfer = tableToIPC(tableInfer);
+
+// create table using explicit types
+const tableTyped = tableFromArrays(arrays, {
+ types: {
+ ints: int32(),
+ floats: float32(),
+ bools: bool(),
+ strings: dictionary(utf8())
+ }
+});
+
+// encode table to bytes in Arrow IPC file format
+const ipcTyped = tableToIPC(tableTyped, { format: 'file' });
+```
+
+### Customize Data Extraction
+
+Data extraction can be customized using options provided to the table generation method. By default, temporal data is returned as numeric timestamps, 64-bit integers are coerced to numbers, and map-typed data is returned as an array of [key, value] pairs. These defaults can be changed via conversion options that push (or remove) transformations to the underlying data batches.
+
+```js
+const table = tableFromIPC(ipc, {
+ useDate: true, // map dates and timestamps to Date objects
+ useDecimalBigInt: true, // use BigInt for decimals, do not coerce to number
+ useBigInt: true, // use BigInt for 64-bit ints, do not coerce to number
+ useMap: true // create Map objects for [key, value] pair lists
+});
+```
+
+The same extraction options can be passed to `tableFromArrays`. For more, see the [**API Reference**](api).
+
+## Build Instructions
+
+To build and develop Flechette locally:
+
+- Clone https://github.com/uwdata/flechette.
+- Run `npm i` to install dependencies.
+- Run `npm test` to run test cases, `npm run perf` to run performance benchmarks, and `npm run build` to build output files.
diff --git a/package.json b/package.json
index 076221d..6eb8b5f 100644
--- a/package.json
+++ b/package.json
@@ -20,7 +20,10 @@
"url": "https://github.com/uwdata/flechette.git"
},
"scripts": {
- "perf": "node perf/perf-test.js",
+ "perf": "node perf/run-all.js",
+ "perf:build": "node perf/build-perf.js",
+ "perf:decode": "node perf/decode-perf.js",
+ "perf:encode": "node perf/encode-perf.js",
"prebuild": "rimraf dist && mkdir dist",
"build": "node esbuild.js flechette",
"types": "tsc --project tsconfig.json && npm run types:merge",
diff --git a/perf/build-perf.js b/perf/build-perf.js
new file mode 100644
index 0000000..44f3159
--- /dev/null
+++ b/perf/build-perf.js
@@ -0,0 +1,70 @@
+import { Bool, DateDay, Dictionary, Float64, Int32, Utf8, vectorFromArray } from 'apache-arrow';
+import { bool, columnFromArray, dateDay, dictionary, float64, int32, tableFromColumns, tableToIPC, utf8 } from '../src/index.js';
+import { bools, dates, floats, ints, sample, strings, uniqueStrings } from './data.js';
+import { benchmark } from './util.js';
+
+function fl(data, typeKey) {
+ const type = typeKey === 'bool' ? bool()
+ : typeKey === 'int' ? int32()
+ : typeKey === 'float' ? float64()
+ : typeKey === 'utf8' ? utf8()
+ : typeKey === 'date' ? dateDay()
+ : typeKey === 'dict' ? dictionary(utf8(), int32())
+ : null;
+ return columnFromArray(data, type);
+}
+
+function aa(data, typeKey) {
+ const type = typeKey === 'bool' ? new Bool()
+ : typeKey === 'int' ? new Int32()
+ : typeKey === 'float' ? new Float64()
+ : typeKey === 'utf8' ? new Utf8()
+ : typeKey === 'utf8' ? new DateDay()
+ : typeKey === 'dict' ? new Dictionary(new Utf8(), new Int32())
+ : null;
+ return vectorFromArray(data, type);
+}
+
+function js(data) {
+ return JSON.stringify(data);
+}
+
+function run(N, nulls, msg, iter = 5) {
+ const int = ints(N, -10000, 10000, nulls);
+ const float = floats(N, -10000, 10000, nulls);
+ const bool = bools(N, nulls);
+ const utf8 = strings(N, nulls);
+ const date = dates(N, nulls);
+ const dict = sample(N, uniqueStrings(100), nulls);
+
+ console.log(`\n** Build performance for ${msg} **\n`);
+ trial('Int Column', int, 'int', iter);
+ trial('Float Column', float, 'float', iter);
+ trial('Bool Column', bool, 'bool', iter);
+ trial('Date Column', date, 'date', iter);
+ trial('Utf8 Column', utf8, 'utf8', iter);
+ trial('Dict Utf8 Column', dict, 'dict', iter);
+}
+
+function trial(task, data, typeKey, iter) {
+ const jl = new TextEncoder().encode(JSON.stringify(data)).length;
+ const al = tableToIPC(tableFromColumns({ v: fl(data, typeKey) })).length;
+ const sz = `json ${(jl/1e6).toFixed(1)} MB, arrow ${(al/1e6).toFixed(1)} MB`;
+
+ console.log(`${task} (${iter} iteration${iter === 1 ? '' : 's'}, ${sz})`);
+ const j = benchmark(() => js(data, typeKey), iter);
+ const a = benchmark(() => aa(data, typeKey), iter);
+ const f = benchmark(() => fl(data, typeKey), iter);
+ const p = Object.keys(a);
+
+ console.table(p.map(key => ({
+ measure: key,
+ json: +(j[key].toFixed(2)),
+ 'arrow-js': +(a[key].toFixed(2)),
+ flechette: +(f[key].toFixed(2)),
+ ratio: +((a[key] / f[key]).toFixed(2))
+ })));
+}
+
+run(1e6, 0, '1M values');
+run(1e6, 0.05, '1M values, 5% nulls');
diff --git a/perf/data.js b/perf/data.js
new file mode 100644
index 0000000..a972533
--- /dev/null
+++ b/perf/data.js
@@ -0,0 +1,95 @@
+export function rint(min, max) {
+ let delta = min;
+ if (max === undefined) {
+ min = 0;
+ } else {
+ delta = max - min;
+ }
+ return (min + delta * Math.random()) | 0;
+}
+
+export function ints(n, min, max, nullf) {
+ const data = [];
+ for (let i = 0; i < n; ++i) {
+ const v = nullf && Math.random() < nullf ? null : rint(min, max);
+ data.push(v);
+ }
+ return data;
+}
+
+export function floats(n, min, max, nullf) {
+ const data = [];
+ const delta = max - min;
+ for (let i = 0; i < n; ++i) {
+ const v = nullf && Math.random() < nullf
+ ? null
+ : (min + delta * Math.random());
+ data.push(v);
+ }
+ return data;
+}
+
+export function dates(n, nullf) {
+ const data = [];
+ for (let i = 0; i < n; ++i) {
+ const v = nullf && Math.random() < nullf
+ ? null
+ : new Date(Date.UTC(1970 + rint(0, 41), 0, rint(1, 366)));
+ data.push(v);
+ }
+ return data;
+}
+
+export function uniqueStrings(n) {
+ const c = 'bcdfghjlmpqrstvwxyz';
+ const v = 'aeiou';
+ const cn = c.length;
+ const vn = v.length;
+ const data = [];
+ const map = {};
+ while (data.length < n) {
+ const s = c[rint(cn)]
+ + v[rint(vn)] + c[rint(cn)] + c[rint(cn)]
+ + v[rint(vn)] + c[rint(cn)] + c[rint(cn)];
+ if (!map[s]) {
+ data.push(s);
+ map[s] = 1;
+ }
+ }
+ return data;
+}
+
+export function strings(n, nullf) {
+ const c = 'bcdfghjlmpqrstvwxyz';
+ const v = 'aeiou';
+ const cn = c.length;
+ const vn = v.length;
+ const data = [];
+ while (data.length < n) {
+ const s = nullf && Math.random() < nullf
+ ? null
+ : (c[rint(cn)] + v[rint(vn)] + c[rint(cn)] + c[rint(cn)]);
+ data.push(s);
+ }
+ return data;
+}
+
+export function bools(n, nullf) {
+ const data = [];
+ for (let i = 0; i < n; ++i) {
+ const v = nullf && Math.random() < nullf ? null : (Math.random() < 0.5);
+ data.push(v);
+ }
+ return data;
+}
+
+export function sample(n, values, nullf) {
+ const data = [];
+ for (let i = 0; i < n; ++i) {
+ const v = nullf && Math.random() < nullf
+ ? null
+ : values[~~(values.length * Math.random())];
+ data.push(v);
+ }
+ return data;
+}
diff --git a/perf/perf-test.js b/perf/decode-perf.js
similarity index 81%
rename from perf/perf-test.js
rename to perf/decode-perf.js
index e05b8ec..c6d90ed 100644
--- a/perf/perf-test.js
+++ b/perf/decode-perf.js
@@ -7,8 +7,8 @@ import { benchmark } from './util.js';
const fl = bytes => flTable(bytes, { useBigInt: true });
const aa = bytes => aaTable(bytes);
-// parse ipc data to columns
-function parseIPC(table) {
+// decode ipc data to columns
+function decodeIPC(table) {
return table.schema.fields.map((f, i) => table.getChildAt(i));
}
@@ -66,15 +66,14 @@ function trial(task, name, bytes, method, iter) {
})));
}
-async function run(file) {
- console.log(`** Performance tests using ${file} **\n`);
+async function run(file, iter = 5) {
+ console.log(`\n** Decoding performance using ${file} **\n`);
const bytes = new Uint8Array(await readFile(`test/data/${file}`));
- trial('Parse Table from IPC', file, bytes, parseIPC, 10);
- trial('Extract Arrays', file, bytes, extractArrays, 10);
- trial('Iterate Values', file, bytes, iterateValues, 10);
- trial('Random Access', file, bytes, randomAccess, 10);
- trial('Visit Row Objects', file, bytes, visitObjects, 5);
- console.log();
+ trial('Decode Table from IPC', file, bytes, decodeIPC, iter);
+ trial('Extract Arrays', file, bytes, extractArrays, iter);
+ trial('Iterate Values', file, bytes, iterateValues, iter);
+ trial('Random Access', file, bytes, randomAccess, iter);
+ trial('Visit Row Objects', file, bytes, visitObjects, iter);
}
await run('flights.arrows');
diff --git a/perf/encode-perf.js b/perf/encode-perf.js
new file mode 100644
index 0000000..e98a85d
--- /dev/null
+++ b/perf/encode-perf.js
@@ -0,0 +1,32 @@
+import { readFile } from 'node:fs/promises';
+import { tableFromIPC as aaTable, tableToIPC as aaToIPC } from 'apache-arrow';
+import { tableFromIPC as flTable, tableToIPC as flToIPC } from '../src/index.js';
+import { benchmark } from './util.js';
+
+// table encoding methods
+const fl = table => flToIPC(table);
+const aa = table => aaToIPC(table);
+
+function trial(task, name, bytes, iter) {
+ console.log(`${task} (${name}, ${iter} iteration${iter === 1 ? '' : 's'})`);
+ const aat = aaTable(bytes);
+ const flt = flTable(bytes, { useBigInt: true });
+ const a = benchmark(() => aa(aat), iter);
+ const f = benchmark(() => fl(flt), iter);
+ const p = Object.keys(a);
+ console.table(p.map(key => ({
+ measure: key,
+ 'arrow-js': +(a[key].toFixed(2)),
+ flechette: +(f[key].toFixed(2)),
+ ratio: +((a[key] / f[key]).toFixed(2))
+ })));
+}
+
+async function run(file, iter = 5) {
+ console.log(`\n** Encoding performance using ${file} **\n`);
+ const bytes = new Uint8Array(await readFile(`test/data/${file}`));
+ trial('Encode Table to IPC', file, bytes, iter);
+}
+
+await run('flights.arrows');
+await run('scrabble.arrows');
diff --git a/perf/run-all.js b/perf/run-all.js
new file mode 100755
index 0000000..43f5f32
--- /dev/null
+++ b/perf/run-all.js
@@ -0,0 +1,19 @@
+import { spawn } from 'child_process';
+
+async function node(cmdstr) {
+ return new Promise((resolve, reject) => {
+ const child = spawn('node', [cmdstr], {
+ cwd: process.cwd(),
+ detached: true,
+ stdio: 'inherit'
+ });
+ child.on('close', code => {
+ if (code !== 0) reject(code);
+ resolve();
+ });
+ });
+}
+
+await node('./perf/decode-perf.js');
+await node('./perf/encode-perf.js');
+await node('./perf/build-perf.js');
diff --git a/src/array-types.js b/src/array-types.js
deleted file mode 100644
index 236cb33..0000000
--- a/src/array-types.js
+++ /dev/null
@@ -1,24 +0,0 @@
-export const uint8 = Uint8Array;
-export const uint16 = Uint16Array;
-export const uint32 = Uint32Array;
-export const uint64 = BigUint64Array;
-export const int8 = Int8Array;
-export const int16 = Int16Array;
-export const int32 = Int32Array;
-export const int64 = BigInt64Array;
-export const float32 = Float32Array;
-export const float64 = Float64Array;
-
-/**
- * Return the appropriate typed array constructor for the given
- * integer type metadata.
- * @param {number} bitWidth The integer size in bits.
- * @param {boolean} signed Flag indicating if the integer is signed.
- * @returns {import('./types.js').IntArrayConstructor}
- */
-export function arrayTypeInt(bitWidth, signed) {
- const i = Math.log2(bitWidth) - 3;
- return (
- signed ? [int8, int16, int32, int64] : [uint8, uint16, uint32, uint64]
- )[i];
-}
diff --git a/src/batch-type.js b/src/batch-type.js
new file mode 100644
index 0000000..a2b6aaa
--- /dev/null
+++ b/src/batch-type.js
@@ -0,0 +1,66 @@
+import { BinaryBatch, BinaryViewBatch, BoolBatch, DateBatch, DateDayBatch, DateDayMillisecondBatch, DecimalBigIntBatch, DecimalNumberBatch, DenseUnionBatch, DictionaryBatch, DirectBatch, FixedBinaryBatch, FixedListBatch, Float16Batch, Int64Batch, IntervalDayTimeBatch, IntervalMonthDayNanoBatch, IntervalYearMonthBatch, LargeBinaryBatch, LargeListBatch, LargeListViewBatch, LargeUtf8Batch, ListBatch, ListViewBatch, MapBatch, MapEntryBatch, NullBatch, RunEndEncodedBatch, SparseUnionBatch, StructBatch, TimestampMicrosecondBatch, TimestampMillisecondBatch, TimestampNanosecondBatch, TimestampSecondBatch, Utf8Batch, Utf8ViewBatch } from './batch.js';
+import { DateUnit, IntervalUnit, TimeUnit, Type } from './constants.js';
+import { invalidDataType } from './data-types.js';
+
+export function batchType(type, options = {}) {
+ const { typeId, bitWidth, precision, unit } = type;
+ const { useBigInt, useDate, useDecimalBigInt, useMap } = options;
+
+ switch (typeId) {
+ case Type.Null: return NullBatch;
+ case Type.Bool: return BoolBatch;
+ case Type.Int:
+ case Type.Time:
+ case Type.Duration:
+ return useBigInt || bitWidth < 64 ? DirectBatch : Int64Batch;
+ case Type.Float:
+ return precision ? DirectBatch : Float16Batch;
+ case Type.Date:
+ return wrap(
+ unit === DateUnit.DAY ? DateDayBatch : DateDayMillisecondBatch,
+ useDate && DateBatch
+ );
+ case Type.Timestamp:
+ return wrap(
+ unit === TimeUnit.SECOND ? TimestampSecondBatch
+ : unit === TimeUnit.MILLISECOND ? TimestampMillisecondBatch
+ : unit === TimeUnit.MICROSECOND ? TimestampMicrosecondBatch
+ : TimestampNanosecondBatch,
+ useDate && DateBatch
+ );
+ case Type.Decimal:
+ return useDecimalBigInt ? DecimalBigIntBatch : DecimalNumberBatch;
+ case Type.Interval:
+ return unit === IntervalUnit.DAY_TIME ? IntervalDayTimeBatch
+ : unit === IntervalUnit.YEAR_MONTH ? IntervalYearMonthBatch
+ : IntervalMonthDayNanoBatch;
+ case Type.FixedSizeBinary: return FixedBinaryBatch;
+ case Type.Utf8: return Utf8Batch;
+ case Type.LargeUtf8: return LargeUtf8Batch;
+ case Type.Binary: return BinaryBatch;
+ case Type.LargeBinary: return LargeBinaryBatch;
+ case Type.BinaryView: return BinaryViewBatch;
+ case Type.Utf8View: return Utf8ViewBatch;
+ case Type.List: return ListBatch;
+ case Type.LargeList: return LargeListBatch;
+ case Type.Map: return useMap ? MapBatch : MapEntryBatch;
+ case Type.ListView: return ListViewBatch;
+ case Type.LargeListView: return LargeListViewBatch;
+ case Type.FixedSizeList: return FixedListBatch;
+ case Type.Struct: return StructBatch;
+ case Type.RunEndEncoded: return RunEndEncodedBatch;
+ case Type.Dictionary: return DictionaryBatch;
+ case Type.Union: return type.mode ? DenseUnionBatch : SparseUnionBatch;
+ }
+ throw new Error(invalidDataType(typeId));
+}
+
+function wrap(BaseClass, WrapperClass) {
+ return WrapperClass
+ ? class WrapBatch extends WrapperClass {
+ constructor(options) {
+ super(new BaseClass(options));
+ }
+ }
+ : BaseClass;
+}
diff --git a/src/batch.js b/src/batch.js
index 7548f66..3db8069 100644
--- a/src/batch.js
+++ b/src/batch.js
@@ -1,5 +1,7 @@
-import { float64 } from './array-types.js';
-import { bisect, decodeBit, decodeUtf8, divide, readInt32, readInt64AsNum, toNumber } from './util.js';
+import { bisect, float64Array } from './util/arrays.js';
+import { divide, fromDecimal128, fromDecimal256, toNumber } from './util/numbers.js';
+import { decodeBit, readInt32, readInt64 } from './util/read.js';
+import { decodeUtf8 } from './util/strings.js';
/**
* Check if the input is a batch that supports direct access to
@@ -30,6 +32,7 @@ export class Batch {
* @param {object} options
* @param {number} options.length The length of the batch
* @param {number} options.nullCount The null value count
+ * @param {import('./types.js').DataType} options.type The data type.
* @param {Uint8Array} [options.validity] Validity bitmap buffer
* @param {import('./types.js').TypedArray} [options.values] Values buffer
* @param {import('./types.js').OffsetArray} [options.offsets] Offsets buffer
@@ -39,6 +42,7 @@ export class Batch {
constructor({
length,
nullCount,
+ type,
validity,
values,
offsets,
@@ -47,6 +51,7 @@ export class Batch {
}) {
this.length = length;
this.nullCount = nullCount;
+ this.type = type;
this.validity = validity;
this.values = values;
this.offsets = offsets;
@@ -54,7 +59,9 @@ export class Batch {
this.children = children;
// optimize access if this batch has no null values
- if (!nullCount) {
+ // some types (like union) may have null values in
+ // child batches, but no top-level validity buffer
+ if (!nullCount || !this.validity) {
/** @type {(index: number) => T | null} */
this.at = index => this.value(index);
}
@@ -134,6 +141,7 @@ export class DirectBatch extends Batch {
* @param {object} options
* @param {number} options.length The length of the batch
* @param {number} options.nullCount The null value count
+ * @param {import('./types.js').DataType} options.type The data type.
* @param {Uint8Array} [options.validity] Validity bitmap buffer
* @param {import('./types.js').TypedArray} options.values Values buffer
*/
@@ -176,7 +184,7 @@ export class DirectBatch extends Batch {
* @extends {Batch}
*/
export class NumberBatch extends Batch {
- static ArrayType = float64;
+ static ArrayType = float64Array;
}
/**
@@ -249,56 +257,53 @@ export class BoolBatch extends ArrayBatch {
}
}
-// generate base values for big integers represented in a Uint32Array
-const BASE32 = Array.from(
- { length: 8 },
- (_, i) => Math.pow(2, i * 32)
-);
+/**
+ * An abstract class for a batch of 128- or 256-bit decimal numbers,
+ * accessed in strided BigUint64Arrays.
+ * @template T
+ * @extends {Batch}
+ */
+export class DecimalBatch extends Batch {
+ constructor(options) {
+ super(options);
+ const { bitWidth, scale } = /** @type {import('./types.js').DecimalType} */ (this.type);
+ this.decimal = bitWidth === 128 ? fromDecimal128 : fromDecimal256;
+ this.scale = 10n ** BigInt(scale);
+ }
+}
/**
- * A batch of 128- or 256-bit decimal numbers, accessed as unsigned
- * 32-bit ints and coerced to 64-bit numbers. The number coercion
- * may be lossy if the decimal precision can not be represented in
- * a 64-bit floating point format.
+ * A batch of 128- or 256-bit decimal numbers, returned as converted
+ * 64-bit numbers. The number coercion may be lossy if the decimal
+ * precision can not be represented in a 64-bit floating point format.
+ * @extends {DecimalBatch}
*/
-export class DecimalBatch extends NumberBatch {
+export class DecimalNumberBatch extends DecimalBatch {
+ static ArrayType = float64Array;
/**
- * Create a new decimal batch.
- * @param {object} options
- * @param {number} options.length The length of the batch
- * @param {number} options.nullCount The null value count
- * @param {Uint8Array} [options.validity] Validity bitmap buffer
- * @param {import('./types.js').TypedArray} options.values Values buffer
- * @param {number} options.bitWidth The decimal bit width
- * @param {number} options.scale The number of decimal digits
+ * @param {number} index The value index
*/
- constructor({ bitWidth, scale, ...rest }) {
- super(rest);
- this.stride = bitWidth >> 5, // 8 bits/byte and 4 bytes/uint32;
- this.scale = Math.pow(10, scale);
+ value(index) {
+ return divide(
+ this.decimal(/** @type {BigUint64Array} */ (this.values), index),
+ this.scale
+ );
}
+}
+/**
+ * A batch of 128- or 256-bit decimal numbers, returned as scaled
+ * bigint values, such that all fractional digits have been shifted
+ * to integer places by the decimal type scale factor.
+ * @extends {DecimalBatch}
+ */
+export class DecimalBigIntBatch extends DecimalBatch {
+ static ArrayType = Array;
/**
* @param {number} index The value index
*/
value(index) {
- // TODO: check magnitude, use slower but more accurate BigInt ops if needed?
- // Using numbers we can prep with integers up to MAX_SAFE_INTEGER (2^53 - 1)
- const v = /** @type {Uint32Array} */ (this.values);
- const n = this.stride;
- const off = index << 2;
- let x = 0;
- if ((v[n - 1] | 0) < 0) {
- for (let i = 0; i < n; ++i) {
- x += ~v[i + off] * BASE32[i];
- }
- x = -(x + 1);
- } else {
- for (let i = 0; i < n; ++i) {
- x += v[i + off] * BASE32[i];
- }
- }
- return x / this.scale;
+ return this.decimal(/** @type {BigUint64Array} */ (this.values), index);
}
}
@@ -430,11 +435,11 @@ export class IntervalMonthDayNanoBatch extends ArrayBatch {
*/
value(index) {
const values = /** @type {Uint8Array} */ (this.values);
- const base = index << 2;
+ const base = index << 4;
return Float64Array.of(
readInt32(values, base),
readInt32(values, base + 4),
- readInt64AsNum(values, base + 8)
+ readInt64(values, base + 8)
);
}
}
@@ -578,20 +583,11 @@ export class LargeListViewBatch extends ArrayBatch {
* @extends {ArrayBatch}
*/
class FixedBatch extends ArrayBatch {
- /**
- * Create a new column batch with fixed stride.
- * @param {object} options
- * @param {number} options.length The length of the batch
- * @param {number} options.nullCount The null value count
- * @param {Uint8Array} [options.validity] Validity bitmap buffer
- * @param {Uint8Array} [options.values] Values buffer
- * @param {Batch[]} [options.children] Children batches
- * @param {number} options.stride The fixed stride (size) of values.
- */
- constructor({ stride, ...rest }) {
- super(rest);
+ constructor(options) {
+ super(options);
/** @type {number} */
- this.stride = stride;
+ // @ts-ignore
+ this.stride = this.type.stride;
}
}
@@ -688,26 +684,28 @@ export class SparseUnionBatch extends ArrayBatch {
* @param {object} options
* @param {number} options.length The length of the batch
* @param {number} options.nullCount The null value count
+ * @param {import('./types.js').DataType} options.type The data type.
* @param {Uint8Array} [options.validity] Validity bitmap buffer
* @param {Int32Array} [options.offsets] Offsets buffer
* @param {Batch[]} options.children Children batches
* @param {Int8Array} options.typeIds Union type ids buffer
* @param {Record} options.map A typeId to children index map
*/
- constructor({ typeIds, map, ...rest }) {
- super(rest);
+ constructor({ typeIds, ...options }) {
+ super(options);
/** @type {Int8Array} */
this.typeIds = typeIds;
/** @type {Record} */
- this.map = map;
+ // @ts-ignore
+ this.typeMap = this.type.typeMap;
}
/**
* @param {number} index The value index.
*/
- value(index) {
- const { typeIds, children, map } = this;
- return children[map[typeIds[index]]].at(index);
+ value(index, offset = index) {
+ const { typeIds, children, typeMap } = this;
+ return children[typeMap[typeIds[index]]].at(offset);
}
}
@@ -722,7 +720,7 @@ export class DenseUnionBatch extends SparseUnionBatch {
* @param {number} index The value index.
*/
value(index) {
- return super.value(/** @type {number} */ (this.offsets[index]));
+ return super.value(index, /** @type {number} */ (this.offsets[index]));
}
}
@@ -732,19 +730,11 @@ export class DenseUnionBatch extends SparseUnionBatch {
* @extends {ArrayBatch>}
*/
export class StructBatch extends ArrayBatch {
- /**
- * Create a new column batch.
- * @param {object} options
- * @param {number} options.length The length of the batch
- * @param {number} options.nullCount The null value count
- * @param {Uint8Array} [options.validity] Validity bitmap buffer
- * @param {Batch[]} options.children Children batches
- * @param {string[]} options.names Child batch names
- */
- constructor({ names, ...rest }) {
- super(rest);
+ constructor(options) {
+ super(options);
/** @type {string[]} */
- this.names = names;
+ // @ts-ignore
+ this.names = this.type.children.map(child => child.name);
}
/**
@@ -786,18 +776,16 @@ export class RunEndEncodedBatch extends ArrayBatch {
*/
export class DictionaryBatch extends ArrayBatch {
/**
- * Create a new dictionary batch.
- * @param {object} options Batch options.
- * @param {number} options.length The length of the batch
- * @param {number} options.nullCount The null value count
- * @param {Uint8Array} [options.validity] Validity bitmap buffer
- * @param {import('./types.js').IntegerArray} options.values Values buffer
- * @param {import('./column.js').Column} options.dictionary
- * The dictionary of column values.
+ * Register the backing dictionary. Dictionaries are added
+ * after batch creation as the complete dictionary may not
+ * be finished across multiple record batches.
+ * @param {import('./column.js').Column} dictionary
+ * The dictionary of column values.
*/
- constructor({ dictionary, ...rest }) {
- super(rest);
+ setDictionary(dictionary) {
+ this.dictionary = dictionary;
this.cache = dictionary.cache();
+ return this;
}
/**
@@ -826,12 +814,13 @@ class ViewBatch extends ArrayBatch {
* @param {object} options Batch options.
* @param {number} options.length The length of the batch
* @param {number} options.nullCount The null value count
+ * @param {import('./types.js').DataType} options.type The data type.
* @param {Uint8Array} [options.validity] Validity bitmap buffer
* @param {Uint8Array} options.values Values buffer
* @param {Uint8Array[]} options.data View data buffers
*/
- constructor({ data, ...rest }) {
- super(rest);
+ constructor({ data, ...options }) {
+ super(options);
this.data = data;
}
diff --git a/src/build/buffer.js b/src/build/buffer.js
new file mode 100644
index 0000000..d495be6
--- /dev/null
+++ b/src/build/buffer.js
@@ -0,0 +1,90 @@
+import { align, grow, uint8Array } from '../util/arrays.js';
+
+/**
+ * Create a new resizable buffer instance.
+ * @param {import('../types.js').TypedArrayConstructor} [arrayType]
+ * The array type.
+ * @returns {Buffer} The buffer.
+ */
+export function buffer(arrayType) {
+ return new Buffer(arrayType);
+}
+
+/**
+ * Resizable byte buffer.
+ */
+export class Buffer {
+ /**
+ * Create a new resizable buffer instance.
+ * @param {import('../types.js').TypedArrayConstructor} arrayType
+ */
+ constructor(arrayType = uint8Array) {
+ this.buf = new arrayType(512);
+ }
+ /**
+ * Return the underlying data as a 64-bit aligned array of minimum size.
+ * @param {number} size The desired minimum array size.
+ * @returns {import('../types.js').TypedArray} The 64-bit aligned array.
+ */
+ array(size) {
+ return align(this.buf, size);
+ }
+ /**
+ * Prepare for writes to the given index, resizing as necessary.
+ * @param {number} index The array index to prepare to write to.
+ */
+ prep(index) {
+ if (index >= this.buf.length) {
+ this.buf = grow(this.buf, index);
+ }
+ }
+ /**
+ * Return the value at the given index.
+ * @param {number} index The array index.
+ */
+ get(index) {
+ return this.buf[index];
+ }
+ /**
+ * Set a value at the given index.
+ * @param {number | bigint} value The value to set.
+ * @param {number} index The index to write to.
+ */
+ set(value, index) {
+ this.prep(index);
+ this.buf[index] = value;
+ }
+ /**
+ * Write a byte array at the given index. The method should be called
+ * only when the underlying buffer is of type Uint8Array.
+ * @param {Uint8Array} bytes The byte array.
+ * @param {number} index The starting index to write to.
+ */
+ write(bytes, index) {
+ this.prep(index + bytes.length);
+ /** @type {Uint8Array} */ (this.buf).set(bytes, index);
+ }
+}
+
+/**
+ * Create a new resizable bitmap instance.
+ * @returns {Bitmap} The bitmap buffer.
+ */
+export function bitmap() {
+ return new Bitmap();
+}
+
+/**
+ * Resizable bitmap buffer.
+ */
+export class Bitmap extends Buffer {
+ /**
+ * Set a bit to true at the given bitmap index.
+ * @param {number} index The index to write to.
+ */
+ set(index) {
+ const i = index >> 3;
+ this.prep(i);
+ /** @type {Uint8Array} */ (this.buf)[i] |= (1 << (index % 8));
+ }
+}
diff --git a/src/build/builder.js b/src/build/builder.js
new file mode 100644
index 0000000..af71251
--- /dev/null
+++ b/src/build/builder.js
@@ -0,0 +1,124 @@
+import { batchType } from '../batch-type.js';
+import { IntervalUnit, Type } from '../constants.js';
+import { invalidDataType } from '../data-types.js';
+import { isInt64ArrayType } from '../util/arrays.js';
+import { toBigInt, toDateDay, toFloat16, toTimestamp, toYearMonth } from '../util/numbers.js';
+import { BinaryBuilder } from './builders/binary.js';
+import { BoolBuilder } from './builders/bool.js';
+import { DecimalBuilder } from './builders/decimal.js';
+import { DictionaryBuilder, dictionaryValues } from './builders/dictionary.js';
+import { FixedSizeBinaryBuilder } from './builders/fixed-size-binary.js';
+import { FixedSizeListBuilder } from './builders/fixed-size-list.js';
+import { IntervalDayTimeBuilder, IntervalMonthDayNanoBuilder } from './builders/interval.js';
+import { ListBuilder } from './builders/list.js';
+import { MapBuilder } from './builders/map.js';
+import { RunEndEncodedBuilder } from './builders/run-end-encoded.js';
+import { StructBuilder } from './builders/struct.js';
+import { DenseUnionBuilder, SparseUnionBuilder } from './builders/union.js';
+import { Utf8Builder } from './builders/utf8.js';
+import { DirectBuilder, Int64Builder, TransformBuilder } from './builders/values.js';
+
+/**
+ * Create a new context object for shared builder state.
+ * @param {import('../types.js').ExtractionOptions} [options]
+ * Batch extraction options.
+ * @param {Map>} [dictMap]
+ * A map of dictionary ids to value builder helpers.
+ */
+export function builderContext(options, dictMap = new Map) {
+ let dictId = 0;
+ return {
+ batchType(type) {
+ return batchType(type, options);
+ },
+ dictionary(type, id) {
+ let dict;
+ if (id != null) {
+ dict = dictMap.get(id);
+ } else {
+ while (dictMap.has(dictId + 1)) ++dictId;
+ id = dictId;
+ }
+ if (!dict) {
+ dictMap.set(id, dict = dictionaryValues(id, type, this));
+ }
+ return dict;
+ },
+ finish() {
+ for (const dict of dictMap.values()) {
+ dict.finish(options);
+ }
+ }
+ };
+}
+
+/**
+ * Returns a batch builder for the given type and builder context.
+ * @param {import('../types.js').DataType} type A data type.
+ * @param {ReturnType} [ctx] A builder context.
+ * @returns {import('./builders/batch.js').BatchBuilder}
+ */
+export function builder(type, ctx = builderContext()) {
+ const { typeId } = type;
+ switch (typeId) {
+ case Type.Int:
+ case Type.Time:
+ case Type.Duration:
+ return isInt64ArrayType(type.values)
+ ? new Int64Builder(type, ctx)
+ : new DirectBuilder(type, ctx);
+ case Type.Float:
+ return type.precision
+ ? new DirectBuilder(type, ctx)
+ : new TransformBuilder(type, ctx, toFloat16)
+ case Type.Binary:
+ case Type.LargeBinary:
+ return new BinaryBuilder(type, ctx);
+ case Type.Utf8:
+ case Type.LargeUtf8:
+ return new Utf8Builder(type, ctx);
+ case Type.Bool:
+ return new BoolBuilder(type, ctx);
+ case Type.Decimal:
+ return new DecimalBuilder(type, ctx);
+ case Type.Date:
+ return new TransformBuilder(type, ctx, type.unit ? toBigInt : toDateDay);
+ case Type.Timestamp:
+ return new TransformBuilder(type, ctx, toTimestamp(type.unit));
+ case Type.Interval:
+ switch (type.unit) {
+ case IntervalUnit.YEAR_MONTH:
+ return new TransformBuilder(type, ctx, toYearMonth);
+ case IntervalUnit.DAY_TIME:
+ return new IntervalDayTimeBuilder(type, ctx);
+ case IntervalUnit.MONTH_DAY_NANO:
+ return new IntervalMonthDayNanoBuilder(type, ctx);
+ }
+ break;
+ case Type.List:
+ case Type.LargeList:
+ return new ListBuilder(type, ctx);
+ case Type.Struct:
+ return new StructBuilder(type, ctx);
+ case Type.Union:
+ return type.mode
+ ? new DenseUnionBuilder(type, ctx)
+ : new SparseUnionBuilder(type, ctx);
+ case Type.FixedSizeBinary:
+ return new FixedSizeBinaryBuilder(type, ctx);
+ case Type.FixedSizeList:
+ return new FixedSizeListBuilder(type, ctx);
+ case Type.Map:
+ return new MapBuilder(type, ctx);
+ case Type.RunEndEncoded:
+ return new RunEndEncodedBuilder(type, ctx);
+
+ case Type.Dictionary:
+ return new DictionaryBuilder(type, ctx);
+ }
+ // case Type.BinaryView:
+ // case Type.Utf8View:
+ // case Type.ListView:
+ // case Type.LargeListView:
+ throw new Error(invalidDataType(typeId));
+}
diff --git a/src/build/builders/batch.js b/src/build/builders/batch.js
new file mode 100644
index 0000000..973d455
--- /dev/null
+++ b/src/build/builders/batch.js
@@ -0,0 +1,49 @@
+/**
+ * Abstract class for building a column data batch.
+ */
+export class BatchBuilder {
+ constructor(type, ctx) {
+ this.type = type;
+ this.ctx = ctx;
+ this.batchClass = ctx.batchType(type);
+ }
+
+ /**
+ * Initialize the builder state.
+ * @returns {this} This builder.
+ */
+ init() {
+ this.index = -1;
+ return this;
+ }
+
+ /**
+ * Write a value to the builder.
+ * @param {*} value
+ * @param {number} index
+ * @returns {boolean | void}
+ */
+ set(value, index) {
+ this.index = index;
+ return false;
+ }
+
+ /**
+ * Returns a batch constructor options object.
+ * Used internally to marshal batch data.
+ * @returns {Record}
+ */
+ done() {
+ return null;
+ }
+
+ /**
+ * Returns a completed batch and reinitializes the builder state.
+ * @returns {import('../../batch.js').Batch}
+ */
+ batch() {
+ const b = new this.batchClass(this.done());
+ this.init();
+ return b;
+ }
+}
diff --git a/src/build/builders/binary.js b/src/build/builders/binary.js
new file mode 100644
index 0000000..42fa1be
--- /dev/null
+++ b/src/build/builders/binary.js
@@ -0,0 +1,37 @@
+import { toOffset } from '../../util/numbers.js';
+import { buffer } from '../buffer.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Builder for batches of binary-typed data.
+ */
+export class BinaryBuilder extends ValidityBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.toOffset = toOffset(type.offsets);
+ }
+
+ init() {
+ this.offsets = buffer(this.type.offsets);
+ this.values = buffer();
+ this.pos = 0;
+ return super.init();
+ }
+
+ set(value, index) {
+ const { offsets, values, toOffset } = this;
+ if (super.set(value, index)) {
+ values.write(value, this.pos);
+ this.pos += value.length;
+ }
+ offsets.set(toOffset(this.pos), index + 1);
+ }
+
+ done() {
+ return {
+ ...super.done(),
+ offsets: this.offsets.array(this.index + 2),
+ values: this.values.array(this.pos + 1)
+ };
+ }
+}
diff --git a/src/build/builders/bool.js b/src/build/builders/bool.js
new file mode 100644
index 0000000..721daa1
--- /dev/null
+++ b/src/build/builders/bool.js
@@ -0,0 +1,28 @@
+import { bitmap } from '../buffer.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Builder for batches of bool-typed data.
+ */
+export class BoolBuilder extends ValidityBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ }
+
+ init() {
+ this.values = bitmap();
+ return super.init();
+ }
+
+ set(value, index) {
+ super.set(value, index);
+ if (value) this.values.set(index);
+ }
+
+ done() {
+ return {
+ ...super.done(),
+ values: this.values.array((this.index >> 3) + 1)
+ }
+ }
+}
diff --git a/src/build/builders/decimal.js b/src/build/builders/decimal.js
new file mode 100644
index 0000000..4af455e
--- /dev/null
+++ b/src/build/builders/decimal.js
@@ -0,0 +1,36 @@
+import { toDecimal } from '../../util/numbers.js';
+import { buffer } from '../buffer.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Builder for batches of decimal-typed data.
+ */
+export class DecimalBuilder extends ValidityBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.scale = 10 ** type.scale;
+ this.stride = type.bitWidth >> 6;
+ }
+
+ init() {
+ this.values = buffer(this.type.values);
+ return super.init();
+ }
+
+ set(value, index) {
+ const { scale, stride, values } = this;
+ if (super.set(value, index)) {
+ values.prep((index + 1) * stride);
+ // @ts-ignore
+ toDecimal(value, values.buf, index * stride, stride, scale);
+ }
+ }
+
+ done() {
+ const { index, stride, values } = this;
+ return {
+ ...super.done(),
+ values: values.array((index + 1) * stride)
+ };
+ }
+}
diff --git a/src/build/builders/dictionary.js b/src/build/builders/dictionary.js
new file mode 100644
index 0000000..2720378
--- /dev/null
+++ b/src/build/builders/dictionary.js
@@ -0,0 +1,85 @@
+import { Column } from '../../column.js';
+import { keyString } from '../../util/strings.js';
+import { batchType } from '../../batch-type.js';
+import { buffer } from '../buffer.js';
+import { builder } from '../builder.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Builder helped for creating dictionary values.
+ * @param {number} id The dictionary id.
+ * @param {import('../../types.js').DictionaryType} type
+ * The dictionary data type.
+ * @param {*} ctx
+ * @returns
+ */
+export function dictionaryValues(id, type, ctx) {
+ const keys = Object.create(null);
+ const values = builder(type.dictionary, ctx);
+ const batches = [];
+
+ values.init();
+ let index = -1;
+ type.id = id;
+
+ return {
+ type,
+ values,
+
+ add(batch) {
+ batches.push(batch);
+ return batch;
+ },
+
+ key(value) {
+ const v = keyString(value);
+ let k = keys[v];
+ if (k === undefined) {
+ keys[v] = k = ++index;
+ values.set(value, k);
+ }
+ return k;
+ },
+
+ finish(options) {
+ const valueType = type.dictionary;
+ const batch = new (batchType(valueType, options))(values.done());
+ const dictionary = new Column([batch]);
+ batches.forEach(batch => batch.setDictionary(dictionary));
+ }
+ };
+}
+
+/**
+ * Builder for dictionary-typed data batches.
+ */
+export class DictionaryBuilder extends ValidityBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.dict = ctx.dictionary(type);
+ }
+
+ init() {
+ this.values = buffer(this.type.indices.values);
+ return super.init();
+ }
+
+ set(value, index) {
+ if (super.set(value, index)) {
+ this.values.set(this.dict.key(value), index);
+ }
+ }
+
+ done() {
+ return {
+ ...super.done(),
+ values: this.values.array(this.index + 1)
+ };
+ }
+
+ batch() {
+ // register batch with dictionary
+ // batch will be updated when the dictionary is finished
+ return this.dict.add(super.batch());
+ }
+}
diff --git a/src/build/builders/fixed-size-binary.js b/src/build/builders/fixed-size-binary.js
new file mode 100644
index 0000000..58f0b86
--- /dev/null
+++ b/src/build/builders/fixed-size-binary.js
@@ -0,0 +1,31 @@
+import { buffer } from '../buffer.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Builder for fixed-size-binary-typed data batches.
+ */
+export class FixedSizeBinaryBuilder extends ValidityBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.stride = type.stride;
+ }
+
+ init() {
+ this.values = buffer();
+ return super.init();
+ }
+
+ set(value, index) {
+ if (super.set(value, index)) {
+ this.values.write(value, index * this.stride);
+ }
+ }
+
+ done() {
+ const { stride, values } = this;
+ return {
+ ...super.done(),
+ values: values.array(stride * (this.index + 1))
+ };
+ }
+}
diff --git a/src/build/builders/fixed-size-list.js b/src/build/builders/fixed-size-list.js
new file mode 100644
index 0000000..5508e12
--- /dev/null
+++ b/src/build/builders/fixed-size-list.js
@@ -0,0 +1,38 @@
+import { builder } from '../builder.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Builder for fixed-size-list-typed data batches.
+ */
+export class FixedSizeListBuilder extends ValidityBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.child = builder(this.type.children[0].type, this.ctx);
+ this.stride = type.stride;
+ }
+
+ init() {
+ this.child.init();
+ return super.init();
+ }
+
+ set(value, index) {
+ const { child, stride } = this;
+ const base = index * stride;
+ if (super.set(value, index)) {
+ for (let i = 0; i < stride; ++i) {
+ child.set(value[i], base + i);
+ }
+ } else {
+ child.index = base + stride;
+ }
+ }
+
+ done() {
+ const { child } = this;
+ return {
+ ...super.done(),
+ children: [ child.batch() ]
+ };
+ }
+}
diff --git a/src/build/builders/interval.js b/src/build/builders/interval.js
new file mode 100644
index 0000000..101023d
--- /dev/null
+++ b/src/build/builders/interval.js
@@ -0,0 +1,51 @@
+import { toMonthDayNanoBytes } from '../../util/numbers.js';
+import { buffer } from '../buffer.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Builder for day/time interval-typed data batches.
+ */
+export class IntervalDayTimeBuilder extends ValidityBuilder {
+ init() {
+ this.values = buffer(this.type.values);
+ return super.init();
+ }
+
+ set(value, index) {
+ if (super.set(value, index)) {
+ const i = index << 1;
+ this.values.set(value[0], i);
+ this.values.set(value[1], i + 1);
+ }
+ }
+
+ done() {
+ return {
+ ...super.done(),
+ values: this.values.array((this.index + 1) << 1)
+ }
+ }
+}
+
+/**
+ * Builder for month/day/nano interval-typed data batches.
+ */
+export class IntervalMonthDayNanoBuilder extends ValidityBuilder {
+ init() {
+ this.values = buffer();
+ return super.init();
+ }
+
+ set(value, index) {
+ if (super.set(value, index)) {
+ this.values.write(toMonthDayNanoBytes(value), index << 4);
+ }
+ }
+
+ done() {
+ return {
+ ...super.done(),
+ values: this.values.array((this.index + 1) << 4)
+ }
+ }
+}
diff --git a/src/build/builders/list.js b/src/build/builders/list.js
new file mode 100644
index 0000000..0ee0a0a
--- /dev/null
+++ b/src/build/builders/list.js
@@ -0,0 +1,48 @@
+import { toOffset } from '../../util/numbers.js';
+import { buffer } from '../buffer.js';
+import { builder } from '../builder.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Abstract class for building list data batches.
+ */
+export class AbstractListBuilder extends ValidityBuilder {
+ constructor(type, ctx, child) {
+ super(type, ctx);
+ this.child = child;
+ }
+
+ init() {
+ this.child.init();
+ const offsetType = this.type.offsets;
+ this.offsets = buffer(offsetType);
+ this.toOffset = toOffset(offsetType);
+ this.pos = 0;
+ return super.init();
+ }
+
+ done() {
+ return {
+ ...super.done(),
+ offsets: this.offsets.array(this.index + 2),
+ children: [ this.child.batch() ]
+ };
+ }
+}
+
+/**
+ * Builder for list-typed data batches.
+ */
+export class ListBuilder extends AbstractListBuilder {
+ constructor(type, ctx) {
+ super(type, ctx, builder(type.children[0].type, ctx));
+ }
+
+ set(value, index) {
+ const { child, offsets, toOffset } = this;
+ if (super.set(value, index)) {
+ value.forEach(v => child.set(v, this.pos++));
+ }
+ offsets.set(toOffset(this.pos), index + 1);
+ }
+}
diff --git a/src/build/builders/map.js b/src/build/builders/map.js
new file mode 100644
index 0000000..e7e953b
--- /dev/null
+++ b/src/build/builders/map.js
@@ -0,0 +1,33 @@
+import { AbstractListBuilder } from './list.js';
+import { AbstractStructBuilder } from './struct.js';
+
+/**
+ * Builder for map-typed data batches.
+ */
+export class MapBuilder extends AbstractListBuilder {
+ constructor(type, ctx) {
+ super(type, ctx, new MapStructBuilder(type.children[0].type, ctx));
+ }
+
+ set(value, index) {
+ const { child, offsets, toOffset } = this;
+ if (super.set(value, index)) {
+ for (const keyValuePair of value) {
+ child.set(keyValuePair, this.pos++);
+ }
+ }
+ offsets.set(toOffset(this.pos), index + 1);
+ }
+}
+
+/**
+ * Builder for key-value struct batches within a map.
+ */
+class MapStructBuilder extends AbstractStructBuilder {
+ set(value, index) {
+ super.set(value, index);
+ const [key, val] = this.children;
+ key.set(value[0], index);
+ val.set(value[1], index);
+ }
+}
diff --git a/src/build/builders/run-end-encoded.js b/src/build/builders/run-end-encoded.js
new file mode 100644
index 0000000..9b88d82
--- /dev/null
+++ b/src/build/builders/run-end-encoded.js
@@ -0,0 +1,55 @@
+import { keyString } from '../../util/strings.js';
+import { BatchBuilder } from './batch.js';
+import { builder } from '../builder.js';
+
+const NO_VALUE = {}; // empty object that fails strict equality
+
+/**
+ * Builder for run-end-encoded-typed data batches.
+ */
+export class RunEndEncodedBuilder extends BatchBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.children = type.children.map(c => builder(c.type, ctx));
+ }
+
+ init() {
+ this.pos = 0;
+ this.key = null;
+ this.value = NO_VALUE;
+ this.children.forEach(c => c.init());
+ return super.init();
+ }
+
+ next() {
+ const [runs, vals] = this.children;
+ runs.set(this.index + 1, this.pos);
+ vals.set(this.value, this.pos++);
+ }
+
+ set(value, index) {
+ // perform fast strict equality test
+ if (value !== this.value) {
+ // if no match, fallback to key string test
+ const key = keyString(value);
+ if (key !== this.key) {
+ // if key doesn't match, write prior run and update
+ if (this.key) this.next();
+ this.key = key;
+ this.value = value;
+ }
+ }
+ this.index = index;
+ }
+
+ done() {
+ this.next();
+ const { children, index, type } = this;
+ return {
+ length: index + 1,
+ nullCount: 0,
+ type,
+ children: children.map(c => c.batch())
+ };
+ }
+}
diff --git a/src/build/builders/struct.js b/src/build/builders/struct.js
new file mode 100644
index 0000000..4998789
--- /dev/null
+++ b/src/build/builders/struct.js
@@ -0,0 +1,47 @@
+import { builder } from '../builder.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Abstract class for building list-typed data batches.
+ */
+export class AbstractStructBuilder extends ValidityBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.children = type.children.map(c => builder(c.type, ctx));
+ }
+
+ init() {
+ this.children.forEach(c => c.init());
+ return super.init();
+ }
+
+ done() {
+ const { children } = this;
+ children.forEach(c => c.index = this.index);
+ return {
+ ...super.done(),
+ children: children.map(c => c.batch())
+ };
+ }
+}
+
+/**
+ * Builder for struct-typed data batches.
+ */
+export class StructBuilder extends AbstractStructBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.setters = this.children.map((child, i) => {
+ const name = type.children[i].name;
+ return (value, index) => child.set(value?.[name], index);
+ });
+ }
+
+ set(value, index) {
+ super.set(value, index);
+ const setters = this.setters;
+ for (let i = 0; i < setters.length; ++i) {
+ setters[i](value, index);
+ }
+ }
+}
diff --git a/src/build/builders/union.js b/src/build/builders/union.js
new file mode 100644
index 0000000..8f86e5f
--- /dev/null
+++ b/src/build/builders/union.js
@@ -0,0 +1,81 @@
+import { int8Array } from '../../util/arrays.js';
+import { BatchBuilder } from './batch.js';
+import { buffer } from '../buffer.js';
+import { builder } from '../builder.js';
+
+/**
+ * Abstract class for building union-typed data batches.
+ */
+export class AbstractUnionBuilder extends BatchBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.children = type.children.map(c => builder(c.type, ctx));
+ this.typeMap = type.typeMap;
+ this.lookup = type.typeIdForValue;
+ }
+
+ init() {
+ this.nullCount = 0;
+ this.typeIds = buffer(int8Array);
+ this.children.forEach(c => c.init());
+ return super.init();
+ }
+
+ set(value, index) {
+ const { children, lookup, typeMap, typeIds } = this;
+ this.index = index;
+ const typeId = lookup(value, index);
+ const child = children[typeMap[typeId]];
+ typeIds.set(typeId, index);
+ if (value == null) ++this.nullCount;
+ // @ts-ignore
+ this.update(value, index, child);
+ }
+
+ done() {
+ const { children, nullCount, type, typeIds } = this;
+ const length = this.index + 1;
+ return {
+ length,
+ nullCount,
+ type,
+ typeIds: typeIds.array(length),
+ children: children.map(c => c.batch())
+ };
+ }
+}
+
+/**
+ * Builder for sparse union-typed data batches.
+ */
+export class SparseUnionBuilder extends AbstractUnionBuilder {
+ update(value, index, child) {
+ // update selected child with value
+ // then set all other children to null
+ child.set(value, index);
+ this.children.forEach(c => { if (c !== child) c.set(null, index) });
+ }
+}
+
+/**
+ * Builder for dense union-typed data batches.
+ */
+export class DenseUnionBuilder extends AbstractUnionBuilder {
+ init() {
+ this.offsets = buffer(this.type.offsets);
+ return super.init();
+ }
+
+ update(value, index, child) {
+ const offset = child.index + 1;
+ child.set(value, offset);
+ this.offsets.set(offset, index);
+ }
+
+ done() {
+ return {
+ ...super.done(),
+ offsets: this.offsets.array(this.index + 1)
+ };
+ }
+}
diff --git a/src/build/builders/utf8.js b/src/build/builders/utf8.js
new file mode 100644
index 0000000..9b2b600
--- /dev/null
+++ b/src/build/builders/utf8.js
@@ -0,0 +1,11 @@
+import { encodeUtf8 } from '../../util/strings.js';
+import { BinaryBuilder } from './binary.js';
+
+/**
+ * Builder for utf8-typed data batches.
+ */
+export class Utf8Builder extends BinaryBuilder {
+ set(value, index) {
+ super.set(value && encodeUtf8(value), index);
+ }
+}
diff --git a/src/build/builders/validity.js b/src/build/builders/validity.js
new file mode 100644
index 0000000..e4ba872
--- /dev/null
+++ b/src/build/builders/validity.js
@@ -0,0 +1,46 @@
+import { uint8Array } from '../../util/arrays.js';
+import { bitmap } from '../buffer.js';
+import { BatchBuilder } from './batch.js';
+
+/**
+ * Builder for validity bitmaps within batches.
+ */
+export class ValidityBuilder extends BatchBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ }
+
+ init() {
+ this.nullCount = 0;
+ this.validity = bitmap();
+ return super.init();
+ }
+
+ /**
+ * @param {*} value
+ * @param {number} index
+ * @returns {boolean | void}
+ */
+ set(value, index) {
+ this.index = index;
+ const isValid = value != null;
+ if (isValid) {
+ this.validity.set(index);
+ } else {
+ this.nullCount++;
+ }
+ return isValid;
+ }
+
+ done() {
+ const { index, nullCount, type, validity } = this;
+ return {
+ length: index + 1,
+ nullCount,
+ type,
+ validity: nullCount
+ ? validity.array((index >> 3) + 1)
+ : new uint8Array(0)
+ };
+ }
+}
diff --git a/src/build/builders/values.js b/src/build/builders/values.js
new file mode 100644
index 0000000..e5f0ea1
--- /dev/null
+++ b/src/build/builders/values.js
@@ -0,0 +1,58 @@
+import { toBigInt } from '../../util/numbers.js';
+import { buffer } from '../buffer.js';
+import { ValidityBuilder } from './validity.js';
+
+/**
+ * Builder for data batches that can be accessed directly as typed arrays.
+ */
+export class DirectBuilder extends ValidityBuilder {
+ constructor(type, ctx) {
+ super(type, ctx);
+ this.values = buffer(type.values);
+ }
+
+ init() {
+ this.values = buffer(this.type.values);
+ return super.init();
+ }
+
+ /**
+ * @param {*} value
+ * @param {number} index
+ * @returns {boolean | void}
+ */
+ set(value, index) {
+ if (super.set(value, index)) {
+ this.values.set(value, index);
+ }
+ }
+ done() {
+ return {
+ ...super.done(),
+ values: this.values.array(this.index + 1)
+ };
+ }
+}
+
+/**
+ * Builder for int64/uint64 data batches written as bigints.
+ */
+export class Int64Builder extends DirectBuilder {
+ set(value, index) {
+ super.set(value == null ? value : toBigInt(value), index);
+ }
+}
+
+/**
+ * Builder for data batches whose values must pass through a transform
+ * function prior to be written to a backing buffer.
+ */
+export class TransformBuilder extends DirectBuilder {
+ constructor(type, ctx, transform) {
+ super(type, ctx);
+ this.transform = transform;
+ }
+ set(value, index) {
+ super.set(value == null ? value : this.transform(value), index);
+ }
+}
diff --git a/src/build/column-from-array.js b/src/build/column-from-array.js
new file mode 100644
index 0000000..be0e8a6
--- /dev/null
+++ b/src/build/column-from-array.js
@@ -0,0 +1,149 @@
+import { float32Array, float64Array, int16Array, int32Array, int64Array, int8Array, isInt64ArrayType, isTypedArray, uint16Array, uint32Array, uint64Array, uint8Array } from '../util/arrays.js';
+import { DirectBatch, Int64Batch, NullBatch } from '../batch.js';
+import { Column } from '../column.js';
+import { float32, float64, int16, int32, int64, int8, uint16, uint32, uint64, uint8 } from '../data-types.js';
+import { inferType } from './infer-type.js';
+import { builder, builderContext } from './builder.js';
+import { Type } from '../constants.js';
+
+/**
+ * Create a new column from a provided data array.
+ * @template T
+ * @param {Array | import('../types.js').TypedArray} data The input data.
+ * @param {import('../types.js').DataType} [type] The data type.
+ * If not specified, type inference is attempted.
+ * @param {import('../types.js').ColumnBuilderOptions} [options]
+ * Builder options for the generated column.
+ * @param {ReturnType} [ctx]
+ * Builder context object, for internal use only.
+ * @returns {Column} The generated column.
+ */
+export function columnFromArray(data, type, options = {}, ctx) {
+ if (!type) {
+ if (isTypedArray(data)) {
+ return columnFromTypedArray(data, options);
+ } else {
+ type = inferType(data);
+ }
+ }
+ return columnFromValues(data, type, options, ctx);
+}
+
+/**
+ * Create a new column from a typed array input.
+ * @template T
+ * @param {import('../types.js').TypedArray} values
+ * @param {import('../types.js').ColumnBuilderOptions} options
+ * Builder options for the generated column.
+ * @returns {Column} The generated column.
+ */
+function columnFromTypedArray(values, { maxBatchRows, useBigInt }) {
+ const arrayType = /** @type {import('../types.js').TypedArrayConstructor} */ (
+ values.constructor
+ );
+ const type = typeForTypedArray(arrayType);
+ const length = values.length;
+ const limit = Math.min(maxBatchRows || Infinity, length);
+ const numBatches = Math.floor(length / limit);
+
+ const batches = [];
+ const batchType = isInt64ArrayType(arrayType) && !useBigInt ? Int64Batch : DirectBatch;
+ const add = (start, end) => batches.push(new batchType({
+ length: limit,
+ nullCount: 0,
+ type,
+ values: values.subarray(start, end)
+ }));
+
+ let idx = 0;
+ for (let i = 0; i < numBatches; ++i) add(idx, idx += limit);
+ if (idx < length) add(idx, length);
+
+ return new Column(batches);
+}
+
+/**
+ * Build a column by iterating over the provided values array.
+ * @template T
+ * @param {Array | import('../types.js').TypedArray} values The input data.
+ * @param {import('../types.js').DataType} type The column data type.
+ * @param {import('../types.js').ColumnBuilderOptions} [options]
+ * Builder options for the generated column.
+ * @param {ReturnType} [ctx]
+ * Builder context object, for internal use only.
+ * @returns {Column} The generated column.
+ */
+function columnFromValues(values, type, options, ctx) {
+ const { maxBatchRows, ...opt } = options;
+ const length = values.length;
+ const limit = Math.min(maxBatchRows || Infinity, length);
+
+ // if null type, generate batches and exit early
+ if (type.typeId === Type.Null) {
+ return new Column(nullBatches(type, length, limit));
+ }
+
+ const data = [];
+ ctx ??= builderContext(opt);
+ const b = builder(type, ctx).init();
+ const next = b => data.push(b.batch());
+ const numBatches = Math.floor(length / limit);
+
+ let idx = 0;
+ let row = 0;
+ for (let i = 0; i < numBatches; ++i) {
+ for (row = 0; row < limit; ++row) {
+ b.set(values[idx++], row);
+ }
+ next(b);
+ }
+ for (row = 0; idx < length; ++idx) {
+ b.set(values[idx], row++);
+ }
+ if (row) next(b);
+
+ // resolve dictionaries
+ ctx.finish();
+
+ return new Column(data);
+}
+
+/**
+ * Return an Arrow data type for a given typed array type.
+ * @param {import('../types.js').TypedArrayConstructor} arrayType
+ * The typed array type.
+ * @returns {import('../types.js').DataType} The data type.
+ */
+function typeForTypedArray(arrayType) {
+ switch (arrayType) {
+ case float32Array: return float32();
+ case float64Array: return float64();
+ case int8Array: return int8();
+ case int16Array: return int16();
+ case int32Array: return int32();
+ case int64Array: return int64();
+ case uint8Array: return uint8();
+ case uint16Array: return uint16();
+ case uint32Array: return uint32();
+ case uint64Array: return uint64();
+ }
+}
+
+/**
+ * Create null batches with the given batch size limit.
+ * @param {import('../types.js').NullType} type The null data type.
+ * @param {number} length The total column length.
+ * @param {number} limit The maximum batch size.
+ * @returns {import('../batch.js').NullBatch[]} The null batches.
+ */
+function nullBatches(type, length, limit) {
+ const data = [];
+ const batch = length => new NullBatch({ length, nullCount: length, type });
+ const numBatches = Math.floor(length / limit);
+ for (let i = 0; i < numBatches; ++i) {
+ data.push(batch(limit));
+ }
+ const rem = length % limit;
+ if (rem) data.push(batch(rem));
+ return data;
+}
diff --git a/src/build/infer-type.js b/src/build/infer-type.js
new file mode 100644
index 0000000..a66ee17
--- /dev/null
+++ b/src/build/infer-type.js
@@ -0,0 +1,152 @@
+import { bool, dateDay, dictionary, field, fixedSizeList, float64, int16, int32, int64, int8, list, nullType, struct, timestamp, utf8 } from '../data-types.js';
+import { isArray } from '../util/arrays.js';
+
+/**
+ * Infer the data type for a given input array.
+ * @param {import('../types.js').ValueArray} data The data array.
+ * @returns {import('../types.js').DataType} The data type.
+ */
+export function inferType(data) {
+ const profile = profiler();
+ for (let i = 0; i < data.length; ++i) {
+ profile.add(data[i]);
+ }
+ return profile.type();
+}
+
+function profiler() {
+ let length = 0;
+ let nullCount = 0;
+ let boolCount = 0;
+ let numberCount = 0;
+ let intCount = 0;
+ let bigintCount = 0;
+ let dateCount = 0;
+ let dayCount = 0;
+ let stringCount = 0;
+ let arrayCount = 0;
+ let structCount = 0;
+ let min = Infinity;
+ let max = -Infinity;
+ let minLength = 0;
+ let maxLength = 0;
+ let minBigInt;
+ let maxBigInt;
+ let arrayProfile;
+ let structProfiles = {};
+
+ return {
+ add(value) {
+ length++;
+ if (value == null) {
+ nullCount++;
+ return;
+ }
+ switch (typeof value) {
+ case 'string':
+ stringCount++;
+ break;
+ case 'number':
+ numberCount++;
+ if (value < min) min = value;
+ if (value > max) max = value;
+ if (Number.isInteger(value)) intCount++;
+ break;
+ case 'bigint':
+ bigintCount++;
+ if (minBigInt === undefined) {
+ minBigInt = maxBigInt = value;
+ } else {
+ if (value < minBigInt) minBigInt = value;
+ if (value > maxBigInt) maxBigInt = value;
+ }
+ break;
+ case 'boolean':
+ boolCount++;
+ break;
+ case 'object':
+ if (value instanceof Date) {
+ dateCount++;
+ // 1 day = 1000ms * 60s * 60min * 24hr = 86400000
+ if ((+value % 864e5) === 0) dayCount++;
+ } else if (isArray(value)) {
+ arrayCount++;
+ const len = value.length;
+ if (len < minLength) minLength = len;
+ if (len > maxLength) maxLength = len;
+ arrayProfile ??= profiler();
+ value.forEach(arrayProfile.add);
+ } else {
+ structCount++;
+ for (const key in value) {
+ const fieldProfiler = structProfiles[key]
+ ?? (structProfiles[key] = profiler());
+ fieldProfiler.add(value[key]);
+ }
+ }
+ }
+ },
+ type() {
+ const valid = length - nullCount;
+ return valid === 0 ? nullType()
+ : intCount === valid ? intType(min, max)
+ : numberCount === valid ? float64()
+ : bigintCount === valid ? bigintType(minBigInt, maxBigInt)
+ : boolCount === valid ? bool()
+ : dayCount === valid ? dateDay()
+ : dateCount === valid ? timestamp()
+ : stringCount === valid ? dictionary(utf8())
+ : arrayCount === valid ? arrayType(arrayProfile.type(), minLength, maxLength)
+ : structCount === valid ? struct(
+ Object.entries(structProfiles).map(_ => field(_[0], _[1].type()))
+ )
+ : unionType();
+ }
+ };
+}
+
+/**
+ * Return a list or fixed list type.
+ * @param {import('../types.js').DataType} type The child data type.
+ * @param {number} minLength The minumum list length.
+ * @param {number} maxLength The maximum list length.
+ * @returns {import('../types.js').DataType} The data type.
+ */
+function arrayType(type, minLength, maxLength) {
+ return (maxLength - minLength) === 0
+ ? fixedSizeList(type, minLength)
+ : list(type);
+}
+
+/**
+ * @param {number} min
+ * @param {number} max
+ * @returns {import('../types.js').DataType}
+ */
+function intType(min, max) {
+ const v = Math.max(Math.abs(min) - 1, max);
+ return v < (1 << 7) ? int8()
+ : v < (1 << 15) ? int16()
+ : v < (2 ** 31) ? int32()
+ : float64();
+}
+
+/**
+ * @param {bigint} min
+ * @param {bigint} max
+ * @returns {import('../types.js').IntType}
+ */
+function bigintType(min, max) {
+ const v = -min > max ? -min - 1n : max;
+ if (v >= 2 ** 63) {
+ throw new Error(`BigInt exceeds 64 bits: ${v}`);
+ }
+ return int64();
+}
+
+/**
+ * @returns {import('../types.js').UnionType}
+ */
+function unionType() {
+ throw new Error('Mixed types detected, please define a union type.');
+}
diff --git a/src/build/table-from-arrays.js b/src/build/table-from-arrays.js
new file mode 100644
index 0000000..bcb063c
--- /dev/null
+++ b/src/build/table-from-arrays.js
@@ -0,0 +1,23 @@
+import { builderContext } from './builder.js';
+import { columnFromArray } from './column-from-array.js';
+import { tableFromColumns } from './table-from-columns.js';
+
+/**
+ * Create a new table from the provided arrays.
+ * @param {[string, Array | import('../types.js').TypedArray][]
+ * | Record} data
+ * The input data as a collection of named arrays.
+ * @param {import('../types.js').TableBuilderOptions} options
+ * Table builder options, including an optional type map.
+ * @returns {import('../table.js').Table} The new table.
+ */
+export function tableFromArrays(data, options = {}) {
+ const { types = {}, ...opt } = options;
+ const ctx = builderContext();
+ const entries = Array.isArray(data) ? data : Object.entries(data);
+ const columns = entries.map(([name, array]) =>
+ /** @type {[string, import('../column.js').Column]} */ (
+ [ name, columnFromArray(array, types[name], opt, ctx)]
+ ));
+ return tableFromColumns(columns);
+}
diff --git a/src/build/table-from-columns.js b/src/build/table-from-columns.js
new file mode 100644
index 0000000..a738762
--- /dev/null
+++ b/src/build/table-from-columns.js
@@ -0,0 +1,43 @@
+import { Endianness, Type, Version } from '../constants.js';
+import { field } from '../data-types.js';
+import { Table } from '../table.js';
+
+/**
+ * Create a new table from a collection of columns. Columns are assumed
+ * to have the same record batch sizes and consistent dictionary ids.
+ * @param {[string, import('../column.js').Column][]
+* | Record} data The columns,
+* as an object with name keys, or an array of [name, column] pairs.
+* @returns {Table} The new table.
+*/
+export function tableFromColumns(data) {
+ const fields = [];
+ const dictionaryTypes = new Map;
+ const entries = Array.isArray(data) ? data : Object.entries(data);
+ const length = entries[0]?.[1].length;
+ const columns = entries.map(([name, col]) => {
+ if (col.length !== length) {
+ throw new Error('All columns must have the same length.');
+ }
+ const type = col.type;
+ if (type.typeId === Type.Dictionary) {
+ const dict = dictionaryTypes.get(type.id);
+ if (dict && dict !== type.dictionary) {
+ throw new Error('Same id used across different dictionaries.');
+ }
+ dictionaryTypes.set(type.id, type.dictionary);
+ }
+ fields.push(field(name, col.type));
+ return col;
+ });
+
+ const schema = {
+ version: Version.V5,
+ endianness: Endianness.Little,
+ fields,
+ metadata: null,
+ dictionaryTypes
+ };
+
+ return new Table(schema, columns);
+}
diff --git a/src/column.js b/src/column.js
index b7c20d3..e1fd798 100644
--- a/src/column.js
+++ b/src/column.js
@@ -1,15 +1,15 @@
+import { bisect } from './util/arrays.js';
import { isDirectBatch } from './batch.js';
-import { bisect } from './util.js';
/**
* Build up a column from batches.
*/
-export function columnBuilder(type) {
+export function columnBuilder() {
let data = [];
return {
add(batch) { data.push(batch); return this; },
clear: () => data = [],
- done: () => new Column(type, data)
+ done: () => new Column(data)
};
}
@@ -24,16 +24,15 @@ export function columnBuilder(type) {
export class Column {
/**
* Create a new column instance.
- * @param {import('./types.js').DataType} type The data type.
* @param {import('./batch.js').Batch[]} data The value batches.
*/
- constructor(type, data) {
+ constructor(data) {
/**
* The column data type.
* @type {import('./types.js').DataType}
* @readonly
*/
- this.type = type;
+ this.type = data[0].type;
/**
* The column length.
* @type {number}
diff --git a/src/constants.js b/src/constants.js
index 040724d..3c478a4 100644
--- a/src/constants.js
+++ b/src/constants.js
@@ -1,3 +1,6 @@
+/** Magic bytes 'ARROW1' indicating the Arrow 'file' format. */
+export const MAGIC = Uint8Array.of(65, 82, 82, 79, 87, 49);
+
/**
* Apache Arrow version.
*/
@@ -153,8 +156,22 @@ export const Type = /** @type {const} */ ({
* A "calendar" interval which models types that don't necessarily
* have a precise duration without the context of a base timestamp (e.g.
* days can differ in length during day light savings time transitions).
- * All integers in the types below are stored in the endianness indicated
+ * All integers in the units below are stored in the endianness indicated
* by the schema.
+ *
+ * - YEAR_MONTH - Indicates the number of elapsed whole months, stored as
+ * 4-byte signed integers.
+ * - DAY_TIME - Indicates the number of elapsed days and milliseconds (no
+ * leap seconds), stored as 2 contiguous 32-bit signed integers (8-bytes
+ * in total). Support of this IntervalUnit is not required for full arrow
+ * compatibility.
+ * - MONTH_DAY_NANO - A triple of the number of elapsed months, days, and
+ * nanoseconds. The values are stored contiguously in 16-byte blocks.
+ * Months and days are encoded as 32-bit signed integers and nanoseconds
+ * is encoded as a 64-bit signed integer. Nanoseconds does not allow for
+ * leap seconds. Each field is independent (e.g. there is no constraint
+ * that nanoseconds have the same sign as days or that the quantity of
+ * nanoseconds represents less than a day's worth of time).
*/
Interval: 11,
/**
@@ -242,7 +259,7 @@ export const Type = /** @type {const} */ ({
/**
* Contains two child arrays, run_ends and values. The run_ends child array
* must be a 16/32/64-bit integer array which encodes the indices at which
- * the run with the value in each corresponding index in the values child
+ * the run with the value in each corresponding index in the values child
* array ends. Like list/struct types, the value array can be of any type.
*/
RunEndEncoded: 22,
diff --git a/src/data-types.js b/src/data-types.js
new file mode 100644
index 0000000..de61854
--- /dev/null
+++ b/src/data-types.js
@@ -0,0 +1,630 @@
+import { DateUnit, IntervalUnit, Precision, TimeUnit, Type, UnionMode } from './constants.js';
+import { intArrayType, float32Array, float64Array, int32Array, int64Array, uint16Array, uint64Array } from './util/arrays.js';
+import { check, checkOneOf, keyFor } from './util/objects.js';
+
+/**
+ * @typedef {import('./types.js').Field | import('./types.js').DataType} FieldInput
+ */
+
+export const invalidDataType = (typeId) =>
+ `Unsupported data type: "${keyFor(Type, typeId)}" (id ${typeId})`;
+
+/**
+ * Return a new field instance for use in a schema or type definition. A field
+ * represents a field name, data type, and additional metadata. Fields are used
+ * to represent child types within nested types like List, Struct, and Union.
+ * @param {string} name The field name.
+ * @param {import('./types.js').DataType} type The field data type.
+ * @param {boolean} [nullable=true] Flag indicating if the field is nullable
+ * (default `true`).
+ * @param {Map|null} [metadata=null] Custom field metadata
+ * annotations (default `null`).
+ * @returns {import('./types.js').Field} The field instance.
+ */
+export const field = (name, type, nullable = true, metadata = null) => ({
+ name,
+ nullable,
+ type,
+ metadata
+});
+
+/**
+ * Checks if a value is a field instance.
+ * @param {any} value
+ * @returns {value is import('./types.js').Field}
+ */
+function isField(value) {
+ return Object.hasOwn(value, 'name') && isDataType(value.type)
+}
+
+/**
+ * Checks if a value is a data type instance.
+ * @param {any} value
+ * @returns {value is import('./types.js').DataType}
+ */
+function isDataType(value) {
+ return typeof value?.typeId === 'number';
+}
+
+/**
+ * Return a field instance from a field or data type input.
+ * @param {FieldInput} value
+ * The value to map to a field.
+ * @param {string} [defaultName] The default field name.
+ * @param {boolean} [defaultNullable=true] The default nullable value.
+ * @returns {import('./types.js').Field} The field instance.
+ */
+function asField(value, defaultName = '', defaultNullable = true) {
+ return isField(value)
+ ? value
+ : field(
+ defaultName,
+ check(value, isDataType, () => `Data type expected.`),
+ defaultNullable
+ );
+}
+
+/////
+
+/**
+ * Return a basic type with only a type id.
+ * @template {typeof Type[keyof typeof Type]} T
+ * @param {T} typeId The type id.
+ */
+const basicType = (typeId) => ({ typeId });
+
+/**
+ * Return a Dictionary data type instance. A dictionary type consists of a
+ * dictionary of values (which may be of any type) and corresponding integer
+ * indices that reference those values. If values are repeated, a dictionary
+ * encoding can provide substantial space savings. In the IPC format,
+ * dictionary indices reside alongside other columns in a record batch, while
+ * dictionary values are written to special dictionary batches, linked by a
+ * unique dictionary *id*.
+ * @param {import('./types.js').DataType} type The data type of dictionary
+ * values.
+ * @param {import('./types.js').IntType} [indexType] The data type of
+ * dictionary indices. Must be an integer type (default `int32`).
+ * @param {number} [id=-1] The dictionary id, should be unique in a table.
+ * @param {boolean} [ordered=false] Indicates if dictionary values are
+ * ordered (default `false`).
+ * @returns {import('./types.js').DictionaryType}
+ */
+export const dictionary = (type, indexType, id = -1, ordered = false) => ({
+ typeId: Type.Dictionary,
+ dictionary: type,
+ indices: indexType || int32(),
+ id,
+ ordered
+});
+
+/**
+ * Return a Null data type instance. Null data requires no storage and all
+ * extracted values are `null`.
+ * @returns {import('./types.js').NullType} The null data type.
+ */
+export const nullType = () => basicType(Type.Null);
+
+/**
+ * Return an Int data type instance.
+ * @param {import('./types.js').IntBitWidth} [bitWidth=32] The integer bit width.
+ * One of `8`, `16`, `32` (default), or `64`.
+ * @param {boolean} [signed=true] Flag for signed or unsigned integers
+ * (default `true`).
+ * @returns {import('./types.js').IntType} The integer data type.
+ */
+export const int = (bitWidth = 32, signed = true) => ({
+ typeId: Type.Int,
+ bitWidth: checkOneOf(bitWidth, [8, 16, 32, 64]),
+ signed,
+ values: intArrayType(bitWidth, signed)
+});
+/**
+ * Return an Int data type instance for 8 bit signed integers.
+ * @returns {import('./types.js').IntType} The integer data type.
+ */
+export const int8 = () => int(8);
+/**
+ * Return an Int data type instance for 16 bit signed integers.
+ * @returns {import('./types.js').IntType} The integer data type.
+ */
+export const int16 = () => int(16);
+/**
+ * Return an Int data type instance for 32 bit signed integers.
+ * @returns {import('./types.js').IntType} The integer data type.
+ */
+export const int32 = () => int(32);
+/**
+ * Return an Int data type instance for 64 bit signed integers.
+ * @returns {import('./types.js').IntType} The integer data type.
+ */
+export const int64 = () => int(64);
+/**
+ * Return an Int data type instance for 8 bit unsigned integers.
+ * @returns {import('./types.js').IntType} The integer data type.
+ */
+export const uint8 = () => int(8, false);
+/**
+ * Return an Int data type instance for 16 bit unsigned integers.
+ * @returns {import('./types.js').IntType} The integer data type.
+ */
+export const uint16 = () => int(16, false);
+/**
+ * Return an Int data type instance for 32 bit unsigned integers.
+ * @returns {import('./types.js').IntType} The integer data type.
+ */
+export const uint32 = () => int(32, false);
+/**
+ * Return an Int data type instance for 64 bit unsigned integers.
+ * @returns {import('./types.js').IntType} The integer data type.
+ */
+export const uint64 = () => int(64, false);
+
+/**
+ * Return a Float data type instance for floating point numbers.
+ * @param {import('./types.js').Precision_} [precision=2] The floating point
+ * precision. One of `Precision.HALF` (16-bit), `Precision.SINGLE` (32-bit)
+ * or `Precision.DOUBLE` (64-bit, default).
+ * @returns {import('./types.js').FloatType} The floating point data type.
+ */
+export const float = (precision = 2) => ({
+ typeId: Type.Float,
+ precision: checkOneOf(precision, Precision),
+ values: [uint16Array, float32Array, float64Array][precision]
+});
+/**
+ * Return a Float data type instance for half-precision (16 bit) numbers.
+ * @returns {import('./types.js').FloatType} The floating point data type.
+ */
+export const float16 = () => float(Precision.HALF);
+/**
+ * Return a Float data type instance for single-precision (32 bit) numbers.
+ * @returns {import('./types.js').FloatType} The floating point data type.
+ */
+export const float32 = () => float(Precision.SINGLE);
+/**
+ * Return a Float data type instance for double-precision (64 bit) numbers.
+ * @returns {import('./types.js').FloatType} The floating point data type.
+ */
+export const float64 = () => float(Precision.DOUBLE);
+
+/**
+ * Return a Binary data type instance for variably-sized opaque binary data
+ * with 32-bit offsets.
+ * @returns {import('./types.js').BinaryType} The binary data type.
+ */
+export const binary = () => ({
+ typeId: Type.Binary,
+ offsets: int32Array
+});
+
+/**
+ * Return a Utf8 data type instance for Unicode string data.
+ * [UTF-8](https://en.wikipedia.org/wiki/UTF-8) code points are stored as
+ * binary data.
+ * @returns {import('./types.js').Utf8Type} The utf8 data type.
+ */
+export const utf8 = () => ({
+ typeId: Type.Utf8,
+ offsets: int32Array
+});
+
+/**
+ * Return a Bool data type instance. Bool values are stored compactly in
+ * bitmaps with eight values per byte.
+ * @returns {import('./types.js').BoolType} The bool data type.
+ */
+export const bool = () => basicType(Type.Bool);
+
+/**
+ * Return a Decimal data type instance. Decimal values are represented as 128
+ * or 256 bit integers in two's complement. Decimals are fixed point numbers
+ * with a set *precision* (total number of decimal digits) and *scale*
+ * (number of fractional digits). For example, the number `35.42` can be
+ * represented as `3542` with *precision* ≥ 4 and *scale* = 2.
+ * @param {number} precision The decimal precision: the total number of
+ * decimal digits that can be represented.
+ * @param {number} scale The number of fractional digits, beyond the
+ * decimal point.
+ * @param {128 | 256} [bitWidth] The decimal bit width.
+ * One of 128 (default) or 256.
+ * @returns {import('./types.js').DecimalType} The decimal data type.
+ */
+export const decimal = (precision, scale, bitWidth = 128) => ({
+ typeId: Type.Decimal,
+ precision,
+ scale,
+ bitWidth: checkOneOf(bitWidth, [128, 256]),
+ values: uint64Array
+});
+
+/**
+ * Return a Date data type instance. Date values are 32-bit or 64-bit signed
+ * integers representing elapsed time since the UNIX epoch (Jan 1, 1970 UTC),
+ * either in units of days (32 bits) or milliseconds (64 bits, with values
+ * evenly divisible by 86400000).
+ * @param {import('./types.js').DateUnit_} unit The date unit.
+ * One of `DateUnit.DAY` or `DateUnit.MILLISECOND`.
+ * @returns {import('./types.js').DateType} The date data type.
+ */
+export const date = (unit) => ({
+ typeId: Type.Date,
+ unit: checkOneOf(unit, DateUnit),
+ values: unit === DateUnit.DAY ? int32Array : int64Array
+});
+/**
+ * Return a Date data type instance with units of days.
+ * @returns {import('./types.js').DateType} The date data type.
+ */
+export const dateDay = () => date(DateUnit.DAY);
+/**
+ * Return a Date data type instance with units of milliseconds.
+ * @returns {import('./types.js').DateType} The date data type.
+ */
+export const dateMillisecond = () => date(DateUnit.MILLISECOND);
+
+/**
+ * Return a Time data type instance, stored in one of four *unit*s: seconds,
+ * milliseconds, microseconds or nanoseconds. The integer *bitWidth* depends
+ * on the *unit* and must be 32 bits for seconds and milliseconds or 64 bits
+ * for microseconds and nanoseconds. The allowed values are between 0
+ * (inclusive) and 86400 (=24*60*60) seconds (exclusive), adjusted for the
+ * time unit (for example, up to 86400000 exclusive for the
+ * `DateUnit.MILLISECOND` unit.
+ *
+ * This definition doesn't allow for leap seconds. Time values from
+ * measurements with leap seconds will need to be corrected when ingesting
+ * into Arrow (for example by replacing the value 86400 with 86399).
+ * @param {import('./types.js').TimeUnit_} unit The time unit.
+ * One of `TimeUnit.SECOND`, `TimeUnit.MILLISECOND` (default),
+ * `TimeUnit.MICROSECOND`, or `TimeUnit.NANOSECOND`.
+ * @param {32 | 64} bitWidth The time bit width. One of `32` (for seconds
+ * and milliseconds) or `64` (for microseconds and nanoseconds).
+ * @returns {import('./types.js').TimeType} The time data type.
+ */
+export const time = (unit = TimeUnit.MILLISECOND, bitWidth = 32) => ({
+ typeId: Type.Time,
+ unit: checkOneOf(unit, TimeUnit),
+ bitWidth: checkOneOf(bitWidth, [32, 64]),
+ values: bitWidth === 32 ? int32Array : int64Array
+});
+/**
+ * Return a Time data type instance, represented as seconds.
+ * @returns {import('./types.js').TimeType} The time data type.
+ */
+export const timeSecond = () => time(TimeUnit.SECOND, 32);
+/**
+ * Return a Time data type instance, represented as milliseconds.
+ * @returns {import('./types.js').TimeType} The time data type.
+ */
+export const timeMillisecond = () => time(TimeUnit.MILLISECOND, 32);
+/**
+ * Return a Time data type instance, represented as microseconds.
+ * @returns {import('./types.js').TimeType} The time data type.
+ */
+export const timeMicrosecond = () => time(TimeUnit.MICROSECOND, 64);
+/**
+ * Return a Time data type instance, represented as nanoseconds.
+ * @returns {import('./types.js').TimeType} The time data type.
+ */
+export const timeNanosecond = () => time(TimeUnit.NANOSECOND, 64);
+
+/**
+ * Return a Timestamp data type instance. Timestamp values are 64-bit signed
+ * integers representing an elapsed time since a fixed epoch, stored in either
+ * of four units: seconds, milliseconds, microseconds or nanoseconds, and are
+ * optionally annotated with a timezone. Timestamp values do not include any
+ * leap seconds (in other words, all days are considered 86400 seconds long).
+ * @param {import('./types.js').TimeUnit_} [unit] The time unit.
+ * One of `TimeUnit.SECOND`, `TimeUnit.MILLISECOND` (default),
+ * `TimeUnit.MICROSECOND`, or `TimeUnit.NANOSECOND`.
+ * @param {string|null} [timezone=null] An optional string for the name of a
+ * timezone. If provided, the value should either be a string as used in the
+ * Olson timezone database (the "tz database" or "tzdata"), such as
+ * "America/New_York", or an absolute timezone offset of the form "+XX:XX" or
+ * "-XX:XX", such as "+07:30".Whether a timezone string is present indicates
+ * different semantics about the data.
+ * @returns {import('./types.js').TimestampType} The time data type.
+ */
+export const timestamp = (unit = TimeUnit.MILLISECOND, timezone = null) => ({
+ typeId: Type.Timestamp,
+ unit: checkOneOf(unit, TimeUnit),
+ timezone,
+ values: int64Array
+});
+
+/**
+ * Return an Interval type instance. Values represent calendar intervals stored
+ * as integers for each date part. The supported intervals *unit*s are:
+ *
+ * - `IntervalUnit.YEAR_MONTH`: Indicates the number of elapsed whole months,
+ * stored as 4-byte signed integers.
+ * - `IntervalUnit.DAY_TIME`: Indicates the number of elapsed days and
+ * milliseconds (no leap seconds), stored as 2 contiguous 32-bit signed
+ * integers (8-bytes in total).
+ * - `IntervalUnit.MONTH_DAY_NANO`: A triple of the number of elapsed months,
+ * days, and nanoseconds. The values are stored contiguously in 16-byte
+ * blocks. Months and days are encoded as 32-bit signed integers and
+ * nanoseconds is encoded as a 64-bit signed integer. Nanoseconds does not
+ * allow for leap seconds. Each field is independent (e.g. there is no
+ * constraint that nanoseconds have the same sign as days or that the
+ * quantity of nanoseconds represents less than a day's worth of time).
+ * @param {import('./types.js').IntervalUnit_} unit The interval unit.
+ * One of `IntervalUnit.YEAR_MONTH`, `IntervalUnit.DAY_TIME`, or
+ * `IntervalUnit.MONTH_DAY_NANO` (default).
+ * @returns {import('./types.js').IntervalType} The interval data type.
+ */
+export const interval = (unit = IntervalUnit.MONTH_DAY_NANO) => ({
+ typeId: Type.Interval,
+ unit: checkOneOf(unit, IntervalUnit),
+ values: unit === IntervalUnit.MONTH_DAY_NANO ? undefined : int32Array
+});
+
+/**
+ * Return a List data type instance, representing variably-sized lists
+ * (arrays) with 32-bit offsets. A list has a single child data type for
+ * list entries. Lists are represented using integer offsets that indicate
+ * list extents within a single child array containing all list values.
+ * @param {FieldInput} child The child (list item) field or data type.
+ * @returns {import('./types.js').ListType} The list data type.
+ */
+export const list = (child) => ({
+ typeId: Type.List,
+ children: [ asField(child) ],
+ offsets: int32Array
+});
+
+/**
+ * Return a Struct data type instance. A struct consists of multiple named
+ * child data types. Struct values are stored as parallel child batches, one
+ * per child type, and extracted to standard JavaScript objects.
+ * @param {import('./types.js').Field[] | Record} children
+ * An array of property fields, or an object mapping property names to data
+ * types. If an object, the instantiated fields are assumed to be nullable
+ * and have no metadata.
+ * @returns {import('./types.js').StructType} The struct data type.
+ */
+export const struct = (children) => ({
+ typeId: Type.Struct,
+ children: Array.isArray(children) && isField(children[0])
+ ? /** @type {import('./types.js').Field[]} */ (children)
+ : Object.entries(children).map(([name, type]) => field(name, type))
+});
+
+/**
+ * Return a Union type instance. A union is a complex type with parallel
+ * *children* data types. Union values are stored in either a sparse
+ * (`UnionMode.Sparse`) or dense (`UnionMode.Dense`) layout *mode*. In a
+ * sparse layout, child types are stored in parallel arrays with the same
+ * lengths, resulting in many unused, empty values. In a dense layout, child
+ * types have variable lengths and an offsets array is used to index the
+ * appropriate value.
+ *
+ * By default, ids in the type vector refer to the index in the children
+ * array. Optionally, *typeIds* provide an indirection between the child
+ * index and the type id. For each child, `typeIds[index]` is the id used
+ * in the type vector. The *typeIdForValue* argument provides a lookup
+ * function for mapping input data to the proper child type id, and is
+ * required if using builder methods.
+ * @param {import('./types.js').UnionMode_} mode The union mode.
+ * One of `UnionMode.Sparse` or `UnionMode.Dense`.
+ * @param {FieldInput[]} children The children fields or data types.
+ * Types are mapped to nullable fields with no metadata.
+ * @param {number[]} [typeIds] Children type ids, in the same order as the
+ * children types. Type ids provide a level of indirection over children
+ * types. If not provided, the children indices are used as the type ids.
+ * @param {(value: any, index: number) => number} [typeIdForValue]
+ * A function that takes an arbitrary value and a row index and returns a
+ * correponding union type id. Required by builder methods.
+ * @returns {import('./types.js').UnionType} The union data type.
+ */
+export const union = (mode, children, typeIds, typeIdForValue) => {
+ typeIds ??= children.map((v, i) => i);
+ return {
+ typeId: Type.Union,
+ mode: checkOneOf(mode, UnionMode),
+ typeIds,
+ typeMap: typeIds.reduce((m, id, i) => ((m[id] = i), m), {}),
+ children: children.map((v, i) => asField(v, `_${i}`)),
+ typeIdForValue,
+ offsets: int32Array,
+ };
+};
+
+/**
+ * Create a FixedSizeBinary data type instance for opaque binary data where
+ * each entry has the same fixed size.
+ * @param {number} stride The fixed size in bytes.
+ * @returns {import('./types.js').FixedSizeBinaryType} The fixed size binary data type.
+ */
+export const fixedSizeBinary = (stride) => ({
+ typeId: Type.FixedSizeBinary,
+ stride
+});
+
+/**
+ * Return a FixedSizeList type instance for list (array) data where every list
+ * has the same fixed size. A list has a single child data type for list
+ * entries. Fixed size lists are represented as a single child array containing
+ * all list values, indexed using the known stride.
+ * @param {FieldInput} child The list item data type.
+ * @param {number} stride The fixed list size.
+ * @returns {import('./types.js').FixedSizeListType} The fixed size list data type.
+ */
+export const fixedSizeList = (child, stride) => ({
+ typeId: Type.FixedSizeList,
+ stride,
+ children: [ asField(child) ]
+});
+
+/**
+ * Internal method to create a Map type instance.
+ * @param {boolean} keysSorted Flag indicating if the map keys are sorted.
+ * @param {import('./types.js').Field} child The child fields.
+ * @returns {import('./types.js').MapType} The map data type.
+ */
+export const mapType = (keysSorted, child) => ({
+ typeId: Type.Map,
+ keysSorted,
+ children: [child],
+ offsets: int32Array
+});
+
+/**
+ * Return a Map data type instance representing collections of key-value pairs.
+ * A Map is a logical nested type that is represented as a list of key-value
+ * structs. The key and value types are not constrained, so the application is
+ * responsible for ensuring that the keys are hashable and unique, and that
+ * keys are properly sorted if *keysSorted* is `true`.
+ * @param {FieldInput} keyField The map key field or data type.
+ * @param {FieldInput} valueField The map value field or data type.
+ * @param {boolean} [keysSorted=false] Flag indicating if the map keys are
+ * sorted (default `false`).
+ * @returns {import('./types.js').MapType} The map data type.
+ */
+export const map = (keyField, valueField, keysSorted = false) => mapType(
+ keysSorted,
+ field(
+ 'entries',
+ struct([ asField(keyField, 'key', false), asField(valueField, 'value') ]),
+ false
+ )
+);
+
+/**
+ * Return a Duration data type instance. Durations represent an absolute length
+ * of time unrelated to any calendar artifacts. The resolution defaults to
+ * millisecond, but can be any of the other `TimeUnit` values. This type is
+ * always represented as a 64-bit integer.
+ * @param {import('./types.js').TimeUnit_} unit
+ * @returns {import('./types.js').DurationType} The duration data type.
+ */
+export const duration = (unit = TimeUnit.MILLISECOND) => ({
+ typeId: Type.Duration,
+ unit: checkOneOf(unit, TimeUnit),
+ values: int64Array
+});
+
+/**
+ * Return a LargeBinary data type instance for variably-sized opaque binary
+ * data with 64-bit offsets, allowing representation of extremely large data
+ * values.
+ * @returns {import('./types.js').LargeBinaryType} The large binary data type.
+ */
+export const largeBinary = () => ({
+ typeId: Type.LargeBinary,
+ offsets: int64Array
+});
+
+/**
+ * Return a LargeUtf8 data type instance for Unicode string data of variable
+ * length with 64-bit offsets, allowing representation of extremely large data
+ * values. [UTF-8](https://en.wikipedia.org/wiki/UTF-8) code points are stored
+ * as binary data.
+ * @returns {import('./types.js').LargeUtf8Type} The large utf8 data type.
+ */
+export const largeUtf8 = () => ({
+ typeId: Type.LargeUtf8,
+ offsets: int64Array
+});
+
+/**
+ * Return a LargeList data type instance, representing variably-sized lists
+ * (arrays) with 64-bit offsets, allowing representation of extremely large
+ * data values. A list has a single child data type for list entries. Lists
+ * are represented using integer offsets that indicate list extents within a
+ * single child array containing all list values.
+ * @param {FieldInput} child The child (list item) field or data type.
+ * @returns {import('./types.js').LargeListType} The large list data type.
+ */
+export const largeList = (child) => ({
+ typeId: Type.LargeList,
+ children: [ asField(child) ],
+ offsets: int64Array
+});
+
+/**
+ * Return a RunEndEncoded data type instance, which compresses data by
+ * representing consecutive repeated values as a run. This data type uses two
+ * child arrays, `run_ends` and `values`. The `run_ends` child array must be
+ * a 16, 32, or 64 bit integer array which encodes the indices at which the
+ * run with the value in each corresponding index in the values child array
+ * ends. Like list and struct types, the `values` array can be of any type.
+ * @param {FieldInput} runsField The run-ends field or data type.
+ * @param {FieldInput} valuesField The values field or data type.
+ * @returns {import('./types.js').RunEndEncodedType} The large list data type.
+ */
+export const runEndEncoded = (runsField, valuesField) => ({
+ typeId: Type.RunEndEncoded,
+ children: [
+ check(
+ asField(runsField, 'run_ends'),
+ (field) => field.type.typeId === Type.Int,
+ () => 'Run-ends must have an integer type.'
+ ),
+ asField(valuesField, 'values')
+ ]
+});
+
+/**
+ * Return a BinaryView data type instance. BinaryView data is logically the
+ * same as the Binary type, but the internal representation uses a view struct
+ * that contains the string length and either the string's entire data inline
+ * (for small strings) or an inlined prefix, an index of another buffer, and an
+ * offset pointing to a slice in that buffer (for non-small strings).
+ *
+ * Flechette can encode and decode BinaryView data; however, Flechette does
+ * not currently support building BinaryView columns from JavaScript values.
+ * @returns {import('./types.js').BinaryViewType} The binary view data type.
+ */
+export const binaryView = () => /** @type{import('./types.js').BinaryViewType} */
+ (basicType(Type.BinaryView));
+
+/**
+ * Return a Utf8View data type instance. Utf8View data is logically the same as
+ * the Utf8 type, but the internal representation uses a view struct that
+ * contains the string length and either the string's entire data inline (for
+ * small strings) or an inlined prefix, an index of another buffer, and an
+ * offset pointing to a slice in that buffer (for non-small strings).
+ *
+ * Flechette can encode and decode Utf8View data; however, Flechette does
+ * not currently support building Utf8View columns from JavaScript values.
+ * @returns {import('./types.js').Utf8ViewType} The utf8 view data type.
+ */
+export const utf8View = () => /** @type{import('./types.js').Utf8ViewType} */
+ (basicType(Type.Utf8View));
+
+/**
+ * Return a ListView data type instance, representing variably-sized lists
+ * (arrays) with 32-bit offsets. ListView data represents the same logical
+ * types that List can, but contains both offsets and sizes allowing for
+ * writes in any order and sharing of child values among list values.
+ *
+ * Flechette can encode and decode ListView data; however, Flechette does not
+ * currently support building ListView columns from JavaScript values.
+ * @param {FieldInput} child The child (list item) field or data type.
+ * @returns {import('./types.js').ListViewType} The list view data type.
+ */
+export const listView = (child) => ({
+ typeId: Type.ListView,
+ children: [ asField(child, 'value') ],
+ offsets: int32Array
+});
+
+/**
+ * Return a LargeListView data type instance, representing variably-sized lists
+ * (arrays) with 64-bit offsets, allowing representation of extremely large
+ * data values. LargeListView data represents the same logical types that
+ * LargeList can, but contains both offsets and sizes allowing for writes
+ * in any order and sharing of child values among list values.
+ *
+ * Flechette can encode and decode LargeListView data; however, Flechette does
+ * not currently support building LargeListView columns from JavaScript values.
+ * @param {FieldInput} child The child (list item) field or data type.
+ * @returns {import('./types.js').LargeListViewType} The large list view data type.
+ */
+export const largeListView = (child) => ({
+ typeId: Type.LargeListView,
+ children: [ asField(child, 'value') ],
+ offsets: int64Array
+});
diff --git a/src/decode/block.js b/src/decode/block.js
index 657ec2e..41890d3 100644
--- a/src/decode/block.js
+++ b/src/decode/block.js
@@ -1,19 +1,19 @@
-import { readInt32, readInt64AsNum, readVector } from '../util.js';
+import { readInt32, readInt64, readVector } from '../util/read.js';
/**
* Decode a block that points to messages within an Arrow 'file' format.
* @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
* @param {number} index The starting index in the byte buffer
- * @returns The message block.
+ * @returns The file block.
*/
export function decodeBlock(buf, index) {
// 0: offset
// 8: metadataLength
// 16: bodyLength
return {
- offset: readInt64AsNum(buf, index),
+ offset: readInt64(buf, index),
metadataLength: readInt32(buf, index + 8),
- bodyLength: readInt64AsNum(buf, index + 16)
+ bodyLength: readInt64(buf, index + 16)
}
}
@@ -21,7 +21,7 @@ export function decodeBlock(buf, index) {
* Decode a vector of blocks.
* @param {Uint8Array} buf
* @param {number} index
- * @returns An array of message blocks.
+ * @returns An array of file blocks.
*/
export function decodeBlocks(buf, index) {
return readVector(buf, index, 24, decodeBlock);
diff --git a/src/decode/data-type.js b/src/decode/data-type.js
index a060cf9..74a0b9c 100644
--- a/src/decode/data-type.js
+++ b/src/decode/data-type.js
@@ -1,295 +1,96 @@
-import { arrayTypeInt, float32, float64, int32, int64, uint16, uint32 } from '../array-types.js';
import { DateUnit, IntervalUnit, Precision, TimeUnit, Type, UnionMode } from '../constants.js';
-import { keyFor, readBoolean, readInt16, readInt32, readOffset, readString, readVector, table } from '../util.js';
+import { binary, date, decimal, duration, fixedSizeBinary, fixedSizeList, float, int, interval, invalidDataType, largeBinary, largeList, largeListView, largeUtf8, list, listView, mapType, runEndEncoded, struct, time, timestamp, union, utf8 } from '../data-types.js';
+import { checkOneOf } from '../util/objects.js';
+import { readBoolean, readInt16, readInt32, readObject, readOffset, readString, readVector } from '../util/read.js';
/**
* Decode a data type definition for a field.
* @param {Uint8Array} buf A byte buffer of binary Arrow IPC data.
* @param {number} index The starting index in the byte buffer.
* @param {number} typeId The data type id.
- * @param {any[]} [children] A list of parsed child fields.
+ * @param {import('../types.js').Field[]} [children] A list of parsed child fields.
* @returns {import('../types.js').DataType} The data type.
*/
export function decodeDataType(buf, index, typeId, children) {
- switch (typeId) {
- case Type.NONE:
- case Type.Null:
- case Type.Bool:
- case Type.BinaryView:
- case Type.Utf8View:
- return { typeId };
- case Type.Binary:
- case Type.Utf8:
- return { typeId, offsets: int32 };
- case Type.LargeBinary:
- case Type.LargeUtf8:
- return { typeId, offsets: int64 };
- case Type.List:
- case Type.ListView:
- return { typeId, children: [children?.[0]], offsets: int32 };
- case Type.LargeList:
- case Type.LargeListView:
- return { typeId, children: [children?.[0]], offsets: int64 };
- case Type.Struct:
- case Type.RunEndEncoded:
- // @ts-ignore - suppress children length warning for run-end encoded
- return { typeId, children };
- case Type.Int:
- return decodeInt(buf, index);
- case Type.Float:
- return decodeFloat(buf, index);
- case Type.Decimal:
- return decodeDecimal(buf, index);
- case Type.Date:
- return decodeDate(buf, index);
- case Type.Time:
- return decodeTime(buf, index);
- case Type.Timestamp:
- return decodeTimestamp(buf, index);
- case Type.Interval:
- return decodeInterval(buf, index);
- case Type.Duration:
- return decodeDuration(buf, index);
- case Type.FixedSizeBinary:
- return decodeFixedSizeBinary(buf, index);
- case Type.FixedSizeList:
- return decodeFixedSizeList(buf, index, children);
- case Type.Map:
- return decodeMap(buf, index, children);
- case Type.Union:
- return decodeUnion(buf, index, children);
- }
- throw new Error(`Unrecognized type: "${keyFor(Type, typeId)}" (id ${typeId})`);
-}
-
-/**
- * Construct an integer data type.
- * @param {import('../types.js').IntBitWidth} bitWidth The integer bit width.
- * @param {boolean} signed Flag for signed or unsigned integers.
- * @returns {import('../types.js').IntType} The integer data type.
- */
-export function typeInt(bitWidth, signed) {
- return {
- typeId: Type.Int,
- bitWidth,
- signed,
- values: arrayTypeInt(bitWidth, signed)
- };
-}
-
-/**
- * Decode an integer data type from binary data.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data.
- * @param {number} index The starting index in the byte buffer.
- * @returns {import('../types.js').IntType} The integer data type.
- */
-export function decodeInt(buf, index) {
- // 4: bitWidth
- // 6: isSigned
- const get = table(buf, index);
- return typeInt(
- /** @type {import('../types.js').IntBitWidth} */
- (get(4, readInt32, 0)), // bitwidth
- get(6, readBoolean, false) // signed
- );
-}
-
-/**
- * Decode a float type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').FloatType}
- */
-function decodeFloat(buf, index) {
- // 4: precision
- const get = table(buf, index);
- const precision = /** @type {typeof Precision[keyof Precision]} */
- (get(4, readInt16, Precision.HALF));
- return {
- typeId: Type.Float,
- precision,
- values: precision === Precision.HALF ? uint16
- : precision === Precision.SINGLE ? float32
- : float64
- };
-}
-
-/**
- * Decode a decimal type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').DecimalType}
- */
-function decodeDecimal(buf, index) {
- // 4: precision
- // 6: scale
- // 8: bitWidth
- const get = table(buf, index);
- const bitWidth = /** @type {128 | 256 } */ (get(8, readInt32, 128));
- return {
- typeId: Type.Decimal,
- precision: get(4, readInt32, 0),
- scale: get(6, readInt32, 0),
- bitWidth,
- values: uint32
- };
-}
-
-/**
- * Decode a date type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').DateType}
- */
-function decodeDate(buf, index) {
- // 4: unit
- const get = table(buf, index);
- const unit = /** @type {typeof DateUnit[keyof DateUnit]} */
- (get(4, readInt16, DateUnit.MILLISECOND));
- return {
- typeId: Type.Date,
- unit,
- values: unit === DateUnit.DAY ? int32 : int64
- };
-}
-
-/**
- * Decode a time type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').TimeType}
- */
-function decodeTime(buf, index) {
- // 4: unit
- // 6: bitWidth
- const get = table(buf, index);
- const bitWidth = /** @type {32 | 64 } */ (get(6, readInt32, 32));
- return {
- typeId: Type.Time,
- unit: /** @type {typeof TimeUnit[keyof TimeUnit]} */
- (get(4, readInt16, TimeUnit.MILLISECOND)),
- bitWidth,
- values: bitWidth === 32 ? int32 : int64
- };
-}
-
-/**
- * Decode a timestamp type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').TimestampType}
- */
-function decodeTimestamp(buf, index) {
- // 4: unit
- // 6: timezone
- const get = table(buf, index);
- return {
- typeId: Type.Timestamp,
- unit: /** @type {typeof TimeUnit[keyof TimeUnit]} */
- (get(4, readInt16, TimeUnit.SECOND)),
- timezone: get(6, readString),
- values: int64
- };
-}
-
-/**
- * Decode an interval type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').IntervalType}
- */
-function decodeInterval(buf, index) {
- // 4: unit
- const get = table(buf, index);
- const unit = /** @type {typeof IntervalUnit[keyof IntervalUnit]} */
- (get(4, readInt16, IntervalUnit.YEAR_MONTH));
- return {
- typeId: Type.Interval,
- unit,
- values: unit === IntervalUnit.MONTH_DAY_NANO ? undefined : int32
- };
-}
-
-/**
- * Decode a duration type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').DurationType}
- */
-function decodeDuration(buf, index) {
- // 4: unit
- const get = table(buf, index);
- return {
- typeId: Type.Duration,
- unit: /** @type {typeof TimeUnit[keyof TimeUnit]} */
- (get(4, readInt16, TimeUnit.MILLISECOND)),
- values: int64
- };
-}
+ checkOneOf(typeId, Type, invalidDataType);
+ const get = readObject(buf, index);
-/**
- * Decode a fixed size binary type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').FixedSizeBinaryType}
- */
-function decodeFixedSizeBinary(buf, index) {
- // 4: size (byteWidth)
- const get = table(buf, index);
- return {
- typeId: Type.FixedSizeBinary,
- stride: get(4, readInt32, 0)
- };
-}
+ switch (typeId) {
+ // types without flatbuffer objects
+ case Type.Binary: return binary();
+ case Type.Utf8: return utf8();
+ case Type.LargeBinary: return largeBinary();
+ case Type.LargeUtf8: return largeUtf8();
+ case Type.List: return list(children[0]);
+ case Type.ListView: return listView(children[0]);
+ case Type.LargeList: return largeList(children[0]);
+ case Type.LargeListView: return largeListView(children[0]);
+ case Type.Struct: return struct(children);
+ case Type.RunEndEncoded: return runEndEncoded(children[0], children[1]);
-/**
- * Decode a fixed size list type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').FixedSizeListType}
- */
-function decodeFixedSizeList(buf, index, children) {
- // 4: size (listSize)
- const get = table(buf, index);
- return {
- typeId: Type.FixedSizeList,
- stride: get(4, readInt32, 0),
- children: [children?.[0] ?? null]
- };
-}
+ // types with flatbuffer objects
+ case Type.Int: return int(
+ // @ts-ignore
+ get(4, readInt32, 0), // bitwidth
+ get(6, readBoolean, false) // signed
+ );
+ case Type.Float: return float(
+ // @ts-ignore
+ get(4, readInt16, Precision.HALF) // precision
+ );
+ case Type.Decimal: return decimal(
+ get(4, readInt32, 0), // precision
+ get(6, readInt32, 0), // scale
+ // @ts-ignore
+ get(8, readInt32, 128) // bitwidth
+ );
+ case Type.Date: return date(
+ // @ts-ignore
+ get(4, readInt16, DateUnit.MILLISECOND) // unit
+ );
+ case Type.Time: return time(
+ // @ts-ignore
+ get(4, readInt16, TimeUnit.MILLISECOND), // unit
+ get(6, readInt32, 32) // bitWidth
+ );
+ case Type.Timestamp: return timestamp(
+ // @ts-ignore
+ get(4, readInt16, TimeUnit.SECOND), // unit
+ get(6, readString) // timezone
+ );
+ case Type.Interval: return interval(
+ // @ts-ignore
+ get(4, readInt16, IntervalUnit.YEAR_MONTH) // unit
+ );
+ case Type.Duration: return duration(
+ // @ts-ignore
+ get(4, readInt16, TimeUnit.MILLISECOND) // unit
+ );
-/**
- * Decode a map type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').MapType}
- */
-function decodeMap(buf, index, children) {
- // 4: keysSorted (bool)
- const get = table(buf, index);
- return {
- typeId: Type.Map,
- keysSorted: get(4, readBoolean, false),
- children,
- offsets: int32
- };
-}
+ case Type.FixedSizeBinary: return fixedSizeBinary(
+ get(4, readInt32, 0) // stride
+ );
+ case Type.FixedSizeList: return fixedSizeList(
+ children[0],
+ get(4, readInt32, 0), // stride
+ );
+ case Type.Map: return mapType(
+ get(4, readBoolean, false), // keysSorted
+ children[0]
+ );
-/**
- * Decode a union type.
- * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data
- * @param {number} index The starting index in the byte buffer
- * @returns {import('../types.js').UnionType}
- */
-function decodeUnion(buf, index, children) {
- // 4: mode
- // 6: typeIds
- const get = table(buf, index);
- return {
- typeId: Type.Union,
- mode: /** @type {typeof UnionMode[keyof UnionMode]} */ (get(4, readInt16, UnionMode.Sparse)),
- typeIds: readVector(buf, get(6, readOffset), 4, readInt32),
- children: children ?? [],
- offsets: int32
- };
+ case Type.Union: return union(
+ // @ts-ignore
+ get(4, readInt16, UnionMode.Sparse), // mode
+ children,
+ readVector(buf, get(6, readOffset), 4, readInt32) // type ids
+ );
+ }
+ // case Type.NONE:
+ // case Type.Null:
+ // case Type.Bool:
+ // case Type.BinaryView:
+ // case Type.Utf8View:
+ // @ts-ignore
+ return { typeId };
}
diff --git a/src/parse-ipc.js b/src/decode/decode-ipc.js
similarity index 76%
rename from src/parse-ipc.js
rename to src/decode/decode-ipc.js
index cba93af..370b6c7 100644
--- a/src/parse-ipc.js
+++ b/src/decode/decode-ipc.js
@@ -1,9 +1,9 @@
-import { MessageHeader, Version } from './constants.js';
-import { readInt16, readInt32, table } from './util.js';
-import { decodeSchema } from './decode/schema.js';
-import { decodeMessage } from './decode/message.js';
-import { decodeMetadata } from './decode/metadata.js';
-import { decodeBlocks } from './decode/block.js';
+import { MAGIC, MessageHeader, Version } from '../constants.js';
+import { readInt16, readInt32, readObject } from '../util/read.js';
+import { decodeBlocks } from './block.js';
+import { decodeMessage } from './message.js';
+import { decodeMetadata } from './metadata.js';
+import { decodeSchema } from './schema.js';
/**
* Decode [Apache Arrow IPC data][1] and return parsed schema, record batch,
@@ -19,20 +19,17 @@ import { decodeBlocks } from './decode/block.js';
* The source byte buffer, or an array of buffers. If an array, each byte
* array may contain one or more self-contained messages. Messages may NOT
* span multiple byte arrays.
- * @returns {import('./types.js').ArrowData}
+ * @returns {import('../types.js').ArrowData}
*/
-export function parseIPC(data) {
+export function decodeIPC(data) {
const source = data instanceof ArrayBuffer
? new Uint8Array(data)
: data;
return !Array.isArray(source) && isArrowFileFormat(source)
- ? parseIPCFile(source)
- : parseIPCStream(source);
+ ? decodeIPCFile(source)
+ : decodeIPCStream(source);
}
-/** Magic bytes 'ARROW1' indicating the Arrow 'file' format. */
-const MAGIC = Uint8Array.of(65, 82, 82, 79, 87, 49);
-
/**
* @param {Uint8Array} buf
* @returns {boolean}
@@ -52,9 +49,9 @@ function isArrowFileFormat(buf) {
* @param {Uint8Array | Uint8Array[]} data The source byte buffer, or an
* array of buffers. If an array, each byte array may contain one or more
* self-contained messages. Messages may NOT span multiple byte arrays.
- * @returns {import('./types.js').ArrowData}
+ * @returns {import('../types.js').ArrowData}
*/
-export function parseIPCStream(data) {
+export function decodeIPCStream(data) {
const stream = [data].flat();
let schema;
@@ -86,7 +83,7 @@ export function parseIPCStream(data) {
}
}
- return /** @type {import('./types.js').ArrowData} */ (
+ return /** @type {import('../types.js').ArrowData} */ (
{ schema, dictionaries, records, metadata: null }
);
}
@@ -96,9 +93,9 @@ export function parseIPCStream(data) {
*
* [1]: https://arrow.apache.org/docs/format/Columnar.html#ipc-file-format
* @param {Uint8Array} data The source byte buffer.
- * @returns {import('./types.js').ArrowData}
+ * @returns {import('../types.js').ArrowData}
*/
-export function parseIPCFile(data) {
+export function decodeIPCFile(data) {
// find footer location
const offset = data.byteLength - (MAGIC.length + 4);
const length = readInt32(data, offset);
@@ -109,13 +106,13 @@ export function parseIPCFile(data) {
// 8: dictionaries (vector)
// 10: batches (vector)
// 12: metadata
- const get = table(data, offset - length);
- const version = /** @type {import('./types.js').Version_} */
+ const get = readObject(data, offset - length);
+ const version = /** @type {import('../types.js').Version_} */
(get(4, readInt16, Version.V1));
const dicts = get(8, decodeBlocks, []);
const recs = get(10, decodeBlocks, []);
- return /** @type {import('./types.js').ArrowData} */ ({
+ return /** @type {import('../types.js').ArrowData} */ ({
schema: get(6, (buf, index) => decodeSchema(buf, index, version)),
dictionaries: dicts.map(({ offset }) => decodeMessage(data, offset).content),
records: recs.map(({ offset }) => decodeMessage(data, offset).content),
diff --git a/src/decode/dictionary-batch.js b/src/decode/dictionary-batch.js
index 3738b0a..8340496 100644
--- a/src/decode/dictionary-batch.js
+++ b/src/decode/dictionary-batch.js
@@ -1,4 +1,4 @@
-import { readBoolean, readInt64AsNum, table } from '../util.js';
+import { readBoolean, readInt64, readObject } from '../util/read.js';
import { decodeRecordBatch } from './record-batch.js';
/**
@@ -12,9 +12,9 @@ export function decodeDictionaryBatch(buf, index, version) {
// 4: id
// 6: data
// 8: isDelta
- const get = table(buf, index);
+ const get = readObject(buf, index);
return {
- id: get(4, readInt64AsNum, 0),
+ id: get(4, readInt64, 0),
data: get(6, (buf, off) => decodeRecordBatch(buf, off, version)),
/**
* If isDelta is true the values in the dictionary are to be appended to a
diff --git a/src/decode/message.js b/src/decode/message.js
index 6d87bd9..c01c59b 100644
--- a/src/decode/message.js
+++ b/src/decode/message.js
@@ -1,5 +1,6 @@
import { MessageHeader, Version } from '../constants.js';
-import { SIZEOF_INT, keyFor, readInt16, readInt32, readInt64AsNum, readOffset, readUint8, table } from '../util.js';
+import { keyFor } from '../util/objects.js';
+import { SIZEOF_INT, readInt16, readInt32, readInt64, readObject, readOffset, readUint8 } from '../util/read.js';
import { decodeDictionaryBatch } from './dictionary-batch.js';
import { decodeRecordBatch } from './record-batch.js';
import { decodeSchema } from './schema.js';
@@ -45,13 +46,13 @@ export function decodeMessage(buf, index) {
// 6: headerType
// 8: headerIndex
// 10: bodyLength
- const get = table(head, 0);
+ const get = readObject(head, 0);
const version = /** @type {import('../types.js').Version_} */
(get(4, readInt16, Version.V1));
const type = /** @type {import('../types.js').MessageHeader_} */
(get(6, readUint8, MessageHeader.NONE));
const offset = get(8, readOffset, 0);
- const bodyLength = get(10, readInt64AsNum, 0);
+ const bodyLength = get(10, readInt64, 0);
let content;
if (offset) {
diff --git a/src/decode/metadata.js b/src/decode/metadata.js
index d34b6c0..9327014 100644
--- a/src/decode/metadata.js
+++ b/src/decode/metadata.js
@@ -1,4 +1,4 @@
-import { readString, readVector, table } from '../util.js';
+import { readObject, readString, readVector } from '../util/read.js';
/**
* Decode custom metadata consisting of key-value string pairs.
@@ -8,7 +8,7 @@ import { readString, readVector, table } from '../util.js';
*/
export function decodeMetadata(buf, index) {
const entries = readVector(buf, index, 4, (buf, pos) => {
- const get = table(buf, pos);
+ const get = readObject(buf, pos);
return /** @type {[string, string]} */ ([
get(4, readString), // 4: key (string)
get(6, readString) // 6: key (string)
diff --git a/src/decode/record-batch.js b/src/decode/record-batch.js
index 8740fa2..d61e2b5 100644
--- a/src/decode/record-batch.js
+++ b/src/decode/record-batch.js
@@ -1,5 +1,5 @@
import { Version } from '../constants.js';
-import { readInt64AsNum, readOffset, readVector, table } from '../util.js';
+import { readInt64, readObject, readOffset, readVector } from '../util/read.js';
/**
* Decode a record batch.
@@ -14,7 +14,7 @@ export function decodeRecordBatch(buf, index, version) {
// 8: buffers
// 10: compression (not supported)
// 12: variadicBuffers (buffer counts for view-typed fields)
- const get = table(buf, index);
+ const get = readObject(buf, index);
if (get(10, readOffset, 0)) {
throw new Error('Record batch compression not implemented');
}
@@ -24,15 +24,15 @@ export function decodeRecordBatch(buf, index, version) {
const offset = version < Version.V4 ? 8 : 0;
return {
- length: get(4, readInt64AsNum, 0),
+ length: get(4, readInt64, 0),
nodes: readVector(buf, get(6, readOffset), 16, (buf, pos) => ({
- length: readInt64AsNum(buf, pos),
- nullCount: readInt64AsNum(buf, pos + 8)
+ length: readInt64(buf, pos),
+ nullCount: readInt64(buf, pos + 8)
})),
- buffers: readVector(buf, get(8, readOffset), 16 + offset, (buf, pos) => ({
- offset: readInt64AsNum(buf, pos + offset),
- length: readInt64AsNum(buf, pos + offset + 8)
+ regions: readVector(buf, get(8, readOffset), 16 + offset, (buf, pos) => ({
+ offset: readInt64(buf, pos + offset),
+ length: readInt64(buf, pos + offset + 8)
})),
- variadic: readVector(buf, get(12, readOffset), 8, readInt64AsNum)
+ variadic: readVector(buf, get(12, readOffset), 8, readInt64)
};
}
diff --git a/src/decode/schema.js b/src/decode/schema.js
index ab2e51b..1a8df94 100644
--- a/src/decode/schema.js
+++ b/src/decode/schema.js
@@ -1,6 +1,7 @@
import { Type } from '../constants.js';
-import { readBoolean, readInt16, readInt64AsNum, readOffset, readString, readUint8, readVector, table } from '../util.js';
-import { decodeDataType, decodeInt, typeInt } from './data-type.js';
+import { dictionary, int32 } from '../data-types.js';
+import { readBoolean, readInt16, readInt64, readObject, readOffset, readString, readUint8, readVector } from '../util/read.js';
+import { decodeDataType } from './data-type.js';
import { decodeMetadata } from './metadata.js';
/**
@@ -16,7 +17,7 @@ export function decodeSchema(buf, index, version) {
// 6: fields (vector)
// 8: metadata (vector)
// 10: features (int64[])
- const get = table(buf, index);
+ const get = readObject(buf, index);
return {
version,
endianness: /** @type {import('../types.js').Endianness_} */ (get(4, readInt16, 0)),
@@ -46,7 +47,7 @@ function decodeField(buf, index, dictionaryTypes) {
// 12: dictionary (table)
// 14: children (vector)
// 16: metadata (vector)
- const get = table(buf, index);
+ const get = readObject(buf, index);
const typeId = get(8, readUint8, Type.NONE);
const typeOffset = get(10, readOffset, 0);
const dict = get(12, decodeDictionary);
@@ -62,7 +63,7 @@ function decodeField(buf, index, dictionaryTypes) {
dictType = decodeDataType(buf, typeOffset, typeId, children);
dictionaryTypes.set(id, dictType);
}
- dict.type = dictType;
+ dict.dictionary = dictType;
type = dict;
} else {
type = decodeDataType(buf, typeOffset, typeId, children);
@@ -97,12 +98,23 @@ function decodeDictionary(buf, index) {
// 6: indexType (Int type)
// 8: isOrdered (boolean)
// 10: kind (int16) currently only dense array is supported
- const get = table(buf, index);
- return {
- type: null, // to be populated by caller
- typeId: Type.Dictionary,
- id: get(4, readInt64AsNum, 0),
- keys: get(6, decodeInt, typeInt(32, true)), // index defaults to int32
- ordered: get(8, readBoolean, false)
- };
+ const get = readObject(buf, index);
+ return dictionary(
+ null, // data type will be populated by caller
+ get(6, decodeInt, int32()), // index type
+ get(4, readInt64, 0), // id
+ get(8, readBoolean, false) // ordered
+ );
+}
+
+/**
+ * Decode an integer data type.
+ * @param {Uint8Array} buf A byte buffer of binary Arrow IPC data.
+ * @param {number} index The starting index in the byte buffer.
+ * @returns {import('../types.js').IntType}
+ */
+function decodeInt(buf, index) {
+ return /** @type {import('../types.js').IntType} */ (
+ decodeDataType(buf, index, Type.Int)
+ );
}
diff --git a/src/decode/table-from-ipc.js b/src/decode/table-from-ipc.js
new file mode 100644
index 0000000..7b90cdc
--- /dev/null
+++ b/src/decode/table-from-ipc.js
@@ -0,0 +1,228 @@
+import { batchType } from '../batch-type.js';
+import { columnBuilder } from '../column.js';
+import { Type, UnionMode, Version } from '../constants.js';
+import { invalidDataType } from '../data-types.js';
+import { Table } from '../table.js';
+import { int8Array } from '../util/arrays.js';
+import { decodeIPC } from './decode-ipc.js';
+
+/**
+ * Decode [Apache Arrow IPC data][1] and return a new Table. The input binary
+ * data may be either an `ArrayBuffer` or `Uint8Array`. For Arrow data in the
+ * [IPC 'stream' format][2], an array of `Uint8Array` values is also supported.
+ *
+ * [1]: https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
+ * [2]: https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
+ * @param {ArrayBuffer | Uint8Array | Uint8Array[]} data
+ * The source byte buffer, or an array of buffers. If an array, each byte
+ * array may contain one or more self-contained messages. Messages may NOT
+ * span multiple byte arrays.
+ * @param {import('../types.js').ExtractionOptions} [options]
+ * Options for controlling how values are transformed when extracted
+ * from an Arrow binary representation.
+ * @returns {Table} A Table instance.
+ */
+export function tableFromIPC(data, options) {
+ return createTable(decodeIPC(data), options);
+}
+
+/**
+ * Create a table from parsed IPC data.
+ * @param {import('../types.js').ArrowData} data
+ * The IPC data, as returned by parseIPC.
+ * @param {import('../types.js').ExtractionOptions} [options]
+ * Options for controlling how values are transformed when extracted
+ * from am Arrow binary representation.
+ * @returns {Table} A Table instance.
+ */
+export function createTable(data, options = {}) {
+ const { schema = { fields: [] }, dictionaries, records } = data;
+ const { version, fields, dictionaryTypes } = schema;
+ const dictionaryMap = new Map;
+ const context = contextGenerator(options, version, dictionaryMap);
+
+ // decode dictionaries
+ const dicts = new Map;
+ for (const dict of dictionaries) {
+ const { id, data, isDelta, body } = dict;
+ const type = dictionaryTypes.get(id);
+ const batch = visit(type, context({ ...data, body }));
+ if (!dicts.has(id)) {
+ if (isDelta) {
+ throw new Error('Delta update can not be first dictionary batch.');
+ }
+ dicts.set(id, columnBuilder().add(batch));
+ } else {
+ const dict = dicts.get(id);
+ if (!isDelta) dict.clear();
+ dict.add(batch);
+ }
+ }
+ dicts.forEach((value, key) => dictionaryMap.set(key, value.done()));
+
+ // decode column fields
+ const cols = fields.map(() => columnBuilder());
+ for (const batch of records) {
+ const ctx = context(batch);
+ fields.forEach((f, i) => cols[i].add(visit(f.type, ctx)));
+ }
+
+ return new Table(schema, cols.map(c => c.done()));
+}
+
+/**
+ * Context object generator for field visitation and buffer definition.
+ */
+function contextGenerator(options, version, dictionaryMap) {
+ const base = {
+ version,
+ options,
+ dictionary: id => dictionaryMap.get(id),
+ };
+
+ /**
+ * Return a context generator.
+ * @param {import('../types.js').RecordBatch} batch
+ */
+ return batch => {
+ const { length, nodes, regions, variadic, body } = batch;
+ let nodeIndex = -1;
+ let bufferIndex = -1;
+ let variadicIndex = -1;
+ return {
+ ...base,
+ length,
+ node: () => nodes[++nodeIndex],
+ buffer: (ArrayType) => {
+ const { length, offset } = regions[++bufferIndex];
+ return ArrayType
+ ? new ArrayType(body.buffer, body.byteOffset + offset, length / ArrayType.BYTES_PER_ELEMENT)
+ : body.subarray(offset, offset + length)
+ },
+ variadic: () => variadic[++variadicIndex],
+ visit(children) { return children.map(f => visit(f.type, this)); }
+ };
+ };
+}
+
+/**
+ * Visit a field, instantiating views of buffer regions.
+ */
+function visit(type, ctx) {
+ const { typeId } = type;
+ const BatchType = batchType(type, ctx.options);
+
+ if (typeId === Type.Null) {
+ // no field node, no buffers
+ return new BatchType({ length: ctx.length, nullCount: length });
+ }
+
+ // extract the next { length, nullCount } field node
+ const node = { ...ctx.node(), type };
+
+ switch (typeId) {
+ // validity and data value buffers
+ case Type.Bool:
+ case Type.Int:
+ case Type.Time:
+ case Type.Duration:
+ case Type.Float:
+ case Type.Decimal:
+ case Type.Date:
+ case Type.Timestamp:
+ case Type.Interval:
+ case Type.FixedSizeBinary:
+ return new BatchType({
+ ...node,
+ validity: ctx.buffer(),
+ values: ctx.buffer(type.values)
+ });
+
+ // validity, offset, and value buffers
+ case Type.Utf8:
+ case Type.LargeUtf8:
+ case Type.Binary:
+ case Type.LargeBinary:
+ return new BatchType({
+ ...node,
+ validity: ctx.buffer(),
+ offsets: ctx.buffer(type.offsets),
+ values: ctx.buffer()
+ });
+
+ // views with variadic buffers
+ case Type.BinaryView:
+ case Type.Utf8View:
+ return new BatchType({
+ ...node,
+ validity: ctx.buffer(),
+ values: ctx.buffer(), // views buffer
+ data: Array.from({ length: ctx.variadic() }, () => ctx.buffer()) // data buffers
+ });
+
+ // validity, offset, and list child
+ case Type.List:
+ case Type.LargeList:
+ case Type.Map:
+ return new BatchType({
+ ...node,
+ validity: ctx.buffer(),
+ offsets: ctx.buffer(type.offsets),
+ children: ctx.visit(type.children)
+ });
+
+ // validity, offset, size, and list child
+ case Type.ListView:
+ case Type.LargeListView:
+ return new BatchType({
+ ...node,
+ validity: ctx.buffer(),
+ offsets: ctx.buffer(type.offsets),
+ sizes: ctx.buffer(type.offsets),
+ children: ctx.visit(type.children)
+ });
+
+ // validity and children
+ case Type.FixedSizeList:
+ case Type.Struct:
+ return new BatchType({
+ ...node,
+ validity: ctx.buffer(),
+ children: ctx.visit(type.children)
+ });
+
+ // children only
+ case Type.RunEndEncoded:
+ return new BatchType({
+ ...node,
+ children: ctx.visit(type.children)
+ });
+
+ // dictionary
+ case Type.Dictionary: {
+ const { id, indices } = type;
+ return new BatchType({
+ ...node,
+ validity: ctx.buffer(),
+ values: ctx.buffer(indices.values),
+ }).setDictionary(ctx.dictionary(id));
+ }
+
+ // union
+ case Type.Union: {
+ if (ctx.version < Version.V5) {
+ ctx.buffer(); // skip unused null bitmap
+ }
+ return new BatchType({
+ ...node,
+ typeIds: ctx.buffer(int8Array),
+ offsets: type.mode === UnionMode.Sparse ? null : ctx.buffer(type.offsets),
+ children: ctx.visit(type.children)
+ });
+ }
+
+ // unsupported type
+ default:
+ throw new Error(invalidDataType(typeId));
+ }
+}
diff --git a/src/encode/builder.js b/src/encode/builder.js
new file mode 100644
index 0000000..1343180
--- /dev/null
+++ b/src/encode/builder.js
@@ -0,0 +1,437 @@
+import { grow } from '../util/arrays.js';
+import { SIZEOF_INT, SIZEOF_SHORT, readInt16 } from '../util/read.js';
+import { encodeUtf8 } from '../util/strings.js';
+
+export function writeInt32(buf, index, value) {
+ buf[index] = value;
+ buf[index + 1] = value >> 8;
+ buf[index + 2] = value >> 16;
+ buf[index + 3] = value >> 24;
+}
+
+const INIT_SIZE = 1024;
+
+/** Flatbuffer binary builder. */
+export class Builder {
+ /**
+ * Create a new builder instance.
+ * @param {import('./sink.js').Sink} sink The byte consumer.
+ */
+ constructor(sink) {
+ /**
+ * Sink that consumes built byte buffers;
+ * @type {import('./sink.js').Sink}
+ */
+ this.sink = sink;
+ /**
+ * Minimum alignment encountered so far.
+ * @type {number}
+ */
+ this.minalign = 1;
+ /**
+ * Current byte buffer.
+ * @type {Uint8Array}
+ */
+ this.buf = new Uint8Array(INIT_SIZE);
+ /**
+ * Remaining space in the current buffer.
+ * @type {number}
+ */
+ this.space = INIT_SIZE;
+ /**
+ * List of offsets of all vtables. Used to find and
+ * reuse tables upon duplicated table field schemas.
+ * @type {number[]}
+ */
+ this.vtables = [];
+ /**
+ * Total bytes written to sink thus far.
+ */
+ this.outputBytes = 0;
+ }
+
+ /**
+ * Returns the flatbuffer offset, relative to the end of the current buffer.
+ * @returns {number} Offset relative to the end of the buffer.
+ */
+ offset() {
+ return this.buf.length - this.space;
+ }
+
+ /**
+ * Write a flatbuffer int8 value at the current buffer position
+ * and advance the internal cursor.
+ * @param {number} value
+ */
+ writeInt8(value) {
+ this.buf[this.space -= 1] = value;
+ }
+
+ /**
+ * Write a flatbuffer int16 value at the current buffer position
+ * and advance the internal cursor.
+ * @param {number} value
+ */
+ writeInt16(value) {
+ this.buf[this.space -= 2] = value;
+ this.buf[this.space + 1] = value >> 8;
+ }
+
+ /**
+ * Write a flatbuffer int32 value at the current buffer position
+ * and advance the internal cursor.
+ * @param {number} value
+ */
+ writeInt32(value) {
+ writeInt32(this.buf, this.space -= 4, value);
+ }
+
+ /**
+ * Write a flatbuffer int64 value at the current buffer position
+ * and advance the internal cursor.
+ * @param {number} value
+ */
+ writeInt64(value) {
+ const v = BigInt(value);
+ this.writeInt32(Number(BigInt.asIntN(32, v >> BigInt(32))));
+ this.writeInt32(Number(BigInt.asIntN(32, v)));
+ }
+
+ /**
+ * Add a flatbuffer int8 value, properly aligned,
+ * @param value The int8 value to add the buffer.
+ */
+ addInt8(value) {
+ prep(this, 1, 0);
+ this.writeInt8(value);
+ }
+
+ /**
+ * Add a flatbuffer int16 value, properly aligned,
+ * @param value The int16 value to add the buffer.
+ */
+ addInt16(value) {
+ prep(this, 2, 0);
+ this.writeInt16(value);
+ }
+
+ /**
+ * Add a flatbuffer int32 value, properly aligned,
+ * @param value The int32 value to add the buffer.
+ */
+ addInt32(value) {
+ prep(this, 4, 0);
+ this.writeInt32(value);
+ }
+
+ /**
+ * Add a flatbuffer int64 values, properly aligned.
+ * @param value The int64 value to add the buffer.
+ */
+ addInt64(value) {
+ prep(this, 8, 0);
+ this.writeInt64(value);
+ }
+
+ /**
+ * Add a flatbuffer offset, relative to where it will be written.
+ * @param {number} offset The offset to add.
+ */
+ addOffset(offset) {
+ prep(this, SIZEOF_INT, 0); // Ensure alignment is already done.
+ this.writeInt32(this.offset() - offset + SIZEOF_INT);
+ }
+
+ /**
+ * Add a flatbuffer object (vtable).
+ * @param {number} numFields The maximum number of fields
+ * this object may include.
+ * @param {(tableBuilder: ReturnType) => void} [addFields]
+ * A callback function that writes all fields using an object builder.
+ * @returns {number} The object offset.
+ */
+ addObject(numFields, addFields) {
+ const b = objectBuilder(this, numFields);
+ addFields?.(b);
+ return b.finish();
+ }
+
+ /**
+ * Add a flatbuffer vector (list).
+ * @template T
+ * @param {T[]} items An array of items to write.
+ * @param {number} itemSize The size in bytes of a serialized item.
+ * @param {number} alignment The desired byte alignment value.
+ * @param {(builder: this, item: T) => void} writeItem A callback
+ * function that writes a vector item to this builder.
+ * @returns {number} The vector offset.
+ */
+ addVector(items, itemSize, alignment, writeItem) {
+ const n = items?.length;
+ if (!n) return 0;
+ prep(this, SIZEOF_INT, itemSize * n);
+ prep(this, alignment, itemSize * n); // Just in case alignment > int.
+ for (let i = n; --i >= 0;) {
+ writeItem(this, items[i]);
+ }
+ this.writeInt32(n);
+ return this.offset();
+ }
+
+ /**
+ * Convenience method for writing a vector of byte buffer offsets.
+ * @param {number[]} offsets
+ * @returns {number} The vector offset.
+ */
+ addOffsetVector(offsets) {
+ return this.addVector(offsets, 4, 4, (b, off) => b.addOffset(off));
+ }
+
+ /**
+ * Add a flatbuffer UTF-8 string.
+ * @param {string} s The string to encode.
+ * @return {number} The string offset.
+ */
+ addString(s) {
+ if (s == null) return 0;
+ const utf8 = encodeUtf8(s);
+ const n = utf8.length;
+ this.addInt8(0); // string null terminator
+ prep(this, SIZEOF_INT, n);
+ this.buf.set(utf8, this.space -= n);
+ this.writeInt32(n);
+ return this.offset();
+ }
+
+ /**
+ * Finish the current flatbuffer by adding a root offset.
+ * @param {number} rootOffset The root offset.
+ */
+ finish(rootOffset) {
+ prep(this, this.minalign, SIZEOF_INT);
+ this.addOffset(rootOffset);
+ }
+
+ /**
+ * Flush the current flatbuffer byte buffer content to the sink,
+ * and reset the flatbuffer builder state.
+ */
+ flush() {
+ const { buf, sink } = this;
+ const bytes = buf.subarray(this.space, buf.length);
+ sink.write(bytes);
+ this.outputBytes += bytes.byteLength;
+ this.minalign = 1;
+ this.vtables = [];
+ this.buf = new Uint8Array(INIT_SIZE);
+ this.space = INIT_SIZE;
+ }
+
+ /**
+ * Add a byte buffer directly to the builder sink. This method bypasses
+ * any unflushed flatbuffer state and leaves it unchanged, writing the
+ * buffer to the sink *before* the flatbuffer.
+ * The buffer will be padded for 64-bit (8-byte) alignment as needed.
+ * @param {Uint8Array} buffer The buffer to add.
+ * @returns {number} The total byte count of the buffer and padding.
+ */
+ addBuffer(buffer) {
+ const size = buffer.byteLength;
+ if (!size) return 0;
+ this.sink.write(buffer);
+ this.outputBytes += size;
+ const pad = ((size + 7) & ~7) - size;
+ this.addPadding(pad);
+ return size + pad;
+ }
+
+ /**
+ * Write padding bytes directly to the builder sink. This method bypasses
+ * any unflushed flatbuffer state and leaves it unchanged, writing the
+ * padding bytes to the sink *before* the flatbuffer.
+ * @param {number} byteCount The number of padding bytes.
+ */
+ addPadding(byteCount) {
+ if (byteCount > 0) {
+ this.sink.write(new Uint8Array(byteCount));
+ this.outputBytes += byteCount;
+ }
+ }
+}
+
+/**
+ * Prepare to write an element of `size` after `additionalBytes` have been
+ * written, e.g. if we write a string, we need to align such the int length
+ * field is aligned to 4 bytes, and the string data follows it directly. If all
+ * we need to do is alignment, `additionalBytes` will be 0.
+ * @param {Builder} builder The builder to prep.
+ * @param {number} size The size of the new element to write.
+ * @param {number} additionalBytes Additional padding size.
+ */
+export function prep(builder, size, additionalBytes) {
+ let { buf, space, minalign } = builder;
+
+ // track the biggest thing we've ever aligned to
+ if (size > minalign) {
+ builder.minalign = size;
+ }
+
+ // find alignment needed so that `size` aligns after `additionalBytes`
+ const bufSize = buf.length;
+ const used = bufSize - space + additionalBytes;
+ const alignSize = (~used + 1) & (size - 1);
+
+ // reallocate the buffer if needed
+ buf = grow(buf, used + alignSize + size);
+ space += buf.length - bufSize;
+
+ // add padding
+ for (let i = 0; i < alignSize; ++i) {
+ buf[--space] = 0;
+ }
+
+ // update builder state
+ builder.buf = buf;
+ builder.space = space;
+}
+
+/**
+ * Returns a builder object for flatbuffer objects (vtables).
+ * @param {Builder} builder The underlying flatbuffer builder.
+ * @param {number} numFields The expected number of fields, not
+ * including the standard size fields.
+ */
+function objectBuilder(builder, numFields) {
+ /** @type {number[]} */
+ const vtable = Array(numFields).fill(0);
+ const startOffset = builder.offset();
+
+ function slot(index) {
+ vtable[index] = builder.offset();
+ }
+
+ return {
+ /**
+ * Add an int8-valued table field.
+ * @param {number} index
+ * @param {number} value
+ * @param {number} defaultValue
+ */
+ addInt8(index, value, defaultValue) {
+ if (value != defaultValue) {
+ builder.addInt8(value);
+ slot(index);
+ }
+ },
+
+ /**
+ * Add an int16-valued table field.
+ * @param {number} index
+ * @param {number} value
+ * @param {number} defaultValue
+ */
+ addInt16(index, value, defaultValue) {
+ if (value != defaultValue) {
+ builder.addInt16(value);
+ slot(index);
+ }
+ },
+
+ /**
+ * Add an int32-valued table field.
+ * @param {number} index
+ * @param {number} value
+ * @param {number} defaultValue
+ */
+ addInt32(index, value, defaultValue) {
+ if (value != defaultValue) {
+ builder.addInt32(value);
+ slot(index);
+ }
+ },
+
+ /**
+ * Add an int64-valued table field.
+ * @param {number} index
+ * @param {number} value
+ * @param {number} defaultValue
+ */
+ addInt64(index, value, defaultValue) {
+ if (value != defaultValue) {
+ builder.addInt64(value);
+ slot(index);
+ }
+ },
+
+ /**
+ * Add a buffer offset-valued table field.
+ * @param {number} index
+ * @param {number} value
+ * @param {number} defaultValue
+ */
+ addOffset(index, value, defaultValue) {
+ if (value != defaultValue) {
+ builder.addOffset(value);
+ slot(index);
+ }
+ },
+
+ /**
+ * Write the vtable to the buffer and return the table offset.
+ * @returns {number} The buffer offset to the vtable.
+ */
+ finish() {
+ // add offset entry, will overwrite later with actual offset
+ builder.addInt32(0);
+ const vtableOffset = builder.offset();
+
+ // trim zero-valued fields (indicating default value)
+ let i = numFields;
+ while (--i >= 0 && vtable[i] === 0) {} // eslint-disable-line no-empty
+ const size = i + 1;
+
+ // Write out the current vtable.
+ for (; i >= 0; --i) {
+ // Offset relative to the start of the table.
+ builder.addInt16(vtable[i] ? (vtableOffset - vtable[i]) : 0);
+ }
+
+ const standardFields = 2; // size fields
+ builder.addInt16(vtableOffset - startOffset);
+ const len = (size + standardFields) * SIZEOF_SHORT;
+ builder.addInt16(len);
+
+ // Search for an existing vtable that matches the current one.
+ let existingTable = 0;
+ const { buf, vtables, space: vt1 } = builder;
+ outer_loop:
+ for (i = 0; i < vtables.length; ++i) {
+ const vt2 = buf.length - vtables[i];
+ if (len == readInt16(buf, vt2)) {
+ for (let j = SIZEOF_SHORT; j < len; j += SIZEOF_SHORT) {
+ if (readInt16(buf, vt1 + j) != readInt16(buf, vt2 + j)) {
+ continue outer_loop;
+ }
+ }
+ existingTable = vtables[i];
+ break;
+ }
+ }
+
+ if (existingTable) {
+ // Found a match: remove the current vtable.
+ // Point table to existing vtable.
+ builder.space = buf.length - vtableOffset;
+ writeInt32(buf, builder.space, existingTable - vtableOffset);
+ } else {
+ // No match: add the location of the current vtable to the vtables list.
+ // Point table to current vtable.
+ const off = builder.offset();
+ vtables.push(off);
+ writeInt32(buf, buf.length - vtableOffset, off - vtableOffset);
+ }
+
+ return vtableOffset;
+ }
+ }
+}
diff --git a/src/encode/data-type.js b/src/encode/data-type.js
new file mode 100644
index 0000000..44a3338
--- /dev/null
+++ b/src/encode/data-type.js
@@ -0,0 +1,149 @@
+import { DateUnit, IntervalUnit, Precision, TimeUnit, Type, UnionMode } from '../constants.js';
+import { invalidDataType } from '../data-types.js';
+import { checkOneOf } from '../util/objects.js';
+
+/**
+ * Encode a data type into a flatbuffer.
+ * @param {import('./builder.js').Builder} builder
+ * @param {import('../types.js').DataType} type
+ * @returns {number} The offset at which the data type is written.
+ */
+export function encodeDataType(builder, type) {
+ const typeId = checkOneOf(type.typeId, Type, invalidDataType);
+
+ switch (typeId) {
+ case Type.Dictionary:
+ return encodeDictionary(builder, type);
+ case Type.Int:
+ return encodeInt(builder, type);
+ case Type.Float:
+ return encodeFloat(builder, type);
+ case Type.Decimal:
+ return encodeDecimal(builder, type);
+ case Type.Date:
+ return encodeDate(builder, type);
+ case Type.Time:
+ return encodeTime(builder, type);
+ case Type.Timestamp:
+ return encodeTimestamp(builder, type);
+ case Type.Interval:
+ return encodeInterval(builder, type);
+ case Type.Duration:
+ return encodeDuration(builder, type);
+ case Type.FixedSizeBinary:
+ case Type.FixedSizeList:
+ return encodeFixedSize(builder, type);
+ case Type.Map:
+ return encodeMap(builder, type);
+ case Type.Union:
+ return encodeUnion(builder, type);
+ }
+ // case Type.Null:
+ // case Type.Binary:
+ // case Type.LargeBinary:
+ // case Type.BinaryView:
+ // case Type.Bool:
+ // case Type.Utf8:
+ // case Type.Utf8View:
+ // case Type.LargeUtf8:
+ // case Type.List:
+ // case Type.ListView:
+ // case Type.LargeList:
+ // case Type.LargeListView:
+ // case Type.RunEndEncoded:
+ // case Type.Struct:
+ return builder.addObject(0);
+}
+
+function encodeDate(builder, type) {
+ return builder.addObject(1, b => {
+ b.addInt16(0, type.unit, DateUnit.MILLISECOND);
+ });
+}
+
+function encodeDecimal(builder, type) {
+ return builder.addObject(3, b => {
+ b.addInt32(0, type.precision, 0);
+ b.addInt32(1, type.scale, 0);
+ b.addInt32(2, type.bitWidth, 128);
+ });
+}
+
+function encodeDuration(builder, type) {
+ return builder.addObject(1, b => {
+ b.addInt16(0, type.unit, TimeUnit.MILLISECOND);
+ });
+}
+
+function encodeFixedSize(builder, type) {
+ return builder.addObject(1, b => {
+ b.addInt32(0, type.stride, 0);
+ });
+}
+
+function encodeFloat(builder, type) {
+ return builder.addObject(1, b => {
+ b.addInt16(0, type.precision, Precision.HALF);
+ });
+}
+
+function encodeInt(builder, type) {
+ return builder.addObject(2, b => {
+ b.addInt32(0, type.bitWidth, 0);
+ b.addInt8(1, +type.signed, 0);
+ });
+}
+
+function encodeInterval(builder, type) {
+ return builder.addObject(1, b => {
+ b.addInt16(0, type.unit, IntervalUnit.YEAR_MONTH);
+ });
+}
+
+function encodeMap(builder, type) {
+ return builder.addObject(1, b => {
+ b.addInt8(0, +type.keysSorted, 0);
+ });
+}
+
+function encodeTime(builder, type) {
+ return builder.addObject(2, b => {
+ b.addInt16(0, type.unit, TimeUnit.MILLISECOND);
+ b.addInt32(1, type.bitWidth, 32);
+ });
+}
+
+function encodeTimestamp(builder, type) {
+ const timezoneOffset = builder.addString(type.timezone);
+ return builder.addObject(2, b => {
+ b.addInt16(0, type.unit, TimeUnit.SECOND);
+ b.addOffset(1, timezoneOffset, 0);
+ });
+}
+
+function encodeUnion(builder, type) {
+ const typeIdsOffset = builder.addVector(
+ type.typeIds, 4, 4,
+ (builder, value) => builder.addInt32(value)
+ );
+ return builder.addObject(2, b => {
+ b.addInt16(0, type.mode, UnionMode.Sparse);
+ b.addOffset(1, typeIdsOffset, 0);
+ });
+}
+
+function encodeDictionary(builder, type) {
+ const keyTypeOffset = isInt32(type.indices)
+ ? 0
+ : encodeDataType(builder, type.indices);
+ return builder.addObject(4, b => {
+ b.addInt64(0, type.id, 0);
+ b.addOffset(1, keyTypeOffset, 0);
+ b.addInt8(2, +type.ordered, 0);
+ // NOT SUPPORTED: 3, dictionaryKind (defaults to dense array)
+ });
+}
+
+function isInt32(type) {
+ return type.typeId === Type.Int && type.bitWidth === 32 && type.signed;
+}
diff --git a/src/encode/dictionary-batch.js b/src/encode/dictionary-batch.js
new file mode 100644
index 0000000..34c9e86
--- /dev/null
+++ b/src/encode/dictionary-batch.js
@@ -0,0 +1,15 @@
+import { encodeRecordBatch } from './record-batch.js';
+
+/**
+ * @param {import('./builder.js').Builder} builder
+ * @param {import('../types.js').DictionaryBatch} dictionaryBatch
+ * @returns {number}
+ */
+export function encodeDictionaryBatch(builder, dictionaryBatch) {
+ const dataOffset = encodeRecordBatch(builder, dictionaryBatch.data);
+ return builder.addObject(3, b => {
+ b.addInt64(0, dictionaryBatch.id, 0);
+ b.addOffset(1, dataOffset, 0);
+ b.addInt8(2, +dictionaryBatch.isDelta, 0);
+ });
+}
diff --git a/src/encode/encode-ipc.js b/src/encode/encode-ipc.js
new file mode 100644
index 0000000..177d1e4
--- /dev/null
+++ b/src/encode/encode-ipc.js
@@ -0,0 +1,76 @@
+import { MAGIC, MessageHeader } from '../constants.js';
+import { Builder } from './builder.js';
+import { encodeDictionaryBatch } from './dictionary-batch.js';
+import { writeFooter } from './footer.js';
+import { encodeRecordBatch } from './record-batch.js';
+import { encodeSchema } from './schema.js';
+import { writeMessage } from './message.js';
+import { MemorySink } from './sink.js';
+
+/**
+ * Encode assembled data into Arrow IPC binary format.
+ * @param {any} data Assembled table data.
+ * @param {object} options Encoding options.
+ * @param {import('./sink.js').Sink} [options.sink] IPC byte consumer.
+ * @param {'stream' | 'file'} [options.format] Arrow stream or file format.
+ * @returns {import('./sink.js').Sink} The sink that was passed in.
+ */
+export function encodeIPC(data, { sink, format } = {}) {
+ const { schema, dictionaries = [], records = [], metadata } = data;
+ const builder = new Builder(sink || new MemorySink());
+ const file = format === 'file';
+ const dictBlocks = [];
+ const recordBlocks = [];
+
+ if (file) {
+ builder.addBuffer(MAGIC);
+ } else if (schema) {
+ writeMessage(
+ builder,
+ MessageHeader.Schema,
+ encodeSchema(builder, schema),
+ 0
+ );
+ }
+
+ for (const dict of dictionaries) {
+ const { data } = dict;
+ writeMessage(
+ builder,
+ MessageHeader.DictionaryBatch,
+ encodeDictionaryBatch(builder, dict),
+ data.byteLength,
+ dictBlocks
+ );
+ writeBuffers(builder, data.buffers);
+ }
+
+ for (const batch of records) {
+ writeMessage(
+ builder,
+ MessageHeader.RecordBatch,
+ encodeRecordBatch(builder, batch),
+ batch.byteLength,
+ recordBlocks
+ );
+ writeBuffers(builder, batch.buffers);
+ }
+
+ if (file) {
+ writeFooter(builder, schema, dictBlocks, recordBlocks, metadata);
+ }
+
+ return builder.sink;
+}
+
+/**
+ * Write byte buffers to the builder sink.
+ * Buffers are aligned to 64 bits (8 bytes) as needed.
+ * @param {import('./builder.js').Builder} builder
+ * @param {Uint8Array[]} buffers
+ */
+function writeBuffers(builder, buffers) {
+ for (let i = 0; i < buffers.length; ++i) {
+ builder.addBuffer(buffers[i]); // handles alignment for us
+ }
+}
diff --git a/src/encode/footer.js b/src/encode/footer.js
new file mode 100644
index 0000000..4581d9a
--- /dev/null
+++ b/src/encode/footer.js
@@ -0,0 +1,54 @@
+import { MAGIC, Version } from '../constants.js';
+import { encodeMetadata } from './metadata.js';
+import { encodeSchema } from './schema.js';
+
+/**
+ * Write a file footer.
+ * @param {import('./builder.js').Builder} builder The binary builder.
+ * @param {import('../types.js').Schema} schema The table schema.
+ * @param {import('../types.js').Block[]} dictBlocks Dictionary batch file blocks.
+ * @param {import('../types.js').Block[]} recordBlocks Record batch file blocks.
+ * @param {Map | null} metadata File-level metadata.
+ */
+export function writeFooter(builder, schema, dictBlocks, recordBlocks, metadata) {
+ // encode footer flatbuffer
+ const metadataOffset = encodeMetadata(builder, metadata);
+ const recsOffset = builder.addVector(recordBlocks, 24, 8, encodeBlock);
+ const dictsOffset = builder.addVector(dictBlocks, 24, 8, encodeBlock);
+ const schemaOffset = encodeSchema(builder, schema);
+ builder.finish(
+ builder.addObject(5, b => {
+ b.addInt16(0, Version.V5, Version.V1);
+ b.addOffset(1, schemaOffset, 0);
+ b.addOffset(2, dictsOffset, 0);
+ b.addOffset(3, recsOffset, 0);
+ b.addOffset(4, metadataOffset, 0);
+ })
+ );
+ const size = builder.offset();
+
+ // add eos with continuation indicator
+ builder.addInt32(0);
+ builder.addInt32(-1);
+
+ // write builder contents
+ builder.flush();
+
+ // write file tail
+ builder.sink.write(new Uint8Array(Int32Array.of(size).buffer));
+ builder.sink.write(MAGIC);
+}
+
+/**
+ * Encode a file pointer block.
+ * @param {import('./builder.js').Builder} builder
+ * @param {import('../types.js').Block} block
+ * @returns {number} the current block offset
+ */
+function encodeBlock(builder, { offset, metadataLength, bodyLength }) {
+ builder.writeInt64(bodyLength);
+ builder.writeInt32(0);
+ builder.writeInt32(metadataLength);
+ builder.writeInt64(offset);
+ return builder.offset();
+}
diff --git a/src/encode/message.js b/src/encode/message.js
new file mode 100644
index 0000000..d3e970e
--- /dev/null
+++ b/src/encode/message.js
@@ -0,0 +1,44 @@
+import { MessageHeader, Version } from '../constants.js';
+
+/**
+ * Write an IPC message to the builder sink.
+ * @param {import('./builder.js').Builder} builder
+ * @param {import('../types.js').MessageHeader_} headerType
+ * @param {number} headerOffset
+ * @param {number} bodyLength
+ * @param {import('../types.js').Block[]} [blocks]
+ */
+export function writeMessage(builder, headerType, headerOffset, bodyLength, blocks) {
+ builder.finish(
+ builder.addObject(5, b => {
+ b.addInt16(0, Version.V5, Version.V1);
+ b.addInt8(1, headerType, MessageHeader.NONE);
+ b.addOffset(2, headerOffset, 0);
+ b.addInt64(3, bodyLength, 0);
+ // NOT SUPPORTED: 4, message-level metadata
+ })
+ );
+
+ const prefixSize = 8; // continuation indicator + message size
+ const messageSize = builder.offset();
+ const alignedSize = (messageSize + prefixSize + 7) & ~7;
+
+ // track blocks for file footer
+ blocks?.push({
+ offset: builder.outputBytes,
+ metadataLength: alignedSize,
+ bodyLength
+ });
+
+ // write size prefix (including padding)
+ builder.addInt32(alignedSize - prefixSize);
+
+ // write the stream continuation indicator
+ builder.addInt32(-1);
+
+ // flush the builder content
+ builder.flush();
+
+ // add alignment padding as needed
+ builder.addPadding(alignedSize - messageSize - prefixSize);
+}
diff --git a/src/encode/metadata.js b/src/encode/metadata.js
new file mode 100644
index 0000000..5867556
--- /dev/null
+++ b/src/encode/metadata.js
@@ -0,0 +1,17 @@
+/**
+ * @param {import('./builder.js').Builder} builder
+ * @param {Map} metadata
+ * @returns {number}
+ */
+export function encodeMetadata(builder, metadata) {
+ return metadata?.size > 0
+ ? builder.addOffsetVector(Array.from(metadata, ([k, v]) => {
+ const key = builder.addString(`${k}`);
+ const val = builder.addString(`${v}`);
+ return builder.addObject(2, b => {
+ b.addOffset(0, key, 0);
+ b.addOffset(1, val, 0);
+ });
+ }))
+ : 0;
+}
diff --git a/src/encode/record-batch.js b/src/encode/record-batch.js
new file mode 100644
index 0000000..7daf38f
--- /dev/null
+++ b/src/encode/record-batch.js
@@ -0,0 +1,32 @@
+/**
+ * @param {import('./builder.js').Builder} builder
+ * @param {import('../types.js').RecordBatch} batch
+ * @returns {number}
+ */
+export function encodeRecordBatch(builder, batch) {
+ const { nodes, regions, variadic } = batch;
+ const nodeVector = builder.addVector(nodes, 16, 8,
+ (builder, node) => {
+ builder.writeInt64(node.nullCount);
+ builder.writeInt64(node.length);
+ return builder.offset();
+ }
+ );
+ const regionVector = builder.addVector(regions, 16, 8,
+ (builder, region) => {
+ builder.writeInt64(region.length);
+ builder.writeInt64(region.offset);
+ return builder.offset();
+ }
+ );
+ const variadicVector = builder.addVector(variadic, 8, 8,
+ (builder, count) => builder.addInt64(count)
+ );
+ return builder.addObject(5, b => {
+ b.addInt64(0, nodes[0].length, 0);
+ b.addOffset(1, nodeVector, 0);
+ b.addOffset(2, regionVector, 0);
+ // NOT SUPPORTED: 3, compression offset
+ b.addOffset(4, variadicVector, 0);
+ });
+}
diff --git a/src/encode/schema.js b/src/encode/schema.js
new file mode 100644
index 0000000..42d1d78
--- /dev/null
+++ b/src/encode/schema.js
@@ -0,0 +1,61 @@
+import { Type } from '../constants.js';
+import { encodeDataType } from './data-type.js';
+import { encodeMetadata } from './metadata.js';
+
+const isLittleEndian = new Uint16Array(new Uint8Array([1, 0]).buffer)[0] === 1;
+
+/**
+ * @param {import('./builder.js').Builder} builder
+ * @param {import('../types.js').Schema} schema
+ * @returns {number}
+ */
+export function encodeSchema(builder, schema) {
+ const { fields, metadata } = schema;
+ const fieldOffsets = fields.map(f => encodeField(builder, f));
+ const fieldsVectorOffset = builder.addOffsetVector(fieldOffsets);
+ const metadataOffset = encodeMetadata(builder, metadata);
+ return builder.addObject(4, b => {
+ b.addInt16(0, +(!isLittleEndian), 0);
+ b.addOffset(1, fieldsVectorOffset, 0);
+ b.addOffset(2, metadataOffset, 0);
+ // NOT SUPPORTED: 3, features
+ });
+}
+
+/**
+ * @param {import('./builder.js').Builder} builder
+ * @param {import('../types.js').Field} field
+ * @returns {number}
+ */
+function encodeField(builder, field) {
+ const { name, nullable, type, metadata } = field;
+ let { typeId } = type;
+
+ // encode field data type
+ let typeOffset = 0;
+ let dictionaryOffset = 0;
+ if (typeId !== Type.Dictionary) {
+ typeOffset = encodeDataType(builder, type);
+ } else {
+ const dict = /** @type {import('../types.js').DictionaryType} */ (type).dictionary;
+ typeId = dict.typeId;
+ dictionaryOffset = encodeDataType(builder, type);
+ typeOffset = encodeDataType(builder, dict);
+ }
+
+ // encode children, metadata, name, and field object
+ // @ts-ignore
+ const childOffsets = (type.children || []).map(f => encodeField(builder, f));
+ const childrenVectorOffset = builder.addOffsetVector(childOffsets);
+ const metadataOffset = encodeMetadata(builder, metadata);
+ const nameOffset = builder.addString(name);
+ return builder.addObject(7, b => {
+ b.addOffset(0, nameOffset, 0);
+ b.addInt8(1, +nullable, +false);
+ b.addInt8(2, typeId, Type.NONE);
+ b.addOffset(3, typeOffset, 0);
+ b.addOffset(4, dictionaryOffset, 0);
+ b.addOffset(5, childrenVectorOffset, 0);
+ b.addOffset(6, metadataOffset, 0);
+ });
+}
diff --git a/src/encode/sink.js b/src/encode/sink.js
new file mode 100644
index 0000000..a0b327b
--- /dev/null
+++ b/src/encode/sink.js
@@ -0,0 +1,55 @@
+export class Sink {
+ /**
+ * Write bytes to this sink.
+ * @param {Uint8Array} bytes The byte buffer to write.
+ */
+ write(bytes) { // eslint-disable-line no-unused-vars
+ }
+
+ /**
+ * Write padding bytes (zeroes) to this sink.
+ * @param {number} byteCount The number of padding bytes.
+ */
+ pad(byteCount) {
+ this.write(new Uint8Array(byteCount));
+ }
+
+ /**
+ * @returns {Uint8Array | null}
+ */
+ finish() {
+ return null;
+ }
+}
+
+export class MemorySink extends Sink {
+ /**
+ * A sink that collects bytes in memory.
+ */
+ constructor() {
+ super();
+ this.buffers = [];
+ }
+
+ /**
+ * Write bytes
+ * @param {Uint8Array} bytes
+ */
+ write(bytes) {
+ this.buffers.push(bytes);
+ }
+
+ /**
+ * @returns {Uint8Array}
+ */
+ finish() {
+ const bufs = this.buffers;
+ const size = bufs.reduce((sum, b) => sum + b.byteLength, 0);
+ const buf = new Uint8Array(size);
+ for (let i = 0, off = 0; i < bufs.length; ++i) {
+ buf.set(bufs[i], off);
+ off += bufs[i].byteLength;
+ }
+ return buf;
+ }
+}
diff --git a/src/encode/table-to-ipc.js b/src/encode/table-to-ipc.js
new file mode 100644
index 0000000..6f9b110
--- /dev/null
+++ b/src/encode/table-to-ipc.js
@@ -0,0 +1,225 @@
+import { Type, UnionMode } from '../constants.js';
+import { invalidDataType } from '../data-types.js';
+import { encodeIPC } from './encode-ipc.js';
+
+/**
+ * Encode an Arrow table into Arrow IPC binary format.
+ * @param {import('../table.js').Table} table The Arrow table to encode.
+ * @param {object} options Encoding options.
+ * @param {import('./sink.js').Sink} [options.sink] IPC byte consumer.
+ * @param {'stream' | 'file'} [options.format] Arrow stream or file format.
+ * @returns {Uint8Array | null} The generated bytes (for an in-memory sink)
+ * or null (if using a sink that writes bytes elsewhere).
+ */
+export function tableToIPC(table, options) {
+ // accept a format string option for Arrow-JS compatibility
+ if (typeof options === 'string') {
+ options = { format: options };
+ }
+ const schema = table.schema;
+ const columns = table.children;
+ const dictionaries = assembleDictionaryBatches(columns);
+ const records = assembleRecordBatches(columns);
+ const data = { schema, dictionaries, records };
+ return encodeIPC(data, options).finish();
+}
+
+function assembleContext() {
+ let byteLength = 0;
+ const nodes = [];
+ const regions = [];
+ const buffers = [];
+ const variadic = [];
+ return {
+ /**
+ * @param {number} length
+ * @param {number} nullCount
+ */
+ node(length, nullCount) {
+ nodes.push({ length, nullCount });
+ },
+ /**
+ * @param {import('../types.js').TypedArray} b
+ */
+ buffer(b) {
+ const size = b.byteLength;
+ const length = ((size + 7) & ~7);
+ regions.push({ offset: byteLength, length });
+ byteLength += length;
+ buffers.push(new Uint8Array(b.buffer, b.byteOffset, size));
+ },
+ /**
+ * @param {number} length
+ */
+ variadic(length) {
+ variadic.push(length);
+ },
+ /**
+ * @param {import('../types.js').DataType} type
+ * @param {import('../batch.js').Batch} batch
+ */
+ children(type, batch) {
+ // @ts-ignore
+ type.children.forEach((field, index) => {
+ visit(field.type, batch.children[index], this);
+ });
+ },
+ /**
+ * @returns {import('../types.js').RecordBatch}
+ */
+ done() {
+ return { byteLength, nodes, regions, variadic, buffers };
+ }
+ };
+}
+
+/**
+ * @param {import('../column.js').Column[]} columns
+ * @returns {import('../types.js').DictionaryBatch[]}
+ */
+function assembleDictionaryBatches(columns) {
+ const dictionaries = [];
+ const seen = new Set;
+
+ for (const col of columns) {
+ const { type } = col;
+ if (type.typeId !== -1) continue;
+ if (seen.has(type.id)) continue;
+ seen.add(type.id);
+
+ // pass dictionary and deltas as-is
+ // @ts-ignore
+ const dict = col.data[0].dictionary;
+ for (let i = 0; i < dict.data.length; ++i) {
+ dictionaries.push({
+ id: type.id,
+ isDelta: i > 0,
+ data: assembleRecordBatch([dict], i)
+ });
+ }
+ }
+
+ return dictionaries;
+}
+
+/**
+ * @param {import('../column.js').Column[]} columns
+ * @returns {import('../types.js').RecordBatch[]}
+ */
+function assembleRecordBatches(columns) {
+ return (columns[0]?.data || [])
+ .map((_, index) => assembleRecordBatch(columns, index));
+}
+
+/**
+ * @param {import('../column.js').Column[]} columns
+ * @returns {import('../types.js').RecordBatch}
+ */
+function assembleRecordBatch(columns, batchIndex = 0) {
+ const ctx = assembleContext();
+ columns.forEach(column => {
+ visit(column.type, column.data[batchIndex], ctx);
+ });
+ return ctx.done();
+}
+
+/**
+ * Visit a column batch, assembling buffer information.
+ * @param {import('../types.js').DataType} type
+ * @param {import('../batch.js').Batch} batch
+ * @param {ReturnType} ctx
+ */
+function visit(type, batch, ctx) {
+ const { typeId } = type;
+
+ // no field node, no buffers
+ if (typeId === Type.Null) return;
+
+ // record field node info
+ ctx.node(batch.length, batch.nullCount);
+
+ switch (typeId) {
+ // validity and value buffers
+ // backing dictionaries handled elsewhere
+ case Type.Bool:
+ case Type.Int:
+ case Type.Time:
+ case Type.Duration:
+ case Type.Float:
+ case Type.Date:
+ case Type.Timestamp:
+ case Type.Decimal:
+ case Type.Interval:
+ case Type.FixedSizeBinary:
+ case Type.Dictionary: // dict key values
+ ctx.buffer(batch.validity);
+ ctx.buffer(batch.values);
+ return;
+
+ // validity, offset, and value buffers
+ case Type.Utf8:
+ case Type.LargeUtf8:
+ case Type.Binary:
+ case Type.LargeBinary:
+ ctx.buffer(batch.validity);
+ ctx.buffer(batch.offsets);
+ ctx.buffer(batch.values);
+ return;
+
+ // views with variadic buffers
+ case Type.BinaryView:
+ case Type.Utf8View:
+ ctx.buffer(batch.validity);
+ ctx.buffer(batch.values);
+ // @ts-ignore
+ ctx.variadic(batch.data.length);
+ // @ts-ignore
+ batch.data.forEach(b => ctx.buffer(b));
+ return;
+
+ // validity, offset, and list child
+ case Type.List:
+ case Type.LargeList:
+ case Type.Map:
+ ctx.buffer(batch.validity);
+ ctx.buffer(batch.offsets);
+ ctx.children(type, batch);
+ return;
+
+ // validity, offset, size, and list child
+ case Type.ListView:
+ case Type.LargeListView:
+ ctx.buffer(batch.validity);
+ ctx.buffer(batch.offsets);
+ ctx.buffer(batch.sizes);
+ ctx.children(type, batch);
+ return;
+
+ // validity and children
+ case Type.FixedSizeList:
+ case Type.Struct:
+ ctx.buffer(batch.validity);
+ ctx.children(type, batch);
+ return;
+
+ // children only
+ case Type.RunEndEncoded:
+ ctx.children(type, batch);
+ return;
+
+ // union
+ case Type.Union: {
+ // @ts-ignore
+ ctx.buffer(batch.typeIds);
+ if (type.mode === UnionMode.Dense) {
+ ctx.buffer(batch.offsets);
+ }
+ ctx.children(type, batch);
+ return;
+ }
+
+ // unsupported type
+ default:
+ throw new Error(invalidDataType(typeId));
+ }
+}
diff --git a/src/index.js b/src/index.js
index 86b6fa3..ed21313 100644
--- a/src/index.js
+++ b/src/index.js
@@ -1,2 +1,51 @@
-export { Version, Endianness, Type, Precision, DateUnit, TimeUnit, IntervalUnit, UnionMode } from './constants.js';
-export { tableFromIPC } from './table-from-ipc.js';
+export {
+ Version,
+ Endianness,
+ Type,
+ Precision,
+ DateUnit,
+ TimeUnit,
+ IntervalUnit,
+ UnionMode
+} from './constants.js';
+
+export {
+ field,
+ nullType,
+ int, int8, int16, int32, int64, uint8, uint16,uint32, uint64,
+ float, float16, float32, float64,
+ binary,
+ utf8,
+ bool,
+ decimal,
+ date, dateDay, dateMillisecond,
+ dictionary,
+ time, timeSecond, timeMillisecond, timeMicrosecond, timeNanosecond,
+ timestamp,
+ interval,
+ list,
+ struct,
+ union,
+ fixedSizeBinary,
+ fixedSizeList,
+ map,
+ duration,
+ largeBinary,
+ largeUtf8,
+ largeList,
+ runEndEncoded,
+ binaryView,
+ utf8View,
+ listView,
+ largeListView
+} from './data-types.js';
+
+export { Column } from './column.js';
+export { Table } from './table.js';
+export { Batch } from './batch.js';
+export { batchType } from './batch-type.js';
+export { tableFromIPC } from './decode/table-from-ipc.js';
+export { tableToIPC } from './encode/table-to-ipc.js';
+export { tableFromArrays } from './build/table-from-arrays.js';
+export { tableFromColumns } from './build/table-from-columns.js';
+export { columnFromArray } from './build/column-from-array.js';
diff --git a/src/table-from-ipc.js b/src/table-from-ipc.js
deleted file mode 100644
index 371a0c2..0000000
--- a/src/table-from-ipc.js
+++ /dev/null
@@ -1,288 +0,0 @@
-import { int8 } from './array-types.js';
-import {
- BinaryBatch,
- BinaryViewBatch,
- BoolBatch,
- DateBatch,
- DateDayBatch,
- DateDayMillisecondBatch,
- DecimalBatch,
- DenseUnionBatch,
- DictionaryBatch,
- DirectBatch,
- FixedBinaryBatch,
- FixedListBatch,
- Float16Batch,
- Int64Batch,
- IntervalDayTimeBatch,
- IntervalMonthDayNanoBatch,
- IntervalYearMonthBatch,
- LargeBinaryBatch,
- LargeListBatch,
- LargeListViewBatch,
- LargeUtf8Batch,
- ListBatch,
- ListViewBatch,
- MapBatch,
- MapEntryBatch,
- NullBatch,
- RunEndEncodedBatch,
- SparseUnionBatch,
- StructBatch,
- TimestampMicrosecondBatch,
- TimestampMillisecondBatch,
- TimestampNanosecondBatch,
- TimestampSecondBatch,
- Utf8Batch,
- Utf8ViewBatch
-} from './batch.js';
-import { columnBuilder } from './column.js';
-import {
- DateUnit, IntervalUnit, Precision, TimeUnit, Type, UnionMode, Version
-} from './constants.js';
-import { parseIPC } from './parse-ipc.js';
-import { Table } from './table.js';
-import { keyFor } from './util.js';
-
-/**
- * Decode [Apache Arrow IPC data][1] and return a new Table. The input binary
- * data may be either an `ArrayBuffer` or `Uint8Array`. For Arrow data in the
- * [IPC 'stream' format][2], an array of `Uint8Array` values is also supported.
- *
- * [1]: https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc
- * [2]: https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format
- * @param {ArrayBuffer | Uint8Array | Uint8Array[]} data
- * The source byte buffer, or an array of buffers. If an array, each byte
- * array may contain one or more self-contained messages. Messages may NOT
- * span multiple byte arrays.
- * @param {import('./types.js').ExtractionOptions} [options]
- * Options for controlling how values are transformed when extracted
- * from am Arrow binary representation.
- * @returns {Table} A Table instance.
- */
-export function tableFromIPC(data, options) {
- return createTable(parseIPC(data), options);
-}
-
-/**
- * Create a table from parsed IPC data.
- * @param {import('./types.js').ArrowData} data
- * The IPC data, as returned by parseIPC.
- * @param {import('./types.js').ExtractionOptions} [options]
- * Options for controlling how values are transformed when extracted
- * from am Arrow binary representation.
- * @returns {Table} A Table instance.
- */
-export function createTable(data, options = {}) {
- const { schema = { fields: [] }, dictionaries, records } = data;
- const { version, fields, dictionaryTypes } = schema;
- const dictionaryMap = new Map;
- const context = contextGenerator(options, version, dictionaryMap);
-
- // decode dictionaries
- const dicts = new Map;
- for (const dict of dictionaries) {
- const { id, data, isDelta, body } = dict;
- const type = dictionaryTypes.get(id);
- const batch = visit(type, context({ ...data, body }));
- if (!dicts.has(id)) {
- if (isDelta) {
- throw new Error('Delta update can not be first dictionary batch.');
- }
- dicts.set(id, columnBuilder(type).add(batch));
- } else {
- const dict = dicts.get(id);
- if (!isDelta) dict.clear();
- dict.add(batch);
- }
- }
- dicts.forEach((value, key) => dictionaryMap.set(key, value.done()));
-
- // decode column fields
- const cols = fields.map(f => columnBuilder(f.type));
- for (const batch of records) {
- const ctx = context(batch);
- fields.forEach((f, i) => cols[i].add(visit(f.type, ctx)));
- }
-
- return new Table(schema, cols.map(c => c.done()));
-}
-
-/**
- * Context object generator for field visitation and buffer definition.
- */
-function contextGenerator(options, version, dictionaryMap) {
- const base = {
- version,
- options,
- dictionary: id => dictionaryMap.get(id),
- };
-
- // return a context generator
- return batch => {
- const { length, nodes, buffers, variadic, body } = batch;
- let nodeIndex = -1;
- let bufferIndex = -1;
- let variadicIndex = -1;
- return {
- ...base,
- length,
- node: () => nodes[++nodeIndex],
- buffer: (ArrayType) => {
- const { length, offset } = buffers[++bufferIndex];
- return ArrayType
- ? new ArrayType(body.buffer, body.byteOffset + offset, length / ArrayType.BYTES_PER_ELEMENT)
- : body.subarray(offset, offset + length)
- },
- variadic: () => variadic[++variadicIndex],
- visitAll(list) { return list.map(x => visit(x.type, this)); }
- };
- };
-}
-
-/**
- * Visit a field, instantiating views of buffer regions.
- */
-function visit(type, ctx) {
- const { typeId, bitWidth, precision, scale, stride, unit } = type;
- const { useBigInt, useDate, useMap } = ctx.options;
-
- // no field node, no buffers
- if (typeId === Type.Null) {
- const { length } = ctx;
- return new NullBatch({ length, nullCount: length });
- }
-
- // extract the next { length, nullCount } field node
- const node = ctx.node();
-
- // batch constructors
- const value = (BatchType, opt) => new BatchType({
- ...node,
- ...opt,
- validity: ctx.buffer(),
- values: ctx.buffer(type.values)
- });
- const offset = (BatchType) => new BatchType({
- ...node,
- validity: ctx.buffer(),
- offsets: ctx.buffer(type.offsets),
- values: ctx.buffer()
- });
- const view = (BatchType) => new BatchType({
- ...node,
- validity: ctx.buffer(),
- values: ctx.buffer(), // views buffer
- data: Array.from({ length: ctx.variadic() }, () => ctx.buffer()) // data buffers
- });
- const list = (BatchType) => new BatchType({
- ...node,
- validity: ctx.buffer(),
- offsets: ctx.buffer(type.offsets),
- children: ctx.visitAll(type.children)
- });
- const listview = (BatchType) => new BatchType({
- ...node,
- validity: ctx.buffer(),
- offsets: ctx.buffer(type.offsets),
- sizes: ctx.buffer(type.offsets),
- children: ctx.visitAll(type.children)
- });
- const kids = (BatchType, opt) => new BatchType({
- ...node,
- ...opt,
- validity: ctx.buffer(),
- children: ctx.visitAll(type.children)
- });
- const date = useDate
- ? (BatchType) => new DateBatch(value(BatchType))
- : value;
-
- switch (typeId) {
- // validity and data value buffers
- case Type.Bool:
- return value(BoolBatch);
- case Type.Int:
- case Type.Time:
- case Type.Duration:
- return value(bitWidth === 64 && !useBigInt ? Int64Batch : DirectBatch);
- case Type.Float:
- return value(precision === Precision.HALF ? Float16Batch : DirectBatch);
- case Type.Date:
- return date(unit === DateUnit.DAY ? DateDayBatch : DateDayMillisecondBatch);
- case Type.Timestamp:
- return date(unit === TimeUnit.SECOND ? TimestampSecondBatch
- : unit === TimeUnit.MILLISECOND ? TimestampMillisecondBatch
- : unit === TimeUnit.MICROSECOND ? TimestampMicrosecondBatch
- : TimestampNanosecondBatch);
- case Type.Decimal:
- return value(DecimalBatch, { bitWidth, scale });
- case Type.Interval:
- return value(unit === IntervalUnit.DAY_TIME ? IntervalDayTimeBatch
- : unit === IntervalUnit.YEAR_MONTH ? IntervalYearMonthBatch
- : IntervalMonthDayNanoBatch);
- case Type.FixedSizeBinary:
- return value(FixedBinaryBatch, { stride });
-
- // validity, offset, and value buffers
- case Type.Utf8: return offset(Utf8Batch);
- case Type.LargeUtf8: return offset(LargeUtf8Batch);
- case Type.Binary: return offset(BinaryBatch);
- case Type.LargeBinary: return offset(LargeBinaryBatch);
-
- // views with variadic buffers
- case Type.BinaryView: return view(BinaryViewBatch);
- case Type.Utf8View: return view(Utf8ViewBatch);
-
- // validity, offset, and list child
- case Type.List: return list(ListBatch);
- case Type.LargeList: return list(LargeListBatch);
- case Type.Map: return list(useMap ? MapBatch : MapEntryBatch);
-
- // validity, offset, size, and list child
- case Type.ListView: return listview(ListViewBatch);
- case Type.LargeListView: return listview(LargeListViewBatch);
-
- // validity and children
- case Type.FixedSizeList: return kids(FixedListBatch, { stride });
- case Type.Struct: return kids(StructBatch, {
- names: type.children.map(child => child.name)
- });
-
- // children only
- case Type.RunEndEncoded: return new RunEndEncodedBatch({
- ...node,
- children: ctx.visitAll(type.children)
- });
-
- // dictionary
- case Type.Dictionary: {
- const { id, keys } = type;
- return new DictionaryBatch({
- ...node,
- validity: ctx.buffer(),
- values: ctx.buffer(keys.values),
- dictionary: ctx.dictionary(id)
- });
- }
-
- // union
- case Type.Union: {
- if (ctx.version < Version.V5) {
- ctx.buffer(); // skip unused null bitmap
- }
- const isSparse = type.mode === UnionMode.Sparse;
- return new (isSparse ? SparseUnionBatch : DenseUnionBatch)({
- ...node,
- map: type.typeIds.reduce((map, id, i) => ((map[id] = i), map), {}),
- typeIds: ctx.buffer(int8),
- offsets: isSparse ? null : ctx.buffer(type.offsets),
- children: ctx.visitAll(type.children)
- });
- }
-
- // unsupported type
- default:
- throw new Error(`Unsupported type: ${typeId} ${keyFor(Type, typeId)}`);
- }
-}
diff --git a/src/table.js b/src/table.js
index 1e8f2cc..5ec759e 100644
--- a/src/table.js
+++ b/src/table.js
@@ -1,8 +1,8 @@
-import { bisect } from './util.js';
+import { bisect } from './util/arrays.js';
/**
* A table consists of a collection of named columns (or 'children').
- * To work with table data directly in JavaScript, usse `toColumns()`
+ * To work with table data directly in JavaScript, use `toColumns()`
* to extract an object that maps column names to extracted value arrays,
* or `toArray()` to extract an array of row objects. For random access
* by row index, use `getChild()` to access data for a specific column.
diff --git a/src/types.ts b/src/types.ts
index 1ce706f..cf12927 100644
--- a/src/types.ts
+++ b/src/types.ts
@@ -9,11 +9,6 @@ import {
UnionMode
} from './constants.js';
-// additional jsdoc types to export
-export { Batch } from './batch.js';
-export { Column } from './column.js';
-export { Table } from './table.js';
-
/** A valid Arrow version number. */
export type Version_ = typeof Version[keyof typeof Version];
@@ -61,10 +56,13 @@ export type IntArrayConstructor =
| Uint8ArrayConstructor
| Uint16ArrayConstructor
| Uint32ArrayConstructor
- | BigUint64ArrayConstructor
| Int8ArrayConstructor
| Int16ArrayConstructor
| Int32ArrayConstructor
+ | Int64ArrayConstructor;
+
+export type Int64ArrayConstructor =
+ | BigUint64ArrayConstructor
| BigInt64ArrayConstructor;
export type FloatArrayConstructor =
@@ -121,7 +119,7 @@ export interface Field {
export type IntBitWidth = 8 | 16 | 32 | 64;
/** Dictionary-encoded data type. */
-export type DictionaryType = { typeId: -1, type: DataType, id: number, keys: IntType, ordered: boolean };
+export type DictionaryType = { typeId: -1, dictionary: DataType, id: number, indices: IntType, ordered: boolean };
/** None data type. */
export type NoneType = { typeId: 0 };
@@ -145,7 +143,7 @@ export type Utf8Type = { typeId: 5, offsets: Int32ArrayConstructor };
export type BoolType = { typeId: 6 };
/** Fixed decimal number data type. */
-export type DecimalType = { typeId: 7, precision: number, scale: number, bitWidth: 128 | 256, values: Uint32ArrayConstructor };
+export type DecimalType = { typeId: 7, precision: number, scale: number, bitWidth: 128 | 256, values: BigUint64ArrayConstructor };
/** Date data type. */
export type DateType = { typeId: 8, unit: DateUnit_, values: DateTimeArrayConstructor };
@@ -166,7 +164,7 @@ export type ListType = { typeId: 12, children: [Field], offsets: Int32ArrayConst
export type StructType = { typeId: 13, children: Field[] };
/** Union data type. */
-export type UnionType = { typeId: 14, mode: UnionMode_, typeIds: number[], children: Field[], offsets: Int32ArrayConstructor };
+export type UnionType = { typeId: 14, mode: UnionMode_, typeIds: number[], typeMap: Record, children: Field[], typeIdForValue?: (value: any, index: number) => number, offsets: Int32ArrayConstructor };
/** Fixed-size opaque binary data type. */
export type FixedSizeBinaryType = { typeId: 15, stride: number };
@@ -175,7 +173,7 @@ export type FixedSizeBinaryType = { typeId: 15, stride: number };
export type FixedSizeListType = { typeId: 16, stride: number, children: Field[] };
/** Key-value map data type. */
-export type MapType = { typeId: 17, keysSorted: boolean, children: [Field, Field], offsets: Int32ArrayConstructor };
+export type MapType = { typeId: 17, keysSorted: boolean, children: [Field], offsets: Int32ArrayConstructor };
/** Duration data type. */
export type DurationType = { typeId: 18, unit: TimeUnit_, values: BigInt64ArrayConstructor };
@@ -241,11 +239,13 @@ export type DataType =
* Arrow IPC record batch message.
*/
export interface RecordBatch {
- length: number;
+ length?: number;
nodes: {length: number, nullCount: number}[];
- buffers: {offset: number, length: number}[];
+ regions: {offset: number, length: number}[];
variadic: number[];
body?: Uint8Array;
+ buffers?: Uint8Array[];
+ byteLength?: number;
}
/**
@@ -282,9 +282,21 @@ export interface Message {
content?: Schema | RecordBatch | DictionaryBatch;
}
+/**
+ * A pointer block in the Arrow IPC 'file' format.
+ */
+export interface Block {
+ /** The file byte offset to the message start. */
+ offset: number,
+ /** The size of the message header metadata. */
+ metadataLength: number,
+ /** The size of the message body. */
+ bodyLength: number
+}
+
/**
* Options for controlling how values are transformed when extracted
- * from am Arrow binary representation.
+ * from an Arrow binary representation.
*/
export interface ExtractionOptions {
/**
@@ -292,6 +304,12 @@ export interface ExtractionOptions {
* Otherwise, return numerical timestamp values (default).
*/
useDate?: boolean;
+ /**
+ * If true, extract decimal-type data as BigInt values, where fractional
+ * digits are scaled to integers. Otherwise, return converted floating-point
+ * numbers (default).
+ */
+ useDecimalBigInt?: boolean;
/**
* If true, extract 64-bit integers as JavaScript `BigInt` values.
* Otherwise, coerce long integers to JavaScript number values (default).
@@ -304,3 +322,25 @@ export interface ExtractionOptions {
*/
useMap?: boolean;
}
+
+/**
+ * Options for building new columns and controlling how values are
+ * transformed when extracted from an Arrow binary representation.
+ */
+export interface ColumnBuilderOptions extends ExtractionOptions {
+ /**
+ * The maximum number of rows to include in a single record batch.
+ */
+ maxBatchRows?: number;
+}
+
+/**
+ * Options for building new tables and controlling how values are
+ * transformed when extracted from an Arrow binary representation.
+ */
+export interface TableBuilderOptions extends ColumnBuilderOptions {
+ /**
+ * A map from column names to Arrow data types.
+ */
+ types?: Record;
+}
diff --git a/src/util/arrays.js b/src/util/arrays.js
new file mode 100644
index 0000000..fab0bbb
--- /dev/null
+++ b/src/util/arrays.js
@@ -0,0 +1,143 @@
+export const uint8Array = Uint8Array;
+export const uint16Array = Uint16Array;
+export const uint32Array = Uint32Array;
+export const uint64Array = BigUint64Array;
+export const int8Array = Int8Array;
+export const int16Array = Int16Array;
+export const int32Array = Int32Array;
+export const int64Array = BigInt64Array;
+export const float32Array = Float32Array;
+export const float64Array = Float64Array;
+
+/**
+ * Return the appropriate typed array constructor for the given
+ * integer type metadata.
+ * @param {number} bitWidth The integer size in bits.
+ * @param {boolean} signed Flag indicating if the integer is signed.
+ * @returns {import('../types.js').IntArrayConstructor}
+ */
+export function intArrayType(bitWidth, signed) {
+ const i = Math.log2(bitWidth) - 3;
+ return (
+ signed
+ ? [int8Array, int16Array, int32Array, int64Array]
+ : [uint8Array, uint16Array, uint32Array, uint64Array]
+ )[i];
+}
+
+/** Shared prototype for typed arrays. */
+const TypedArray = Object.getPrototypeOf(Int8Array);
+
+/**
+ * Check if a value is a typed array.
+ * @param {*} value The value to check.
+ * @returns {value is import('../types.js').TypedArray}
+ * True if value is a typed array, false otherwise.
+ */
+export function isTypedArray(value) {
+ return value instanceof TypedArray;
+}
+
+/**
+ * Check if a value is either a standard array or typed array.
+ * @param {*} value The value to check.
+ * @returns {value is (Array | import('../types.js').TypedArray)}
+ * True if value is an array, false otherwise.
+ */
+export function isArray(value) {
+ return Array.isArray(value) || isTypedArray(value);
+}
+
+/**
+ * Check if a value is an array type (constructor) for 64-bit integers,
+ * one of BigInt64Array or BigUint64Array.
+ * @param {*} value The value to check.
+ * @returns {value is import('../types.js').Int64ArrayConstructor}
+ * True if value is a 64-bit array type, false otherwise.
+ */
+export function isInt64ArrayType(value) {
+ return value === int64Array || value === uint64Array;
+}
+
+/**
+ * Determine the correct index into an offset array for a given
+ * full column row index. Assumes offset indices can be manipulated
+ * as 32-bit signed integers.
+ * @param {import('../types.js').IntegerArray} offsets The offsets array.
+ * @param {number} index The full column row index.
+ */
+export function bisect(offsets, index) {
+ let a = 0;
+ let b = offsets.length;
+ if (b <= 2147483648) { // 2 ** 31
+ // fast version, use unsigned bit shift
+ // array length fits within 32-bit signed integer
+ do {
+ const mid = (a + b) >>> 1;
+ if (offsets[mid] <= index) a = mid + 1;
+ else b = mid;
+ } while (a < b);
+ } else {
+ // slow version, use division and truncate
+ // array length exceeds 32-bit signed integer
+ do {
+ const mid = Math.trunc((a + b) / 2);
+ if (offsets[mid] <= index) a = mid + 1;
+ else b = mid;
+ } while (a < b);
+ }
+ return a;
+}
+
+/**
+ * Compute a 64-bit aligned buffer size.
+ * @param {number} length The starting size.
+ * @param {number} bpe Bytes per element.
+ * @returns {number} The aligned size.
+ */
+function align64(length, bpe = 1) {
+ return (((length * bpe) + 7) & ~7) / bpe;
+}
+
+/**
+ * Return a 64-bit aligned version of the array.
+ * @template {import('../types.js').TypedArray} T
+ * @param {T} array The array.
+ * @param {number} length The current array length.
+ * @returns {T} The aligned array.
+ */
+export function align(array, length = array.length) {
+ const alignedLength = align64(length, array.BYTES_PER_ELEMENT);
+ return array.length > alignedLength ? /** @type {T} */ (array.subarray(0, alignedLength))
+ : array.length < alignedLength ? resize(array, alignedLength)
+ : array;
+}
+
+/**
+ * Resize a typed array to exactly the specified length.
+ * @template {import('../types.js').TypedArray} T
+ * @param {T} array The array.
+ * @param {number} newLength The new length.
+ * @returns {T} The resized array.
+ */
+export function resize(array, newLength) {
+ // @ts-ignore
+ const newArray = new array.constructor(newLength);
+ newArray.set(array, array.length);
+ return newArray;
+}
+
+/**
+ * Grow a typed array to accommdate a minimum length. The array size is
+ * doubled until it meets or exceeds the minimum length.
+ * @template {import('../types.js').TypedArray} T
+ * @param {T} array The array.
+ * @param {number} minLength The minimum length.
+ * @returns {T} The resized array.
+ */
+export function grow(array, minLength) {
+ while (array.length < minLength) {
+ array = resize(array, array.length << 1);
+ }
+ return array;
+}
diff --git a/src/util/numbers.js b/src/util/numbers.js
new file mode 100644
index 0000000..2e1d51d
--- /dev/null
+++ b/src/util/numbers.js
@@ -0,0 +1,217 @@
+import { float64Array, int32Array, int64Array, isInt64ArrayType, uint32Array, uint8Array } from './arrays.js';
+import { TimeUnit } from '../constants.js';
+
+const f64 = new float64Array(2);
+const buf = f64.buffer;
+const i64 = new int64Array(buf);
+const u32 = new uint32Array(buf);
+const i32 = new int32Array(buf);
+const u8 = new uint8Array(buf);
+
+export function identity(value) {
+ return value;
+}
+
+export function toBigInt(value) {
+ return BigInt(value);
+}
+
+export function toOffset(type) {
+ return isInt64ArrayType(type.offsets) ? toBigInt : identity;
+}
+
+export function toDateDay(value) {
+ return (value / 864e5) | 0;
+}
+
+export function toTimestamp(unit) {
+ return unit === TimeUnit.SECOND ? value => toBigInt(value / 1e3)
+ : unit === TimeUnit.MILLISECOND ? toBigInt
+ : unit === TimeUnit.MICROSECOND ? value => toBigInt(value * 1e3)
+ : value => toBigInt(value * 1e6);
+}
+
+export function toYearMonth(value) {
+ return (value[0] * 12) + (value[1] % 12);
+}
+
+export function toMonthDayNanoBytes([m, d, n]) {
+ i32[0] = m;
+ i32[1] = d;
+ i64[1] = toBigInt(n);
+ return u8;
+}
+
+/**
+ * Coerce a bigint value to a number. Throws an error if the bigint value
+ * lies outside the range of what a number can precisely represent.
+ * @param {bigint} value The value to check and possibly convert.
+ * @returns {number} The converted number value.
+ */
+export function toNumber(value) {
+ if (value > Number.MAX_SAFE_INTEGER || value < Number.MIN_SAFE_INTEGER) {
+ throw Error(`BigInt exceeds integer number representation: ${value}`);
+ }
+ return Number(value);
+}
+
+/**
+ * Divide one BigInt value by another, and return the result as a number.
+ * The division may involve unsafe integers and a loss of precision.
+ * @param {bigint} num The numerator.
+ * @param {bigint} div The divisor.
+ * @returns {number} The result of the division as a floating point number.
+ */
+export function divide(num, div) {
+ return Number(num / div) + Number(num % div) / Number(div);
+}
+
+/**
+ * Convert a floating point number or bigint to decimal bytes.
+ * @param {number|bigint} value The number to encode. If a bigint, we assume
+ * it already represents the decimal in integer form with the correct scale.
+ * Otherwise, we assume a float that requires scaled integer conversion.
+ * @param {BigUint64Array} buf The uint64 array to write to.
+ * @param {number} offset The starting index offset into the array.
+ * @param {number} stride The stride of an encoded decimal, in 64-bit steps.
+ * @param {number} scale The scale mapping fractional digits to integers.
+ */
+export function toDecimal(value, buf, offset, stride, scale) {
+ const v = typeof value === 'bigint'
+ ? value
+ : toBigInt(Math.trunc(value * scale));
+ // assignment into uint64array performs needed truncation for us
+ buf[offset] = v;
+ buf[offset + 1] = (v >> 64n);
+ if (stride > 2) {
+ buf[offset + 2] = (v >> 128n);
+ buf[offset + 3] = (v >> 192n);
+ }
+}
+
+const asUint64 = v => BigInt.asUintN(64, v);
+
+/**
+ * Convert a 128-bit decimal value to a bigint.
+ * @param {BigUint64Array} buf The uint64 array containing the decimal bytes.
+ * @param {number} offset The starting index offset into the array.
+ * @returns {bigint} The converted decimal as a bigint, such that all
+ * fractional digits are scaled up to integers (for example, 1.12 -> 112).
+ */
+export function fromDecimal128(buf, offset) {
+ const i = offset << 1;
+ let x;
+ if (BigInt.asIntN(64, buf[i + 1]) < 0) {
+ x = asUint64(~buf[i]) | (asUint64(~buf[i + 1]) << 64n);
+ x = -(x + 1n);
+ } else {
+ x = buf[i] | (buf[i + 1] << 64n);
+ }
+ return x;
+}
+
+/**
+ * Convert a 256-bit decimal value to a bigint.
+ * @param {BigUint64Array} buf The uint64 array containing the decimal bytes.
+ * @param {number} offset The starting index offset into the array.
+ * @returns {bigint} The converted decimal as a bigint, such that all
+ * fractional digits are scaled up to integers (for example, 1.12 -> 112).
+ */
+export function fromDecimal256(buf, offset) {
+ const i = offset << 2;
+ let x;
+ if (BigInt.asIntN(64, buf[i + 3]) < 0) {
+ x = asUint64(~buf[i])
+ | (asUint64(~buf[i + 1]) << 64n)
+ | (asUint64(~buf[i + 2]) << 128n)
+ | (asUint64(~buf[i + 3]) << 192n);
+ x = -(x + 1n);
+ } else {
+ x = buf[i]
+ | (buf[i + 1] << 64n)
+ | (buf[i + 2] << 128n)
+ | (buf[i + 3] << 192n);
+ }
+ return x;
+}
+
+/**
+ * Convert a 16-bit float from integer bytes to a number.
+ * Adapted from https://github.com/apache/arrow/blob/main/js/src/util/math.ts
+ * @param {number} value The float as a 16-bit integer.
+ * @returns {number} The converted 64-bit floating point number.
+ */
+export function fromFloat16(value) {
+ const expo = (value & 0x7C00) >> 10;
+ const sigf = (value & 0x03FF) / 1024;
+ const sign = (-1) ** ((value & 0x8000) >> 15);
+ switch (expo) {
+ case 0x1F: return sign * (sigf ? Number.NaN : 1 / 0);
+ case 0x00: return sign * (sigf ? 6.103515625e-5 * sigf : 0);
+ }
+ return sign * (2 ** (expo - 15)) * (1 + sigf);
+}
+
+/**
+ * Convert a number to a 16-bit float as integer bytes..
+ * Inspired by numpy's `npy_double_to_half`:
+ * https://github.com/numpy/numpy/blob/5a5987291dc95376bb098be8d8e5391e89e77a2c/numpy/core/src/npymath/halffloat.c#L43
+ * Adapted from https://github.com/apache/arrow/blob/main/js/src/util/math.ts
+ * @param {number} value The 64-bit floating point number to convert.
+ * @returns {number} The converted 16-bit integer.
+ */
+export function toFloat16(value) {
+ if (value !== value) return 0x7E00; // NaN
+ f64[0] = value;
+
+ // Magic numbers:
+ // 0x80000000 = 10000000 00000000 00000000 00000000 -- masks the 32nd bit
+ // 0x7ff00000 = 01111111 11110000 00000000 00000000 -- masks the 21st-31st bits
+ // 0x000fffff = 00000000 00001111 11111111 11111111 -- masks the 1st-20th bit
+ const sign = (u32[1] & 0x80000000) >> 16 & 0xFFFF;
+ let expo = (u32[1] & 0x7FF00000), sigf = 0x0000;
+
+ if (expo >= 0x40F00000) {
+ //
+ // If exponent overflowed, the float16 is either NaN or Infinity.
+ // Rules to propagate the sign bit: mantissa > 0 ? NaN : +/-Infinity
+ //
+ // Magic numbers:
+ // 0x40F00000 = 01000000 11110000 00000000 00000000 -- 6-bit exponent overflow
+ // 0x7C000000 = 01111100 00000000 00000000 00000000 -- masks the 27th-31st bits
+ //
+ // returns:
+ // qNaN, aka 32256 decimal, 0x7E00 hex, or 01111110 00000000 binary
+ // sNaN, aka 32000 decimal, 0x7D00 hex, or 01111101 00000000 binary
+ // +inf, aka 31744 decimal, 0x7C00 hex, or 01111100 00000000 binary
+ // -inf, aka 64512 decimal, 0xFC00 hex, or 11111100 00000000 binary
+ //
+ // If mantissa is greater than 23 bits, set to +Infinity like numpy
+ if (u32[0] > 0) {
+ expo = 0x7C00;
+ } else {
+ expo = (expo & 0x7C000000) >> 16;
+ sigf = (u32[1] & 0x000FFFFF) >> 10;
+ }
+ } else if (expo <= 0x3F000000) {
+ //
+ // If exponent underflowed, the float is either signed zero or subnormal.
+ //
+ // Magic numbers:
+ // 0x3F000000 = 00111111 00000000 00000000 00000000 -- 6-bit exponent underflow
+ //
+ sigf = 0x100000 + (u32[1] & 0x000FFFFF);
+ sigf = 0x100000 + (sigf << ((expo >> 20) - 998)) >> 21;
+ expo = 0;
+ } else {
+ //
+ // No overflow or underflow, rebase the exponent and round the mantissa
+ // Magic numbers:
+ // 0x200 = 00000010 00000000 -- masks off the 10th bit
+ //
+ // Ensure the first mantissa bit (the 10th one) is 1 and round
+ expo = (expo - 0x3F000000) >> 10;
+ sigf = ((u32[1] & 0x000FFFFF) + 0x200) >> 10;
+ }
+ return sign | expo | sigf & 0xFFFF;
+}
diff --git a/src/util/objects.js b/src/util/objects.js
new file mode 100644
index 0000000..95c69b3
--- /dev/null
+++ b/src/util/objects.js
@@ -0,0 +1,55 @@
+/**
+ * Check if a value is a Date instance
+ * @param {*} value The value to check.
+ * @returns {value is Date} True if value is a Date, false otherwise.
+ */
+export function isDate(value) {
+ return value instanceof Date;
+}
+
+/**
+ * Return the input value if it passes a test.
+ * Otherwise throw an error using the given message generator.
+ * @template T
+ * @param {T} value he value to check.
+ * @param {(value: T) => boolean} test The test function.
+ * @param {(value: *) => string} message Message generator.
+ * @returns {T} The input value.
+ * @throws if the value does not pass the test
+ */
+export function check(value, test, message) {
+ if (test(value)) return value;
+ throw new Error(message(value));
+}
+
+/**
+ * Return the input value if it exists in the provided set.
+ * Otherwise throw an error using the given message generator.
+ * @template T
+ * @param {T} value The value to check.
+ * @param {T[] | Record} set The set of valid values.
+ * @param {(value: *) => string} [message] Message generator.
+ * @returns {T} The input value.
+ * @throws if the value is not included in the set
+ */
+export function checkOneOf(value, set, message) {
+ set = Array.isArray(set) ? set : Object.values(set);
+ return check(
+ value,
+ (value) => set.includes(value),
+ message ?? (() => `${value} must be one of ${set}`)
+ );
+}
+
+/**
+ * Return the first object key that pairs with the given value.
+ * @param {Record} object The object to search.
+ * @param {any} value The value to lookup.
+ * @returns {string} The first matching key, or '' if not found.
+ */
+export function keyFor(object, value) {
+ for (const [key, val] of Object.entries(object)) {
+ if (val === value) return key;
+ }
+ return '';
+}
diff --git a/src/util.js b/src/util/read.js
similarity index 59%
rename from src/util.js
rename to src/util/read.js
index 713efb7..ba5d381 100644
--- a/src/util.js
+++ b/src/util/read.js
@@ -1,5 +1,12 @@
+import { toNumber } from './numbers.js';
+import { decodeUtf8 } from './strings.js';
+
+/** The size in bytes of a 32-bit integer. */
export const SIZEOF_INT = 4;
+/** The size in bytes of a 16-bit integer. */
+export const SIZEOF_SHORT = 2;
+
/**
* Return a boolean for a single bit in a bitmap.
* @param {Uint8Array} bitmap The bitmap.
@@ -10,91 +17,12 @@ export function decodeBit(bitmap, index) {
return (bitmap[index >> 3] & 1 << (index % 8)) !== 0;
}
-const textDecoder = new TextDecoder('utf-8');
-
-/**
- * Return a UTF-8 string decoded from a byte buffer.
- * @param {Uint8Array} buf a byte buffer
- * @returns {String} A decoded string.
- */
-export function decodeUtf8(buf) {
- return textDecoder.decode(buf);
-}
-
-/**
- * Return the first object key that pairs with the given value.
- * @param {Record} object The object to search.
- * @param {any} value The value to lookup.
- * @returns {string} The first matching key, or '' if not found.
- */
-export function keyFor(object, value) {
- for (const [key, val] of Object.entries(object)) {
- if (val === value) return key;
- }
- return '';
-}
-
/**
- * Coerce a bigint value to a number. Throws an error if the bigint value
- * lies outside the range of what a number can precisely represent.
- * @param {bigint} value The value to check and possibly convert.
- * @returns {number} The converted number value.
- */
-export function toNumber(value) {
- if (value > Number.MAX_SAFE_INTEGER || value < Number.MIN_SAFE_INTEGER) {
- throw Error(`BigInt exceeds integer number representation: ${value}`);
- }
- return Number(value);
-}
-
-/**
- * Divide one BigInt value by another, and return the result as a number.
- * @param {bigint} num The numerator
- * @param {bigint} div The divisor
- * @returns {number} The result of the division as a floating point number.
- */
-export function divide(num, div) {
- return toNumber(num / div) + toNumber(num % div) / toNumber(div);
-}
-
-/**
- * Determine the correct index into an offset array for a given
- * full column row index. Assumes offset indices can be manipulated
- * as 32-bit signed integers.
- * @param {import("./types.js").IntegerArray} offsets The offsets array.
- * @param {number} index The full column row index.
- */
-export function bisect(offsets, index) {
- let a = 0;
- let b = offsets.length;
- if (b <= 2147483648) { // 2 ** 31
- // fast version, use unsigned bit shift
- // array length fits within 32-bit signed integer
- do {
- const mid = (a + b) >>> 1;
- if (offsets[mid] <= index) a = mid + 1;
- else b = mid;
- } while (a < b);
- } else {
- // slow version, use division and truncate
- // array length exceeds 32-bit signed integer
- do {
- const mid = Math.trunc((a + b) / 2);
- if (offsets[mid] <= index) a = mid + 1;
- else b = mid;
- } while (a < b);
- }
- return a;
-}
-
-// -- flatbuffer utilities -----
-
-/**
- * Lookup helper for flatbuffer table entries.
+ * Lookup helper for flatbuffer object (table) entries.
* @param {Uint8Array} buf The byte buffer.
- * @param {number} index The base index of the table.
+ * @param {number} index The base index of the object.
*/
-export function table(buf, index) {
+export function readObject(buf, index) {
const pos = index + readInt32(buf, index);
const vtable = pos - readInt32(buf, pos);
const size = readInt16(buf, vtable);
@@ -198,19 +126,6 @@ export function readUint32(buf, offset) {
return readInt32(buf, offset) >>> 0;
}
-/**
- * Return a signed 64-bit BigInt value.
- * @param {Uint8Array} buf
- * @param {number} offset
- * @returns {bigint}
- */
-export function readInt64(buf, offset) {
- return BigInt.asIntN(
- 64,
- BigInt(readUint32(buf, offset)) + (BigInt(readUint32(buf, offset + SIZEOF_INT)) << BigInt(32))
- );
-}
-
/**
* Return a signed 64-bit integer value coerced to a JS number.
* Throws an error if the value exceeds what a JS number can represent.
@@ -218,8 +133,12 @@ export function readInt64(buf, offset) {
* @param {number} offset
* @returns {number}
*/
-export function readInt64AsNum(buf, offset) {
- return toNumber(readInt64(buf, offset));
+export function readInt64(buf, offset) {
+ return toNumber(BigInt.asIntN(
+ 64,
+ BigInt(readUint32(buf, offset)) +
+ (BigInt(readUint32(buf, offset + SIZEOF_INT)) << 32n)
+ ));
}
/**
diff --git a/src/util/strings.js b/src/util/strings.js
new file mode 100644
index 0000000..60827f6
--- /dev/null
+++ b/src/util/strings.js
@@ -0,0 +1,47 @@
+import { isArray } from './arrays.js';
+import { isDate } from './objects.js';
+
+const textDecoder = new TextDecoder('utf-8');
+const textEncoder = new TextEncoder();
+
+/**
+ * Return a UTF-8 string decoded from a byte buffer.
+ * @param {Uint8Array} buf The byte buffer.
+ * @returns {string} The decoded string.
+ */
+export function decodeUtf8(buf) {
+ return textDecoder.decode(buf);
+}
+
+/**
+ * Return a byte buffer encoded from a UTF-8 string.
+ * @param {string } str The string to encode.
+ * @returns {Uint8Array} The encoded byte buffer.
+ */
+export function encodeUtf8(str) {
+ return textEncoder.encode(str);
+}
+
+/**
+ * Return a string-coercible key value that uniquely identifies a value.
+ * @param {*} value The input value.
+ * @returns {string} The key string.
+ */
+export function keyString(value) {
+ const val = typeof value !== 'object' || !value ? (value ?? null)
+ : isDate(value) ? +value
+ // @ts-ignore
+ : isArray(value) ? `[${value.map(keyString)}]`
+ : objectKey(value);
+ return `${val}`;
+}
+
+function objectKey(value) {
+ let s = '';
+ let i = -1;
+ for (const k in value) {
+ if (++i > 0) s += ',';
+ s += `"${k}":${keyString(value[k])}`;
+ }
+ return `{${s}}`;
+}
diff --git a/test/column-from-array-test.js b/test/column-from-array-test.js
new file mode 100644
index 0000000..1cad3d6
--- /dev/null
+++ b/test/column-from-array-test.js
@@ -0,0 +1,294 @@
+import assert from 'node:assert';
+import { IntervalUnit, UnionMode, binary, bool, columnFromArray, decimal, dictionary, field, fixedSizeBinary, fixedSizeList, float16, float32, float64, int16, int32, int64, int8, interval, list, map, runEndEncoded, struct, uint16, uint32, uint64, uint8, union, utf8 } from '../src/index.js';
+import { isTypedArray } from '../src/util/arrays.js';
+
+function test(values, type, options) {
+ const col = columnFromArray(values, type, options);
+ if (type) assert.deepEqual(col.type, type);
+ if (!type && isTypedArray(values)) {
+ // check that inferred data type maintains input array type
+ assert.strictEqual(col.data[0].values.constructor, values.constructor);
+ }
+ assert.strictEqual(col.length, values.length);
+ assert.deepStrictEqual(Array.from(col), Array.from(values));
+ return col;
+}
+
+describe('column', () => {
+ it('creates integer columns', () => {
+ // without nulls
+ test([1, 2, 3], uint8());
+ test([1, 2, 3], uint16());
+ test([1, 2, 3], uint32());
+ test([1, 2, 3], uint64());
+ test([1, 2, 3], int8());
+ test([1, 2, 3], int16());
+ test([1, 2, 3], int32());
+ test([1, 2, 3], int64());
+
+ // with nulls
+ test([1, 2, null, 3], uint8());
+ test([1, 2, null, 3], uint16());
+ test([1, 2, null, 3], uint32());
+ test([1, 2, null, 3], uint64());
+ test([1, 2, null, 3], int8());
+ test([1, 2, null, 3], int16());
+ test([1, 2, null, 3], int32());
+ test([1, 2, null, 3], int64());
+
+ // use big int
+ const opt = { useBigInt: true };
+ test([1n, 2n, 3n], uint64(), opt);
+ test([1n, 2n, 3n], int64(), opt);
+ test([1n, 2n, null, 3n], uint64(), opt);
+ test([1n, 2n, null, 3n], int64(), opt);
+ });
+
+ it('creates float columns', () => {
+ // without nulls
+ test([1, 2, 3], float16());
+ test([1, 2, 3], float32());
+ test([1, 2, 3], float64());
+
+ // with nulls
+ test([1, 2, null, 3], float16());
+ test([1, 2, null, 3], float32());
+ test([1, 2, null, 3], float64());
+ });
+
+ it('creates bool columns', () => {
+ test([true, false, true, false], bool());
+ test([true, false, null, true, false], bool());
+ });
+
+ it('creates utf8 columns', () => {
+ test(['foo', 'bar', 'baz'], utf8());
+ test(['foo', 'bar', null, 'baz'], utf8());
+ });
+
+ it('creates binary columns', () => {
+ test([
+ Uint8Array.of(255, 2, 3, 1),
+ null,
+ Uint8Array.of(5, 9, 128)
+ ], binary());
+ });
+
+ it('creates decimal columns', () => {
+ test([1.1, 2.3, 3.4], decimal(18, 1, 128));
+ test([-1.1, -2.3, -3.4], decimal(18, 1, 128));
+ test([0.12345678987, -0.12345678987], decimal(18, 11, 128));
+ test([10000.12345678987, -10000.12345678987], decimal(18, 11, 128));
+ test([0.1000012345678987, -0.1000012345678987], decimal(18, 16, 128));
+ test([0.12345678987654321, -0.12345678987654321], decimal(18, 18, 128));
+
+ test([1.1, 2.3, 3.4], decimal(40, 1, 256));
+ test([-1.1, -2.3, -3.4], decimal(40, 1, 256));
+ test([0.12345678987, -0.12345678987], decimal(40, 11, 256));
+ test([10000.12345678987, -10000.12345678987], decimal(40, 11, 256));
+ test([0.1000012345678987, -0.1000012345678987], decimal(40, 16, 256));
+ test([0.12345678987654321, -0.12345678987654321], decimal(40, 18, 256));
+
+ const opt = { useDecimalBigInt: true };
+
+ test([11n, 23n, 34n], decimal(18, 1, 128), opt);
+ test([-11n, -23n, -34n], decimal(18, 1, 128), opt);
+ test([12345678987n, -12345678987n], decimal(18, 11, 128), opt);
+ test([1000012345678987n, -1000012345678987n], decimal(18, 11, 128), opt);
+ test([1000012345678987n, -1000012345678987n], decimal(18, 16, 128), opt);
+ test([12345678987654321n, -12345678987654321n], decimal(18, 18, 128), opt);
+
+ test([11n, 23n, 34n], decimal(40, 1, 256), opt);
+ test([-11n, -23n, -34n], decimal(40, 1, 256), opt);
+ test([12345678987n, -12345678987n], decimal(40, 11, 256), opt);
+ test([1000012345678987n, -1000012345678987n], decimal(40, 11, 256), opt);
+ test([1000012345678987n, -1000012345678987n], decimal(40, 16, 256), opt);
+ test([12345678987654321n, -12345678987654321n], decimal(40, 18, 256), opt);
+ test([2n ** 156n, (-2n) ** 157n], decimal(18, 24, 256), opt);
+ });
+
+ it('creates month-day-nano interval columns', () => {
+ test(
+ [
+ Float64Array.of(1992, 3, 1e10),
+ Float64Array.of(2000, 6, 15)
+ ],
+ interval(IntervalUnit.MONTH_DAY_NANO)
+ );
+ });
+
+ it('creates fixed size binary columns', () => {
+ test([
+ Uint8Array.of(255, 2, 3),
+ null,
+ Uint8Array.of(5, 9, 128)
+ ], fixedSizeBinary(3));
+ });
+
+ it('creates list columns', () => {
+ test([
+ Int32Array.of(1, 2, 3),
+ Int32Array.of(4, 5),
+ Int32Array.of(6, 7, 8)
+ ], list(int32()));
+ test([
+ Int32Array.of(1, 2, 3),
+ Int32Array.of(4, 5),
+ null,
+ Int32Array.of(6, 7, 8)
+ ], list(int32()));
+ });
+
+ it('creates fixed size list columns', () => {
+ test([
+ Int32Array.of(1, 2, 3),
+ Int32Array.of(4, 5, 6),
+ null,
+ Int32Array.of(7, 8, 9)
+ ], fixedSizeList(int32(), 3));
+ });
+
+ it('creates struct columns', () => {
+ const data = [
+ { foo: 1, bar: 'a', baz: true },
+ { foo: 2, bar: 'b', baz: false },
+ { foo: 3, bar: 'd', baz: true },
+ null,
+ { foo: 4, bar: 'c', baz: null },
+ ];
+
+ test(data, struct({
+ foo: int32(),
+ bar: utf8(),
+ baz: bool()
+ }));
+
+ test(data, struct([
+ field('foo', int32()),
+ field('bar', utf8()),
+ field('baz', bool()),
+ ]));
+ });
+
+ it('creates union columns', () => {
+ const unionTypeId = value => {
+ const vtype = typeof value;
+ return vtype === 'number' ? 0 : vtype === 'boolean' ? 1 : 2;
+ };
+ const ids = Int8Array.of(0, 0, 2, 1, 2, 2, 0, 1);
+ const values = [1, 2, 'foo', true, null, 'baz', 3, false];
+ const childTypes = [int32(), bool(), utf8()];
+
+ const sparse = union(UnionMode.Sparse, childTypes, null, unionTypeId);
+ const sparseCol = test(values, sparse);
+ const sparseBatch = sparseCol.data[0];
+ assert.equal(sparseBatch.nullCount, 1);
+ assert.deepStrictEqual(sparseBatch.children.map(b => b.length), [8, 8, 8]);
+ assert.deepStrictEqual(sparseBatch.children.map(b => b.nullCount), [5, 6, 6]);
+ assert.deepStrictEqual(sparseBatch.typeIds, ids);
+
+ const dense = union(UnionMode.Dense, childTypes, null, unionTypeId);
+ const denseCol = test(values, dense);
+ const denseBatch = denseCol.data[0];
+ assert.equal(denseBatch.nullCount, 1);
+ assert.deepStrictEqual(denseBatch.children.map(b => b.length), [3, 2, 3]);
+ assert.deepStrictEqual(denseBatch.children.map(b => b.nullCount), [0, 0, 1]);
+ assert.deepStrictEqual(denseBatch.typeIds, ids);
+ });
+
+ it('creates map columns', () => {
+ const asMap = d => d.map(a => a && new Map(a));
+ const keyvals = [
+ [['foo', 1], ['bar', 2], ['baz', 3]],
+ [['foo', 4], ['bar', 5], ['baz', 6]],
+ null
+ ];
+ const reverse = keyvals.map(a => a && a.map(kv => kv.slice().reverse()));
+ test(keyvals, map(utf8(), int16()));
+ test(reverse, map(int16(), utf8()));
+ test(asMap(keyvals), map(utf8(), int16()), { useMap: true });
+ test(asMap(reverse), map(int16(), utf8()), { useMap: true });
+ });
+
+ it('creates dictionary columns', () => {
+ function check(values, type) {
+ const col = test(values, type);
+ // check array type of indices
+ assert.ok(col.data[0].values instanceof type.indices.values);
+ }
+
+ const strs = ['foo', 'foo', 'baz', 'bar', null, 'baz', 'bar'];
+ const ints = [12, 34, 12, 12, 12, 27, null, 34];
+ const arrs = [[1,2,3], [1,2,3], null, [3,5], [3,5]].map(x => x && Int32Array.from(x));
+
+ check(strs, dictionary(utf8()));
+ check(ints, dictionary(int32()));
+ check(arrs, dictionary(list(int32())));
+ check(strs, dictionary(utf8(), int16()));
+ check(ints, dictionary(int32(), int16()));
+ });
+
+ it('creates run-end encoded columns', () => {
+ function check(values, runs, type) {
+ const col = test(values, type);
+ // check run-ends
+ const colRuns = col.data[0].children[0];
+ assert.deepStrictEqual([...colRuns], runs);
+ assert.ok(colRuns.values instanceof type.children[0].type.values);
+ }
+
+ const strs = ['foo', 'foo', 'baz', 'bar', null, 'baz', 'baz'];
+ const srun = [2, 3, 4, 5, 7];
+ const ints = [12, 34, 12, 12, 12, 27, 27, null, 34, 34];
+ const irun = [1, 2, 5, 7, 8, 10];
+ const arrs = [[1,2,3], [1,2,3], null, [3,5], [3,5]].map(x => x && Int32Array.from(x));
+ const arun = [2, 3, 5];
+
+ // 32-bit runs
+ check(strs, srun, runEndEncoded(int32(), utf8()));
+ check(ints, irun, runEndEncoded(int32(), int32()));
+ check(arrs, arun, runEndEncoded(int32(), list(int32())));
+
+ // 64-bit runs
+ check(strs, srun, runEndEncoded(int64(), utf8()));
+ check(ints, irun, runEndEncoded(int64(), int32()));
+ check(arrs, arun, runEndEncoded(int64(), list(int32())));
+ });
+
+ it('creates columns with multiple record batches', () => {
+ const data = [
+ ...Array(10).fill(0),
+ ...Array(10).fill(null),
+ ...Array(10).fill(1),
+ 4, 5, 6
+ ];
+ const col = test(data, int16(), { maxBatchRows: 10 });
+ assert.strictEqual(col.nullCount, 10);
+ assert.strictEqual(col.data.length, 4);
+ assert.deepStrictEqual(col.data.map(d => d.length), [10, 10, 10, 3]);
+ });
+
+ it('creates columns from typed arrays', () => {
+ test(Int8Array.of(1, 2, 3));
+ test(Int16Array.of(1, 2, 3));
+ test(Int32Array.of(1, 2, 3));
+ test(Uint8Array.of(1, 2, 3));
+ test(Uint16Array.of(1, 2, 3));
+ test(Uint32Array.of(1, 2, 3));
+ test(Float32Array.of(1, 2, 3));
+ test(Float64Array.of(1, 2, 3));
+ test(BigInt64Array.of(1n, 2n, 3n), null, { useBigInt: true });
+ test(BigUint64Array.of(1n, 2n, 3n), null, { useBigInt: true });
+ });
+
+ it('creates columns from inferred types', () => {
+ test([1, 2, 3]);
+ test([1e3, 2e3, 3e3]);
+ test([1e6, 2e6, 3e6]);
+ test([1.1, 2.2, 3.3]);
+ test([1n, 2n, 3n], null, { useBigInt: true });
+ test([1n, 2n, 3n], null, { useBigInt: true });
+ test([true, false, true]);
+ test([Int8Array.of(1,2), Int8Array.of(3,4), Int8Array.of(5,6)]);
+ });
+});
diff --git a/test/decode-ipc-test.js b/test/decode-ipc-test.js
new file mode 100644
index 0000000..2c3779a
--- /dev/null
+++ b/test/decode-ipc-test.js
@@ -0,0 +1,35 @@
+import assert from 'node:assert';
+import { readFile } from 'node:fs/promises';
+import { decodeIPC } from '../src/decode/decode-ipc.js';
+import { decimalDataDecoded } from './util/decimal.js';
+
+describe('decodeIPC', () => {
+ it('decodes arrow file format', async () => {
+ const buffer = await readFile(`test/data/decimal.arrow`);
+ const bytes = new Uint8Array(buffer);
+ const expect = decimalDataDecoded();
+ assert.deepEqual(decodeIPC(buffer), expect, 'Node Buffer');
+ assert.deepStrictEqual(decodeIPC(bytes), expect, 'Uint8Array');
+ assert.deepStrictEqual(decodeIPC(bytes.buffer), expect, 'ArrayBuffer');
+ });
+
+ it('decodes arrow stream format', async () => {
+ const buffer = await readFile(`test/data/decimal.arrows`);
+ const bytes = new Uint8Array(buffer);
+ const expect = decimalDataDecoded();
+ assert.deepEqual(decodeIPC(buffer), expect, 'Node Buffer');
+ assert.deepStrictEqual(decodeIPC(bytes), expect, 'Uint8Array');
+ assert.deepStrictEqual(decodeIPC(bytes.buffer), expect, 'ArrayBuffer');
+ });
+
+ it('decodes arrow stream format from multiple buffers', async () => {
+ // decimal.arrows, divided into separate messages
+ const array = [
+ Uint8Array.of(255,255,255,255,120,0,0,0,16,0,0,0,0,0,10,0,12,0,6,0,5,0,8,0,10,0,0,0,0,1,4,0,12,0,0,0,8,0,8,0,0,0,4,0,8,0,0,0,4,0,0,0,1,0,0,0,20,0,0,0,16,0,20,0,8,0,6,0,7,0,12,0,0,0,16,0,16,0,0,0,0,0,1,7,16,0,0,0,28,0,0,0,4,0,0,0,0,0,0,0,1,0,0,0,100,0,0,0,8,0,12,0,4,0,8,0,8,0,0,0,18,0,0,0,3,0,0,0),
+ Uint8Array.of(255,255,255,255,136,0,0,0,20,0,0,0,0,0,0,0,12,0,22,0,6,0,5,0,8,0,12,0,12,0,0,0,0,3,4,0,24,0,0,0,48,0,0,0,0,0,0,0,0,0,10,0,24,0,12,0,4,0,8,0,10,0,0,0,60,0,0,0,16,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,48,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,232,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,224,46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,208,132,0,0,0,0,0,0,0,0,0,0,0,0,0,0),
+ Uint8Array.of(255,255,255,255,0,0,0,0)
+ ];
+ const expect = decimalDataDecoded();
+ assert.deepStrictEqual(decodeIPC(array), expect, 'Uint8Array');
+ });
+});
diff --git a/test/encode-ipc-test.js b/test/encode-ipc-test.js
new file mode 100644
index 0000000..0266601
--- /dev/null
+++ b/test/encode-ipc-test.js
@@ -0,0 +1,44 @@
+import assert from 'node:assert';
+import { tableFromIPC } from 'apache-arrow';
+import { decodeIPC } from '../src/decode/decode-ipc.js';
+import { encodeIPC } from '../src/encode/encode-ipc.js';
+import { MAGIC } from '../src/constants.js';
+import { decimalDataDecoded, decimalDataToEncode } from './util/decimal.js';
+
+function arrowJSCheck(input, bytes) {
+ // cross-check against arrow-js
+ const arrowJS = tableFromIPC(bytes);
+ assert.strictEqual(arrowJS.numRows, 3);
+ assert.strictEqual(arrowJS.numCols, 1);
+ const arrowCol = arrowJS.getChildAt(0);
+ const arrowBuf = arrowCol.data[0].values;
+ assert.strictEqual(arrowCol.type.typeId, 7);
+ assert.deepStrictEqual(
+ new Uint8Array(
+ arrowBuf.buffer,
+ arrowBuf.byteOffset,
+ arrowBuf.length * arrowBuf.BYTES_PER_ELEMENT
+ ),
+ input.records[0].buffers[0]
+ );
+}
+
+describe('encodeIPC', () => {
+ it('encodes arrow file format', () => {
+ const input = decimalDataToEncode();
+ const expect = decimalDataDecoded();
+ const bytes = encodeIPC(input, { format: 'file' }).finish();
+ assert.deepStrictEqual(bytes.subarray(0, 6), MAGIC, 'start ARROW1 magic string');
+ assert.deepStrictEqual(bytes.slice(-6), MAGIC, 'end ARROW1 magic string');
+ assert.deepStrictEqual(decodeIPC(bytes), expect, 'Uint8Array');
+ arrowJSCheck(input, bytes);
+ });
+
+ it('encodes arrow stream format', () => {
+ const input = decimalDataToEncode();
+ const expect = decimalDataDecoded();
+ const bytes = encodeIPC(input, { format: 'stream' }).finish();
+ assert.deepStrictEqual(decodeIPC(bytes), expect, 'Uint8Array');
+ arrowJSCheck(input, bytes);
+ });
+});
diff --git a/test/infer-type-test.js b/test/infer-type-test.js
new file mode 100644
index 0000000..1a3fc57
--- /dev/null
+++ b/test/infer-type-test.js
@@ -0,0 +1,95 @@
+import assert from 'node:assert';
+import { bool, dateDay, dictionary, float64, int16, int32, int64, int8, list, struct, timestamp, utf8 } from '../src/index.js';
+import { inferType } from '../src/build/infer-type.js';
+
+function matches(actual, expect) {
+ assert.deepStrictEqual(actual, expect);
+}
+
+describe('inferType', () => {
+ it('infers integer types', () => {
+ matches(inferType([1, 2, 3]), int8());
+ matches(inferType([1e3, 2e3, 3e3]), int16());
+ matches(inferType([1e6, 2e6, 3e6]), int32());
+ matches(inferType([1n, 2n, 3n]), int64());
+
+ matches(inferType([-1, 2, 3]), int8());
+ matches(inferType([-1e3, 2e3, 3e3]), int16());
+ matches(inferType([-1e6, 2e6, 3e6]), int32());
+ matches(inferType([-1n, 2n, 3n]), int64());
+
+ matches(inferType([1, 2, null, undefined, 3]), int8());
+ matches(inferType([1e3, 2e3, null, undefined, 3e3]), int16());
+ matches(inferType([1e6, 2e6, null, undefined, 3e6]), int32());
+ matches(inferType([1n, 2n, null, undefined, 3n]), int64());
+ });
+
+ it('infers float types', () => {
+ matches(inferType([1.1, 2.2, 3.3]), float64());
+ matches(inferType([-1.1, 2.2, 3.3]), float64());
+ matches(inferType([1, 2, 3.3]), float64());
+ matches(inferType([1, 2, NaN]), float64());
+ matches(inferType([NaN, null, undefined, NaN]), float64());
+ matches(inferType([Number.MIN_SAFE_INTEGER, Number.MAX_SAFE_INTEGER]), float64());
+ });
+
+ it('infers utf8 dictionary types', () => {
+ const type = dictionary(utf8(), int32());
+ matches(inferType(['foo', 'bar', 'baz']), type);
+ matches(inferType(['foo', 'bar', null, undefined, 'baz']), type);
+ });
+
+ it('infers bool types', () => {
+ matches(inferType([true, false, true]), bool());
+ matches(inferType([true, false, null, undefined, true]), bool());
+ });
+
+ it('infers date day types', () => {
+ matches(inferType([
+ new Date(Date.UTC(2000, 1, 2)),
+ new Date(Date.UTC(2006, 3, 20)),
+ null,
+ undefined
+ ]), dateDay());
+ });
+
+ it('infers timestamp types', () => {
+ matches(
+ inferType([
+ new Date(Date.UTC(2000, 1, 2)),
+ new Date(Date.UTC(2006, 3, 20)),
+ null,
+ undefined,
+ new Date(1990, 3, 12, 5, 37)
+ ]),
+ timestamp()
+ );
+ });
+
+ it('infers list types', () => {
+ matches(inferType([[1, 2], [3, 4]]), list(int8()));
+ matches(inferType([[true, null, false], null, undefined, [false, undefined, true]]), list(bool()));
+ matches(inferType([['foo', 'bar', null], null, ['bar', 'baz']]), list(dictionary(utf8(), int32())));
+ });
+
+ it('infers struct types', () => {
+ matches(
+ inferType([
+ { foo: 1, bar: [1.1, 2.2] },
+ { foo: null, bar: [2.2, null, 3.3] },
+ null,
+ undefined,
+ { foo: 2, bar: null },
+ ]),
+ struct({ foo: int8(), bar: list(float64()) })
+ );
+ });
+
+ it('throws on bigints that exceed 64 bits', () => {
+ assert.throws(() => inferType([(1n << 200n)]));
+ });
+
+ it('throws on mixed types', () => {
+ assert.throws(() => inferType([1, true, 'foo']));
+ });
+});
diff --git a/test/parse-ipc-test.js b/test/parse-ipc-test.js
deleted file mode 100644
index 2fe95c8..0000000
--- a/test/parse-ipc-test.js
+++ /dev/null
@@ -1,64 +0,0 @@
-import assert from 'node:assert';
-import { readFile } from 'node:fs/promises';
-import { Type, Version } from '../src/index.js';
-import { parseIPC } from '../src/parse-ipc.js';
-
-function decimalData() {
- return {
- schema: {
- version: Version.V5,
- endianness: 0,
- fields: [{
- name: 'd',
- type: { typeId: Type.Decimal, precision: 18, scale: 3, bitWidth: 128, values: Uint32Array },
- nullable: true,
- metadata: null
- }],
- metadata: null,
- dictionaryTypes: new Map
- },
- records: [{
- length: 3,
- nodes: [ { length: 3, nullCount: 0 } ],
- buffers: [
- { offset: 0, length: 0 },
- { offset: 0, length: 48 }
- ],
- variadic: [],
- body: Uint8Array.of(232,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,224,46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,208,132,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
- }],
- dictionaries: [],
- metadata: null
- };
-}
-
-describe('parseIPC', () => {
- it('decodes arrow file format', async () => {
- const buffer = await readFile(`test/data/decimal.arrow`);
- const bytes = new Uint8Array(buffer);
- const expect = decimalData();
- assert.deepEqual(parseIPC(buffer), expect, 'Node Buffer');
- assert.deepStrictEqual(parseIPC(bytes), expect, 'Uint8Array');
- assert.deepStrictEqual(parseIPC(bytes.buffer), expect, 'ArrayBuffer');
- });
-
- it('decodes arrow stream format', async () => {
- const buffer = await readFile(`test/data/decimal.arrows`);
- const bytes = new Uint8Array(buffer);
- const expect = decimalData();
- assert.deepEqual(parseIPC(buffer), expect, 'Node Buffer');
- assert.deepStrictEqual(parseIPC(bytes), expect, 'Uint8Array');
- assert.deepStrictEqual(parseIPC(bytes.buffer), expect, 'ArrayBuffer');
- });
-
- it('decodes arrow stream format from multiple buffers', async () => {
- // decimal.arrows, divided into separate messages
- const array = [
- Uint8Array.of(255,255,255,255,120,0,0,0,16,0,0,0,0,0,10,0,12,0,6,0,5,0,8,0,10,0,0,0,0,1,4,0,12,0,0,0,8,0,8,0,0,0,4,0,8,0,0,0,4,0,0,0,1,0,0,0,20,0,0,0,16,0,20,0,8,0,6,0,7,0,12,0,0,0,16,0,16,0,0,0,0,0,1,7,16,0,0,0,28,0,0,0,4,0,0,0,0,0,0,0,1,0,0,0,100,0,0,0,8,0,12,0,4,0,8,0,8,0,0,0,18,0,0,0,3,0,0,0),
- Uint8Array.of(255,255,255,255,136,0,0,0,20,0,0,0,0,0,0,0,12,0,22,0,6,0,5,0,8,0,12,0,12,0,0,0,0,3,4,0,24,0,0,0,48,0,0,0,0,0,0,0,0,0,10,0,24,0,12,0,4,0,8,0,10,0,0,0,60,0,0,0,16,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,48,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,232,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,224,46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,208,132,0,0,0,0,0,0,0,0,0,0,0,0,0,0),
- Uint8Array.of(255,255,255,255,0,0,0,0)
- ];
- const expect = decimalData();
- assert.deepStrictEqual(parseIPC(array), expect, 'Uint8Array');
- });
-});
diff --git a/test/table-from-arrays-test.js b/test/table-from-arrays-test.js
new file mode 100644
index 0000000..003607e
--- /dev/null
+++ b/test/table-from-arrays-test.js
@@ -0,0 +1,68 @@
+import assert from 'node:assert';
+import { float64, int8, int32, bool, dictionary, tableFromArrays, utf8, float32 } from '../src/index.js';
+
+describe('tableFromArrays', () => {
+ const values = {
+ foo: [1, 2, 3, 4, 5],
+ bar: [1.3, NaN, 1e27, Math.PI, Math.E].map(v => Math.fround(v)),
+ baz: [true, false, null, false, true],
+ bop: ['foo', 'bar', 'baz', 'bop', 'bip']
+ };
+
+ const types = {
+ foo: int8(),
+ bar: float64(),
+ baz: bool(),
+ bop: dictionary(utf8(), int32())
+ };
+
+ function check(table, colTypes = types) {
+ const { fields } = table.schema;
+ assert.strictEqual(table.numRows, 5);
+ assert.strictEqual(table.numCols, 4);
+ table.children.forEach((c, i) => {
+ const { name } = fields[i];
+ assert.deepStrictEqual(c.type, colTypes[name]);
+ assert.deepStrictEqual(fields[i].type, colTypes[name]);
+ assert.deepStrictEqual([...c], values[name]);
+ });
+ return table;
+ }
+
+ it('creates table from provided types', () => {
+ // with types that match type inference results
+ check(tableFromArrays(values, { types }));
+ check(tableFromArrays(Object.entries(values), { types }));
+
+ // with types that do not match type inference reults
+ const opt = { types: { ...types, foo: int32(), bar: float32() } };
+ check(tableFromArrays(values, opt), opt.types);
+ check(tableFromArrays({
+ ...values,
+ foo: Int16Array.from(values.foo),
+ bar: Float64Array.from(values.bar)
+ }, opt), opt.types);
+ });
+
+ it('creates table from inferred types', () => {
+ check(tableFromArrays(values));
+ check(tableFromArrays(Object.entries(values)));
+
+ // infer from typed arrays
+ check(tableFromArrays({
+ ...values,
+ foo: Int8Array.from(values.foo),
+ bar: Float64Array.from(values.bar)
+ }));
+ });
+
+ it('creates empty table', () => {
+ const table = tableFromArrays({});
+ assert.strictEqual(table.numRows, 0);
+ assert.strictEqual(table.numCols, 0);
+ });
+
+ it('throws when arrays lengths differ', () => {
+ assert.throws(() => tableFromArrays({ foo: [1, 2, 3], bar: [1, 2] }));
+ });
+});
diff --git a/test/table-from-ipc-test.js b/test/table-from-ipc-test.js
index faf52da..26e216f 100644
--- a/test/table-from-ipc-test.js
+++ b/test/table-from-ipc-test.js
@@ -1,18 +1,26 @@
import assert from 'node:assert';
-import { readFile } from 'node:fs/promises';
-import { arrowFromDuckDB, arrowQuery } from './util/arrow-from-duckdb.js';
import { tableFromIPC } from '../src/index.js';
+import { arrowFromDuckDB } from './util/arrow-from-duckdb.js';
+import { binaryView, bool, dateDay, decimal, empty, fixedListInt32, fixedListUtf8, float32, float64, int16, int32, int64, int8, intervalMonthDayNano, largeListView, listInt32, listUtf8, listView, map, runEndEncoded32, runEndEncoded64, struct, timestampMicrosecond, timestampMillisecond, timestampNanosecond, timestampSecond, uint16, uint32, uint64, uint8, union, utf8, utf8View } from './util/data.js';
-const toDate = v => new Date(v);
const toBigInt = v => BigInt(v);
+const toDate = v => new Date(v);
+const toFloat32 = v => Math.fround(v);
+
+async function test(dataMethod, arrayType, opt, transform) {
+ const data = await dataMethod();
+ for (const { bytes, values, nullCount } of data) {
+ valueTest(bytes, values, nullCount ? Array : arrayType, opt, transform);
+ }
+}
-async function valueTest(values, dbType = null, arrayType, opt = undefined, transform = undefined) {
+function valueTest(bytes, values, arrayType, opt = undefined, transform = undefined, name = 'value') {
const array = transform
? values.map((v, i) => v == null ? v : transform(v, i))
: Array.from(values);
- const bytes = await arrowFromDuckDB(values, dbType);
- const column = tableFromIPC(bytes, opt).getChild('value');
+ const column = tableFromIPC(bytes, opt).getChild(name);
compare(column, array, arrayType);
+ return column;
}
function compare(column, array, arrayType = Array) {
@@ -41,273 +49,103 @@ describe('tableFromIPC', () => {
assert.throws(() => tableFromIPC(bytes).getChild('value').toArray());
// as bigints
- await valueTest(values, 'BIGINT', BigInt64Array, { useBigInt: true });
+ valueTest(bytes, values, BigInt64Array, { useBigInt: true });
});
- it('decodes boolean data', async () => {
- await valueTest([true, false, true], 'BOOLEAN', Array);
- await valueTest([true, false, null], 'BOOLEAN', Array);
- });
+ it('decodes boolean data', () => test(bool));
- it('decodes uint8 data', async () => {
- await valueTest([1, 2, 3], 'UTINYINT', Uint8Array);
- await valueTest([1, null, 3], 'UTINYINT', Array);
- });
+ it('decodes uint8 data', () => test(uint8, Uint8Array));
+ it('decodes uint16 data', () => test(uint16, Uint16Array));
+ it('decodes uint32 data', () => test(uint32, Uint32Array));
+ it('decodes uint64 data', () => test(uint64, Float64Array));
+ it('decodes uint64 data to bigint', () => test(uint64, BigUint64Array, { useBigInt: true }, toBigInt));
- it('decodes uint16 data', async () => {
- await valueTest([1, 2, 3], 'USMALLINT', Uint16Array);
- await valueTest([1, null, 3], 'USMALLINT', Array);
- });
+ it('decodes int8 data', () => test(int8, Int8Array));
+ it('decodes int16 data', () => test(int16, Int16Array));
+ it('decodes int32 data', () => test(int32, Int32Array));
+ it('decodes int64 data', () => test(int64, Float64Array));
+ it('decodes int64 data to bigint', () => test(int64, BigInt64Array, { useBigInt: true }, toBigInt));
- it('decodes uint32 data', async () => {
- await valueTest([1, 2, 3], 'UINTEGER', Uint32Array);
- await valueTest([1, null, 3], 'UINTEGER', Array);
- });
+ it('decodes float32 data', () => test(float32, Float32Array, {}, toFloat32));
+ it('decodes float64 data', () => test(float64, Float64Array));
+ it('decodes decimal data', () => test(decimal, Float64Array));
- it('decodes uint64 data', async () => {
- // coerced to numbers
- await valueTest([1, 2, 3], 'UBIGINT', Float64Array);
- await valueTest([1, null, 3], 'UBIGINT', Array);
- // as bigints
- await valueTest([1, 2, 3], 'UBIGINT', BigUint64Array, { useBigInt: true }, toBigInt);
- await valueTest([1, null, 3], 'UBIGINT', Array, { useBigInt: true }, toBigInt);
- });
+ it('decodes date day data', () => test(dateDay, Float64Array));
+ it('decodes date day data to dates', () => test(dateDay, Array, { useDate: true }, toDate));
- it('decodes int8 data', async () => {
- await valueTest([1, 2, 3], 'TINYINT', Int8Array);
- await valueTest([1, null, 3], 'TINYINT', Array);
- });
+ it('decodes timestamp nanosecond data', () => test(timestampNanosecond, Float64Array));
+ it('decodes timestamp microsecond data', () => test(timestampMicrosecond, Float64Array));
+ it('decodes timestamp millisecond data', () => test(timestampMillisecond, Float64Array));
+ it('decodes timestamp second data', () => test(timestampSecond, Float64Array));
+ it('decodes timestamp nanosecond data to dates', () => test(timestampNanosecond, Array, { useDate: true }, toDate));
+ it('decodes timestamp microsecond data to dates', () => test(timestampMicrosecond, Array, { useDate: true }, toDate));
+ it('decodes timestamp millisecond data to dates', () => test(timestampMillisecond, Array, { useDate: true }, toDate));
+ it('decodes timestamp second data to dates', () => test(timestampSecond, Array, { useDate: true }, toDate));
- it('decodes int16 data', async () => {
- await valueTest([1, 2, 3], 'SMALLINT', Int16Array);
- await valueTest([1, null, 3], 'SMALLINT', Array);
- });
+ it('decodes interval year/month/nano data', () => test(intervalMonthDayNano));
- it('decodes int32 data', async () => {
- await valueTest([1, 2, 3], 'INTEGER', Int32Array);
- await valueTest([1, null, 3], 'INTEGER', Array);
- });
-
- it('decodes int64 data', async () => {
- // coerced to numbers
- await valueTest([1, 2, 3], 'BIGINT', Float64Array);
- await valueTest([1, null, 3], 'BIGINT', Array);
- // as bigints
- await valueTest([1, 2, 3], 'BIGINT', BigInt64Array, { useBigInt: true }, toBigInt);
- await valueTest([1, null, 3], 'BIGINT', Array, { useBigInt: true }, toBigInt);
- });
-
- it('decodes float32 data', async () => {
- await valueTest([1.1, 2.2, 3.3], 'FLOAT', Float32Array, {}, v => Math.fround(v));
- await valueTest([1.1, null, 3.3], 'FLOAT', Array, {}, v => Math.fround(v));
- });
-
- it('decodes float64 data', async () => {
- await valueTest([1.1, 2.2, 3.3], 'DOUBLE', Float64Array);
- await valueTest([1.1, null, 3.3], 'DOUBLE', Array);
- });
+ it('decodes utf8 data', () => test(utf8));
- it('decodes decimal data', async () => {
- await valueTest([1.212, 3.443, 5.600], 'DECIMAL(18,3)', Float64Array);
- await valueTest([1.212, null, 5.600], 'DECIMAL(18,3)', Array);
- });
-
- it('decodes date day data', async () => {
- const values = ['2001-01-01', '2004-02-03', '2006-12-31'];
- const nulls = ['2001-01-01', null, '2006-12-31'];
- // as timestamps
- await valueTest(values, 'DATE', Float64Array, {}, v => +toDate(v));
- await valueTest(nulls, 'DATE', Array, {}, v => +toDate(v));
- // as Date objects
- await valueTest(values, 'DATE', Array, { useDate: true }, toDate);
- await valueTest(nulls, 'DATE', Array, { useDate: true }, toDate);
- });
+ it('decodes list int32 data', () => test(listInt32));
+ it('decodes list utf8 data', () => test(listUtf8));
- it('decodes timestamp data', async () => {
- const ns = ['1992-09-20T11:30:00.123456789Z', '2002-12-13T07:28:56.564738209Z'];
- const us = ['1992-09-20T11:30:00.123457Z', '2002-12-13T07:28:56.564738Z'];
- const ms = ['1992-09-20T11:30:00.123Z', '2002-12-13T07:28:56.565Z'];
- const sec = ['1992-09-20T11:30:00Z', '2002-12-13T07:28:57Z'];
+ it('decodes fixed list int32 data', () => test(fixedListInt32));
+ it('decodes fixed utf8 data', () => test(fixedListUtf8));
- // From DuckDB docs: When defining timestamps as a TIMESTAMP_NS literal, the
- // decimal places beyond microseconds are ignored. The TIMESTAMP_NS type is
- // able to hold nanoseconds when created e.g., loading Parquet files.
- const _ns = [0.456, 0.738]; // DuckDB truncates here
- const _us = [0.457, 0.738]; // DuckDB rounds here
+ it('decodes list view data', () => test(listView));
+ it('decodes large list view data', () => test(largeListView));
- // as timestamps
- await valueTest(ns, 'TIMESTAMP_NS', Float64Array, {}, (v, i) => +toDate(v) + _ns[i]);
- await valueTest(us, 'TIMESTAMP', Float64Array, {}, (v, i) => +toDate(v) + _us[i]);
- await valueTest(ms, 'TIMESTAMP_MS', Float64Array, {}, v => +toDate(v));
- await valueTest(sec, 'TIMESTAMP_S', Float64Array, {}, v => +toDate(v));
+ it('decodes union data', () => test(union));
- // as timestamps with nulls
- await valueTest(ns.concat(null), 'TIMESTAMP_NS', Array, {}, (v, i) => +toDate(v) + _ns[i]);
- await valueTest(us.concat(null), 'TIMESTAMP', Array, {}, (v, i) => +toDate(v) + _us[i]);
- await valueTest(ms.concat(null), 'TIMESTAMP_MS', Array, {}, v => +toDate(v));
- await valueTest(sec.concat(null), 'TIMESTAMP_S', Array, {}, v => +toDate(v));
+ it('decodes map data', () => test(map, Array, {}, v => Array.from(v.entries())));
+ it('decodes map data to maps', () => test(map, Array, { useMap: true }));
- // as dates
- await valueTest(ns, 'TIMESTAMP_NS', Array, { useDate: true }, toDate);
- await valueTest(us, 'TIMESTAMP', Array, { useDate: true }, toDate);
- await valueTest(ms, 'TIMESTAMP_MS', Array, { useDate: true }, toDate);
- await valueTest(sec, 'TIMESTAMP_S', Array, { useDate: true }, toDate);
-
- // as dates with nulls
- await valueTest(ns.concat(null), 'TIMESTAMP_NS', Array, { useDate: true }, toDate);
- await valueTest(us.concat(null), 'TIMESTAMP', Array, { useDate: true }, toDate);
- await valueTest(ms.concat(null), 'TIMESTAMP_MS', Array, { useDate: true }, toDate);
- await valueTest(sec.concat(null), 'TIMESTAMP_S', Array, { useDate: true }, toDate);
- });
-
- it('decodes interval data', async () => {
- const values = ['2 years', null, '12 years 2 month 1 day 5 seconds', '1 microsecond'];
- const js = [
- Float64Array.of(24, 0, 0),
- null,
- Float64Array.of(146, 1, 5000000000),
- Float64Array.of(0, 0, 1000)
- ];
- await valueTest(values, 'INTERVAL', Array, {}, (v, i) => js[i]);
- });
-
- it('decodes utf8 data', async () => {
- await valueTest(['foo', 'bar', 'baz'], 'VARCHAR', Array);
- await valueTest(['foo', null, 'baz'], 'VARCHAR', Array);
- });
-
- it('decodes list data', async () => {
- const values = [[1, 2, 3, 4], [5, 6], [7, 8, 9]];
- const pnulls = [[1, 2, 3, 4], null, [7, 8, 9]];
- const cnulls = [[1, 2, null, 4], [5, null, 6], [7, null, 9]];
- await valueTest(values, 'INTEGER[]', Array, {}, v => Int32Array.from(v));
- await valueTest(pnulls, 'INTEGER[]', Array, {}, v => Int32Array.from(v));
- await valueTest(cnulls, 'INTEGER[]', Array);
-
- const svalues = [['a', 'b', 'c', 'd'], ['e', 'f'], ['g', 'h', 'i']];
- const spnulls = [['a', 'b', 'c', 'd'], null, ['g', 'h', 'i']];
- const scnulls = [['a', 'b', null, 'd'], ['e', null, 'f'], ['g', null, 'i']];
- await valueTest(svalues, 'VARCHAR[]');
- await valueTest(spnulls, 'VARCHAR[]');
- await valueTest(scnulls, 'VARCHAR[]');
- });
-
- it('decodes fixed list data', async () => {
- const values = [[1, 2, 3], [4, 5, 6], [7, 8, 9]];
- const pnulls = [[1, 2, 3], null, [7, 8, 9]];
- const cnulls = [[1, null, 3], [null, 5, 6], [7, 8, null]];
- await valueTest(values, 'INTEGER[3]', Array, {}, v => Int32Array.from(v));
- await valueTest(pnulls, 'INTEGER[3]', Array, {}, v => Int32Array.from(v));
- await valueTest(cnulls, 'INTEGER[3]', Array);
-
- const svalues = [['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']];
- const spnulls = [['a', 'b', 'c'], null, ['g', 'h', 'i']];
- const scnulls = [['a', null, 'c'], [null, 'e', 'f'], ['g', 'h', null]];
- await valueTest(svalues, 'VARCHAR[]');
- await valueTest(spnulls, 'VARCHAR[]');
- await valueTest(scnulls, 'VARCHAR[]');
- });
-
- it('decodes list view data', async () => {
- const buf = await readFile(`test/data/listview.arrows`);
- const column = tableFromIPC(new Uint8Array(buf)).getChild('value');
- compare(column, [
- ['foo', 'bar', 'baz'],
- null,
- ['baz', null, 'foo'],
- ['foo']
- ]);
- });
-
- it('decodes large list view data', async () => {
- const buf = await readFile(`test/data/largelistview.arrows`);
- const column = tableFromIPC(new Uint8Array(buf)).getChild('value');
- compare(column, [
- ['foo', 'bar', 'baz'],
- null,
- ['baz', null, 'foo'],
- ['foo']
- ]);
- });
-
- it('decodes union data', async () => {
- const values = ['a', 2, 'c'];
- const nulls = ['a', null, 'c'];
- const type = 'UNION(i INTEGER, v VARCHAR)';
- await valueTest(values, type);
- await valueTest(nulls, type);
- });
-
- it('decodes map data', async () => {
- const data = [
- [ ['foo', 1], ['bar', 2] ],
- [ ['foo', null], ['baz', 3] ]
- ];
- const values = data.map(e => new Map(e));
- await valueTest(values, null, Array, {}, v => Array.from(v.entries()));
- await valueTest(values, null, Array, { useMap: true });
- });
-
- it('decodes struct data', async () => {
- await valueTest([ {a: 1, b: 'foo'}, {a: 2, b: 'baz'} ]);
- await valueTest([ {a: 1, b: 'foo'}, null, {a: 2, b: 'baz'} ]);
- await valueTest([ {a: null, b: 'foo'}, {a: 2, b: null} ]);
- await valueTest([ {a: ['a', 'b'], b: Math.E}, {a: ['c', 'd'], b: Math.PI} ]);
- });
+ it('decodes struct data', () => test(struct));
- it('decodes run-end-encoded data', async () => {
- const buf = await readFile(`test/data/runendencoded.arrows`);
- const table = tableFromIPC(new Uint8Array(buf));
- const column = table.getChild('value');
- const [{ children: [runs, vals] }] = column.data;
- assert.deepStrictEqual([...runs], [2, 3, 4, 6, 8, 9]);
- assert.deepStrictEqual([...vals], ['foo', null, 'bar', 'baz', null, 'foo']);
- compare(column, ['foo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo']);
+ it('decodes run-end-encoded data with 32-bit run ends', async () => {
+ const data = await runEndEncoded32();
+ for (const { bytes, runs, values } of data) {
+ const column = valueTest(bytes, values);
+ const ree = column.data[0].children;
+ assert.deepStrictEqual([...ree[0]], runs.counts);
+ assert.deepStrictEqual([...ree[1]], runs.values);
+ }
});
it('decodes run-end-encoded data with 64-bit run ends', async () => {
- const buf = await readFile(`test/data/runendencoded64.arrows`);
- const table = tableFromIPC(new Uint8Array(buf), { useBigInt: true });
- const column = table.getChild('value');
- const [{ children: [runs, vals] }] = column.data;
- assert.deepStrictEqual([...runs], [2n, 3n, 4n, 6n, 8n, 9n]);
- assert.deepStrictEqual([...vals], ['foo', null, 'bar', 'baz', null, 'foo']);
- compare(column, ['foo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo']);
+ const data = await runEndEncoded64();
+ for (const { bytes, runs, values } of data) {
+ const column = valueTest(bytes, values);
+ const ree = column.data[0].children;
+ assert.deepStrictEqual([...ree[0]], runs.counts);
+ assert.deepStrictEqual([...ree[1]], runs.values);
+ }
});
it('decodes binary view data', async () => {
- const f = ['foo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo'];
- const s = ['foobazbarbipbopboodeedoozoo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo'];
- const enc = new TextEncoder();
- const binary = v => v != null ? enc.encode(v) : null;
- const buf = await readFile(`test/data/binaryview.arrows`);
- const table = tableFromIPC(new Uint8Array(buf));
- const flat = table.getChild('flat'); // all strings under 12 bytes
- const spill = table.getChild('spill'); // some strings spill to data buffer
- compare(flat, f.map(binary));
- compare(spill, s.map(binary));
+ const data = await binaryView();
+ for (const { bytes, values: { flat, spill } } of data) {
+ valueTest(bytes, flat, Array, {}, null, 'flat');
+ valueTest(bytes, spill, Array, {}, null, 'spill');
+ }
});
it('decodes utf8 view data', async () => {
- const f = ['foo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo'];
- const s = ['foobazbarbipbopboodeedoozoo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo'];
- const buf = await readFile(`test/data/utf8view.arrows`);
- const table = tableFromIPC(new Uint8Array(buf));
- const flat = table.getChild('flat'); // all strings under 12 bytes
- const spill = table.getChild('spill'); // some strings spill to data buffer
- compare(flat, f);
- compare(spill, s);
+ const data = await utf8View();
+ for (const { bytes, values: { flat, spill } } of data) {
+ valueTest(bytes, flat, Array, {}, null, 'flat');
+ valueTest(bytes, spill, Array, {}, null, 'spill');
+ }
});
it('decodes empty data', async () => {
- // For empty result sets, DuckDB node only returns a zero byte
- // Other variants may include a schema message
- const sql = 'SELECT schema_name FROM information_schema.schemata WHERE false';
- const table = tableFromIPC(await arrowQuery(sql));
- assert.strictEqual(table.numRows, 0);
- assert.strictEqual(table.numCols, 0);
- assert.deepStrictEqual(table.toColumns(), {});
- assert.deepStrictEqual(table.toArray(), []);
- assert.deepStrictEqual([...table], []);
+ for (const { bytes } of empty()) {
+ const table = tableFromIPC(bytes);
+ assert.strictEqual(table.numRows, 0);
+ assert.strictEqual(table.numCols, 0);
+ assert.deepStrictEqual(table.toColumns(), {});
+ assert.deepStrictEqual(table.toArray(), []);
+ assert.deepStrictEqual([...table], []);
+ }
});
});
diff --git a/test/table-to-ipc-test.js b/test/table-to-ipc-test.js
new file mode 100644
index 0000000..8281a57
--- /dev/null
+++ b/test/table-to-ipc-test.js
@@ -0,0 +1,60 @@
+import assert from 'node:assert';
+import { readFile } from 'node:fs/promises';
+import { Version, tableFromIPC, tableToIPC } from '../src/index.js';
+import * as dataMethods from './util/data.js';
+
+const files = [
+ 'flights.arrows',
+ 'scrabble.arrows',
+ 'convert.arrows',
+ 'decimal.arrows'
+];
+
+describe('tableToIPC', () => {
+ for (const [name, method] of Object.entries(dataMethods)) {
+ it(`encodes ${name} data`, async () => {
+ const data = await method();
+ data.forEach(({ bytes }) => testEncode(bytes));
+ });
+ }
+
+ for (const file of files) {
+ it(`encodes ${file}`, async () => {
+ const bytes = new Uint8Array(await readFile(`test/data/${file}`));
+ testEncode(bytes);
+ });
+ }
+});
+
+function testEncode(bytes) {
+ // load table
+ const table = tableFromIPC(bytes);
+
+ // ensure complete schema, override version
+ const schema = {
+ dictionaryTypes: new Map,
+ endianness: 0,
+ metadata: null,
+ ...table.schema,
+ version: Version.V5
+ };
+
+ // encode table to ipc bytes
+ const ipc = tableToIPC(table);
+
+ // parse ipc byte to get a "round-trip" table
+ const round = tableFromIPC(ipc);
+
+ // check schema and shape equality
+ assert.deepStrictEqual(round.schema, schema);
+ assert.strictEqual(round.numRows, table.numRows);
+ assert.strictEqual(round.numCols, table.numCols);
+
+ // check extracted value equality
+ for (let i = 0; i < table.numCols; ++i) {
+ assert.deepStrictEqual(
+ round.getChildAt(i).toArray(),
+ table.getChildAt(i).toArray()
+ );
+ }
+}
diff --git a/test/util/data.js b/test/util/data.js
new file mode 100644
index 0000000..c3d729c
--- /dev/null
+++ b/test/util/data.js
@@ -0,0 +1,308 @@
+import { readFile } from 'node:fs/promises';
+import { arrowFromDuckDB } from './arrow-from-duckdb.js';
+
+const toTimestamp = (v, off = 0) => v == null ? null : (+new Date(v) + off);
+const toInt32s = v => v == null ? null : v.some(x => x == null) ? v : Int32Array.of(...v);
+
+async function dataQuery(data, type, jsValues) {
+ return Promise.all(data.map(async (array, i) => {
+ const values = jsValues?.[i] ?? array;
+ return {
+ values,
+ bytes: await arrowFromDuckDB(array, type),
+ nullCount: values.reduce((nc, v) => v == null ? ++nc : nc, 0)
+ };
+ }));
+}
+
+export function bool() {
+ return dataQuery([
+ [true, false, true],
+ [true, false, null]
+ ], 'BOOLEAN');
+}
+
+export function uint8() {
+ return dataQuery([
+ [1, 2, 3],
+ [1, null, 3]
+ ], 'UTINYINT');
+}
+
+export function uint16() {
+ return dataQuery([
+ [1, 2, 3],
+ [1, null, 3]
+ ], 'USMALLINT');
+}
+
+export function uint32() {
+ return dataQuery([
+ [1, 2, 3],
+ [1, null, 3]
+ ], 'UINTEGER');
+}
+
+export function uint64() {
+ return dataQuery([
+ [1, 2, 3],
+ [1, null, 3]
+ ], 'UBIGINT');
+}
+
+export function int8() {
+ return dataQuery([
+ [1, 2, 3],
+ [1, null, 3]
+ ], 'TINYINT');
+}
+
+export function int16() {
+ return dataQuery([
+ [1, 2, 3],
+ [1, null, 3]
+ ], 'SMALLINT');
+}
+
+export function int32() {
+ return dataQuery([
+ [1, 2, 3],
+ [1, null, 3]
+ ], 'INTEGER');
+}
+
+export function int64() {
+ return dataQuery([
+ [1, 2, 3],
+ [1, null, 3]
+ ], 'BIGINT');
+}
+
+export function float32() {
+ return dataQuery([
+ [1.1, 2.2, 3.3],
+ [1.1, null, 3.3]
+ ], 'FLOAT');
+}
+
+export function float64() {
+ return dataQuery([
+ [1.1, 2.2, 3.3],
+ [1.1, null, 3.3]
+ ], 'DOUBLE');
+}
+
+export function decimal() {
+ return dataQuery([
+ [1.212, 3.443, 5.600],
+ [1.212, null, 5.600]
+ ], 'DECIMAL(18,3)');
+}
+
+export function dateDay() {
+ const data = [
+ ['2001-01-01', '2004-02-03', '2006-12-31'],
+ ['2001-01-01', null, '2006-12-31']
+ ];
+ const vals = data.map(v => v.map(d => toTimestamp(d)));
+ return dataQuery(data, 'DATE', vals);
+}
+
+export function timestampNanosecond() {
+ const ns = [0.456, 0.738]; // DuckDB truncates here
+ const ts = ['1992-09-20T11:30:00.123456789Z', '2002-12-13T07:28:56.564738209Z'];
+ const data = [ts, ts.concat(null)];
+ const vals = data.map(v => v.map((d, i) => toTimestamp(d, ns[i])));
+ return dataQuery(data, 'TIMESTAMP_NS', vals);
+}
+
+export function timestampMicrosecond() {
+ const us = [0.457, 0.738]; // DuckDB rounds here
+ const ts = ['1992-09-20T11:30:00.123457Z', '2002-12-13T07:28:56.564738Z'];
+ const data = [ts, ts.concat(null)];
+ const vals = data.map(v => v.map((d, i) => toTimestamp(d, us[i])));
+ return dataQuery(data, 'TIMESTAMP_NS', vals);
+}
+
+export function timestampMillisecond() {
+ const ts = ['1992-09-20T11:30:00.123Z', '2002-12-13T07:28:56.565Z'];
+ const data = [ts, ts.concat(null)];
+ const vals = data.map(v => v.map(d => toTimestamp(d)));
+ return dataQuery(data, 'TIMESTAMP_MS', vals);
+}
+
+export function timestampSecond() {
+ const ts = ['1992-09-20T11:30:00Z', '2002-12-13T07:28:57Z'];
+ const data = [ts, ts.concat(null)];
+ const vals = data.map(v => v.map(d => toTimestamp(d)));
+ return dataQuery(data, 'TIMESTAMP_S', vals);
+}
+
+export function intervalMonthDayNano() {
+ return dataQuery([
+ ['2 years', null, '12 years 2 month 1 day 5 seconds', '1 microsecond']
+ ], 'INTERVAL', [[
+ Float64Array.of(24, 0, 0),
+ null,
+ Float64Array.of(146, 1, 5000000000),
+ Float64Array.of(0, 0, 1000)
+ ]]);
+}
+
+export function utf8() {
+ return dataQuery([
+ ['foo', 'bar', 'baz'],
+ ['foo', null, 'baz']
+ ], 'VARCHAR');
+}
+
+export function listInt32() {
+ const data = [
+ [[1, 2, 3, 4], [5, 6], [7, 8, 9]],
+ [[1, 2, 3, 4], null, [7, 8, 9]],
+ [[1, 2, null, 4], [5, null, 6], [7, null, 9]]
+ ];
+ const vals = data.map(v => v.map(toInt32s));
+ return dataQuery(data, 'INTEGER[]', vals);
+}
+
+export function listUtf8() {
+ return dataQuery([
+ [['a', 'b', 'c', 'd'], ['e', 'f'], ['g', 'h', 'i']],
+ [['a', 'b', 'c', 'd'], null, ['g', 'h', 'i']],
+ [['a', 'b', null, 'd'], ['e', null, 'f'], ['g', null, 'i']]
+ ], 'VARCHAR[]');
+}
+
+export function fixedListInt32() {
+ const data = [
+ [[1, 2, 3], [4, 5, 6], [7, 8, 9]],
+ [[1, 2, 3], null, [7, 8, 9]],
+ [[1, null, 3], [null, 5, 6], [7, 8, null]]
+ ];
+ const vals = data.map(v => v.map(toInt32s));
+ return dataQuery(data, 'INTEGER[3]', vals);
+}
+
+export function fixedListUtf8() {
+ return dataQuery([
+ [['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']],
+ [['a', 'b', 'c'], null, ['g', 'h', 'i']],
+ [['a', null, 'c'], [null, 'e', 'f'], ['g', 'h', null]]
+ ], 'VARCHAR[3]');
+}
+
+export function union() {
+ return dataQuery([
+ ['a', 2, 'c'],
+ ['a', null, 'c']
+ ], 'UNION(i INTEGER, v VARCHAR)');
+}
+
+export function map() {
+ return dataQuery([
+ [
+ new Map([ ['foo', 1], ['bar', 2] ]),
+ new Map([ ['foo', null], ['baz', 3] ])
+ ]
+ ]);
+}
+
+export function struct() {
+ return dataQuery([
+ [ {a: 1, b: 'foo'}, {a: 2, b: 'baz'} ],
+ [ {a: 1, b: 'foo'}, null, {a: 2, b: 'baz'} ],
+ [ {a: null, b: 'foo'}, {a: 2, b: null} ],
+ [ {a: ['a', 'b'], b: Math.E}, {a: ['c', 'd'], b: Math.PI} ]
+ ]);
+}
+
+export async function listView() {
+ const bytes = await readFile(`test/data/listview.arrows`);
+ return [{
+ values: [
+ ['foo', 'bar', 'baz'],
+ null,
+ ['baz', null, 'foo'],
+ ['foo']
+ ],
+ bytes,
+ nullCount: 1
+ }];
+}
+
+export async function largeListView() {
+ const bytes = await readFile(`test/data/largelistview.arrows`);
+ return [{
+ values: [
+ ['foo', 'bar', 'baz'],
+ null,
+ ['baz', null, 'foo'],
+ ['foo']
+ ],
+ bytes,
+ nullCount: 1
+ }];
+}
+
+export async function runEndEncoded32() {
+ const bytes = new Uint8Array(await readFile(`test/data/runendencoded.arrows`));
+ return [{
+ runs: {
+ counts: [2, 3, 4, 6, 8, 9],
+ values: ['foo', null, 'bar', 'baz', null, 'foo']
+ },
+ values: ['foo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo'],
+ bytes,
+ nullCount: 3
+ }];
+}
+
+export async function runEndEncoded64() {
+ const bytes = new Uint8Array(await readFile(`test/data/runendencoded64.arrows`));
+ return [{
+ runs: {
+ counts: [2, 3, 4, 6, 8, 9],
+ values: ['foo', null, 'bar', 'baz', null, 'foo']
+ },
+ values: ['foo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo'],
+ bytes,
+ nullCount: 3
+ }];
+}
+
+export async function binaryView() {
+ const encoder = new TextEncoder();
+ const toBytes = v => v == null ? null : encoder.encode(v);
+ const bytes = new Uint8Array(await readFile(`test/data/binaryview.arrows`));
+ return [{
+ values: {
+ flat: ['foo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo'].map(toBytes),
+ spill: ['foobazbarbipbopboodeedoozoo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo'].map(toBytes)
+ },
+ bytes,
+ nullCount: 3
+ }];
+}
+
+export async function utf8View() {
+ const bytes = new Uint8Array(await readFile(`test/data/utf8view.arrows`));
+ return [{
+ values: {
+ flat: ['foo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo'],
+ spill: ['foobazbarbipbopboodeedoozoo', 'foo', null, 'bar', 'baz', 'baz', null, null, 'foo']
+ },
+ bytes,
+ nullCount: 3
+ }];
+}
+
+// For empty result sets, DuckDB node only returns a zero byte
+// Other variants may include a schema message
+export function empty() {
+ return [{
+ values: [],
+ bytes: Uint8Array.of(0, 0, 0, 0),
+ nullCount: 0
+ }];
+}
diff --git a/test/util/decimal.js b/test/util/decimal.js
new file mode 100644
index 0000000..080730d
--- /dev/null
+++ b/test/util/decimal.js
@@ -0,0 +1,42 @@
+import { Type, Version } from '../../src/index.js';
+
+export function decimalDataToEncode() {
+ return {
+ schema: {
+ version: Version.V5,
+ endianness: 0,
+ fields: [{
+ name: 'd',
+ type: { typeId: Type.Decimal, precision: 18, scale: 3, bitWidth: 128, values: BigUint64Array },
+ nullable: true,
+ metadata: null
+ }],
+ metadata: null,
+ dictionaryTypes: new Map
+ },
+ records: [{
+ length: 3,
+ nodes: [ { length: 3, nullCount: 0 } ],
+ regions: [
+ { offset: 0, length: 0 },
+ { offset: 0, length: 48 }
+ ],
+ variadic: [],
+ buffers: [
+ Uint8Array.of(232,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,224,46,0,0,0,0,0,0,0,0,0,0,0,0,0,0,208,132,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
+ ],
+ byteLength: 48
+ }],
+ dictionaries: [],
+ metadata: null
+ };
+}
+
+export function decimalDataDecoded() {
+ const data = decimalDataToEncode();
+ const record = data.records[0];
+ record.body = record.buffers[0];
+ delete record.byteLength;
+ delete record.buffers;
+ return data;
+}