Add encoding support. (#13)

* feat: Add encoding support. * test: Add encode performance tests. * refactor: Dictionary type, array utils, types, comments. * docs: Update README.md. * docs: Add documentation. * fix: Use tableFromColumns method. * feat: Export Batch and batchType.
uwdata · Sep 11, 2024 · 4cad0e5 · 4cad0e5
1 parent 789dcbd
commit 4cad0e5
Show file tree

Hide file tree

Showing 79 changed files with 6,467 additions and 1,190 deletions.
diff --git a/README.md b/README.md
@@ -1,24 +1,24 @@
 # Flechette
 
-**Flechette** is a JavaScript library for reading the [Apache Arrow](https://arrow.apache.org/) columnar in-memory data format. It provides a faster, lighter, zero-dependency alternative to the [Arrow JS reference implementation](https://github.com/apache/arrow/tree/main/js).
+**Flechette** is a JavaScript library for reading and writing the [Apache Arrow](https://arrow.apache.org/) columnar in-memory data format. It provides a faster, lighter, zero-dependency alternative to the [Arrow JS reference implementation](https://github.com/apache/arrow/tree/main/js).
 
-Flechette performs fast extraction of data columns in the Arrow binary IPC format, supporting ingestion of Arrow data (from sources such as [DuckDB](https://duckdb.org/)) for downstream use in JavaScript data analysis tools like [Arquero](https://github.com/uwdata/arquero), [Mosaic](https://github.com/uwdata/mosaic), [Observable Plot](https://observablehq.com/plot/), and [Vega-Lite](https://vega.github.io/vega-lite/).
+Flechette performs fast extraction and encoding of data columns in the Arrow binary IPC format, supporting ingestion of Arrow data from sources such as [DuckDB](https://duckdb.org/) and Arrow use in JavaScript data analysis tools like [Arquero](https://github.com/uwdata/arquero), [Mosaic](https://github.com/uwdata/mosaic), [Observable Plot](https://observablehq.com/plot/), and [Vega-Lite](https://vega.github.io/vega-lite/).
 
 ## Why Flechette?
 
 In the process of developing multiple data analysis packages that consume Arrow data (including Arquero, Mosaic, and Vega), we've had to develop workarounds for the performance and correctness of the Arrow JavaScript reference implementation. Instead of workarounds, Flechette addresses these issues head-on.
 
-* _Speed_. Flechette provides faster decoding. Across varied datasets, initial performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, and 5-9x faster row object extraction.
+* _Speed_. Flechette provides better performance. Performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, 5-9x faster row object extraction, and 1.5-3.5x faster building of Arrow columns.
 
-* _Size_. Flechette is ~17k minified (~6k gzip'd), versus 163k minified (~43k gzip'd) for Arrow JS.
+* _Size_. Flechette is smaller: ~42k minified (~13k gzip'd) versus 163k minified (~43k gzip'd) for Arrow JS. Flechette's encoders and decoders also tree-shake cleanly, so you only pay for what you need in your own bundles.
 
-* _Coverage_. Flechette supports data types unsupported by the reference implementation at the time of writing, including decimal-to-number conversion, month/day/nanosecond time intervals (as used by DuckDB, for example), list views, and run-end encoded data.
+* _Coverage_. Flechette supports data types unsupported by the reference implementation, including decimal-to-number conversion, month/day/nanosecond time intervals (as used by DuckDB, for example), run-end encoded data, binary views, and list views.
 
 * _Flexibility_. Flechette includes options to control data value conversion, such as numerical timestamps vs. Date objects for temporal data, and numbers vs. bigint values for 64-bit integer data.
 
-* _Simplicity_. Our goal is to provide a smaller, simpler code base in the hope that it will make it easier for ourselves and others to improve the library. If you'd like to see support for additional Arrow data types or features, please [file an issue](https://github.com/uwdata/flechette/issues) or [open a pull request](https://github.com/uwdata/flechette/pulls).
+* _Simplicity_. Our goal is to provide a smaller, simpler code base in the hope that it will make it easier for ourselves and others to improve the library. If you'd like to see support for additional Arrow features, please [file an issue](https://github.com/uwdata/flechette/issues) or [open a pull request](https://github.com/uwdata/flechette/pulls).
 
-That said, no tool is without limitations or trade-offs. Flechette is *consumption oriented*: it does yet support encoding (though feel free to [upvote encoding support](https://github.com/uwdata/flechette/issues/1)!). Flechette also requires simpler inputs (byte buffers, no promises or streams), has less strict TypeScript typings, and at times has a slightly slower initial parse (as it decodes dictionary data upfront for faster downstream access).
+That said, no tool is without limitations or trade-offs. Flechette assumes simpler inputs (byte buffers, no promises or streams), has less strict TypeScript typings, and may have a slightly slower initial parse (as it decodes dictionary data upfront for faster downstream access).
 
 ## What's with the name?
 
@@ -70,18 +70,57 @@ const objects = table.toArray();
 const subtable = table.select(['delay', 'time']);
 ```
 
+### Build and Encode Arrow Data
+
+```js
+import {
+  bool, dictionary, float32, int32, tableFromArrays, tableToIPC, utf8
+} from '@uwdata/flechette';
+
+// data defined using standard JS types
+// both arrays and typed arrays work well
+const arrays = {
+  ints: [1, 2, null, 4, 5],
+  floats: [1.1, 2.2, 3.3, 4.4, 5.5],
+  bools: [true, true, null, false, true],
+  strings: ['a', 'b', 'c', 'b', 'a']
+};
+
+// create table with automatically inferred types
+const tableInfer = tableFromArrays(arrays);
+
+// encode table to bytes in Arrow IPC stream format
+const ipcInfer = tableToIPC(tableInfer);
+
+// create table using explicit types
+const tableTyped = tableFromArrays(arrays, {
+  types: {
+    ints: int32(),
+    floats: float32(),
+    bools: bool(),
+    strings: dictionary(utf8())
+  }
+});
+
+// encode table to bytes in Arrow IPC file format
+const ipcTyped = tableToIPC(tableTyped, { format: 'file' });
+```
+
 ### Customize Data Extraction
 
 Data extraction can be customized using options provided to the table generation method. By default, temporal data is returned as numeric timestamps, 64-bit integers are coerced to numbers, and map-typed data is returned as an array of [key, value] pairs. These defaults can be changed via conversion options that push (or remove) transformations to the underlying data batches.
 
 ```js
 const table = tableFromIPC(ipc, {
-  useDate: true,   // map temporal data to Date objects
-  useBigInt: true, // use BigInt, do not coerce to number
-  useMap: true     // create Map objects for [key, value] pair lists
+  useDate: true,          // map dates and timestamps to Date objects
+  useDecimalBigInt: true, // use BigInt for decimals, do not coerce to number
+  useBigInt: true,        // use BigInt for 64-bit ints, do not coerce to number
+  useMap: true            // create Map objects for [key, value] pair lists
 });
 ```
 
+The same extraction options can be passed to `tableFromArrays`.
+
 ## Build Instructions
 
 To build and develop Flechette locally:

diff --git a/docs/api/column.md b/docs/api/column.md
@@ -0,0 +1,69 @@
+---
+title: Flechette API Reference
+---
+# Flechette API Reference
+
+[Top-Level](/flechette/api) | [Data Types](data-types) | [Table](table) | [**Column**](column)
+
+## Column Class
+
+A data column. A column provides a view over one or more value batches, each corresponding to part of an Arrow record batch. The Column class supports random access to column values by integer index using the [`at`](#at) method; however, extracting arrays using [`toArray`](#toArray) may provide more efficient means of bulk access and scanning.
+
+* [constructor](#constructor)
+* [type](#type)
+* [length](#length)
+* [nullCount](#nullCount)
+* [data](#data)
+* [at](#at)
+* [get](#get)
+* [toArray](#toArray)
+* [Symbol.iterator](#iterator)
+
+<hr/><a id="constructor" href="#constructor">#</a>
+Column.<b>constructor</b>(<i>data</i>)
+
+Create a new column with the given data batches.
+
+* *data* (`Batch[]`): The column data batches.
+
+<hr/><a id="type" href="#type">#</a>
+Column.<b>type</b>
+
+The column [data type](data-types).
+
+<hr/><a id="length" href="#length">#</a>
+Column.<b>length</b>
+
+The column length (number of rows).
+
+<hr/><a id="nullCount" href="#nullCount">#</a>
+Column.<b>nullCount</b>
+
+The count of null values in the column.
+
+<hr/><a id="data" href="#data">#</a>
+Column.<b>data</b>
+
+An array of column data batches.
+
+<hr/><a id="at" href="#at">#</a>
+Column.<b>at</b>(<i>index</i>)
+
+Return the column value at the given *index*. If a column has multiple batches, this method performs binary search over the batch lengths to determine the batch from which to retrieve the value. The search makes lookup less efficient than a standard array access. If making multiple full scans of a column, consider extracting an array via `toArray()`.
+
+* *index* (`number`): The row index.
+
+<hr/><a id="get" href="#get">#</a>
+Column.<b>get</b>(<i>index</i>)
+
+Return the column value at the given *index*. This method is the same as [`at`](#at) and is provided for better compatibility with Apache Arrow JS.
+
+<hr/><a id="toArray" href="#toArray">#</a>
+Column.<b>toArray</b>()
+
+Extract column values into a single array instance. When possible, a zero-copy subarray of the input Arrow data is returned. A typed array is used if possible. If a column contains `null` values, a standard `Array` is created and populated.
+
+<hr/><a id="iterator" href="#iterator">#</a>
+Column.<b>[Symbol.iterator]</b>()
+
+Return an iterator over the values in this column.