Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add struct/row proxy objects via useProxy option. #15

Merged
merged 1 commit into from
Sep 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 7 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@

Flechette performs fast extraction and encoding of data columns in the Arrow binary IPC format, supporting ingestion of Arrow data from sources such as [DuckDB](https://duckdb.org/) and Arrow use in JavaScript data analysis tools like [Arquero](https://github.com/uwdata/arquero), [Mosaic](https://github.com/uwdata/mosaic), [Observable Plot](https://observablehq.com/plot/), and [Vega-Lite](https://vega.github.io/vega-lite/).

For documentation, see the [**API Reference**](https://idl.uw.edu/flechette/api).

## Why Flechette?

In the process of developing multiple data analysis packages that consume Arrow data (including Arquero, Mosaic, and Vega), we've had to develop workarounds for the performance and correctness of the Arrow JavaScript reference implementation. Instead of workarounds, Flechette addresses these issues head-on.

* _Speed_. Flechette provides better performance. Performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, 5-9x faster row object extraction, and 1.5-3.5x faster building of Arrow columns.
* _Speed_. Flechette provides better performance. Performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, 7-11x faster row object extraction, and 1.5-3.5x faster building of Arrow columns.

* _Size_. Flechette is smaller: ~42k minified (~13k gzip'd) versus 163k minified (~43k gzip'd) for Arrow JS. Flechette's encoders and decoders also tree-shake cleanly, so you only pay for what you need in your own bundles.
* _Size_. Flechette is smaller: ~42k minified (~14k gzip'd) versus 163k minified (~43k gzip'd) for Arrow JS. Flechette's encoders and decoders also tree-shake cleanly, so only pay for what you need in your own bundles.

* _Coverage_. Flechette supports data types unsupported by the reference implementation, including decimal-to-number conversion, month/day/nanosecond time intervals (as used by DuckDB, for example), run-end encoded data, binary views, and list views.

Expand Down Expand Up @@ -108,18 +110,19 @@ const ipcTyped = tableToIPC(tableTyped, { format: 'file' });

### Customize Data Extraction

Data extraction can be customized using options provided to the table generation method. By default, temporal data is returned as numeric timestamps, 64-bit integers are coerced to numbers, and map-typed data is returned as an array of [key, value] pairs. These defaults can be changed via conversion options that push (or remove) transformations to the underlying data batches.
Data extraction can be customized using options provided to table generation methods. By default, temporal data is returned as numeric timestamps, 64-bit integers are coerced to numbers, map-typed data is returned as an array of [key, value] pairs, and struct/row objects are returned as vanilla JS objects with extracted property values. These defaults can be changed via conversion options that push (or remove) transformations to the underlying data batches.

```js
const table = tableFromIPC(ipc, {
useDate: true, // map dates and timestamps to Date objects
useDecimalBigInt: true, // use BigInt for decimals, do not coerce to number
useBigInt: true, // use BigInt for 64-bit ints, do not coerce to number
useMap: true // create Map objects for [key, value] pair lists
useProxy: true // use zero-copy proxies for struct and table row objects
});
```

The same extraction options can be passed to `tableFromArrays`.
The same extraction options can be passed to `tableFromArrays`. For more, see the [**API Reference**](https://idl.uw.edu/flechette/api).

## Build Instructions

Expand Down
4 changes: 2 additions & 2 deletions docs/api/data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ title: Flechette API Reference

## Data Type Overview

The table below provides an overview of all data types supported by the Apache Arrow format and how Flechette maps them to JavaScript types. The table indicates if Flechette can read the type (via [`tableFromIPC`](/flechette/api/#tableFromIPC)), write the type (via [`tableToIPC`](/flechette/api/#tableToIPC)), and build the type from JavaScript values (via [`tableFromArrays`](/flechette/api/#tableFromArrays) or [`columnFromArray`](/flechette/api/#tableFromArray)).
The table below provides an overview of all data types supported by the Apache Arrow format and how Flechette maps them to JavaScript types. The table indicates if Flechette can read the type (via [`tableFromIPC`](/flechette/api/#tableFromIPC)), write the type (via [`tableToIPC`](/flechette/api/#tableToIPC)), and build the type from JavaScript values (via [`tableFromArrays`](/flechette/api/#tableFromArrays) or [`columnFromArray`](/flechette/api/#columnFromArray)).

| Id | Data Type | Read? | Write? | Build? | JavaScript Type |
| --: | ----------------------------------- | :---: | :----: | :----: | --------------- |
Expand Down Expand Up @@ -341,7 +341,7 @@ Extracted JavaScript values depend on the child types.
* *mode* (`number`): The union mode. One of `UnionMode.Sparse` or `UnionMode.Dense`.
* *children* (`(DataType[] | Field)[]`): The children fields or data types. Types are mapped to nullable fields with no metadata.
* *typeIds* (`number[]`): Children type ids, in the same order as the children types. Type ids provide a level of indirection over children types. If not provided, the children indices are used as the type ids.
* *typeIdForValue* (`(value: any, index: number) => number`): A function that takes an arbitrary value and a row index and returns a correponding union type id. This function is required to build union-typed data with [`tableFromArrays`](/flechette/api/#tableFromArrays) or [`columnFromArray`](/flechette/api/#tableFromArray).
* *typeIdForValue* (`(value: any, index: number) => number`): A function that takes an arbitrary value and a row index and returns a correponding union type id. This function is required to build union-typed data with [`tableFromArrays`](/flechette/api/#tableFromArrays) or [`columnFromArray`](/flechette/api/#columnFromArray).

### FixedSizeBinary

Expand Down
6 changes: 4 additions & 2 deletions docs/api/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,13 @@ title: Flechette API Reference

Decode [Apache Arrow IPC data](https://arrow.apache.org/docs/format/Columnar.html#serialization-and-interprocess-communication-ipc) and return a new [`Table`](table). The input binary data may be either an `ArrayBuffer` or `Uint8Array`. For Arrow data in the [IPC 'stream' format](https://arrow.apache.org/docs/format/Columnar.html#ipc-streaming-format), an array of `Uint8Array` values is also supported.

* *data* (`ArrayBuffer` | `Uint8Array` | `Uint8Array[]`): The source byte buffer, or an array of buffers. If an array, each byte array may contain one or more self-contained messages. Messages may NOT span multiple byte arrays.
* *data* (`ArrayBuffer` \| `Uint8Array` \| `Uint8Array[]`): The source byte buffer, or an array of buffers. If an array, each byte array may contain one or more self-contained messages. Messages may NOT span multiple byte arrays.
* *options* (`ExtractionOptions`): Options for controlling how values are transformed when extracted from an Arrow binary representation.
* *useDate* (`boolean`): If true, extract dates and timestamps as JavaScript `Date` objects Otherwise, return numerical timestamp values (default).
* *useDecimalBigInt* (`boolean`): If true, extract decimal-type data as BigInt values, where fractional digits are scaled to integers. Otherwise, return converted floating-point numbers (default).
* *useBigInt* (`boolean`): If true, extract 64-bit integers as JavaScript `BigInt` values Otherwise, coerce long integers to JavaScript number values (default).
* *useMap* (`boolean`): If true, extract Arrow 'Map' values as JavaScript `Map` instances Otherwise, return an array of [key, value] pairs compatible with both `Map` and `Object.fromEntries` (default).
* *useProxy* (`boolean`): If true, extract Arrow 'Struct' values and table row objects using zero-copy proxy objects that extract data from underlying Arrow batches. The proxy objects can improve performance and reduce memory usage, but do not support property enumeration (`Object.keys`, `Object.values`, `Object.entries`) or spreading (`{ ...object }`).

*Examples*

Expand Down Expand Up @@ -134,11 +135,12 @@ const col = columnFromArray(
```

<hr/><a id="tableFromColumns" href="#tableFromColumns">#</a>
<b>tableFromColumns</b>(<i>columns</i>[, <i>type</i>, <i>options</i>])
<b>tableFromColumns</b>(<i>columns</i>[, <i>useProxy</i>])

Create a new table from a collection of columns. This method is useful for creating new tables using one or more pre-existing column instances. Otherwise, [`tableFromArrays`](#tableFromArrays) should be preferred. Input columns are assumed to have the same record batch sizes and non-conflicting dictionary ids.

* *data* (`object | array`): The input columns as an object with name keys, or an array of [name, column] pairs.
* *useProxy* (`boolean`): Flag indicating if row proxy objects should be used to represent table rows (default `false`). Typically this should match the value of the `useProxy` extraction option used for column generation.

*Examples*

Expand Down
2 changes: 1 addition & 1 deletion docs/api/table.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ title: Flechette API Reference

## Table Class

A table consisting of named columns (or 'children'). To extract table data directly to JavaScript values, use [`toColumns()`](#toColumns) to produce an object that maps column names to extracted value arrays, or [`toArray()`](#toArray) to extract an array of row objects. Tables are [iterable](#iterator), iterating over row objects. While `toArray()` and [table iterators](#iterator) enable convenient use by tools that expect row objects, column-oriented processing is more efficient and thus recommended. Use [`getChild`](#getChild) or [`getChildAt`](#getChildAt) to access a specific [`Column`](column).
A table consists of named data [columns](#column) (or 'children'). To extract table data directly to JavaScript values, use [`toColumns()`](#toColumns) to produce an object that maps column names to extracted value arrays, or [`toArray()`](#toArray) to extract an array of row objects. Tables are [iterable](#iterator), iterating over row objects. While `toArray()` and [table iterators](#iterator) enable convenient use by tools that expect row objects, column-oriented processing is more efficient and thus recommended. Use [`getChild`](#getChild) or [`getChildAt`](#getChildAt) to access a specific [`Column`](column).

* [constructor](#constructor)
* [numCols](#numCols)
Expand Down
9 changes: 5 additions & 4 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@

Flechette performs fast extraction and encoding of data columns in the Arrow binary IPC format, supporting ingestion of Arrow data from sources such as [DuckDB](https://duckdb.org/) and Arrow use in JavaScript data analysis tools like [Arquero](https://github.com/uwdata/arquero), [Mosaic](https://github.com/uwdata/mosaic), [Observable Plot](https://observablehq.com/plot/), and [Vega-Lite](https://vega.github.io/vega-lite/).

[**API Reference**](api)
For documentation, see the [**API Reference**](api).

## Why Flechette?

In the process of developing multiple data analysis packages that consume Arrow data (including Arquero, Mosaic, and Vega), we've had to develop workarounds for the performance and correctness of the Arrow JavaScript reference implementation. Instead of workarounds, Flechette addresses these issues head-on.

* _Speed_. Flechette provides better performance. Performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, 5-9x faster row object extraction, and 1.5-3.5x faster building of Arrow columns.
* _Speed_. Flechette provides better performance. Performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, 7-11x faster row object extraction, and 1.5-3.5x faster building of Arrow columns.

* _Size_. Flechette is smaller: ~42k minified (~13k gzip'd) versus 163k minified (~43k gzip'd) for Arrow JS. Flechette's encoders and decoders also tree-shake cleanly, so you only pay for what you need in your own bundles.
* _Size_. Flechette is smaller: ~42k minified (~14k gzip'd) versus 163k minified (~43k gzip'd) for Arrow JS. Flechette's encoders and decoders also tree-shake cleanly, so only pay for what you need in your own bundles.

* _Coverage_. Flechette supports data types unsupported by the reference implementation, including decimal-to-number conversion, month/day/nanosecond time intervals (as used by DuckDB, for example), run-end encoded data, binary views, and list views.

Expand Down Expand Up @@ -110,14 +110,15 @@ const ipcTyped = tableToIPC(tableTyped, { format: 'file' });

### Customize Data Extraction

Data extraction can be customized using options provided to the table generation method. By default, temporal data is returned as numeric timestamps, 64-bit integers are coerced to numbers, and map-typed data is returned as an array of [key, value] pairs. These defaults can be changed via conversion options that push (or remove) transformations to the underlying data batches.
Data extraction can be customized using options provided to table generation methods. By default, temporal data is returned as numeric timestamps, 64-bit integers are coerced to numbers, map-typed data is returned as an array of [key, value] pairs, and struct/row objects are returned as vanilla JS objects with extracted property values. These defaults can be changed via conversion options that push (or remove) transformations to the underlying data batches.

```js
const table = tableFromIPC(ipc, {
useDate: true, // map dates and timestamps to Date objects
useDecimalBigInt: true, // use BigInt for decimals, do not coerce to number
useBigInt: true, // use BigInt for 64-bit ints, do not coerce to number
useMap: true // create Map objects for [key, value] pair lists
useProxy: true // use zero-copy proxies for struct and table row objects
});
```

Expand Down
2 changes: 1 addition & 1 deletion perf/decode-perf.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import { tableFromIPC as flTable } from '../src/index.js';
import { benchmark } from './util.js';

// table creation
const fl = bytes => flTable(bytes, { useBigInt: true });
const fl = bytes => flTable(bytes, { useBigInt: true, useProxy: true });
const aa = bytes => aaTable(bytes);

// decode ipc data to columns
Expand Down
6 changes: 3 additions & 3 deletions src/batch-type.js
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import { BinaryBatch, BinaryViewBatch, BoolBatch, DateBatch, DateDayBatch, DateDayMillisecondBatch, DecimalBigIntBatch, DecimalNumberBatch, DenseUnionBatch, DictionaryBatch, DirectBatch, FixedBinaryBatch, FixedListBatch, Float16Batch, Int64Batch, IntervalDayTimeBatch, IntervalMonthDayNanoBatch, IntervalYearMonthBatch, LargeBinaryBatch, LargeListBatch, LargeListViewBatch, LargeUtf8Batch, ListBatch, ListViewBatch, MapBatch, MapEntryBatch, NullBatch, RunEndEncodedBatch, SparseUnionBatch, StructBatch, TimestampMicrosecondBatch, TimestampMillisecondBatch, TimestampNanosecondBatch, TimestampSecondBatch, Utf8Batch, Utf8ViewBatch } from './batch.js';
import { BinaryBatch, BinaryViewBatch, BoolBatch, DateBatch, DateDayBatch, DateDayMillisecondBatch, DecimalBigIntBatch, DecimalNumberBatch, DenseUnionBatch, DictionaryBatch, DirectBatch, FixedBinaryBatch, FixedListBatch, Float16Batch, Int64Batch, IntervalDayTimeBatch, IntervalMonthDayNanoBatch, IntervalYearMonthBatch, LargeBinaryBatch, LargeListBatch, LargeListViewBatch, LargeUtf8Batch, ListBatch, ListViewBatch, MapBatch, MapEntryBatch, NullBatch, RunEndEncodedBatch, SparseUnionBatch, StructBatch, StructProxyBatch, TimestampMicrosecondBatch, TimestampMillisecondBatch, TimestampNanosecondBatch, TimestampSecondBatch, Utf8Batch, Utf8ViewBatch } from './batch.js';
import { DateUnit, IntervalUnit, TimeUnit, Type } from './constants.js';
import { invalidDataType } from './data-types.js';

export function batchType(type, options = {}) {
const { typeId, bitWidth, precision, unit } = type;
const { useBigInt, useDate, useDecimalBigInt, useMap } = options;
const { useBigInt, useDate, useDecimalBigInt, useMap, useProxy } = options;

switch (typeId) {
case Type.Null: return NullBatch;
Expand Down Expand Up @@ -47,7 +47,7 @@ export function batchType(type, options = {}) {
case Type.ListView: return ListViewBatch;
case Type.LargeListView: return LargeListViewBatch;
case Type.FixedSizeList: return FixedListBatch;
case Type.Struct: return StructBatch;
case Type.Struct: return useProxy ? StructProxyBatch : StructBatch;
case Type.RunEndEncoded: return RunEndEncodedBatch;
case Type.Dictionary: return DictionaryBatch;
case Type.Union: return type.mode ? DenseUnionBatch : SparseUnionBatch;
Expand Down
24 changes: 16 additions & 8 deletions src/batch.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ import { bisect, float64Array } from './util/arrays.js';
import { divide, fromDecimal128, fromDecimal256, toNumber } from './util/numbers.js';
import { decodeBit, readInt32, readInt64 } from './util/read.js';
import { decodeUtf8 } from './util/strings.js';
import { objectFactory, proxyFactory } from './util/struct.js';

/**
* Check if the input is a batch that supports direct access to
Expand Down Expand Up @@ -730,25 +731,32 @@ export class DenseUnionBatch extends SparseUnionBatch {
* @extends {ArrayBatch<Record<string, any>>}
*/
export class StructBatch extends ArrayBatch {
constructor(options) {
constructor(options, factory = objectFactory) {
super(options);
/** @type {string[]} */
// @ts-ignore
this.names = this.type.children.map(child => child.name);
this.factory = factory(this.names, this.children);
}

/**
* @param {number} index The value index.
* @returns {Record<string, any>}
*/
value(index) {
const { children, names } = this;
const n = names.length;
const struct = {};
for (let i = 0; i < n; ++i) {
struct[names[i]] = children[i].at(index);
}
return struct;
return this.factory(index);
}
}

/**
* A batch of struct values, containing a set of named properties.
* Structs are returned as proxy objects that extract data directly
* from underlying Arrow batches.
* @extends {StructBatch}
*/
export class StructProxyBatch extends StructBatch {
constructor(options) {
super(options, proxyFactory);
}
}

Expand Down
2 changes: 1 addition & 1 deletion src/build/table-from-arrays.js
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ export function tableFromArrays(data, options = {}) {
/** @type {[string, import('../column.js').Column]} */ (
[ name, columnFromArray(array, types[name], opt, ctx)]
));
return tableFromColumns(columns);
return tableFromColumns(columns, options.useProxy);
}
14 changes: 8 additions & 6 deletions src/build/table-from-columns.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,13 @@ import { Table } from '../table.js';
* Create a new table from a collection of columns. Columns are assumed
* to have the same record batch sizes and consistent dictionary ids.
* @param {[string, import('../column.js').Column][]
* | Record<string, import('../column.js').Column>} data The columns,
* as an object with name keys, or an array of [name, column] pairs.
* @returns {Table} The new table.
*/
export function tableFromColumns(data) {
* | Record<string, import('../column.js').Column>} data The columns,
* as an object with name keys, or an array of [name, column] pairs.
* @param {boolean} [useProxy] Flag indicating if row proxy
* objects should be used to represent table rows (default `false`).
* @returns {Table} The new table.
*/
export function tableFromColumns(data, useProxy) {
const fields = [];
const dictionaryTypes = new Map;
const entries = Array.isArray(data) ? data : Object.entries(data);
Expand Down Expand Up @@ -39,5 +41,5 @@ export function tableFromColumns(data) {
dictionaryTypes
};

return new Table(schema, columns);
return new Table(schema, columns, useProxy);
}
2 changes: 1 addition & 1 deletion src/decode/table-from-ipc.js
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ export function createTable(data, options = {}) {
fields.forEach((f, i) => cols[i].add(visit(f.type, ctx)));
}

return new Table(schema, cols.map(c => c.done()));
return new Table(schema, cols.map(c => c.done()), options.useProxy);
}

/**
Expand Down
Loading