Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add struct/row proxy objects via useProxy option. #12

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 8 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,13 @@ Flechette performs fast extraction of data columns in the Arrow binary IPC forma

In the process of developing multiple data analysis packages that consume Arrow data (including Arquero, Mosaic, and Vega), we've had to develop workarounds for the performance and correctness of the Arrow JavaScript reference implementation. Instead of workarounds, Flechette addresses these issues head-on.

* _Speed_. Flechette provides faster decoding. Across varied datasets, initial performance tests show 1.3-1.6x faster value iteration, 2-7x faster array extraction, and 5-9x faster row object extraction.
* _Speed_. Flechette provides faster decoding. Initial performance tests show 1.3-1.6x faster value iteration, 2-2.5x faster random value access, 2-7x faster array extraction, and 7-11x faster row object extraction.

* _Size_. Flechette is ~17k minified (~6k gzip'd), versus 163k minified (~43k gzip'd) for Arrow JS.

* _Coverage_. Flechette supports data types unsupported by the reference implementation at the time of writing, including decimal-to-number conversion, month/day/nanosecond time intervals (as used by DuckDB, for example), list views, and run-end encoded data.

* _Flexibility_. Flechette includes options to control data value conversion, such as numerical timestamps vs. Date objects for temporal data, and numbers vs. bigint values for 64-bit integer data.
* _Flexibility_. Flechette includes options to control data value conversion, such as numerical timestamps vs. Date objects for temporal data, numbers vs. bigint values for 64-bit integer data, and vanilla JS objects vs. optimized proxy objects for structs.

* _Simplicity_. Our goal is to provide a smaller, simpler code base in the hope that it will make it easier for ourselves and others to improve the library. If you'd like to see support for additional Arrow data types or features, please [file an issue](https://github.com/uwdata/flechette/issues) or [open a pull request](https://github.com/uwdata/flechette/pulls).

Expand Down Expand Up @@ -61,7 +61,7 @@ const time0 = table.getChild('time').at(0);
// { delay: Int16Array, distance: Int16Array, time: Float32Array }
const columns = table.toColumns();

// convert Arrow data to an array of standard JS objects
// convert Arrow data to an array of JS objects
// [ { delay: 14, distance: 405, time: 0.01666666753590107 }, ... ]
const objects = table.toArray();

Expand All @@ -72,13 +72,14 @@ const subtable = table.select(['delay', 'time']);

### Customize Data Extraction

Data extraction can be customized using options provided to the table generation method. By default, temporal data is returned as numeric timestamps, 64-bit integers are coerced to numbers, and map-typed data is returned as an array of [key, value] pairs. These defaults can be changed via conversion options that push (or remove) transformations to the underlying data batches.
Data extraction can be customized using options provided to the table generation method. By default, temporal data is returned as numeric timestamps, 64-bit integers are coerced to numbers, map-typed data is returned as an array of [key, value] pairs, and struct/row objects are returned as vanilla JS objects with extracted property values. These defaults can be changed via conversion options that push (or remove) transformations to the underlying data batches.

```js
const table = tableFromIPC(ipc, {
useDate: true, // map temporal data to Date objects
useBigInt: true, // use BigInt, do not coerce to number
useMap: true // create Map objects for [key, value] pair lists
useDate: true, // use Date objects for temporal (Date, Timestamp) data
useBigInt: true, // use BigInt for large integers, do not coerce to number
useMap: true, // use Map objects for [key, value] pair lists
useProxy: true // use zero-copy proxies for struct and table row objects
});
```

Expand Down
2 changes: 1 addition & 1 deletion perf/perf-test.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import { tableFromIPC as flTable } from '../src/index.js';
import { benchmark } from './util.js';

// table creation
const fl = bytes => flTable(bytes, { useBigInt: true });
const fl = bytes => flTable(bytes, { useProxy: true, useBigInt: true });
const aa = bytes => aaTable(bytes);

// parse ipc data to columns
Expand Down
18 changes: 7 additions & 11 deletions src/batch.js
Original file line number Diff line number Diff line change
Expand Up @@ -733,32 +733,28 @@ export class DenseUnionBatch extends SparseUnionBatch {
*/
export class StructBatch extends ArrayBatch {
/**
* Create a new column batch.
* Create a new struct batch.
* @param {object} options
* @param {number} options.length The length of the batch
* @param {number} options.nullCount The null value count
* @param {Uint8Array} [options.validity] Validity bitmap buffer
* @param {Batch[]} options.children Children batches
* @param {string[]} options.names Child batch names
* @param {(names: string[], batches: Batch[]) =>
* (index: number) => Record<string, any>} options.factory
* Struct object factory creation method
*/
constructor({ names, ...rest }) {
constructor({ names, factory, ...rest }) {
super(rest);
/** @type {string[]} */
this.names = names;
this.factory = factory(names, this.children);
}

/**
* @param {number} index The value index.
* @returns {Record<string, any>}
*/
value(index) {
const { children, names } = this;
const n = names.length;
const struct = {};
for (let i = 0; i < n; ++i) {
struct[names[i]] = children[i].at(index);
}
return struct;
return this.factory(index);
}
}

Expand Down
72 changes: 72 additions & 0 deletions src/struct.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
export const RowIndex = Symbol('rowIndex');

/**
* Returns a row proxy object factory. The resulting method takes a
* batch-level row index as input and returns an object that proxies
* access to underlying batches.
* @param {string[]} names The column (property) names
* @param {import('./batch.js').Batch[]} batches The value batches.
* @returns {(index: number) => Record<string, any>}
*/
export function proxyFactory(names, batches) {
class RowObject {
/**
* Create a new proxy row object representing a struct or table row.
* @param {number} index The record batch row index.
*/
constructor(index) {
this[RowIndex] = index;
}

/**
* Return a JSON-compatible object representation.
*/
toJSON() {
return structObject(names, batches, this[RowIndex]);
}
};

// prototype for row proxy objects
const proto = RowObject.prototype;

for (let i = 0; i < names.length; ++i) {
// skip duplicated column names
if (Object.hasOwn(proto, names[i])) continue;

// add a getter method for the current batch
const batch = batches[i];
Object.defineProperty(proto, names[i], {
get() { return batch.at(this[RowIndex]); },
enumerable: true
});
}

return index => new RowObject(index);
}

/**
* Returns a row object factory. The resulting method takes a
* batch-level row index as input and returns an object whose property
* values have been extracted from the batches.
* @param {string[]} names The column (property) names
* @param {import('./batch.js').Batch[]} batches The value batches.
* @returns {(index: number) => Record<string, any>}
*/
export function objectFactory(names, batches) {
return index => structObject(names, batches, index);
}

/**
* Return a vanilla object representing a struct (row object) type.
* @param {string[]} names The column (property) names
* @param {import('./batch.js').Batch[]} batches The value batches.
* @param {number} index The record batch row index.
* @returns {Record<string, any>}
*/
export function structObject(names, batches, index) {
const obj = {};
for (let i = 0; i < names.length; ++i) {
obj[names[i]] = batches[i].at(index);
}
return obj;
}
12 changes: 9 additions & 3 deletions src/table-from-ipc.js
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ import {
} from './constants.js';
import { parseIPC } from './parse-ipc.js';
import { Table } from './table.js';
import { objectFactory, proxyFactory } from './struct.js';
import { keyFor } from './util.js';

/**
Expand Down Expand Up @@ -105,7 +106,11 @@ export function createTable(data, options = {}) {
fields.forEach((f, i) => cols[i].add(visit(f.type, ctx)));
}

return new Table(schema, cols.map(c => c.done()));
return new Table(
schema,
cols.map(c => c.done()),
options.useProxy ? proxyFactory : objectFactory
);
}

/**
Expand Down Expand Up @@ -145,7 +150,7 @@ function contextGenerator(options, version, dictionaryMap) {
*/
function visit(type, ctx) {
const { typeId, bitWidth, precision, scale, stride, unit } = type;
const { useBigInt, useDate, useMap } = ctx.options;
const { useBigInt, useDate, useMap, useProxy } = ctx.options;

// no field node, no buffers
if (typeId === Type.Null) {
Expand Down Expand Up @@ -246,7 +251,8 @@ function visit(type, ctx) {
// validity and children
case Type.FixedSizeList: return kids(FixedListBatch, { stride });
case Type.Struct: return kids(StructBatch, {
names: type.children.map(child => child.name)
names: type.children.map(child => child.name),
factory: useProxy ? proxyFactory : objectFactory
});

// children only
Expand Down
56 changes: 37 additions & 19 deletions src/table.js
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import { objectFactory } from './struct.js';
import { bisect } from './util.js';

/**
Expand All @@ -12,17 +13,39 @@ export class Table {
* Create a new table with the given schema and columns (children).
* @param {import('./types.js').Schema} schema The table schema.
* @param {import('./column.js').Column[]} children The table columns.
* @param {import('./types.js').StructFactory} [factory]
* Row object factory creation method. By default, vanilla JS objects
* are used, with property values extracted from Arrow data.
*/
constructor(schema, children) {
constructor(schema, children, factory = objectFactory) {
const names = schema.fields.map(f => f.name);

/** @readonly */
this.schema = schema;
/** @readonly */
this.names = schema.fields.map(f => f.name);
this.names = names;
/**
* @type {import('./column.js').Column[]}
* @readonly
*/
this.children = children;
/**
* @type {import('./types.js').StructFactory}
* @readonly
*/
this.factory = factory;

// lazily created row object generators
const gen = [];

/**
* Returns a row object generator for the given batch index.
* @private
* @readonly
* @param {number} b The batch index.
* @returns {(index: number) => Record<string,any>}
*/
this.getFactory = b => gen[b] ?? (gen[b] = factory(names, children.map(c => c.data[b])));
}

/**
Expand Down Expand Up @@ -75,14 +98,15 @@ export class Table {
* @returns {Table} A new table with columns at the specified indices.
*/
selectAt(indices, as = []) {
const { children, schema } = this;
const { children, factory, schema } = this;
const { fields } = schema;
return new Table(
{
...schema,
fields: indices.map((i, j) => renameField(fields[i], as[j]))
},
indices.map(i => children[i])
indices.map(i => children[i]),
factory
);
}

Expand Down Expand Up @@ -117,12 +141,13 @@ export class Table {
* @returns {Record<string, any>[]}
*/
toArray() {
const { children, numRows, names } = this;
const { children, getFactory, numRows } = this;
const data = children[0]?.data ?? [];
const output = Array(numRows);
for (let b = 0, row = -1; b < data.length; ++b) {
const f = getFactory(b);
for (let i = 0; i < data[b].length; ++i) {
output[++row] = rowObject(names, children, b, i);
output[++row] = f(i);
}
}
return output;
Expand All @@ -133,11 +158,12 @@ export class Table {
* @returns {Generator<Record<string, any>, any, null>}
*/
*[Symbol.iterator]() {
const { children, names } = this;
const { children, getFactory } = this;
const data = children[0]?.data ?? [];
for (let b = 0; b < data.length; ++b) {
const f = getFactory(b);
for (let i = 0; i < data[b].length; ++i) {
yield rowObject(names, children, b, i);
yield f(i);
}
}
}
Expand All @@ -148,11 +174,11 @@ export class Table {
* @returns {Record<string, any>} The row object.
*/
at(index) {
const { names, children, numRows } = this;
const { children, getFactory, numRows } = this;
if (index < 0 || index >= numRows) return null;
const [{ offsets }] = children;
const i = bisect(offsets, index) - 1;
return rowObject(names, children, i, index - offsets[i]);
const b = bisect(offsets, index) - 1;
return getFactory(b)(index - offsets[b]);
}

/**
Expand All @@ -171,11 +197,3 @@ function renameField(field, name) {
? { ...field, name }
: field;
}

function rowObject(names, children, batch, index) {
const o = {};
for (let j = 0; j < names.length; ++j) {
o[names[j]] = children[j].data[batch].at(index);
}
return o;
}
12 changes: 12 additions & 0 deletions src/types.ts
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import { Batch } from './batch.js';
import {
Version,
Endianness,
Expand Down Expand Up @@ -93,6 +94,9 @@ export interface ValueArray<T> extends ArrayLike<T>, Iterable<T> {
slice(start?: number, end?: number): ValueArray<T>;
}

/** Struct/row object factory method. */
export type StructFactory = (names: string[], batches: Batch<any>[]) => (index: number) => Record<string, any>;

/** Custom metadata. */
export type Metadata = Map<string, string>;

Expand Down Expand Up @@ -303,4 +307,12 @@ export interface ExtractionOptions {
* both `Map` and `Object.fromEntries` (default).
*/
useMap?: boolean;
/**
* If true, extract Arrow 'Struct' values and table row objects using
* zero-copy proxy objects that extract data from underlying Arrow batches.
* The proxy objects can improve performance and reduce memory usage, but
* do not support property enumeration (`Object.keys`, `Object.values`,
* `Object.entries`) or spreading (`{ ...object }`).
*/
useProxy?: boolean;
}
17 changes: 17 additions & 0 deletions test/table-from-ipc-test.js
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ import assert from 'node:assert';
import { readFile } from 'node:fs/promises';
import { arrowFromDuckDB, arrowQuery } from './util/arrow-from-duckdb.js';
import { tableFromIPC } from '../src/index.js';
import { RowIndex } from '../src/struct.js';

const toDate = v => new Date(v);
const toBigInt = v => BigInt(v);
Expand Down Expand Up @@ -255,6 +256,22 @@ describe('tableFromIPC', () => {
await valueTest([ {a: ['a', 'b'], b: Math.E}, {a: ['c', 'd'], b: Math.PI} ]);
});

it('decodes struct data with useProxy', async () => {
const data = [
[ {a: 1, b: 'foo'}, {a: 2, b: 'baz'} ],
[ {a: 1, b: 'foo'}, null, {a: 2, b: 'baz'} ],
[ {a: null, b: 'foo'}, {a: 2, b: null} ],
[ {a: ['a', 'b'], b: Math.E}, {a: ['c', 'd'], b: Math.PI} ]
];
for (const values of data) {
const bytes = await arrowFromDuckDB(values);
const column = tableFromIPC(bytes, { useProxy: true }).getChild('value');
const proxies = column.toArray();
assert.strictEqual(proxies.every(p => p === null || p[RowIndex] >= 0), true);
assert.deepStrictEqual(proxies.map(p => p ? p.toJSON() : null), values);
}
});

it('decodes run-end-encoded data', async () => {
const buf = await readFile(`test/data/runendencoded.arrows`);
const table = tableFromIPC(new Uint8Array(buf));
Expand Down