Skip to content

Commit

Permalink
feat: Redesign dict id use, add columnFromValues.
Browse files Browse the repository at this point in the history
  • Loading branch information
jheer committed Sep 12, 2024
1 parent b59b6f6 commit 2350b68
Show file tree
Hide file tree
Showing 18 changed files with 335 additions and 237 deletions.
6 changes: 3 additions & 3 deletions docs/api/data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,14 +85,14 @@ Create a new field instance for use in a schema or type definition. A field repr
### Dictionary

<hr/><a id="dictionary" href="#dictionary">#</a>
<b>dictionary</b>(<i>type</i>[, <i>indexType</i>, <i>id</i>, <i>ordered</i>])
<b>dictionary</b>(<i>type</i>[, <i>indexType</i>, <i>ordered</i>, <i>id</i>])

Create a Dictionary data type instance. A dictionary type consists of a dictionary of values (which may be of any type) and corresponding integer indices that reference those values. If values are repeated, a dictionary encoding can provide substantial space savings. In the IPC format, dictionary indices reside alongside other columns in a record batch, while dictionary values are written to special dictionary batches, linked by a unique dictionary *id*. Internally Flechette extracts dictionary values upfront; while this incurs some initial overhead, it enables fast subsequent lookups.
Create a Dictionary data type instance. A dictionary type consists of a dictionary of values (which may be of any type) and corresponding integer indices that reference those values. If values are repeated, a dictionary encoding can provide substantial space savings. In the IPC format, dictionary indices reside alongside other columns in a record batch, while dictionary values are written to special dictionary batches, linked by a unique dictionary *id* assigned at encoding time. Internally Flechette extracts dictionary values immediately upon decoding; while this incurs some initial overhead, it enables fast subsequent lookups.

* *type* (`DataType`): The data type of dictionary values.
* *indexType* (`DataType`): The data type of dictionary indices. Must be an integer type (default [`int32`](#int32)).
* *id* (`number`): The dictionary id, should be unique in a table. Defaults to `-1`, but is set to a proper id if the type is passed through [`tableFromArrays`](/flechette/api/#tableFromArrays).
* *ordered* (`boolean`): Indicates if dictionary values are ordered (default `false`).
* *id* (`number`): Optional dictionary id. The default value (-1) indicates that the dictionary applies to a single column only. Provide an explicit id in order to reuse a dictionary across columns when building, in which case different dictionaries *must* have different unique ids. All dictionary ids are later resolved (possibly to new values) upon IPC encoding.

### Null

Expand Down
2 changes: 1 addition & 1 deletion docs/api/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ const col = columnFromArray(
<hr/><a id="tableFromColumns" href="#tableFromColumns">#</a>
<b>tableFromColumns</b>(<i>columns</i>[, <i>useProxy</i>])

Create a new table from a collection of columns. This method is useful for creating new tables using one or more pre-existing column instances. Otherwise, [`tableFromArrays`](#tableFromArrays) should be preferred. Input columns are assumed to have the same record batch sizes and non-conflicting dictionary ids.
Create a new table from a collection of columns. This method is useful for creating new tables using one or more pre-existing column instances. Otherwise, [`tableFromArrays`](#tableFromArrays) should be preferred. Input columns are assumed to have the same record batch sizes.

* *data* (`object | array`): The input columns as an object with name keys, or an array of [name, column] pairs.
* *useProxy* (`boolean`): Flag indicating if row proxy objects should be used to represent table rows (default `false`). Typically this should match the value of the `useProxy` extraction option used for column generation.
Expand Down
38 changes: 11 additions & 27 deletions src/build/builder.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import { toBigInt, toDateDay, toFloat16, toTimestamp } from '../util/numbers.js'
import { BinaryBuilder } from './builders/binary.js';
import { BoolBuilder } from './builders/bool.js';
import { DecimalBuilder } from './builders/decimal.js';
import { DictionaryBuilder, dictionaryValues } from './builders/dictionary.js';
import { DictionaryBuilder, dictionaryContext } from './builders/dictionary.js';
import { FixedSizeBinaryBuilder } from './builders/fixed-size-binary.js';
import { FixedSizeListBuilder } from './builders/fixed-size-list.js';
import { IntervalDayTimeBuilder, IntervalMonthDayNanoBuilder } from './builders/interval.js';
Expand All @@ -19,36 +19,20 @@ import { Utf8Builder } from './builders/utf8.js';
import { DirectBuilder, Int64Builder, TransformBuilder } from './builders/values.js';

/**
* Create a new context object for shared builder state.
* Create a context object for shared builder state.
* @param {import('../types.js').ExtractionOptions} [options]
* Batch extraction options.
* @param {Map<number, ReturnType<dictionaryValues>>} [dictMap]
* A map of dictionary ids to value builder helpers.
* @param {ReturnType<dictionaryContext>} [dictionaries]
* Context object for tracking dictionaries.
*/
export function builderContext(options, dictMap = new Map) {
let dictId = 0;
export function builderContext(
options = {},
dictionaries = dictionaryContext()
) {
return {
batchType(type) {
return batchType(type, options);
},
dictionary(type, id) {
let dict;
if (id != null) {
dict = dictMap.get(id);
} else {
while (dictMap.has(dictId + 1)) ++dictId;
id = dictId;
}
if (!dict) {
dictMap.set(id, dict = dictionaryValues(id, type, this));
}
return dict;
},
finish() {
for (const dict of dictMap.values()) {
dict.finish(options);
}
}
batchType: type => batchType(type, options),
dictionary(type) { return dictionaries.get(type, this); },
finish: () => dictionaries.finish(options)
};
}

Expand Down
49 changes: 43 additions & 6 deletions src/build/builders/dictionary.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,58 @@ import { builder } from '../builder.js';
import { ValidityBuilder } from './validity.js';

/**
* Builder helped for creating dictionary values.
* @param {number} id The dictionary id.
* Create a context object for managing dictionary builders.
*/
export function dictionaryContext() {
const idMap = new Map;
const dicts = new Set;
return {
/**
* Get a dictionary values builder for the given dictionary type.
* @param {import('../../types.js').DictionaryType} type
* The dictionary type.
* @param {*} ctx The builder context.
* @returns {ReturnType<dictionaryValues>}
*/
get(type, ctx) {
// if a dictionary has a non-negative id, assume it was set
// intentionally and track it for potential reuse across columns
// otherwise the dictionary is used for a single column only
const id = type.id;
if (id >= 0 && idMap.has(id)) {
return idMap.get(id);
} else {
const dict = dictionaryValues(type, ctx);
if (id >= 0) idMap.set(id, dict);
dicts.add(dict);
return dict;
}
},
/**
* Finish building dictionary values columns and assign them to
* their corresponding dictionary batches.
* @param {import('../../types.js').ExtractionOptions} options
*/
finish(options) {
dicts.forEach(dict => dict.finish(options));
}
};
}

/**
* Builder helper for creating dictionary values.
* @param {import('../../types.js').DictionaryType} type
* The dictionary data type.
* @param {*} ctx
* @returns
* @param {ReturnType<import('../builder.js').builderContext>} ctx
* The builder context.
*/
export function dictionaryValues(id, type, ctx) {
export function dictionaryValues(type, ctx) {
const keys = Object.create(null);
const values = builder(type.dictionary, ctx);
const batches = [];

values.init();
let index = -1;
type.id = id;

return {
type,
Expand Down
88 changes: 8 additions & 80 deletions src/build/column-from-array.js
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
import { float32Array, float64Array, int16Array, int32Array, int64Array, int8Array, isInt64ArrayType, isTypedArray, uint16Array, uint32Array, uint64Array, uint8Array } from '../util/arrays.js';
import { DirectBatch, Int64Batch, NullBatch } from '../batch.js';
import { DirectBatch, Int64Batch } from '../batch.js';
import { Column } from '../column.js';
import { float32, float64, int16, int32, int64, int8, uint16, uint32, uint64, uint8 } from '../data-types.js';
import { inferType } from './infer-type.js';
import { builder, builderContext } from './builder.js';
import { Type } from '../constants.js';
import { columnFromValues } from './column-from-values.js';

/**
* Create a new column from a provided data array.
Expand All @@ -14,25 +12,20 @@ import { Type } from '../constants.js';
* If not specified, type inference is attempted.
* @param {import('../types.js').ColumnBuilderOptions} [options]
* Builder options for the generated column.
* @param {ReturnType<import('./builder.js').builderContext>} [ctx]
* @param {ReturnType<import('./builders/dictionary.js').dictionaryContext>} [dicts]
* Builder context object, for internal use only.
* @returns {Column<T>} The generated column.
*/
export function columnFromArray(data, type, options = {}, ctx) {
if (!type) {
if (isTypedArray(data)) {
return columnFromTypedArray(data, options);
} else {
type = inferType(data);
}
}
return columnFromValues(data, type, options, ctx);
export function columnFromArray(data, type, options = {}, dicts) {
return !type && isTypedArray(data)
? columnFromTypedArray(data, options)
: columnFromValues(data.length, v => data.forEach(v), type, options, dicts);
}

/**
* Create a new column from a typed array input.
* @template T
* @param {import('../types.js').TypedArray} values
* @param {import('../types.js').TypedArray} values The input data.
* @param {import('../types.js').ColumnBuilderOptions} options
* Builder options for the generated column.
* @returns {Column<T>} The generated column.
Expand Down Expand Up @@ -62,52 +55,6 @@ function columnFromTypedArray(values, { maxBatchRows, useBigInt }) {
return new Column(batches);
}

/**
* Build a column by iterating over the provided values array.
* @template T
* @param {Array | import('../types.js').TypedArray} values The input data.
* @param {import('../types.js').DataType} type The column data type.
* @param {import('../types.js').ColumnBuilderOptions} [options]
* Builder options for the generated column.
* @param {ReturnType<import('./builder.js').builderContext>} [ctx]
* Builder context object, for internal use only.
* @returns {Column<T>} The generated column.
*/
function columnFromValues(values, type, options, ctx) {
const { maxBatchRows, ...opt } = options;
const length = values.length;
const limit = Math.min(maxBatchRows || Infinity, length);

// if null type, generate batches and exit early
if (type.typeId === Type.Null) {
return new Column(nullBatches(type, length, limit));
}

const data = [];
ctx ??= builderContext(opt);
const b = builder(type, ctx).init();
const next = b => data.push(b.batch());
const numBatches = Math.floor(length / limit);

let idx = 0;
let row = 0;
for (let i = 0; i < numBatches; ++i) {
for (row = 0; row < limit; ++row) {
b.set(values[idx++], row);
}
next(b);
}
for (row = 0; idx < length; ++idx) {
b.set(values[idx], row++);
}
if (row) next(b);

// resolve dictionaries
ctx.finish();

return new Column(data);
}

/**
* Return an Arrow data type for a given typed array type.
* @param {import('../types.js').TypedArrayConstructor} arrayType
Expand All @@ -128,22 +75,3 @@ function typeForTypedArray(arrayType) {
case uint64Array: return uint64();
}
}

/**
* Create null batches with the given batch size limit.
* @param {import('../types.js').NullType} type The null data type.
* @param {number} length The total column length.
* @param {number} limit The maximum batch size.
* @returns {import('../batch.js').NullBatch[]} The null batches.
*/
function nullBatches(type, length, limit) {
const data = [];
const batch = length => new NullBatch({ length, nullCount: length, type });
const numBatches = Math.floor(length / limit);
for (let i = 0; i < numBatches; ++i) {
data.push(batch(limit));
}
const rem = length % limit;
if (rem) data.push(batch(rem));
return data;
}
69 changes: 69 additions & 0 deletions src/build/column-from-values.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import { NullBatch } from '../batch.js';
import { Column } from '../column.js';
import { inferType } from './infer-type.js';
import { builder, builderContext } from './builder.js';
import { Type } from '../constants.js';

/**
* Create a new column by iterating over provided values.
* @template T
* @param {number} length The input data length.
* @param {(visitor: (value: any) => void) => void} visit
* A function that applies a callback to successive data values.
* @param {import('../types.js').DataType} type The data type.
* @param {import('../types.js').ColumnBuilderOptions} [options]
* Builder options for the generated column.
* @param {ReturnType<
* import('./builders/dictionary.js').dictionaryContext
* >} [dicts] Builder context object, for internal use only.
* @returns {Column<T>} The generated column.
*/
export function columnFromValues(length, visit, type, options, dicts) {
type ??= inferType(visit);
const { maxBatchRows, ...opt } = options;
const limit = Math.min(maxBatchRows || Infinity, length);

// if null type, generate batches and exit early
if (type.typeId === Type.Null) {
return new Column(nullBatches(type, length, limit));
}

const ctx = builderContext(opt, dicts);
const b = builder(type, ctx).init();
const data = [];
const next = b => data.push(b.batch());

let row = 0;
visit(value => {
b.set(value, row++);
if (row >= limit) {
next(b);
row = 0;
}
});
if (row) next(b);

// resolve dictionaries
ctx.finish();

return new Column(data);
}

/**
* Create null batches with the given batch size limit.
* @param {import('../types.js').NullType} type The null data type.
* @param {number} length The total column length.
* @param {number} limit The maximum batch size.
* @returns {import('../batch.js').NullBatch[]} The null batches.
*/
function nullBatches(type, length, limit) {
const data = [];
const batch = length => new NullBatch({ length, nullCount: length, type });
const numBatches = Math.floor(length / limit);
for (let i = 0; i < numBatches; ++i) {
data.push(batch(limit));
}
const rem = length % limit;
if (rem) data.push(batch(rem));
return data;
}
9 changes: 4 additions & 5 deletions src/build/infer-type.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,13 @@ import { isArray } from '../util/arrays.js';

/**
* Infer the data type for a given input array.
* @param {import('../types.js').ValueArray} data The data array.
* @param {(visitor: (value: any) => void) => void} visit
* A function that applies a callback to successive data values.
* @returns {import('../types.js').DataType} The data type.
*/
export function inferType(data) {
export function inferType(visit) {
const profile = profiler();
for (let i = 0; i < data.length; ++i) {
profile.add(data[i]);
}
visit(value => profile.add(value));
return profile.type();
}

Expand Down
6 changes: 3 additions & 3 deletions src/build/table-from-arrays.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { builderContext } from './builder.js';
import { dictionaryContext } from './builders/dictionary.js';
import { columnFromArray } from './column-from-array.js';
import { tableFromColumns } from './table-from-columns.js';

Expand All @@ -13,11 +13,11 @@ import { tableFromColumns } from './table-from-columns.js';
*/
export function tableFromArrays(data, options = {}) {
const { types = {}, ...opt } = options;
const ctx = builderContext();
const dicts = dictionaryContext();
const entries = Array.isArray(data) ? data : Object.entries(data);
const columns = entries.map(([name, array]) =>
/** @type {[string, import('../column.js').Column]} */ (
[ name, columnFromArray(array, types[name], opt, ctx)]
[ name, columnFromArray(array, types[name], opt, dicts)]
));
return tableFromColumns(columns, options.useProxy);
}
Loading

0 comments on commit 2350b68

Please sign in to comment.