Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign dictionary id use, add columnFromValues. #16

Merged
merged 1 commit into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/api/data-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,14 +85,14 @@ Create a new field instance for use in a schema or type definition. A field repr
### Dictionary

<hr/><a id="dictionary" href="#dictionary">#</a>
<b>dictionary</b>(<i>type</i>[, <i>indexType</i>, <i>id</i>, <i>ordered</i>])
<b>dictionary</b>(<i>type</i>[, <i>indexType</i>, <i>ordered</i>, <i>id</i>])

Create a Dictionary data type instance. A dictionary type consists of a dictionary of values (which may be of any type) and corresponding integer indices that reference those values. If values are repeated, a dictionary encoding can provide substantial space savings. In the IPC format, dictionary indices reside alongside other columns in a record batch, while dictionary values are written to special dictionary batches, linked by a unique dictionary *id*. Internally Flechette extracts dictionary values upfront; while this incurs some initial overhead, it enables fast subsequent lookups.
Create a Dictionary data type instance. A dictionary type consists of a dictionary of values (which may be of any type) and corresponding integer indices that reference those values. If values are repeated, a dictionary encoding can provide substantial space savings. In the IPC format, dictionary indices reside alongside other columns in a record batch, while dictionary values are written to special dictionary batches, linked by a unique dictionary *id* assigned at encoding time. Internally Flechette extracts dictionary values immediately upon decoding; while this incurs some initial overhead, it enables fast subsequent lookups.

* *type* (`DataType`): The data type of dictionary values.
* *indexType* (`DataType`): The data type of dictionary indices. Must be an integer type (default [`int32`](#int32)).
* *id* (`number`): The dictionary id, should be unique in a table. Defaults to `-1`, but is set to a proper id if the type is passed through [`tableFromArrays`](/flechette/api/#tableFromArrays).
* *ordered* (`boolean`): Indicates if dictionary values are ordered (default `false`).
* *id* (`number`): Optional dictionary id. The default value (-1) indicates that the dictionary applies to a single column only. Provide an explicit id in order to reuse a dictionary across columns when building, in which case different dictionaries *must* have different unique ids. All dictionary ids are later resolved (possibly to new values) upon IPC encoding.

### Null

Expand Down
2 changes: 1 addition & 1 deletion docs/api/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ const col = columnFromArray(
<hr/><a id="tableFromColumns" href="#tableFromColumns">#</a>
<b>tableFromColumns</b>(<i>columns</i>[, <i>useProxy</i>])

Create a new table from a collection of columns. This method is useful for creating new tables using one or more pre-existing column instances. Otherwise, [`tableFromArrays`](#tableFromArrays) should be preferred. Input columns are assumed to have the same record batch sizes and non-conflicting dictionary ids.
Create a new table from a collection of columns. This method is useful for creating new tables using one or more pre-existing column instances. Otherwise, [`tableFromArrays`](#tableFromArrays) should be preferred. Input columns are assumed to have the same record batch sizes.

* *data* (`object | array`): The input columns as an object with name keys, or an array of [name, column] pairs.
* *useProxy* (`boolean`): Flag indicating if row proxy objects should be used to represent table rows (default `false`). Typically this should match the value of the `useProxy` extraction option used for column generation.
Expand Down
38 changes: 11 additions & 27 deletions src/build/builder.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ import { toBigInt, toDateDay, toFloat16, toTimestamp } from '../util/numbers.js'
import { BinaryBuilder } from './builders/binary.js';
import { BoolBuilder } from './builders/bool.js';
import { DecimalBuilder } from './builders/decimal.js';
import { DictionaryBuilder, dictionaryValues } from './builders/dictionary.js';
import { DictionaryBuilder, dictionaryContext } from './builders/dictionary.js';
import { FixedSizeBinaryBuilder } from './builders/fixed-size-binary.js';
import { FixedSizeListBuilder } from './builders/fixed-size-list.js';
import { IntervalDayTimeBuilder, IntervalMonthDayNanoBuilder } from './builders/interval.js';
Expand All @@ -19,36 +19,20 @@ import { Utf8Builder } from './builders/utf8.js';
import { DirectBuilder, Int64Builder, TransformBuilder } from './builders/values.js';

/**
* Create a new context object for shared builder state.
* Create a context object for shared builder state.
* @param {import('../types.js').ExtractionOptions} [options]
* Batch extraction options.
* @param {Map<number, ReturnType<dictionaryValues>>} [dictMap]
* A map of dictionary ids to value builder helpers.
* @param {ReturnType<dictionaryContext>} [dictionaries]
* Context object for tracking dictionaries.
*/
export function builderContext(options, dictMap = new Map) {
let dictId = 0;
export function builderContext(
options = {},
dictionaries = dictionaryContext()
) {
return {
batchType(type) {
return batchType(type, options);
},
dictionary(type, id) {
let dict;
if (id != null) {
dict = dictMap.get(id);
} else {
while (dictMap.has(dictId + 1)) ++dictId;
id = dictId;
}
if (!dict) {
dictMap.set(id, dict = dictionaryValues(id, type, this));
}
return dict;
},
finish() {
for (const dict of dictMap.values()) {
dict.finish(options);
}
}
batchType: type => batchType(type, options),
dictionary(type) { return dictionaries.get(type, this); },
finish: () => dictionaries.finish(options)
};
}

Expand Down
49 changes: 43 additions & 6 deletions src/build/builders/dictionary.js
Original file line number Diff line number Diff line change
Expand Up @@ -6,21 +6,58 @@ import { builder } from '../builder.js';
import { ValidityBuilder } from './validity.js';

/**
* Builder helped for creating dictionary values.
* @param {number} id The dictionary id.
* Create a context object for managing dictionary builders.
*/
export function dictionaryContext() {
const idMap = new Map;
const dicts = new Set;
return {
/**
* Get a dictionary values builder for the given dictionary type.
* @param {import('../../types.js').DictionaryType} type
* The dictionary type.
* @param {*} ctx The builder context.
* @returns {ReturnType<dictionaryValues>}
*/
get(type, ctx) {
// if a dictionary has a non-negative id, assume it was set
// intentionally and track it for potential reuse across columns
// otherwise the dictionary is used for a single column only
const id = type.id;
if (id >= 0 && idMap.has(id)) {
return idMap.get(id);
} else {
const dict = dictionaryValues(type, ctx);
if (id >= 0) idMap.set(id, dict);
dicts.add(dict);
return dict;
}
},
/**
* Finish building dictionary values columns and assign them to
* their corresponding dictionary batches.
* @param {import('../../types.js').ExtractionOptions} options
*/
finish(options) {
dicts.forEach(dict => dict.finish(options));
}
};
}

/**
* Builder helper for creating dictionary values.
* @param {import('../../types.js').DictionaryType} type
* The dictionary data type.
* @param {*} ctx
* @returns
* @param {ReturnType<import('../builder.js').builderContext>} ctx
* The builder context.
*/
export function dictionaryValues(id, type, ctx) {
export function dictionaryValues(type, ctx) {
const keys = Object.create(null);
const values = builder(type.dictionary, ctx);
const batches = [];

values.init();
let index = -1;
type.id = id;

return {
type,
Expand Down
88 changes: 8 additions & 80 deletions src/build/column-from-array.js
Original file line number Diff line number Diff line change
@@ -1,10 +1,8 @@
import { float32Array, float64Array, int16Array, int32Array, int64Array, int8Array, isInt64ArrayType, isTypedArray, uint16Array, uint32Array, uint64Array, uint8Array } from '../util/arrays.js';
import { DirectBatch, Int64Batch, NullBatch } from '../batch.js';
import { DirectBatch, Int64Batch } from '../batch.js';
import { Column } from '../column.js';
import { float32, float64, int16, int32, int64, int8, uint16, uint32, uint64, uint8 } from '../data-types.js';
import { inferType } from './infer-type.js';
import { builder, builderContext } from './builder.js';
import { Type } from '../constants.js';
import { columnFromValues } from './column-from-values.js';

/**
* Create a new column from a provided data array.
Expand All @@ -14,25 +12,20 @@ import { Type } from '../constants.js';
* If not specified, type inference is attempted.
* @param {import('../types.js').ColumnBuilderOptions} [options]
* Builder options for the generated column.
* @param {ReturnType<import('./builder.js').builderContext>} [ctx]
* @param {ReturnType<import('./builders/dictionary.js').dictionaryContext>} [dicts]
* Builder context object, for internal use only.
* @returns {Column<T>} The generated column.
*/
export function columnFromArray(data, type, options = {}, ctx) {
if (!type) {
if (isTypedArray(data)) {
return columnFromTypedArray(data, options);
} else {
type = inferType(data);
}
}
return columnFromValues(data, type, options, ctx);
export function columnFromArray(data, type, options = {}, dicts) {
return !type && isTypedArray(data)
? columnFromTypedArray(data, options)
: columnFromValues(data.length, v => data.forEach(v), type, options, dicts);
}

/**
* Create a new column from a typed array input.
* @template T
* @param {import('../types.js').TypedArray} values
* @param {import('../types.js').TypedArray} values The input data.
* @param {import('../types.js').ColumnBuilderOptions} options
* Builder options for the generated column.
* @returns {Column<T>} The generated column.
Expand Down Expand Up @@ -62,52 +55,6 @@ function columnFromTypedArray(values, { maxBatchRows, useBigInt }) {
return new Column(batches);
}

/**
* Build a column by iterating over the provided values array.
* @template T
* @param {Array | import('../types.js').TypedArray} values The input data.
* @param {import('../types.js').DataType} type The column data type.
* @param {import('../types.js').ColumnBuilderOptions} [options]
* Builder options for the generated column.
* @param {ReturnType<import('./builder.js').builderContext>} [ctx]
* Builder context object, for internal use only.
* @returns {Column<T>} The generated column.
*/
function columnFromValues(values, type, options, ctx) {
const { maxBatchRows, ...opt } = options;
const length = values.length;
const limit = Math.min(maxBatchRows || Infinity, length);

// if null type, generate batches and exit early
if (type.typeId === Type.Null) {
return new Column(nullBatches(type, length, limit));
}

const data = [];
ctx ??= builderContext(opt);
const b = builder(type, ctx).init();
const next = b => data.push(b.batch());
const numBatches = Math.floor(length / limit);

let idx = 0;
let row = 0;
for (let i = 0; i < numBatches; ++i) {
for (row = 0; row < limit; ++row) {
b.set(values[idx++], row);
}
next(b);
}
for (row = 0; idx < length; ++idx) {
b.set(values[idx], row++);
}
if (row) next(b);

// resolve dictionaries
ctx.finish();

return new Column(data);
}

/**
* Return an Arrow data type for a given typed array type.
* @param {import('../types.js').TypedArrayConstructor} arrayType
Expand All @@ -128,22 +75,3 @@ function typeForTypedArray(arrayType) {
case uint64Array: return uint64();
}
}

/**
* Create null batches with the given batch size limit.
* @param {import('../types.js').NullType} type The null data type.
* @param {number} length The total column length.
* @param {number} limit The maximum batch size.
* @returns {import('../batch.js').NullBatch[]} The null batches.
*/
function nullBatches(type, length, limit) {
const data = [];
const batch = length => new NullBatch({ length, nullCount: length, type });
const numBatches = Math.floor(length / limit);
for (let i = 0; i < numBatches; ++i) {
data.push(batch(limit));
}
const rem = length % limit;
if (rem) data.push(batch(rem));
return data;
}
69 changes: 69 additions & 0 deletions src/build/column-from-values.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
import { NullBatch } from '../batch.js';
import { Column } from '../column.js';
import { inferType } from './infer-type.js';
import { builder, builderContext } from './builder.js';
import { Type } from '../constants.js';

/**
* Create a new column by iterating over provided values.
* @template T
* @param {number} length The input data length.
* @param {(visitor: (value: any) => void) => void} visit
* A function that applies a callback to successive data values.
* @param {import('../types.js').DataType} type The data type.
* @param {import('../types.js').ColumnBuilderOptions} [options]
* Builder options for the generated column.
* @param {ReturnType<
* import('./builders/dictionary.js').dictionaryContext
* >} [dicts] Builder context object, for internal use only.
* @returns {Column<T>} The generated column.
*/
export function columnFromValues(length, visit, type, options, dicts) {
type ??= inferType(visit);
const { maxBatchRows, ...opt } = options;
const limit = Math.min(maxBatchRows || Infinity, length);

// if null type, generate batches and exit early
if (type.typeId === Type.Null) {
return new Column(nullBatches(type, length, limit));
}

const ctx = builderContext(opt, dicts);
const b = builder(type, ctx).init();
const data = [];
const next = b => data.push(b.batch());

let row = 0;
visit(value => {
b.set(value, row++);
if (row >= limit) {
next(b);
row = 0;
}
});
if (row) next(b);

// resolve dictionaries
ctx.finish();

return new Column(data);
}

/**
* Create null batches with the given batch size limit.
* @param {import('../types.js').NullType} type The null data type.
* @param {number} length The total column length.
* @param {number} limit The maximum batch size.
* @returns {import('../batch.js').NullBatch[]} The null batches.
*/
function nullBatches(type, length, limit) {
const data = [];
const batch = length => new NullBatch({ length, nullCount: length, type });
const numBatches = Math.floor(length / limit);
for (let i = 0; i < numBatches; ++i) {
data.push(batch(limit));
}
const rem = length % limit;
if (rem) data.push(batch(rem));
return data;
}
9 changes: 4 additions & 5 deletions src/build/infer-type.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,13 @@ import { isArray } from '../util/arrays.js';

/**
* Infer the data type for a given input array.
* @param {import('../types.js').ValueArray} data The data array.
* @param {(visitor: (value: any) => void) => void} visit
* A function that applies a callback to successive data values.
* @returns {import('../types.js').DataType} The data type.
*/
export function inferType(data) {
export function inferType(visit) {
const profile = profiler();
for (let i = 0; i < data.length; ++i) {
profile.add(data[i]);
}
visit(value => profile.add(value));
return profile.type();
}

Expand Down
6 changes: 3 additions & 3 deletions src/build/table-from-arrays.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { builderContext } from './builder.js';
import { dictionaryContext } from './builders/dictionary.js';
import { columnFromArray } from './column-from-array.js';
import { tableFromColumns } from './table-from-columns.js';

Expand All @@ -13,11 +13,11 @@ import { tableFromColumns } from './table-from-columns.js';
*/
export function tableFromArrays(data, options = {}) {
const { types = {}, ...opt } = options;
const ctx = builderContext();
const dicts = dictionaryContext();
const entries = Array.isArray(data) ? data : Object.entries(data);
const columns = entries.map(([name, array]) =>
/** @type {[string, import('../column.js').Column]} */ (
[ name, columnFromArray(array, types[name], opt, ctx)]
[ name, columnFromArray(array, types[name], opt, dicts)]
));
return tableFromColumns(columns, options.useProxy);
}
Loading