-
Notifications
You must be signed in to change notification settings - Fork 21
Expand file tree
/
Copy pathtable.properties
More file actions
587 lines (481 loc) · 32.8 KB
/
table.properties
File metadata and controls
587 lines (481 loc) · 32.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
##################
# Example values #
##################
## The following table properties relate to the definition of data inside a table.
# A unique name identifying this table.
sleeper.table.name=example-table
## The following table properties relate to partition splitting.
# Splits file which will be used to initialise the partitions for this table. Defaults to nothing and
# the table will be created with a single root partition.
sleeper.table.splits.file=example/full/splits.txt
## The following table properties relate to the storage of data inside a table.
# Maximum number of bytes to write in a Parquet row group (defaults to value set in instance
# properties). This property is NOT used by DataFusion data engine.
sleeper.table.parquet.rowgroup.size=8388608
# The size of the page in the Parquet files - defaults to the value in the instance properties.
sleeper.table.parquet.page.size=131072
# The compression codec to use for this table. Defaults to the value in the instance properties.
# Valid values are: [uncompressed, snappy, gzip, lzo, brotli, lz4, zstd]
sleeper.table.parquet.compression.codec=zstd
# A file will not be deleted until this number of minutes have passed after it has been marked as
# ready for garbage collection. The reason for not deleting files immediately after they have been
# marked as ready for garbage collection is that they may still be in use by queries. Defaults to the
# value set in the instance properties.
sleeper.table.gc.delay.minutes=15
## The following table properties relate to storing and retrieving metadata for tables.
# The name of the class used for the state store. The default is DynamoDBTransactionLogStateStore.
# Options are:
# DynamoDBTransactionLogStateStore
# DynamoDBTransactionLogStateStoreNoSnapshots
sleeper.table.statestore.classname=DynamoDBTransactionLogStateStore
####################
# Other properties #
####################
## The following table properties relate to the definition of data inside a table.
# A boolean flag representing whether this table is online or offline.
# An offline table will not have any partition splitting or compaction jobs run automatically.
# Note that taking a table offline will not stop any partitions that are being split or compaction
# jobs that are running. Additionally, you are still able to ingest data to offline tables and perform
# queries against them.
# (default value shown below, uncomment to set a value)
# sleeper.table.online=true
# Select which data engine to use for the table. Valid values are: [java, datafusion,
# datafusion_experimental]
# The options "datafusion" and "datafusion_experimental" currently have identical behaviour, as the
# DataFusion data engine no longer has any experimental components. We may remove the
# "datafusion_experimental" option in a future release, which will cause instances with that set to
# fail after an upgrade. Please use the "datafusion" option instead.
# (default value shown below, uncomment to set a value)
# sleeper.table.data.engine=DATAFUSION
# Fully qualified class of a custom iterator to apply to this table. Defaults to nothing. This will be
# applied both during queries and during compaction, and will apply the results to the underlying
# table data persistently. This forces use of the Java data engine for compaction. This is not
# recommended, as the Java implementation is much slower and much more expensive. Consider using the
# aggregation and filtering properties instead.
# (uncomment to set a value)
# sleeper.table.iterator.class.name=
# A configuration string to be passed to the iterator specified in
# `sleeper.table.iterator.class.name`. This will be read by the custom iterator object.
# (uncomment to set a value)
# sleeper.table.iterator.config=
# Sets how rows are filtered out and deleted from the table. This is applied every time the data is
# read, e.g. during compactions or queries. Defaults to retaining all rows.
# Currently this can only be `ageOff(field,age)`, to age off old data. The first parameter is the name
# of the timestamp field to check against, which must be of type long, in milliseconds since the
# epoch. The second parameter is the maximum age in milliseconds, e.g. 1209600000 for 2 weeks.
# (uncomment to set a value)
# sleeper.table.filters=
# Sets how to combine rows that have the same values for all row and sort keys. This is applied every
# time the data is read, e.g. during compactions or queries. Defaults to leaving them as separate
# rows.
# This must be in the format `op(field),op(field)`. This must define an operation for every value
# field, passing the field name as the parameter. All value fields must be of a numeric or map type.
# The available operations are as follows:
# sum: adds the values together for equal rows
# max: takes the maximum value out of all equal rows
# min: takes the minimum value out of all equal rows
# map_sum, map_max, map_min: applies the given operation to every sub-field of a map
# (uncomment to set a value)
# sleeper.table.aggregations=
## The following table properties relate to partition splitting.
# Flag to set if you have base64 encoded the split points (only used for string key types and defaults
# to false).
# (default value shown below, uncomment to set a value)
# sleeper.table.splits.base64.encoded=false
# Partitions in this table with more than the following number of rows in will be split.
# (default value shown below, uncomment to set a value)
# sleeper.table.partition.splitting.threshold=1000000000
# When expanding the partition tree explicitly, this many rows are required in the input data to be
# able to split a partition. This will be used when pre-splitting partitions.
# For example, during bulk import when there are too few leaf partitions, the partition tree will be
# extended based on the data in the bulk import job. The bulk import job must contain at least this
# much data per new split point.
# (default value shown below, uncomment to set a value)
# sleeper.table.partition.splitting.min.rows=1000
# When expanding the partition tree explicitly, this is the minimum percentage of the expected number
# of rows to split a partition assuming an even distribution of rows.
# For example, during bulk import when there are too few leaf partitions, the partition tree will be
# extended based on the data in the bulk import job. For each current leaf partition, we make a sketch
# of the data from the job that's in that partition. We divide the number of rows in the job's input
# data by the current number of leaf partitions, to get the expected rows per partition. If this
# propery is set to 10, then any partition with less than 10% of the expected rows per partition will
# be ignored when extending the partition tree.
# (default value shown below, uncomment to set a value)
# sleeper.table.partition.splitting.min.distribution.percent=10
# If true, partition splits will be applied via asynchronous requests sent to the state store
# committer lambda. If false, the partition splitting lambda will apply splits synchronously.
# This is only applied if async commits are enabled for the table. The default value is set in an
# instance property.
# (default value shown below, uncomment to set a value)
# sleeper.table.partition.splitting.commit.async=true
## The following table properties relate to the storage of data inside a table.
# Whether dictionary encoding should be used for row key columns in the Parquet files.
# (default value shown below, uncomment to set a value)
# sleeper.table.parquet.dictionary.encoding.rowkey.fields=false
# Whether dictionary encoding should be used for sort key columns in the Parquet files.
# (default value shown below, uncomment to set a value)
# sleeper.table.parquet.dictionary.encoding.sortkey.fields=false
# Whether dictionary encoding should be used for value columns in the Parquet files.
# (default value shown below, uncomment to set a value)
# sleeper.table.parquet.dictionary.encoding.value.fields=false
# Used to set parquet.columnindex.truncate.length, see documentation here:
# https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
# The length in bytes to truncate binary values in a column index.
# (default value shown below, uncomment to set a value)
# sleeper.table.parquet.columnindex.truncate.length=128
# Used to set parquet.statistics.truncate.length, see documentation here:
# https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
# The length in bytes to truncate the min/max binary values in row groups.
# (default value shown below, uncomment to set a value)
# sleeper.table.parquet.statistics.truncate.length=2147483647
# Enables a cache of data when reading from S3 with the DataFusion data engine, to hold data in larger
# blocks than are requested by DataFusion.
# (default value shown below, uncomment to set a value)
# sleeper.table.datafusion.s3.readahead.enabled=true
# Used to set parquet.writer.version, see documentation here:
# https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
# Can be either v1 or v2. The v2 pages store levels uncompressed while v1 pages compress levels with
# the data.
# (default value shown below, uncomment to set a value)
# sleeper.table.parquet.writer.version=v2
# Used during Sleeper queries to determine whether the column/offset indexes (also known as page
# indexes) are read from Parquet files. For some queries, e.g. single/few row lookups this can improve
# performance by enabling more aggressive pruning. On range queries, especially on large tables this
# can harm performance, since readers will read the extra index data before returning results, but
# with little benefit from pruning.
# (default value shown below, uncomment to set a value)
# sleeper.table.query.parquet.column.index.enabled=false
# Maximum number of rows to write in a Parquet row group.
# (default value shown below, uncomment to set a value)
# sleeper.table.parquet.rowgroup.rows.max=100000
# The S3 readahead range - defaults to the row group size.
# (default value shown below, uncomment to set a value)
# sleeper.table.fs.s3a.readahead.range=8388608
# If true, deletion of files will be applied via asynchronous requests sent to the state store
# committer lambda. If false, the garbage collector lambda will apply synchronously.
# This is only applied if async commits are enabled for the table. The default value is set in an
# instance property.
# (default value shown below, uncomment to set a value)
# sleeper.table.gc.commit.async=true
# This property is used when applying an instance configuration and a table has been removed.
# If this is true (default), removing the table from the configuration will just take the table
# offline.
# If this is false, it will delete all data associated with the table when the table is removed.
# Be aware that if a table is renamed in the configuration, the CDK will see it as a delete of the old
# table name and a create of the new table name. If this is set to false when that happens it will
# remove the table's data.
# This property isn't currently in use but will be in https://github.com/gchq/sleeper/issues/5870.
# (default value shown below, uncomment to set a value)
# sleeper.table.retain.after.removal=true
# This property is used when applying an instance configuration and a table has been added.
# By default, or if this property is false, when a table is added to an instance configuration it's
# created in the instance. If it already exists the update will fail.
# If this property is true, the existing table will be reused and imported as part of the instance
# configuration. If it doesn't exist the update will fail.
# (default value shown below, uncomment to set a value)
# sleeper.table.reuse.existing=false
## The following table properties relate to compactions.
# The name of the class that defines how compaction jobs should be created.
# This should implement sleeper.compaction.strategy.CompactionStrategy. Defaults to the strategy used
# by the whole instance (set in the instance properties).
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.strategy.class=sleeper.compaction.core.job.creation.strategy.impl.SizeRatioCompactionStrategy
# The maximum number of files to read in a compaction job. Note that the state store must support
# atomic updates for this many files.
# Also note that this many files may need to be open simultaneously. The value of
# 'sleeper.fs.s3a.max-connections' must be at least the value of this plus one. The extra one is for
# the output file.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.files.batch.size=12
# The maximum number of compaction jobs that can be running at once. If this limit is exceeded when
# creating new jobs, the selection of jobs is randomised.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.job.creation.limit=100000
# The number of compaction jobs to send in a single batch.
# When compaction jobs are created, there is no limit on how many jobs can be created at once. A batch
# is a group of compaction jobs that will have their creation updates applied at the same time. For
# each batch, we send all compaction jobs to the SQS queue, then update the state store to assign job
# IDs to the input files.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.job.send.batch.size=1000
# The amount of time in seconds a batch of compaction jobs may be pending before it should not be
# retried. If the input files have not been successfully assigned to the jobs, and this much time has
# passed, then the batch will fail to send.
# Once a pending batch fails the input files will never be compacted again without other intervention,
# so it's important to ensure file assignment will be done within this time. That depends on the
# throughput of state store commits.
# It's also necessary to ensure file assignment will be done before the next invocation of compaction
# job creation, otherwise invalid jobs will be created for the same input files. The rate of these
# invocations is set in `sleeper.compaction.job.creation.period.minutes`.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.job.send.timeout.seconds=90
# The amount of time in seconds to wait between attempts to send a batch of compaction jobs. The batch
# will be sent if all input files have been successfully assigned to the jobs, otherwise the batch
# will be retried after a delay.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.job.send.retry.delay.seconds=30
# If true, compaction job ID assignment commit requests will be sent to the state store committer
# lambda to be performed asynchronously. If false, compaction job ID assignments will be committed
# synchronously by the compaction job creation lambda.
# This is only applied if async commits are enabled for the table. The default value is set in an
# instance property.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.job.id.assignment.commit.async=true
# If true, compaction job commit requests will be sent to the state store committer lambda to be
# performed asynchronously. If false, compaction jobs will be committed synchronously by compaction
# tasks.
# This is only applied if async commits are enabled for the table. The default value is set in an
# instance property.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.job.commit.async=true
# This property affects whether commits of compaction jobs are batched before being sent to the state
# store commit queue to be applied by the committer lambda. If this property is true and asynchronous
# commits are enabled then commits of compactions will be batched. If this property is false and
# asynchronous commits are enabled then commits of compactions will not be batched and will be sent
# directly to the committer lambda.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.job.async.commit.batching=true
# Used by the SizeRatioCompactionStrategy to decide if a group of files should be compacted.
# If the file sizes are s_1, ..., s_n then the files are compacted if s_1 + ... + s_{n-1} >= ratio *
# s_n.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.strategy.sizeratio.ratio=3
# Used by the SizeRatioCompactionStrategy to control the maximum number of jobs that can be running
# concurrently per partition.
# (default value shown below, uncomment to set a value)
# sleeper.table.compaction.strategy.sizeratio.max.concurrent.jobs.per.partition=2147483647
## The following table properties relate to storing and retrieving metadata for tables.
# Overrides whether or not to apply state store updates asynchronously via the state store committer.
# Usually this is decided based on the state store implementation used by the Sleeper table, but other
# default behaviour can be set for the Sleeper instance.
# This is separate from the properties that determine which state store updates will be done as
# asynchronous commits. Those properties will only be applied when asynchronous commits are enabled.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.commit.async.enabled=true
# When using the transaction log state store, this sets whether to update from the transaction log
# before adding a transaction in the asynchronous state store committer.
# If asynchronous commits are used for all or almost all state store updates, this can be false to
# avoid the extra queries.
# If the state store is commonly updated directly outside of the asynchronous committer, this can be
# true to avoid conflicts and retries.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.committer.update.every.commit=false
# When using the transaction log state store, this sets whether to update from the transaction log
# before adding a batch of transactions in the asynchronous state store committer.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.committer.update.every.batch=true
# The number of attempts to make when applying a transaction to the state store.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.add.transaction.max.attempts=10
# The maximum amount of time to wait before the first retry when applying a transaction to the state
# store. Full jitter will be applied so that the actual wait time will be a random period between 0
# and this value. This ceiling will increase exponentially on further retries. See the below article.
# https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.add.transaction.first.retry.wait.ceiling.ms=200
# The maximum amount of time to wait before any retry when applying a transaction to the state store.
# Full jitter will be applied so that the actual wait time will be a random period between 0 and this
# value. This restricts the exponential increase of the wait ceiling while retrying the transaction.
# See the below article.
# https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.add.transaction.max.retry.wait.ceiling.ms=30000
# The number of elements to include per Arrow row batch in a snapshot derived from the transaction
# log, of the state of files in a Sleeper table. Each file includes some number of references on
# different partitions. Each reference will count for one element in a row batch, but a file cannot
# currently be split between row batches. A row batch may contain more file references than this if a
# single file overflows the batch. A file with no references counts as one element.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.files.snapshot.batch.size=1000
# The number of partitions to include per Arrow row batch in a snapshot derived from the transaction
# log, of the state of partitions in a Sleeper table.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.partitions.snapshot.batch.size=1000
# The number of seconds to wait after we've loaded a snapshot before looking for a new snapshot. This
# should relate to the rate at which new snapshots are created, configured in the instance property
# `sleeper.statestore.transactionlog.snapshot.creation.lambda.period.seconds`.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.time.between.snapshot.checks.seconds=60
# The number of milliseconds to wait after we've updated from the transaction log before checking for
# new transactions. The state visible to an instance of the state store can be out of date by this
# amount. This can avoid excessive queries by the same process, but can result in unwanted behaviour
# when using multiple state store objects. When adding a new transaction to update the state, this
# will be ignored and the state will be brought completely up to date.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.time.between.transaction.checks.ms=0
# The minimum number of transactions that a snapshot must be ahead of the local state, before we load
# the snapshot instead of updating from the transaction log.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.snapshot.load.min.transactions.ahead=10
# The number of days that transaction log snapshots remain in the snapshot store before being deleted.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.snapshot.expiry.days=2
# The minimum age in minutes of a snapshot in order to allow deletion of transactions leading up to
# it. When deleting old transactions, there's a chance that processes may still read transactions
# starting from an older snapshot. We need to avoid deletion of any transactions associated with a
# snapshot that may still be used as the starting point for reading the log.
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.delete.behind.snapshot.min.age.minutes=2
# The minimum number of transactions that a transaction must be behind the latest snapshot before
# being deleted. This is the number of transactions that will be kept and protected from deletion,
# whenever old transactions are deleted. This includes the transaction that the latest snapshot was
# created against. Any transactions after the snapshot will never be deleted as they are still in
# active use.
# This should be configured in relation to the property which determines whether a process will load
# the latest snapshot or instead seek through the transaction log, since we need to preserve
# transactions that may still be read:
# sleeper.table.statestore.snapshot.load.min.transactions.ahead
# The snapshot that will be considered the latest snapshot is configured by a property to set the
# minimum age for it to count for this:
# sleeper.table.statestore.transactionlog.delete.behind.snapshot.min.age
# (default value shown below, uncomment to set a value)
# sleeper.table.statestore.transactionlog.delete.number.behind.latest.snapshot=200
## The following table properties relate to ingest.
# Specifies the strategy that ingest uses to creates files and references in partitions.
# Valid values are: [one_file_per_leaf, one_reference_per_leaf]
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.file.writing.strategy=one_reference_per_leaf
# The way in which rows are held in memory before they are written to a local store.
# Valid values are 'arraylist' and 'arrow'.
# The arraylist method is simpler, but it is slower and requires careful tuning of the number of rows
# in each batch.
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.row.batch.type=arrow
# The way in which partition files are written to the main Sleeper store.
# Valid values are 'direct' (which writes using the s3a Hadoop file system) and 'async' (which writes
# locally and then copies the completed Parquet file asynchronously into S3).
# The direct method is simpler but the async method should provide better performance when the number
# of partitions is large.
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.partition.file.writer.type=async
# If true, ingest tasks will add files via requests sent to the state store committer lambda
# asynchronously. If false, ingest tasks will commit new files synchronously.
# This is only applied if async commits are enabled for the table. The default value is set in an
# instance property.
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.job.files.commit.async=true
## The following table properties relate to bulk import, i.e. ingesting data using Spark jobs running
## on EMR or EKS.
# (Non-persistent EMR mode only) Which architecture to be used for EC2 instance types in the EMR
# cluster. Must be either "x86_64" "arm64" or "x86_64,arm64". For more information, see the Bulk
# import using EMR - Instance types section in docs/usage/bulk-import.md
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.emr.instance.architecture=arm64
# (Non-persistent EMR mode only) The EC2 x86_64 instance types and weights to be used for the master
# node of the EMR cluster.
# For more information, see the Bulk import using EMR - Instance types section in
# docs/usage/bulk-import.md
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.emr.master.x86.instance.types=m7i.xlarge
# (Non-persistent EMR mode only) The EC2 x86_64 instance types and weights to be used for the executor
# nodes of the EMR cluster.
# For more information, see the Bulk import using EMR - Instance types section in
# docs/usage/bulk-import.md
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.emr.executor.x86.instance.types=m7i.4xlarge
# (Non-persistent EMR mode only) The EC2 ARM64 instance types and weights to be used for the master
# node of the EMR cluster.
# For more information, see the Bulk import using EMR - Instance types section in
# docs/usage/bulk-import.md
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.emr.master.arm.instance.types=m7g.xlarge
# (Non-persistent EMR mode only) The EC2 ARM64 instance types and weights to be used for the executor
# nodes of the EMR cluster.
# For more information, see the Bulk import using EMR - Instance types section in
# docs/usage/bulk-import.md
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.emr.executor.arm.instance.types=m7g.4xlarge
# (Non-persistent EMR mode only) The purchasing option to be used for the executor nodes of the EMR
# cluster.
# Valid values are ON_DEMAND or SPOT.
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.emr.executor.market.type=SPOT
# (Non-persistent EMR mode only) The initial number of capacity units to provision as EC2 instances
# for executors in the EMR cluster.
# This is measured in instance fleet capacity units. These are declared alongside the requested
# instance types, as each type will count for a certain number of units. By default the units are the
# number of instances.
# This value overrides the default value in the instance properties. It can be overridden by a value
# in the bulk import job specification.
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.emr.executor.initial.capacity=2
# (Non-persistent EMR mode only) The maximum number of capacity units to provision as EC2 instances
# for executors in the EMR cluster.
# This is measured in instance fleet capacity units. These are declared alongside the requested
# instance types, as each type will count for a certain number of units. By default the units are the
# number of instances.
# This value overrides the default value in the instance properties. It can be overridden by a value
# in the bulk import job specification.
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.emr.executor.max.capacity=10
# (Non-persistent EMR mode only) The EMR release label to be used when creating an EMR cluster for
# bulk importing data using Spark running on EMR.
# This value overrides the default value in the instance properties. It can be overridden by a value
# in the bulk import job specification.
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.emr.release.label=emr-7.12.0
# Specifies the minimum number of leaf partitions that are needed to run a bulk import job. If this
# minimum has not been reached, bulk import jobs will refuse to start
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.min.leaf.partitions=256
# Specifies the number of times bulk import tries to create leaf partitions to meet the minimum number
# of leaf partitions. This will be retried if another process splits the same partitions at the same
# time.
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.partition.splitting.attempts=3
# If true, bulk import will add files via requests sent to the state store committer lambda
# asynchronously. If false, bulk import will commit new files at the end of the job synchronously.
# This is only applied if async commits are enabled for the table. The default value is set in an
# instance property.
# (default value shown below, uncomment to set a value)
# sleeper.table.bulk.import.job.files.commit.async=true
## The following table properties relate to the ingest batcher.
# Specifies the minimum total file size required for an ingest job to be batched and sent. An ingest
# job will be created if the batcher runs while this much data is waiting, and the minimum number of
# files is also met.
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.batcher.job.min.size=1G
# Specifies the maximum total file size for a job in the ingest batcher. If more data is waiting than
# this, it will be split into multiple jobs. If a single file exceeds this, it will still be ingested
# in its own job. It's also possible some data may be left for a future run of the batcher if some
# recent files overflow the size of a job but aren't enough to create a job on their own.
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.batcher.job.max.size=5G
# Specifies the minimum number of files for a job in the ingest batcher. An ingest job will be created
# if the batcher runs while this many files are waiting, and the minimum size of files is also met.
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.batcher.job.min.files=1
# Specifies the maximum number of files for a job in the ingest batcher. If more files are waiting
# than this, they will be split into multiple jobs. It's possible some data may be left for a future
# run of the batcher if some recent files overflow the size of a job but aren't enough to create a job
# on their own.
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.batcher.job.max.files=100
# Specifies the maximum time in seconds that a file can be held in the batcher before it will be
# included in an ingest job. When any file has been waiting for longer than this, a job will be
# created with all the currently held files, even if other criteria for a batch are not met.
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.batcher.file.max.age.seconds=300
# Specifies the target ingest queue where batched jobs are sent.
# Valid values are: [standard_ingest, bulk_import_emr, bulk_import_persistent_emr, bulk_import_eks,
# bulk_import_emr_serverless]
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.batcher.ingest.queue=bulk_import_emr_serverless
# The time in minutes that the tracking information is retained for a file before the records of its
# ingest are deleted (eg. which ingest job it was assigned to, the time this occurred, the size of the
# file).
# The expiry time is fixed when a file is saved to the store, so changing this will only affect new
# data.
# Defaults to 1 week.
# (default value shown below, uncomment to set a value)
# sleeper.table.ingest.batcher.file.tracking.ttl.minutes=10080
## The following table properties relate to query execution
# The amount of time in seconds the query executor's cache of partition and file reference information
# is valid for. After this it will time out and need refreshing.
# If this is set too low, then queries will be slower. This due to the state needing to be updated
# from the state store.
# If this is set too high, then queries may not have access to all the latest data.
# Future work will remove or reduce this trade-off.
# If you know the table is inactive, then set this to a higher value.
# (default value shown below, uncomment to set a value)
# sleeper.table.query.processor.cache.timeout.seconds=60