|
| 1 | +# Compacting Operations |
| 2 | + |
| 3 | +This document describes the internal process of compacting operations. This describes what happens behind the scenes when running the `compact` CLI command. |
| 4 | + |
| 5 | +By default, each change in the source database becomes a new operation in the ever-growing bucket history. To avoid this history from growing indefinitely, we "compact" the buckets. |
| 6 | + |
| 7 | +## Previous Workaround |
| 8 | + |
| 9 | +A workaround is to deploy a sync rules change, which re-creates all buckets from scratch, containing only the latest version of each row. This reduces the size of buckets for new clients performing a sync from scratch, but requires existing clients to completely re-sync the buckets. |
| 10 | + |
| 11 | +# Compacting Processes |
| 12 | + |
| 13 | +The compacting process can be split into three distinct processes. |
| 14 | + |
| 15 | +1. Convert operations into MOVE operations. |
| 16 | +2. Convert operations into CLEAR operations. |
| 17 | +3. Defragment buckets. |
| 18 | + |
| 19 | +## 1. MOVE operations |
| 20 | + |
| 21 | +Any operation on a row may be converted into a MOVE operation if there is another PUT or REMOVE operation later in the bucket for the same row. |
| 22 | + |
| 23 | +Two rows are considered the same if the combination of `(object_type, object_id, subkey)` is the same. |
| 24 | + |
| 25 | +A MOVE operation may contain internal metadata of `{target_op: op_id}`. This indicates that the operation was "moved" to the target, and no checkpoint before that op_id will be valid. A previous protocol revision included this in the operation data, and let clients invalidate the checkpoint. Now, this is used purely server-side, and the server omits the `CheckpointComplete` message if the current checkpoint has been invalidated by such an operation. The same applies to CLEAR operations below. |
| 26 | + |
| 27 | +When converting an operation to a MOVE operation, the bucket, op_id and checksum remain the same. The data, object_type, object_id and subkey fields must be cleared, reducing the size of the operation. |
| 28 | + |
| 29 | +By itself, converting operations into MOVE operations does not reduce the number of operations synced, but may reduce the total size of the operations synced. It has no effect on clients that are already up-to-date. |
| 30 | + |
| 31 | +## 2. CLEAR operations |
| 32 | + |
| 33 | +A CLEAR operation in a bucket indicates that all preceeding operations in the bucket must be deleted. It is typically the first operation in a bucket, but the client may receive it at any later point. |
| 34 | + |
| 35 | +The the client has active PUT options before the CLEAR operation, those are effectively converted into REMOVE operations. This will remove the data unless there is another PUT operation for the relevant rows later in the bucket. |
| 36 | + |
| 37 | +The compact process involves: |
| 38 | + |
| 39 | +1. Find the largest sequence of REMOVE, MOVE and/or CLEAR operations of at the start of the bucket. |
| 40 | +2. Replace all of those with the single CLEAR operation. |
| 41 | + |
| 42 | +The op_id of the CLEAR operation is the latest op_id of the operations being replaced, and the checksum is the combination of those operations' checksums. |
| 43 | + |
| 44 | +Compacting to CLEAR operations can reduce the number of operations in a bucket. However, |
| 45 | +it is not effective if there is a PUT operation near the start of the bucket. This compacting step has no effect on clients that are already up-to-date. |
| 46 | + |
| 47 | +The MOVE compact step above should typically be run before the CLEAR compact step, to ensure maximum effectiveness. |
| 48 | + |
| 49 | +## 3. Defragmentation |
| 50 | + |
| 51 | +Even after doing the MOVE and CLEAR compact processes, there is still a possibility of a bucket being fragmented with many MOVE and REMOVE operations. In the worst case, a bucket may start with a single PUT operation, followed by thousands of MOVE and REMOVE operations. Only a single row (the PUT operation) still exists, but new clients must sync all the MOVE and CLEAR operations. |
| 52 | + |
| 53 | +To handle these cases, we can "defragment" the data. |
| 54 | + |
| 55 | +Defragmentation does not involve any new operations. Instead, it just moves PUT operations for active rows from the start of the bucket to the end of the bucket, to allow the above MOVE and COMPACT processes to efficiently compact the bucket. |
| 56 | + |
| 57 | +The disadvantage here is that these rows will be re-synced by existing clients. |
| 58 | + |
| 59 | +# Implementation |
| 60 | + |
| 61 | +## MOVE + CLEAR |
| 62 | + |
| 63 | +This is a process that compacts all buckets, by iterating through all operations. This process can be run periodically, for example once a day, or after bulk data modifications. |
| 64 | + |
| 65 | +The process iterates through all operations in reverse order. This effectively processes one bucket at a time, in reverse order of operations. |
| 66 | + |
| 67 | +We track each row we've seen in a bucket, along with the last PUT/REMOVE operation we've seen for the row. Whenever we see the same row again, we replace that operation with a MOVE operation, using the PUT/REMOVE op_id as the target. |
| 68 | + |
| 69 | +To avoid indefinite memory growth for this process, we place a limit on the memory usage for the set of rows we're tracking. Once we reach this limit, we avoid adding tracking any additional rows for the bucket. We should be able to effectively compact buckets in the order of 4M unique rows using 1GB of memory, and only lose some compacting gains for larger buckets. |
| 70 | + |
| 71 | +The second part is compacting to CLEAR operations. For each bucket, we keep track of the last PUT operation we've seen (last meaning the smallest op_id since we're iterating in reverse). We then replace all the operations before that with a single CLEAR operation. |
| 72 | + |
| 73 | +## Defragmentation |
| 74 | + |
| 75 | +For an initial workaround, defragmenting can be performed outside powersync by touching all rows in a bucket: |
| 76 | + |
| 77 | +```sql |
| 78 | +update mytable set id = id |
| 79 | +-- Repeat the above for other tables in the same bucket if relevant |
| 80 | +``` |
| 81 | + |
| 82 | +After this, the normal MOVE + CLEAR compacting will compact the bucket to only have a single operation per active row. |
| 83 | + |
| 84 | +This would cause existing clients to re-sync every row, while reducing the number of operations for new clients. |
| 85 | + |
| 86 | +If an updated_at column or similar is present, we can use this to defragment more incrementally: |
| 87 | + |
| 88 | +```sql |
| 89 | +update mytable set id = id where updated_at < now() - interval '1 week' |
| 90 | +``` |
| 91 | + |
| 92 | +This version avoids unnecessary defragmentation of rows modified recently. |
| 93 | + |
| 94 | +In the future, we can implement defragmentation inside PowerSync, using heuristics around the spread of operations within a bucket. |
| 95 | + |
| 96 | +# Future additions |
| 97 | + |
| 98 | +Improvements may be implemented in the future: |
| 99 | + |
| 100 | +1. Keeping track of buckets that need compacting would allow us to compact those as soon as needed, without the overhead of compacting buckets where it won't have an effect. |
| 101 | +2. Together with the above, we can implement a lightweight compacting process inside the replication worker, to compact operations as soon as modifications come in. This can help to quickly compact in cases where multiple modifications are made to the same rows in a short time span. |
| 102 | +3. Implement automatic defragmentation inside PowerSync as described above. |
0 commit comments