diff --git a/20240106-low-io-piece-storage.md b/20240106-low-io-piece-storage.md
new file mode 100644
index 0000000..b9174ca
--- /dev/null
+++ b/20240106-low-io-piece-storage.md
@@ -0,0 +1,542 @@
+tags: []
+---
+
+# Low I/O piece storage
+
+## Essentials
+
+### Header
+
+Date: January 6th, 2024
+
+Owner:
+
+Accountable:
+-
+
+Consulted:
+-
+
+Informed:
+-
+
+## Abstract
+
+Storage nodes use a plain file storage for pieces, which result in various file system overheads depending on the
+specific storage node setup. This slows down nodes and puts some popular setups at disadvantage. By reimplementing
+a subset of file system features in a way that is optimized for storage node use, it should be possible to reduce the
+amount of I/O operations performed during routine tasks like uploads and downloads 5- to 10-fold, making it possible to
+use less resources to operate larger nodes.
+
+## Main objectives
+
+1. Operating a performant storage node on a number of popular setups (btrfs, thin-allocated storage, parity RAIDs,
+   low-memory setups especially with NTFS, SMR drives, Storage Spaces) becomes possible.
+2. Reduction in I/O operations necessary to perform routine tasks leads to improvement in time-to-first-byte and latency
+   metrics on setups currently considered performant.
+
+## Background
+
+The cheapest way to provide storage for Storj purposes is to use hard disks, which have some important performance
+characteristics:
+* Reading and writing data sequentially, sector by sector, does not incur penalties and can use the full speed of a hard
+  drive, often >100 MB/s.
+* Reading and writing data from different parts incurs penalties each time the drive moves from one area to another.
+  Penalties depend on various factors [1]:
+  * whether the actuator needs to move to a different track incurs a penalty formally called _settle time_,
+  * numbers of tracks to move incurs a penalty formally named _seek time_,
+  * the time it takes to rotate the platter so that the right part of the track is available is formally called
+    latency.
+
+  Penalties in total may reduce effective read/write speeds by even three orders of magnitude. Penalties are usually
+  smaller when a latter operation has only slightly higher sector number than the former (maybe within the same
+  track, maybe on a neighbouring track). The total penalty on modern hard disk drives is between 5 and 10 ms. It is
+  usually assumed modern drives can perform at most 250 random I/O operations per second, regardless of total drive
+  capacity.
+
+[1] https://web.archive.org/web/20161101210900/http://www.pcguide.com/ref/hdd/perf/perf/spec/pos_Access.htm
+
+Some ways to avoid penalties include:
+* Leveraging RAM for caches. This is limited by the available amount of RAM.
+* Leveraging flash storage for read caches. This requires availability of flash storage, which induces additional
+  costs and may not be possible to use if the computer a node is operating on does not have free controller ports.
+* Delaying writes: by collecting a number of write operations, the operating system (through various I/O scheduler
+  algorithms) and the hard disk drive (through the _Native Command Queue_ feature) can reorder them to reduce
+  penalties, or avoid a performing a write operation altogether (if two write operations end up modifying the same
+  sector). For example, the popular deadline scheduler sorts operations by sector number to leverage the case of two
+  operations in short distance [2].
+
+[2] https://en.wikipedia.org/wiki/Deadline_scheduler
+
+Additional performance constraints come from popular block device setups like thinly allocated storage which might
+increase penalties, or parity RAID schemes and SMR drives, which result in significant I/O amplification
+for small writes. It is therefore advantageous to design data structures stored on hard disk drives to be as small
+as possible (for better cache utilization), to minimize the number of seeks, and avoid forcing writes, especially
+forcing many small writes.
+
+As of now, storage nodes store each piece as a separate file in the file system, requiring an allocation of
+various file system data structures. While these data structures differ between file systems, they usually include
+at least two: a directory entry (_direntry_ in ext4 parlance), and a file node (usually called _inode_). As file
+systems are general-purpose, they need to provide features for various use cases, not just operate storage nodes.
+Many of these features negatively impact performance of storage nodes though. For example:
+
+* Separation of direntries and inodes is necessary to support hardlinks. Yet this separation requires an additional
+  seek for each file access, and storage nodes do not use hard links.
+* Journals improve recovery of file system metadata after a failure (e.g. system crash). Yet they require an
+  additional seek and, depending on configuration, a forced write, for each operation that modifies metadata, while
+  storage nodes can easily recover some types of lost metadata operations, e.g. if a file is not deleted from trash by
+  one trash file walker cycle, it will be deleted by the next one.
+* Many file system operations are required to be atomic, for example file creation. This requires forced
+  writes. Yet storage nodes do not require atomicity for many of them, e.g. it's rather unlikely that two pieces
+  with the same ID will be uploaded at the same time.
+* Some file systems or storage solutions have copy-on-write semantics which allow cheap snapshots or deduplication,
+  but additionally fragment metadata when faced with small writes.
+* Inodes store metadata information, such as user permissions, date of modification, file names, etc. This makes
+  inodes larger than necessary for storage node purposes. Bigger data structures end up taking more cache space.
+  For example with the current average segment size of 10.2 MB: ext4 with default inode size of 256 bytes and a
+  typical direntry size (for a storage node piece) of 62 bytes will require at least 904 MB per 1 TB of pieces; NTFS
+  with an inode size of 1kB will require at least 2.84 GB per 1 TB of pieces. If metadata do not fit in cache, they
+  will incur additional seeks for each file access.
+
+While some aspects of file system operation can be tuned by storage node operators (e.g. changing the journal mode),
+this is not always possible, e.g. because the file system is also used for other purposes.
+
+This document describes a different approach of improving storage node operations. By reorganizing blob storage so
+that there is no one-to-one correspondence between files and pieces storage node will no longer be required to
+perform many file system operations. By implementing a custom way to store metadata, the amount of cache necessary
+is significantly reduced.
+
+## Design
+
+This is a rough draft of a design. Notes that mention known missing details or additional ideas are marked with TODO.
+
+
+### Data structures
+
+Three types of files are used: a journal file, pack files to store pieces, and a piece index file to quickly find a
+piece by its ID. They are roughly equivalent to the file system journaling, data blocks and direntries/inodes.
+
+Data structure descriptions define some magic numbers, such as the maximum pack size. It may make sense to make them
+configurable, but for simplicity the document suggests certain values to focus attention on a specific solution.
+
+
+#### The journal file
+
+The journal file is an append-only file that stores a log of all modifications to metadata. There is one journal file
+per satellite. The structure of this file is a series of messages denoting performed write and compaction operations,
+written in a tight binary encoding (e.g. protobuf). It is only used for recovery purposes.
+
+The file needs to be periodically synced, e.g. once a minute.
+
+TODO: It is probably a good idea to have some sort of CRC inserted every so often, e.g. for each 8 KiB block of
+messages.
+
+TODO: This file may actually grow pretty large, so it might be a good idea to store it in chunks of, let say, 256 MB.
+
+
+#### Pack files
+
+Pack files store pieces and their original header information in a format that resembles an append-only storage.
+Assume that we want to store three pieces:
+
+| Item | Length (bytes)                       | Contents         |
+|------|--------------------------------------|------------------|
+| 0    | 512                                  | Piece 0's header |
+| 1    | piece 0's length padded to 512 bytes | Piece 0's data   |
+| 2    | 512                                  | Piece 1's header |
+| 3    | piece 1's length padded to 512 bytes | Piece 1's data   |
+| 4    | 512                                  | Piece 2's header |
+| 5    | piece 1's length padded to 512 bytes | Piece 2's data   |
+
+This is essentially the current blob files concatenated and padded to 512 bytes.
+
+We will limit the size of a pack file by putting a restriction that no piece header can start after or on the offset of
+256 MiB. With the current average segment size this means approximately 726 pieces per pack file. To store more pieces
+we will use multiple pack files. Each pack file is identified by a 24-bit unsigned integer.
+
+256 MiB was chosen as a trade-off between an attempt to minimize the total number of pack files, and the risk of
+losing a large number of pieces at once in case a pack file is lost. It is also a nice coincidence that with this
+size and counting blocks of 512 bytes, we need exactly 4 bytes total to denote both offset and length of a single
+piece: 19 bits for the offset, and 13 bits for length assuming maximum size of a piece of 4 MiB. On the other side,
+this limits the size of a piece, which may be undesirable.
+
+We will be using hole punching (FALLOC_FL_PUNCH_HOLE) to free up disk space used by deleted pieces. We will be using
+file collapsing (FALLOC_FL_COLLAPSE_RANGE) to compact pack files and free up address space within a pack file. The
+former is available on all modern file systems, even NTFS. The latter only on Linux, but it might be possible to
+emulate it by rewriting pack files. See the _Operations_ section below.
+
+Hole punching also means it may not be possible to jump from one header to the next one by reading piece size from
+the header, as a header may be hole-punched as well. However, as locations of pieces will be stored in the journal
+file, this is not necessary even for recovery purposes.
+
+*Append offset* is the offset of the first byte after the piece's data with the highest offset (logically last). This
+will be the offset of the next stored piece. This design never stores pieces inside punched holes, though it could be
+investigated in future as a replacement for file collapsing.
+
+An _active pack file_ is the file currently selected for new uploads. There is only one active pack file at a time:
+it is desirable to have as many consecutive uploads land in the same file, as this allows write coalescing across
+many uploads. An active pack file is kept open regardless of whether uploads are in progress. At node start, or when
+the active pack file's append offset crosses 256 MiB, a new one is chosen:
+1. If there are any pack files with an append offset smaller than 128 MiB, then the one with the smallest append
+   offset is chosen.
+2. Otherwise, a new pack file is created.
+
+Each time a pack file is activated, the area starting from the append offset and ending at 256 MiB is preallocated to
+reduce initial fragmentation and potentially speed up future writes on some setups. The last piece will likely fall
+out of this region, but for one piece out of an almost thousand, this is not a big concern.
+
+Each time a pack file is activated, a message is written into the journal: *PackFileActivated(pack file ID, append
+offset)*
+
+As we will reuse pack files after compaction, it should be possible to keep a significant majority of pack files at
+least half full. This means that for 20 TB worth of pieces, we should end up with around 75k to 150k pack files.
+
+2^24 pack files even at 128 MiB each allows for 2.2 PB of stored data per node per satellite. As such, this should be
+enough for foreseeable future.
+
+
+#### Piece index
+
+There is a single index file for each satellite. Its goal is to allow quick search for pieces by piece ID in the pack
+files without having to scan the journal file. There are few possible data structures here, but one that is especially
+enticing is to use a hash table data structure with 2^k buckets storing 8 KiB of data each. Each bucket contains
+data about all stored pieces whose piece ID starts with a certain k-bit prefix. The bucket stores its CRC, an
+individual origin timestamp in days since 2020, and following information for each piece:
+
+| Item | Length   | Contents                                                                         |
+|------|----------|----------------------------------------------------------------------------------|
+| 0    | 3 bytes  | Pack file identifier                                                             |
+| 1    | 19 bits  | Piece's header offset in units of 512 bytes                                      |
+| 2    | 13 bits  | Piece's data length in units of 512 bytes                                        |
+| 3    | 32 bytes | Piece ID                                                                         |
+| 4    | 15 bits  | Upload timestamp in number of days since the bucket's origin timestamp           |
+| 5    | 15 bits  | Expiration/trash timestamp in number of days since the bucket's origin timestamp |
+| 6    | 1 bit    | Is file trashed? "trash bit"                                                     |
+
+The size of this file is the key to fast operation, and hence the size of the above structure needs to be made as
+small as possible. As proposed, this structure takes 43 bytes. 190 entries fit in a single bucket, with 22 bytes to
+spare. Those spare bytes are used to store a CRC for each bucket, the origin timestamp, and maybe some version
+number for the structures stored in the bucket.
+
+Expiration/trash timestamp stores the expiration timestamp if the piece is not trashed yet (trash bit clear), or the
+trash timestamp if it is (trash bit set). It does not make much sense to expire an already trashed piece. In case of
+restoring a piece from trash, we can consult the journal file, or just assume we have lost the expiration timestamp.
+
+Upload and expiration/trash timestamps are stored in reference to the origin timestamp for the bucket. Each time a
+bucket is written, all upload and trash timestamps older than "2 weeks ago" are updated to "2 weeks ago", then a
+new origin timestamp is chosen as the earliest date among all timestamps. This is to make 15 bits enough of a sliding
+window: 2^15 days is 89 years, which is effectively "forever" for an expiration timestamp anyway.
+
+Structures do not have to be sorted in any particular order, like by piece ID. After reading a bucket from storage,
+linearly scanning it for a given piece ID will be fast enough.
+
+Any given entry can be declared as unused by zeroing all fields. Identifying an empty entry can be done by checking
+piece's length, as no piece can have a length of zero. Coincidentally, an initially preallocated index file consists
+of zeros.
+
+The initial number of buckets should be 8192 (for k=13 and the total size of the hashmap of 64 MiB), as this is the
+minimal size to contain information on 500 GB worth of data at the current average size of piece. In case a bucket
+fills up, a new piece index file should be created with a k factor increased by one. Each bucket in the original
+file will sequentially be mapped to two consecutive buckets in the new file, making a hash table grow a sequential
+read and write operation. There is probably no need to ever shrink the piece index, as even at 20 TB worth of pieces
+it would likely only grow to 4 GiB. Besides, htrees of direntries in ext4 are never shrunk as well, which is a
+precedent here.
+
+The file should be preallocated to reduce fragmentation and hence allow fast sequential scanning. A hashmap of k=19
+should be enough to store information on 20 TB worth of pieces and will require 4 GiB, making a scan at 100 MB/s
+take 40 seconds. This will become our equivalent of a file walker for garbage collection and trashing.
+
+It probably makes sense to map this file to memory, as opposed to reading/writing a bucket explicitly.
+
+This file should be considered a database and allowed to be stored on a separate file system in case node operator
+desires, in a similar way to other databases. This is the only file where small random reads and writes will be
+performed. This file should be marked as no-CoW on file systems which use copy-on-write semantics by default. In
+case this file is lost or damaged, it can be recreated from the journal file.
+
+In some node setups, it may even be viable to always store it in a RAM disk, as opposed to e.g. flash storage, as
+persistence of this file is not necessary (except for faster startups after a clean shutdown) and the file is
+already designed to fit in RAM cache in most setups. Hence, it may be desirable to have a separate directory setting
+for this file. Alternatively, the design can be changed to having a piece index as just an in-memory data structure
+without disk representation, leveraging swap space in case of extremely low-memory nodes.
+
+By storing the expiration time of pieces in the journal file and this file, we effectively no longer need to have
+a separate SQLite database anymore for storing them anymore.
+
+TODO: this file needs to also have some sort of a clean shutdown check. Might be as simple as a single bit stored
+whether the file is opened, set on node's start and cleared on a clean shutdown.
+
+TODO: to prevent malicious satellites from filling up specific buckets, the k-bit prefix may actually come from a
+separate hash function on a salted piece ID.
+
+
+### Operations
+
+The procedures described in this section are simplified to show the design without writing down all necessary
+details. For example, most operations should consider some sort of locking for many of their steps.
+
+In the comparisons to the current approach the following will be assumed:
+1. The ratio of the unused RAM to the amount of pieces stored is smaller than 1 GB / 1 TB. This covers many
+   enthusiast and SOHO setups, but also allow for much more economical professional setups even if these setups
+   could use a large amounts of memory. This assumption means we cannot count on direntries and inodes of piece
+   files in the current approach to be cached in RAM, as the only case this would happen with setups discussed on the
+   forum is an ext4 setup tuned specifically for Storj with small inodes.
+2. The ratio of the unused RAM to the amount of pieces stored is larger than 150 MB / 1 TB, allowing the piece index
+   file, as well as direntries and inodes of the pack files to be kept in RAM cache. The proposed approach would
+   still be better in terms of number of seeks below this ratio, but even cheap enthusiast setups are rarely below
+   this threshold.
+3. No SSD caching. For the purposes of a storage node an SSD cache of a non-trivial amount set up properly would bring
+   the same benefits as more RAM, but it is not always possible to add an SSD to an existing setup. Besides, even
+   with SSD cache, the described approach reduces the number of writes performed, prolonging its lifespan and
+   allowing cheaper consumer devices to perform.
+
+
+#### Start
+
+Run the following steps for each satellite on node start:
+
+1. Create an index file if it does not exist. Preallocate it for k=13. Or, recover it from the journal file by
+   replaying all journal messages in case of damage detected/unclean shutdown.
+2. Perform the used space file walker by reading the physical sizes of all pack files, the journal file, and the
+   piece index file.
+3. Scan the index file. For each pack file, identify its append offset by finding the max(Piece's header offset +
+   Piece's data length).
+4. Open the journal file for appending.
+5. Identify the active pack file. Open it for appending.
+
+Note: we are scanning the piece index file anyway, so we could sum up the piece sizes from the index instead of running
+the file walker. However, this will not account for misaligned hole punching (see the _Single immediate piece
+deletion_ section notes), will not account for metadata (though the current file walker doesn't do it either), and
+will not be accurate in case of an unclean shutdown. However, summing up pack files will still be 2 orders of
+magnitude faster than the current used space file walker, as the number of inodes will be much smaller.
+
+
+#### Upload
+
+We will assume that we are not writing a piece to disk until we have all piece contents. This means no temporary
+files and the need to store partial piece data in memory (or swap). In such case, swap replaces the role of a
+temporary file, at likely lower efficiency though. It should be rare to have more than few hundreds of pieces uploaded
+at a time, and even for one thousand of concurrently uploaded pieces this means memory usage of less than 3 GB. We
+will consider this a fixed overhead, not dependent on the amount of data stored. It is advisable to set the limit of
+concurrent uploads accordingly though.
+
+1. Collect all data to store the piece.
+2. Append the piece header and piece contents to the active pack file, padding to 512 bytes.
+3. Add an entry to the piece index file. If the hash table's bucket is full, rebuild the hash table with a bigger k.
+4. Add an entry to the journal: *NewPiece(upload timestamp, piece ID, piece data length, piece expiration time)*
+5. Update the pack file's append offset.
+
+Active pack file and the journal are already opened, so no need to perform additional I/O to locate them up. The
+former is also preallocated. None of the writes need to be forced. As such, in most cases, pack file and journal
+file writes will be coalesced across many uploads. The piece index modification will turn into a random write, but
+assuming that many such writes will be collected in a short period, these writes being close to each other (in a
+single preallocated file) still have a chance to be quite a bit faster than random writes across the whole drive
+thanks to I/O schedulers.
+
+This compares well to the expected 10-20 I/O operations, many of them forced writes, with the current approach.
+
+In the event of a crash, as only the active pack file is modified, only pieces from the active pack file are likely
+at risk. Given that thousands of pack files are expected, this leads to risk well below the assumed 2% of files lost.
+
+
+#### Download
+
+1. Look up the piece's location through the piece index.
+2. Verify that its trash bit is not set and expiration date is not in the past.
+3. Read contents of the piece from the pack file denoted in the piece index using offsets given in the piece index.
+
+We perform one cached random read from the piece index, then a cached direntry and inode reads for the pack file. At
+the end, we perform a sequential read of the requested contents. Note that despite that the pack file may
+technically end up fragmented due to hole punching and compaction, fragmentation will likely occur only at the
+boundaries of pieces, making it not a factor regarding single piece reads. However, this fragmentation may
+affect the complexity of the pack file's extent tree. Extend trees will likely stay cached for downloads of recently
+uploaded pieces (a common case), but not for older pieces.
+
+This compares favorably to the current case, where the direntry, inode, and data from the piece itself needs to be
+read from three different locations. Going by the assumption that the direntry and inode of piece files in the
+current scheme are not cached, this means a potential improvement in the latency for piece data reads of 10-20 ms.
+
+
+### Single immediate piece deletion
+
+1. Look up the piece's location through the piece index.
+2. Punch a hole in the pack file denoted in the piece index using offsets given in the piece index. Punching a hole
+   frees up data sectors without invalidating offsets of valid pieces within the pack file.
+3. Remove the piece's entry in the piece index.
+
+We perform one cached random read from the piece index. Hole punching basically requires a change to the extents
+tree of a file and a modification of the file's inode, which is two reads, two writes. As the last step there is
+a write to the piece index, which does not have to be forced (worst case, a potential attempt to read from a hole).
+This is roughly comparable to a file deletion, which requires removal of a direntry and an inode (at least two reads,
+two writes). Both cases will also need to free up allocated data sectors.
+
+TODO: Punching a hole requires file ranges to fall at multiplies of file system's logical block sizes. This may
+mean that punching will not be effective for small pieces, and require adjusting the ranges (instead of "X to Y", we
+may need to do "X+𝛿 to Y-𝜀"). It is not a big problem though: these leftovers will be at most two data blocks, and
+will be properly collected by a future compaction run anyway. Also, in case of trash collection described later,
+multiple hole punches next to each other can be merged into one, which may lessen the impact of this limitation.
+
+It is not necessary to add a journal message. Worst case, the journal will recover a metadata entry for a piece whose
+contents are no longer stored. Any attempt to download such piece will fail, and the metadata entry will be
+garbage-collected at the next opportunity.
+
+
+### Garbage collection
+
+Scan the piece index, searching for all pieces whose Piece ID matches the bloom filter and upload timestamp is early
+enough. For the pieces that do match, set their trash bit and overwrite the piece expiration/trash timestamp.
+
+Note how this operation only touches the piece index file, and does that roughly sequentially. Even if there's just
+a bunch of buckets being modified all over the hash map, any decent I/O scheduler will order writes favorably to the
+hard disk layout, as we modify buckets in order. As such, we may end up in GC runs taking tens of seconds per TB
+worth of pieces.
+
+There's no point in listing the I/O necessary for the current approach.
+
+It is not necessary to add a journal message. Worst case, the journal will recover a metadata entry for a piece whose
+contents are no longer stored. Any attempt to download such piece will fail, and the metadata entry will be
+garbage-collected at the next opportunity.
+
+
+### Restore from trash
+
+1. Look up the piece's location through the piece index.
+2. Clear the trash bit in the piece index, and clear the expiration/trash timestamp.
+
+As we do not store journal messages for trashing a piece, it is not necessary to store messages to restore either.
+
+
+### Checking for existence of a piece
+
+Look up the piece in the piece index. Verify that the trash bit is clear and piece is not expired.
+
+
+### Trash collection, expired pieces collection and pack file compaction
+
+1. Scan the piece index, searching for all pieces whose trash bit is set and expiration/trash timestamp is not old
+   enough, or trash bit is clear and expiration/trash timestamp is in future (i.e., all pieces to be kept):
+   1. Tally up their total size (in terms of the number of file system's logical blocks, not in terms of bytes) for
+      each pack file.
+   2. Note down all piece IDs not matching (i.e., to be removed) together with their pack file identifier.
+2. For each pack file identified as having at least one piece to be removed:
+   1. If the pack file is active, just run the procedure _Single immediate piece deletion_ for each removed piece.
+       Compaction cannot be run together with uploads.
+   2. If the append offset of the pack file is already below 128 MiB, just run the procedure _Single immediate
+      piece deletion_ for each removed piece. No point in compacting _again_.
+   3. If the total size of pieces kept is still bigger than 128 MiB, just run the procedure _Single immediate
+      piece deletion_ for each removed piece. No point in compacting _yet_.
+   4. Otherwise, collect the pack file ID as a candidate for compaction.
+3. Scan the piece index again, this time searching for all piece IDs stored in pack files that are candidates for
+   compaction. While scanning, immediately erase pieces whose trash bit is set and expiration/trash timestamp is early
+   enough, or trash bit is clear and expiration/trash timestamp is in the past.
+4. For each pack file that is a candidate for compaction, run compaction with the piece IDs to be kept.
+
+Total size can be computed exactly (e.g. by using a bit vector with one bit per file system's logical block). Then even
+for 20 TB of data with the popular logical block size of 4 KiB this means a bitmask of 671 MB. Or approximate, by
+overcounting blocks shared by multiple pieces.
+
+Scanning the piece index that is already fully cached requires no I/O. Doing it twice does neither.
+
+The cases running the _Single immediate piece deletion_ procedure will be slightly cheaper than the current trash
+collection implementation due to better locality of consecutive writes in the piece index file. Not much cheaper,
+but still a bit more efficient.
+
+TODO: In this case, if we happen to punch holes next to each other during consecutive _Single immediate piece deletion_
+steps, these holes can be merged into one. This may help leaving less unaligned data.
+
+The full compaction routine is only executed if amount of data left in the pack file is so small that it will again
+make sense to use the pack file as an active file. Compaction is necessary, as otherwise we could end up in a corner
+case where each pack file could at some point be left with just a small number of pieces, making the proposed design
+require again an uncacheable amount of inodes and direntries. However, compaction is not friendly to concurrent
+uploads and downloads, so we prefer to only run it when the expected results are good enough.
+
+To compact a pack file, the following steps represent a simplified version of the procedure. This is probably the
+most complex element of the design. It is also the most risky one, as a crash in the middle of the procedure may make
+the pack file under compaction unusable.
+
+1. Allocate a new pack file identifier.
+2. Sort pieces kept by piece's header offset.
+3. Note down all unused pack file ranges between pieces. E.g. if the first piece to be kept starts at block 15 and
+   has data length of 5, and the second piece starts at block 31, then unused ranges include 0-14 and 21-30.
+4. Shift offsets of all pieces as if there was no empty space between entries, taking into account any adjustments
+   necessary (see notes below).
+5. Write an entry into the journal file: *Compaction(new pack file identifier)*.
+6. Perform FALLOC_FL_COLLAPSE_RANGE on the noted down unused pack file ranges.
+7. Move the pack file to the new identifier.
+8. Update piece index entries with a new pack file identifier and new offsets.
+9. Write entries into the journal file for each piece: *CompactedPiece(piece ID, new offset)*.
+10. Write an entry into the journal file: *PackFileRemoved(old pack file identifier)*.
+11. Update the append offset for the pack file, so that it can be considered ready to become active in future.
+
+We take advantage of the fact that the piece index is cached: despite that we will effectively perform reads and
+writes in the piece index for each existing file (as opposed to doing operations only for removed files), these will
+all be cached reads and localized writes. As such, the only I/O operations done are journal writes,
+FALLOC_FL_COLLAPSE_RANGE, and a single file move.
+
+Notes/TODO:
+1. FALLOC_FL_COLLAPSE_RANGE requires file ranges to fall at multiples of file system's logical block sizes. As such,
+   we may need to detect the exact multiplies allowed, and adjust the ranges (instead of "X to Y", we may need to do
+   "X+𝛿 to Y-𝜀", with care taken that small ranges may collapse to an empty set).
+2. Rounding up piece sizes to file system's logical block sizes is necessary to make sure that after the adjustment
+   above we will still end up with a pack file below 128 MiB.
+3. Concurrent downloads that do not read from the currently compacted file are fine. Concurrent downloads that would
+   read from the compacted file cannot operate concurrently with compaction steps 6-8. These steps will most
+   probably take seconds, as FALLOC_FL_COLLAPSE_RANGE is a pretty fast operation (an update to the extents tree plus
+   freeing up data sectors).
+4. In absence of the FALLOC_FL_COLLAPSE_RANGE operation it might actually be fine to just rewrite the pack file.
+   This is a roughly sequential read and a sequential write of at most 128 MiB, which should finish in few seconds.
+   As the compaction procedure of a pack file is only run at most once per 128 MiB of uploads, it may still be a
+   good trade-off.
+5. The PackFileRemoved message is necessary to avoid accidental overlap in terms of file range of a stored piece
+   with a recovered outdated entry of an already removed file. Hole-punching that outdated entry may prove disastrous.
+
+
+### Journal rewrite
+
+A new periodic operation is performed: a journal is rewritten to get rid of outdated items, such as files removed,
+or old compaction events. The procedure is performed only if journal grows twice as big as the piece index
+file.
+
+1. A new journal file is created.
+2. The first message is: *PieceIndexSize(k)*, so that in case of recovery, the right size of a piece index is known
+   already at the beginning of the recovery procedure.
+3. All entries from the index file are written down into the new file as a sequence of messages: *PresentPiece(…)*,
+   copying all data from the piece index except for the trash bit, and the expiration/trash timestamp if the trash
+   bit was set.
+4. fsync(), rename over the old journal file.
+
+This is a sequential write roughly the size of the piece index file.
+
+To allow concurrent operations, for the period of time this procedure runs, new journal messages need to be enqueued
+for both old and new journal.
+
+It might make sense to also perform this operation on a clean shutdown.
+
+Trash status cannot be stored, as the journal does not store messages for restore from trash procedure. This is
+not a concern though, the next garbage collection will deal with these pieces anyway.
+
+TODO: not so sure about the last statement anymore. If journal rewrites turn out to happen more often than garbage
+collection, then we would never collect that garbage. Needs some thinking.
+
+
+## Rationale
+
+The suggested design is proposed as an alternative to the existing piece storage. The design is more complex and
+requires a careful implementation, especially regarding concurrency, recovery in case of problems. It depends on a
+specific file system features like hole punching and file collapsing, which are not always available. It also
+puts bigger reliability requirements towards the operating system's implementation of file system. Yet for a
+significant majority of existing storage nodes it may bring a significant performance improvement and reduce wear.
+It may also improve system-level metrics like time to first byte by a non-trivial amount if most nodes can adopt
+this design.
+
+
+## Implementation
+
+…
+
+## Wrapup
+
+…
+
+## Open issues
+
+Migration from the existing file-based structure.