|
| 1 | +# Dictionary coder implementation in DSLX |
| 2 | + |
| 3 | +Dictionary coder is a compression algorithm that compresses data by replacing |
| 4 | +sequences of symbols in original data by pointers to sequences of symbols |
| 5 | +stored within a "dictionary" data structure that is known by the decoder. |
| 6 | +One example of such algorithms is LZ4. |
| 7 | + |
| 8 | +This module implements encoder and decoder blocks that implement LZ4 algorithm. |
| 9 | + |
| 10 | +## LZ77 algorithms |
| 11 | + |
| 12 | +LZ4 belongs to a broader class of LZ77-like encoders. LZ77 encoders usually |
| 13 | +have no preset dictionary and they build dictionary as they compress data. Each |
| 14 | +input symbol is thus processed in the context of its own dictionary, and these |
| 15 | +dictionaries are in general different for any two symbols. The decoders perform |
| 16 | +a reverse process: they almost always start with an empty dictionary and build |
| 17 | +it as they are decoding the data, thus obtaining perfect reconstruction not |
| 18 | +only of the original raw data that was encoded, but also of the dictionary that |
| 19 | +should match the encoder's dictionary at every step. |
| 20 | + |
| 21 | +Within the class of LZ77 algorithms, the dictionary is the buffer which contains |
| 22 | +up to `N` past raw symbols that have been already consumed by the encoder, or |
| 23 | +emitted by the encoder. Usually it is implemented as a circular buffer, and is |
| 24 | +called simply a *history buffer (HB)*. |
| 25 | + |
| 26 | +LZ77 encoder emits two types of tokens (we call them _tokens_ to differentiate |
| 27 | +them from _symbols_ which always refer to the original uncompressed data): |
| 28 | +- *UNMATCHED_SYMBOL*, which contains the original symbol as-is and which the |
| 29 | + decoder simply copies to the output, resulting in no data size reduction. |
| 30 | +- *MATCH*, which tells the decoder to copy a string of symbols from the history |
| 31 | + buffer to the output. As these symbols are not present in the token itself, |
| 32 | + this may result in a significant data size reduction. *MATCH* token contains |
| 33 | + of an *offset-count* pair |
| 34 | + - *Offset* tells the decoder how far back in the history buffer the string |
| 35 | + begins (usually 0 means "start with the last character", 1 is "the one |
| 36 | + before the last", etc.). |
| 37 | + - *Count* tells the decoder how many symbols to copy from the history buffer |
| 38 | + to the output. It's logical that count of `0` results in an empty string, |
| 39 | + thus specific byte-level encoding of the token can instead |
| 40 | + encode `count - 1` value. |
| 41 | + |
| 42 | +## LZ4 algorithm |
| 43 | + |
| 44 | +What puts LZ4 algorithm aside other LZ77-compatible algorithms (such as ALDC) |
| 45 | +is how the encoder finds matches in its history buffer. Instead of performing |
| 46 | +exhaustive search which is inefficient for algorithms running on the CPU, and |
| 47 | +complicated for ASIC implementations (e.g. silicon implementations of ALDC |
| 48 | +may use highly-custom CAM memory cores), it performs non-exhaustive search |
| 49 | +using a hash table. |
| 50 | + |
| 51 | +LZ4 algorithm uses two random-addressable memory blocks: |
| 52 | +1. History buffer (also called *HB RAM*), which is the dictionary itself. HB |
| 53 | + stores raw data symbols, usually using a circular-buffer addressing scheme. |
| 54 | + - HB address bus is *MATCH_OFFSET_WIDTH* bits wide, thus it contains |
| 55 | + up to `(1 << MATCH_OFFSET_WIDTH)` symbols. |
| 56 | + - HB data word contains a single symbol, thus it is *SYMBOL_WIDTH* bits |
| 57 | + wide. |
| 58 | + - The size of the HB RAM is `(1 << MATCH_OFFSET_WIDTH) * SYMBOL_WIDTH` bits. |
| 59 | +2. Hash table (also called *HT*). It works as follows: |
| 60 | + - HT address is *HASH_WIDTH* bits wide, this value can be configured to |
| 61 | + slightly tune a footprint-vs-compression ratio tradeoff. The |
| 62 | + address into HT is usually created by hashing a small string that contains |
| 63 | + a small number of symbols (*HASH_SYMBOLS*) extracted in-order from the |
| 64 | + incoming data. |
| 65 | + - HT data word contains a pointer (an address) into the HB RAM at which that |
| 66 | + string of symbols may be found. |
| 67 | + - The size of the HT is `(1 << HASH_WIDTH) * SYMBOL_WIDTH` bits. |
| 68 | + |
| 69 | +In addition, because LZ4 wants to look ahead into the input data to hash it, it |
| 70 | +needs a small FIFO with parallel output of all the bits to store *HASH_SYMBOLS* |
| 71 | +input symbols and feed them to the hash function. |
| 72 | + |
| 73 | +The flow of operation of LZ4 algorithm is depicted on the flowchart: |
| 74 | + |
| 75 | + |
| 76 | +A textual expression of it which gives a bit more context: |
| 77 | +* __(1)__ Consume one input symbol. |
| 78 | + * __(2)__ If it's an EOF, end processing. |
| 79 | + * Otherwise: |
| 80 | + * __(3)__ Push it to the HB (dropping the oldest symbol from the HB). |
| 81 | + * __(4)__ Push it to the FIFO (dropping the oldest symbol from the FIFO). |
| 82 | +* __(5)__ Designate the current oldest symbol in the FIFO as *current_symbol*, |
| 83 | + calculate *current_ptr* - the location of this symbol in the HB. |
| 84 | +* __(6)__ If we do not have an existing matching string that we're trying to |
| 85 | + grow, try to start a new matching string: |
| 86 | + * __(7)__ Compute hash of the data contained in the FIFO. |
| 87 | + * __(8)__ Load *candidate_ptr* pointer from the HT. |
| 88 | + * __(9)__ Store *current_ptr* to the HT. |
| 89 | + * __(10)__ Initiate a new matching string: calculate offset from a difference |
| 90 | + of *candidate_ptr* and *current_ptr*, set length to zero. |
| 91 | + * Here, *candidate_ptr* points to the string in history buffer, the |
| 92 | + beginning of which *potentially* matches the current input string which |
| 93 | + starts with *current_symbol* (the first symbols of the current input |
| 94 | + string are stored in the FIFO, and we've just hashed them and performed a |
| 95 | + HT lookup to find a similar sequence of symbols in the HB). |
| 96 | + * The match is not guaranteed since there can be a hash collision - the |
| 97 | + hash is the same, but actual symbols pointed by *candidate_ptr* differ, |
| 98 | + so we need to check each of them if it matches the one in the FIFO. |
| 99 | + In addition to that, we'd like to grow a matching string longer than |
| 100 | + what's contained in the FIFO. |
| 101 | + * As shown below, the same procedure is used to check matching of the |
| 102 | + beginning of the string, as well as of its continuation. |
| 103 | +* Here, *current_ptr* points to the *current_symbol* - the next input symbol |
| 104 | + for which no output token has been emitted, while *candidate_ptr* points to |
| 105 | + an old symbol in the HB which we'd like to compare with *current_symbol*. |
| 106 | + * __(11)__ We load a symbol from the HB pointed by *candidate_ptr*, it |
| 107 | + becomes a *candidate_symbol*. |
| 108 | + * __(12)__ The *candidate_symbol* is compared with *current_symbol*. |
| 109 | + * __(13)__ If it's a match, continue this matching string: |
| 110 | + * __(14)__ Increment the match length by 1. |
| 111 | + * Go to step __(1)__ . |
| 112 | + * If it's not a match and we've been already growing a matching string: |
| 113 | + * __(15)__ Emit a *MATCH* token for the current matching string. |
| 114 | + * __(16)__ Terminate current matching string. |
| 115 | + * Go to step __(5)__ - this re-processes current input symbol one more time |
| 116 | + (this is done for two reasons: no output token has been emitted for |
| 117 | + that symbol yet, and it must be emitted at some point, and also |
| 118 | + this symbol may be able to start a new matching string on its own). |
| 119 | + * Otherwise: |
| 120 | + * __(17)__ Emit an *UNMATCHED_SYMBOL* token for the current symbol. |
| 121 | + * Go to step __(1)__. |
| 122 | + |
| 123 | +## DSLX implementation |
| 124 | + |
| 125 | +LZ4 encoder in DSLX is implemented as an FSM-like which changes states at most |
| 126 | +once per "tick". It has a form of a _proc_ module. |
| 127 | + |
| 128 | +Compared to the reference flowchart depicted above, this FSM is a bit more |
| 129 | +complicated as it has to take into account some corner cases: |
| 130 | +* Prefilling the FIFO with data before beginning the processing. |
| 131 | +* Correctl handling of current symbol when it runs into an EOF. |
| 132 | +* Draining the FIFO after EOF has been observed. |
| 133 | +* Allowing "warm" restart of the algorithm after EOF to allow compression of |
| 134 | + *dependent* blocks - that is, without resetting the HT and the HB between the |
| 135 | + blocks (this allows tokens from the curent block to refer HB symbols of the |
| 136 | + previous block). |
| 137 | +* System-level integration considerations to allow this block to be chained |
| 138 | + with other XLS-based data pre-/postprocessors: |
| 139 | + * *MATCHED_SYMBOL* tokens. Whenever the encoder adds a symbol to the matching |
| 140 | + string, it emits a *MATCHED_SYMBOL* token, which can be used by the |
| 141 | + postprocessing blocks to reconstruct symbols encoded by *MATCH* tokens |
| 142 | + without having to decode *MATCH*es (such decoding would require a full-size |
| 143 | + history buffer RAM). These tokens should not be stored into the final |
| 144 | + encoded block. |
| 145 | + * Support for a limited form of in-band control signaling via |
| 146 | + *marker tokens*: |
| 147 | + * *END* marker that signals end of block (EOF condition). |
| 148 | + * *ERROR_* family of markers that allow passing error codes between |
| 149 | + processing blocks. These tokens abort the encoding. |
| 150 | + * *RESET* marker that performs a *cold* restart of the encoder (clearing HT |
| 151 | + RAM) and other blocks in the chain, also clearing all error conditions. |
| 152 | + |
| 153 | +One example of a post-processor module can be a *block writer* proc that |
| 154 | +gathers tokens produced by the encoder and encodes them using a standardized |
| 155 | +byte-oriented *LZ4 Block Format*. Implementation of alternative encoding |
| 156 | +schemes may be of interest as well, as the header format used by the standard |
| 157 | +LZ4 requires one to know the number of unmatched symbols between two matches |
| 158 | +before those symbols are emitted, making bufferless stream processing |
| 159 | +difficult. |
| 160 | + |
| 161 | +### Data format |
| 162 | + |
| 163 | +#### PlainData |
| 164 | + |
| 165 | +The encoder consumes a stream of raw symbols, intermixed with control markers. |
| 166 | +This is represented using a parametrized `PlainData` DSLX structure: |
| 167 | +```rust |
| 168 | +pub struct PlainData<DATA_WIDTH: u32> { |
| 169 | + is_marker: bool, |
| 170 | + data: uN[DATA_WIDTH], |
| 171 | + mark: Mark, |
| 172 | +} |
| 173 | +``` |
| 174 | + |
| 175 | +* *is_marker* tells whether this object is a symbol or a marker. |
| 176 | +* *data* communicates a symbol whenever *is_marker* is not set. |
| 177 | +* *mark* communicates a control mark whenever *is_marker* is set. |
| 178 | + |
| 179 | +#### Token |
| 180 | + |
| 181 | +The encoder produces a stream of tokens. There are four types of them: |
| 182 | +* *MATCH* is an *offset-length* pair that represents a sequence of |
| 183 | + symbols that is the same as the specified sequence within HB. |
| 184 | +* *UNMATCHED_SYMBOL* represents a symbol for which no match was found in the |
| 185 | + HB. It will be encoded as a raw symbol in the final piece of encoded data. |
| 186 | +* *MATCHED_SYMBOL* is a symbol that is encoded within the next *MATCH* token. |
| 187 | + Its intended use is to allow easy postprocessing of a stream of tokens |
| 188 | + without a need for a full-fledged and heavy *MATCH* decoder. |
| 189 | +* *MARKER* contains a control mark code. |
| 190 | + |
| 191 | +Tokens are represented using following enum and structure in DSLX: |
| 192 | +```rust |
| 193 | +pub enum TokenKind : u2 { |
| 194 | + UNMATCHED_SYMBOL = 0, |
| 195 | + MATCHED_SYMBOL = 1, |
| 196 | + MATCH = 2, |
| 197 | + MARKER = 3, |
| 198 | +} |
| 199 | + |
| 200 | +pub struct Token< |
| 201 | + SYMBOL_WIDTH: u32, MATCH_OFFSET_WIDTH: u32, MATCH_LENGTH_WIDTH: u32 |
| 202 | +>{ |
| 203 | + kind: TokenKind, |
| 204 | + symbol: uN[SYMBOL_WIDTH], |
| 205 | + match_offset: uN[MATCH_OFFSET_WIDTH], |
| 206 | + match_length: uN[MATCH_LENGTH_WIDTH], |
| 207 | + mark: Mark |
| 208 | +} |
| 209 | +``` |
| 210 | + |
| 211 | +* *kind* specifies one of the four token kinds. |
| 212 | +* *symbol* contains symbol value for *UNMATCHED_SYMBOL* and *MATCHED_SYMBOL* |
| 213 | + tokens. |
| 214 | +* *match_offset* and *match_length* are valid only for a *MATCH* token: |
| 215 | + * Length is the length of the string that has to be copied from the history |
| 216 | + buffer, minus one. That is, *match_offset=0* specifies a string of 1 |
| 217 | + symbol, *match_offset=3* a string of 4 symbols, etc. |
| 218 | + * Offset points to the beginning of the string that has to be copied from |
| 219 | + HB. |
| 220 | + * An offset of 0 means that the first symbol to be copied is the last |
| 221 | + symbol written to the HB - the last symbol emitted when processing the |
| 222 | + previous token. |
| 223 | + * An offset of 1 means starting with the symbol preceding the one for |
| 224 | + offset 0, thus with the second newest symbol in the HB. |
| 225 | + * *MATCH* token may specify *length > offset* - in this case the decoder will |
| 226 | + have to copy not only old symbols, but also symbols produced when handling |
| 227 | + the current *MATCH* token, generating a repetitive sequence of characters, |
| 228 | + which resembles the behavior of a multisymbol *Run-Length Encoder*. |
| 229 | + |
| 230 | +### FSM |
| 231 | + |
| 232 | +A state diagram of the FSM is displayed below: |
| 233 | + |
| 234 | + |
| 235 | +* **RESET** - the initial state of the decoder. It initializes other state |
| 236 | + variables, resulting in a *cold* block start and jumps to |
| 237 | + **HASH_TABLE_CLEAR**. |
| 238 | + * To facilitate testing in presence of |
| 239 | + [issue #1042](https://github.com/google/xls/issues/1042), there is a proc |
| 240 | + parameter that allows bypassing **HASH_TABLE_CLEAR** and jumping |
| 241 | + directly into the **RESTART** state. It should not be used in real |
| 242 | + implementations as it will make encoding of an independent block to depend |
| 243 | + on the contents of the preceding data blocks, thus making encoding |
| 244 | + non-deterministic and potentially allowing for data leaks across blocks. |
| 245 | +* **HASH_TABLE_CLEAR** - encoder iterates here, clearing one word of HT RAM |
| 246 | + per tick. |
| 247 | +* **RESTART** - the encoder clears a small set of state variables, resulting in |
| 248 | + a *warm* start, allowing to preserve necessary state (HB, HT) between |
| 249 | + depending data blocks. |
| 250 | +* **FIFO_PREFILL** - the encoder fills the input FIFO with symbols. |
| 251 | + * If an *END* token is observed, the FSM will transition into either |
| 252 | + *EMIT_END* or to *FIFO_DRAIN* - this depends on whether the FIFO already |
| 253 | + has multiple symbols in it and thus whether extra ticks are needed to drain |
| 254 | + all of them. |
| 255 | + * Handles input steps of a flowchart: __1, 4__. |
| 256 | +* **START_MATCH_0** - roughly corresponds to the *Starting a new match*, |
| 257 | + *Read potential matching symbol from HB*, *Check for match*, |
| 258 | + *Growing the match* parts of the flowchart. |
| 259 | + * Handles input steps: __1, 4__. |
| 260 | + * Handles match steps: __5, 7, 8, 10, 11, 12, 13__. |
| 261 | + * Step __14__ belongs here, but is skipped as an optimization, since step |
| 262 | + __10__ can initialize match variables properly from the start. |
| 263 | +* **START_MATCH_1** - "upper" counterpart of **START_MATCH_0**, necessary |
| 264 | + because two accesses to the same (single-port) RAM can not be done in the |
| 265 | + same tick. |
| 266 | + * Handles step __9__, writing to the HT RAM. |
| 267 | + * Handles step __3__, writing to the HB RAM. |
| 268 | + * May emit *UNMATCHED_SYMBOL* token, step __17__. |
| 269 | +* **CONTINUE_MATCH_0** - mostly the same as **START_MATCH_0** except that it |
| 270 | + does not start a new match. |
| 271 | + * Handles input steps: __1, 4__. |
| 272 | + * Handles match steps __5, 11, 12, 16__. |
| 273 | +* **CONTINUE_MATCH_1** - "upper" counterpart of **CONTINUE_MATCH_0**. |
| 274 | + * Handles step __3__, writing to the HB RAM. |
| 275 | + * May emit *MATCH* token, step __16__. |
| 276 | +* **FIFO_DRAIN** - the encoder loops here, draining symbols from the FIFO |
| 277 | + and emitting **UNMATCHED_SYMBOL* tokens for them. |
| 278 | + * Symbols are also written to the HB to make them visible in case a new |
| 279 | + block is started after a *warm* restart. |
| 280 | +* **EMIT_END** - the encoder emits a single *END* token and transitions into |
| 281 | + a **RESTART** state. |
| 282 | +* **ERROR** - state that is entered whenever the error condition is |
| 283 | + encountered. This can happen if e.g. *ERROR* marker is received from another |
| 284 | + block that precedes the encoder, or if an unknown (unsupported) marker is |
| 285 | + received. |
| 286 | + * The encoder receives and discards incoming symbols, with an exception of |
| 287 | + a *RESET* command marker that is replicated on the output (so that other |
| 288 | + processing blocks can be reset) and makes FSM transition to the **RESET** |
| 289 | + state. |
0 commit comments