Skip to content

Commit 2f30860

Browse files
committed
modules/dbe: add documentation for LZ4 encoder
Add a Markdown readme file and a couple of diagrams that explain how the encoder works. Signed-off-by: Roman Dobrodii <[email protected]>
1 parent 988d05b commit 2f30860

File tree

3 files changed

+289
-0
lines changed

3 files changed

+289
-0
lines changed

xls/modules/dbe/docs/README.md

Lines changed: 289 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,289 @@
1+
# Dictionary coder implementation in DSLX
2+
3+
Dictionary coder is a compression algorithm that compresses data by replacing
4+
sequences of symbols in original data by pointers to sequences of symbols
5+
stored within a "dictionary" data structure that is known by the decoder.
6+
One example of such algorithms is LZ4.
7+
8+
This module implements encoder and decoder blocks that implement LZ4 algorithm.
9+
10+
## LZ77 algorithms
11+
12+
LZ4 belongs to a broader class of LZ77-like encoders. LZ77 encoders usually
13+
have no preset dictionary and they build dictionary as they compress data. Each
14+
input symbol is thus processed in the context of its own dictionary, and these
15+
dictionaries are in general different for any two symbols. The decoders perform
16+
a reverse process: they almost always start with an empty dictionary and build
17+
it as they are decoding the data, thus obtaining perfect reconstruction not
18+
only of the original raw data that was encoded, but also of the dictionary that
19+
should match the encoder's dictionary at every step.
20+
21+
Within the class of LZ77 algorithms, the dictionary is the buffer which contains
22+
up to `N` past raw symbols that have been already consumed by the encoder, or
23+
emitted by the encoder. Usually it is implemented as a circular buffer, and is
24+
called simply a *history buffer (HB)*.
25+
26+
LZ77 encoder emits two types of tokens (we call them _tokens_ to differentiate
27+
them from _symbols_ which always refer to the original uncompressed data):
28+
- *UNMATCHED_SYMBOL*, which contains the original symbol as-is and which the
29+
decoder simply copies to the output, resulting in no data size reduction.
30+
- *MATCH*, which tells the decoder to copy a string of symbols from the history
31+
buffer to the output. As these symbols are not present in the token itself,
32+
this may result in a significant data size reduction. *MATCH* token contains
33+
of an *offset-count* pair
34+
- *Offset* tells the decoder how far back in the history buffer the string
35+
begins (usually 0 means "start with the last character", 1 is "the one
36+
before the last", etc.).
37+
- *Count* tells the decoder how many symbols to copy from the history buffer
38+
to the output. It's logical that count of `0` results in an empty string,
39+
thus specific byte-level encoding of the token can instead
40+
encode `count - 1` value.
41+
42+
## LZ4 algorithm
43+
44+
What puts LZ4 algorithm aside other LZ77-compatible algorithms (such as ALDC)
45+
is how the encoder finds matches in its history buffer. Instead of performing
46+
exhaustive search which is inefficient for algorithms running on the CPU, and
47+
complicated for ASIC implementations (e.g. silicon implementations of ALDC
48+
may use highly-custom CAM memory cores), it performs non-exhaustive search
49+
using a hash table.
50+
51+
LZ4 algorithm uses two random-addressable memory blocks:
52+
1. History buffer (also called *HB RAM*), which is the dictionary itself. HB
53+
stores raw data symbols, usually using a circular-buffer addressing scheme.
54+
- HB address bus is *MATCH_OFFSET_WIDTH* bits wide, thus it contains
55+
up to `(1 << MATCH_OFFSET_WIDTH)` symbols.
56+
- HB data word contains a single symbol, thus it is *SYMBOL_WIDTH* bits
57+
wide.
58+
- The size of the HB RAM is `(1 << MATCH_OFFSET_WIDTH) * SYMBOL_WIDTH` bits.
59+
2. Hash table (also called *HT*). It works as follows:
60+
- HT address is *HASH_WIDTH* bits wide, this value can be configured to
61+
slightly tune a footprint-vs-compression ratio tradeoff. The
62+
address into HT is usually created by hashing a small string that contains
63+
a small number of symbols (*HASH_SYMBOLS*) extracted in-order from the
64+
incoming data.
65+
- HT data word contains a pointer (an address) into the HB RAM at which that
66+
string of symbols may be found.
67+
- The size of the HT is `(1 << HASH_WIDTH) * SYMBOL_WIDTH` bits.
68+
69+
In addition, because LZ4 wants to look ahead into the input data to hash it, it
70+
needs a small FIFO with parallel output of all the bits to store *HASH_SYMBOLS*
71+
input symbols and feed them to the hash function.
72+
73+
The flow of operation of LZ4 algorithm is depicted on the flowchart:
74+
![LZ4 encoder flowchart](lz4_encoder_flowchart.png)
75+
76+
A textual expression of it which gives a bit more context:
77+
* __(1)__ Consume one input symbol.
78+
* __(2)__ If it's an EOF, end processing.
79+
* Otherwise:
80+
* __(3)__ Push it to the HB (dropping the oldest symbol from the HB).
81+
* __(4)__ Push it to the FIFO (dropping the oldest symbol from the FIFO).
82+
* __(5)__ Designate the current oldest symbol in the FIFO as *current_symbol*,
83+
calculate *current_ptr* - the location of this symbol in the HB.
84+
* __(6)__ If we do not have an existing matching string that we're trying to
85+
grow, try to start a new matching string:
86+
* __(7)__ Compute hash of the data contained in the FIFO.
87+
* __(8)__ Load *candidate_ptr* pointer from the HT.
88+
* __(9)__ Store *current_ptr* to the HT.
89+
* __(10)__ Initiate a new matching string: calculate offset from a difference
90+
of *candidate_ptr* and *current_ptr*, set length to zero.
91+
* Here, *candidate_ptr* points to the string in history buffer, the
92+
beginning of which *potentially* matches the current input string which
93+
starts with *current_symbol* (the first symbols of the current input
94+
string are stored in the FIFO, and we've just hashed them and performed a
95+
HT lookup to find a similar sequence of symbols in the HB).
96+
* The match is not guaranteed since there can be a hash collision - the
97+
hash is the same, but actual symbols pointed by *candidate_ptr* differ,
98+
so we need to check each of them if it matches the one in the FIFO.
99+
In addition to that, we'd like to grow a matching string longer than
100+
what's contained in the FIFO.
101+
* As shown below, the same procedure is used to check matching of the
102+
beginning of the string, as well as of its continuation.
103+
* Here, *current_ptr* points to the *current_symbol* - the next input symbol
104+
for which no output token has been emitted, while *candidate_ptr* points to
105+
an old symbol in the HB which we'd like to compare with *current_symbol*.
106+
* __(11)__ We load a symbol from the HB pointed by *candidate_ptr*, it
107+
becomes a *candidate_symbol*.
108+
* __(12)__ The *candidate_symbol* is compared with *current_symbol*.
109+
* __(13)__ If it's a match, continue this matching string:
110+
* __(14)__ Increment the match length by 1.
111+
* Go to step __(1)__ .
112+
* If it's not a match and we've been already growing a matching string:
113+
* __(15)__ Emit a *MATCH* token for the current matching string.
114+
* __(16)__ Terminate current matching string.
115+
* Go to step __(5)__ - this re-processes current input symbol one more time
116+
(this is done for two reasons: no output token has been emitted for
117+
that symbol yet, and it must be emitted at some point, and also
118+
this symbol may be able to start a new matching string on its own).
119+
* Otherwise:
120+
* __(17)__ Emit an *UNMATCHED_SYMBOL* token for the current symbol.
121+
* Go to step __(1)__.
122+
123+
## DSLX implementation
124+
125+
LZ4 encoder in DSLX is implemented as an FSM-like which changes states at most
126+
once per "tick". It has a form of a _proc_ module.
127+
128+
Compared to the reference flowchart depicted above, this FSM is a bit more
129+
complicated as it has to take into account some corner cases:
130+
* Prefilling the FIFO with data before beginning the processing.
131+
* Correctl handling of current symbol when it runs into an EOF.
132+
* Draining the FIFO after EOF has been observed.
133+
* Allowing "warm" restart of the algorithm after EOF to allow compression of
134+
*dependent* blocks - that is, without resetting the HT and the HB between the
135+
blocks (this allows tokens from the curent block to refer HB symbols of the
136+
previous block).
137+
* System-level integration considerations to allow this block to be chained
138+
with other XLS-based data pre-/postprocessors:
139+
* *MATCHED_SYMBOL* tokens. Whenever the encoder adds a symbol to the matching
140+
string, it emits a *MATCHED_SYMBOL* token, which can be used by the
141+
postprocessing blocks to reconstruct symbols encoded by *MATCH* tokens
142+
without having to decode *MATCH*es (such decoding would require a full-size
143+
history buffer RAM). These tokens should not be stored into the final
144+
encoded block.
145+
* Support for a limited form of in-band control signaling via
146+
*marker tokens*:
147+
* *END* marker that signals end of block (EOF condition).
148+
* *ERROR_* family of markers that allow passing error codes between
149+
processing blocks. These tokens abort the encoding.
150+
* *RESET* marker that performs a *cold* restart of the encoder (clearing HT
151+
RAM) and other blocks in the chain, also clearing all error conditions.
152+
153+
One example of a post-processor module can be a *block writer* proc that
154+
gathers tokens produced by the encoder and encodes them using a standardized
155+
byte-oriented *LZ4 Block Format*. Implementation of alternative encoding
156+
schemes may be of interest as well, as the header format used by the standard
157+
LZ4 requires one to know the number of unmatched symbols between two matches
158+
before those symbols are emitted, making bufferless stream processing
159+
difficult.
160+
161+
### Data format
162+
163+
#### PlainData
164+
165+
The encoder consumes a stream of raw symbols, intermixed with control markers.
166+
This is represented using a parametrized `PlainData` DSLX structure:
167+
```rust
168+
pub struct PlainData<DATA_WIDTH: u32> {
169+
is_marker: bool,
170+
data: uN[DATA_WIDTH],
171+
mark: Mark,
172+
}
173+
```
174+
175+
* *is_marker* tells whether this object is a symbol or a marker.
176+
* *data* communicates a symbol whenever *is_marker* is not set.
177+
* *mark* communicates a control mark whenever *is_marker* is set.
178+
179+
#### Token
180+
181+
The encoder produces a stream of tokens. There are four types of them:
182+
* *MATCH* is an *offset-length* pair that represents a sequence of
183+
symbols that is the same as the specified sequence within HB.
184+
* *UNMATCHED_SYMBOL* represents a symbol for which no match was found in the
185+
HB. It will be encoded as a raw symbol in the final piece of encoded data.
186+
* *MATCHED_SYMBOL* is a symbol that is encoded within the next *MATCH* token.
187+
Its intended use is to allow easy postprocessing of a stream of tokens
188+
without a need for a full-fledged and heavy *MATCH* decoder.
189+
* *MARKER* contains a control mark code.
190+
191+
Tokens are represented using following enum and structure in DSLX:
192+
```rust
193+
pub enum TokenKind : u2 {
194+
UNMATCHED_SYMBOL = 0,
195+
MATCHED_SYMBOL = 1,
196+
MATCH = 2,
197+
MARKER = 3,
198+
}
199+
200+
pub struct Token<
201+
SYMBOL_WIDTH: u32, MATCH_OFFSET_WIDTH: u32, MATCH_LENGTH_WIDTH: u32
202+
>{
203+
kind: TokenKind,
204+
symbol: uN[SYMBOL_WIDTH],
205+
match_offset: uN[MATCH_OFFSET_WIDTH],
206+
match_length: uN[MATCH_LENGTH_WIDTH],
207+
mark: Mark
208+
}
209+
```
210+
211+
* *kind* specifies one of the four token kinds.
212+
* *symbol* contains symbol value for *UNMATCHED_SYMBOL* and *MATCHED_SYMBOL*
213+
tokens.
214+
* *match_offset* and *match_length* are valid only for a *MATCH* token:
215+
* Length is the length of the string that has to be copied from the history
216+
buffer, minus one. That is, *match_offset=0* specifies a string of 1
217+
symbol, *match_offset=3* a string of 4 symbols, etc.
218+
* Offset points to the beginning of the string that has to be copied from
219+
HB.
220+
* An offset of 0 means that the first symbol to be copied is the last
221+
symbol written to the HB - the last symbol emitted when processing the
222+
previous token.
223+
* An offset of 1 means starting with the symbol preceding the one for
224+
offset 0, thus with the second newest symbol in the HB.
225+
* *MATCH* token may specify *length > offset* - in this case the decoder will
226+
have to copy not only old symbols, but also symbols produced when handling
227+
the current *MATCH* token, generating a repetitive sequence of characters,
228+
which resembles the behavior of a multisymbol *Run-Length Encoder*.
229+
230+
### FSM
231+
232+
A state diagram of the FSM is displayed below:
233+
![LZ4 encoder states](lz4_encoder_states.png)
234+
235+
* **RESET** - the initial state of the decoder. It initializes other state
236+
variables, resulting in a *cold* block start and jumps to
237+
**HASH_TABLE_CLEAR**.
238+
* To facilitate testing in presence of
239+
[issue #1042](https://github.com/google/xls/issues/1042), there is a proc
240+
parameter that allows bypassing **HASH_TABLE_CLEAR** and jumping
241+
directly into the **RESTART** state. It should not be used in real
242+
implementations as it will make encoding of an independent block to depend
243+
on the contents of the preceding data blocks, thus making encoding
244+
non-deterministic and potentially allowing for data leaks across blocks.
245+
* **HASH_TABLE_CLEAR** - encoder iterates here, clearing one word of HT RAM
246+
per tick.
247+
* **RESTART** - the encoder clears a small set of state variables, resulting in
248+
a *warm* start, allowing to preserve necessary state (HB, HT) between
249+
depending data blocks.
250+
* **FIFO_PREFILL** - the encoder fills the input FIFO with symbols.
251+
* If an *END* token is observed, the FSM will transition into either
252+
*EMIT_END* or to *FIFO_DRAIN* - this depends on whether the FIFO already
253+
has multiple symbols in it and thus whether extra ticks are needed to drain
254+
all of them.
255+
* Handles input steps of a flowchart: __1, 4__.
256+
* **START_MATCH_0** - roughly corresponds to the *Starting a new match*,
257+
*Read potential matching symbol from HB*, *Check for match*,
258+
*Growing the match* parts of the flowchart.
259+
* Handles input steps: __1, 4__.
260+
* Handles match steps: __5, 7, 8, 10, 11, 12, 13__.
261+
* Step __14__ belongs here, but is skipped as an optimization, since step
262+
__10__ can initialize match variables properly from the start.
263+
* **START_MATCH_1** - "upper" counterpart of **START_MATCH_0**, necessary
264+
because two accesses to the same (single-port) RAM can not be done in the
265+
same tick.
266+
* Handles step __9__, writing to the HT RAM.
267+
* Handles step __3__, writing to the HB RAM.
268+
* May emit *UNMATCHED_SYMBOL* token, step __17__.
269+
* **CONTINUE_MATCH_0** - mostly the same as **START_MATCH_0** except that it
270+
does not start a new match.
271+
* Handles input steps: __1, 4__.
272+
* Handles match steps __5, 11, 12, 16__.
273+
* **CONTINUE_MATCH_1** - "upper" counterpart of **CONTINUE_MATCH_0**.
274+
* Handles step __3__, writing to the HB RAM.
275+
* May emit *MATCH* token, step __16__.
276+
* **FIFO_DRAIN** - the encoder loops here, draining symbols from the FIFO
277+
and emitting **UNMATCHED_SYMBOL* tokens for them.
278+
* Symbols are also written to the HB to make them visible in case a new
279+
block is started after a *warm* restart.
280+
* **EMIT_END** - the encoder emits a single *END* token and transitions into
281+
a **RESTART** state.
282+
* **ERROR** - state that is entered whenever the error condition is
283+
encountered. This can happen if e.g. *ERROR* marker is received from another
284+
block that precedes the encoder, or if an unknown (unsupported) marker is
285+
received.
286+
* The encoder receives and discards incoming symbols, with an exception of
287+
a *RESET* command marker that is replicated on the output (so that other
288+
processing blocks can be reset) and makes FSM transition to the **RESET**
289+
state.
580 KB
Loading
465 KB
Loading

0 commit comments

Comments
 (0)