-
Notifications
You must be signed in to change notification settings - Fork 0
Localization Pattern
Although a visual pattern may consist of any set of symbols (such micro-dots of various offsets), binary pixel blocks were the obvious choice for this particular application, primarily because they can be sampled using very low resolution images, in turn significantly reducing real-time processing requirements.
For a pattern to be usable for positioning, the minimum requirement is that every local window of that pattern must be unique, otherwise the mapping from local pattern to position will not be 1-to-1 leading to ambiguities.
In the 1-D case, a valid positioning sequence can be defined as any string where every substring of length n is unique. The longest of such strings for a given code length n are called De Bruijn Sequences, which are cyclic sequences of length 2^n that contain every possible code as substring once. There exist various methods to construct De Bruijn Sequences that will not be detailed here. The significant takeaway is that 2^n is established as the theoretical maximum number of positions that can be decoded from a n-bit code.
A 2D generalization also exists called the De Bruijn Torus. They are of size 2^(n^2) and densely contain every matrix of size n x n uniquely. It would seem that this makes the ideal basis for a 2D positioning pattern, but in practice, there are no known algorithms to efficiently decode them. Another shortcoming is that because the pattern is maximally dense, it has no buffer to detect or correct decoding errors, since every single bit flip will map to another position.
The structure of the 2D pattern is primarily inspired from a method by Bruckstien et al, where the Outer Product of two 1D sequences are taken using XOR as the composition operator. For example:
| 1 | 0 | 0 | 1 | ||
| 1 | 0 | 1 | 1 | 0 | |
| 0 | 1 | 0 | 0 | 1 | |
| 1 | 0 | 1 | 1 | 0 | |
| 0 | 1 | 0 | 0 | 1 |
This mapping provides a simple and efficient way to convert between 1D sequences and 2D patterns while preserving the uniqueness property necessary for localization. Aside from the native hardware performance of XOR operations, a massive benefit is that the problem of decoding a 2D pattern is reduced to separate 1D problems for each axis, which is individually much simpler. With this scheme, 2^2n positions can be decoded from a (n x n)-bit block. This is much less than the theoretical maximum density, but allows for high levels of error correction via redundancy across all rows and columns. However since the XOR operator is only invertible up to a parity, 2D to 1D conversion leaves ambiguity of whether the resultant 1D codes are bitwise inverted. This will later be addressed by extending the uniqueness property of the source sequence to bitwise inversions as well.
To obtain in reverse the pair of 1D codes from a 2D block generated as mentioned above, it is sufficient to simply take any pair of row and column to be the solution directly (up to an unknown parity). In practice however, this provides no redundancy to bit errors that may exist, so it is desirable to account for all rows and columns to produce the most likely output.
One mathematically sound method for Matrix Factorization of the 2D block is through a Rank 1 Truncated Singular Value Decomposition. This can be performed on a binary matrix by mapping {0, 1} under XOR homomorphically to {-1, 1} under multiplication. It can be thought of as taking the inverse outer product operation while optimizing for least-square error. Although the approach outlines the desired flavor of such methods, it is not the one used because it is hard to justify the inclusion of a heavy linear algebra library for an implementation intended for micro-controllers.
The method used in the current implementation is instead based on binary vector clustering. Borrowing from machine learning terminology, the dataset (ie 2D matrix/block) consists of a set of samples (ie rows) each containing a set of binary features (ie columns). The goal is to assign each sample to one of two clusters such that total assignment error is minimized. The assignment of each row to binary classes then directly map to individual bits of the output 1D column code. To obtain the row code, the process is simply repeated with the input transposed.
The clustering algorithm is based on a fast and simple method called K-Means. Instead of euclidean distance, Hamming Distance is used since it is more efficient to perform bitwise operations on binary vectors. Another optimization is that the two cluster centroids are assumed to be bitwise complements, which halves the number of distance computations and clusters to keep track of.
A 2D localization system is required to deduce not only the x and y positions, but also orientation/heading. From purely image based alignment of rows and columns, a rotation estimate of the camera frame with respect to the localization pattern can be obtained up to the nearest 90 degrees. What remains ambiguous is the four cardinal directions.
An even lower level problem is that knowing which way is up/down/left/right is required to decode the binary pattern in the first place, because a block that is rotated 90 degrees will be decoded with x and y axis swapped, and a block that is rotated 180 will be decoded with the codes reversed/flipped.
Usually other visual codes have markers external to the binary encoding that indicate the correct direction, for example the large position markers on 3 corners of a QR code. This approach of marking 3 corners is obviously unsuited for a continuous sliding-window code where there are no boundaries. Another possible solution is to scatter markers akin to small arrows across the whole pattern at a density that ensures every sliding window of a fixed size contains at least one marker. This has the downside of requiring higher image resolutions and more processing to robustly recognize the marker as well as obscuring the background binary code.
The more elegant solution that is implemented instead, is to embed directionality into the underlying binary sequence itself, allowing the direction of the bit pattern to be decoded along with the position. This is done by extending the uniqueness property of the 1D sequence to bitwise reversals as well. For example, by ensuring if substring 001 exists, then 100 does not, and later when querying for 100 fails, the bitwise reversal is attempted instead, thereby returning the position of 001 with the knowledge that the input sequence is reversed. It is then possible to infer the correct quadrant of rotation from the direction of the sequences corresponding to each axis.
It is established from previous sections that the underlying 1D localization sequence is required to satisfy the following properties:
- Every substring of n-bits is unique.
- Such that each code corresponds to a single position.
- If a substring of n-bits exists, then the bitwise complement of that substring does not exist.
- To resolve the ambiguity of parity resulting from XOR binary matrix factorization.
- If a substring of n-bits exists, then the bitwise reversal of that substring does not exist.
- To determine the direction the code is read from. (ie if the camera rotated by 180 degrees)
Another potential property to consider is the uniqueness of a substring at different scales in order to determine distance between camera and floor pattern. However, this is highly complicated since substrings of all lengths must be considered. The method used for scale estimation in actual implementation is instead based on statistically comparing the amount of bit errors when decoding the input scaled to various lengths.
With the required properties defined, the next question is how to find such a sequence. There are no known algorithms in literature that precisely addresses this, particularly the directional property as defined above (please contact author if reader knows otherwise). So we are left to design one, from an engineering approach rather than a theoretical approach since the concern is for practical application.
The method implemented takes inspiration from a related class of sequences called Maximum Length Sequences (MLS/M-sequences), which have the uniqueness property akin to a Debruijn sequence, where every code of length n occurs exactly once as a sub-string of the cyclic sequence (except the zero string is excluded). They can be generated from Linear-Feedback Shift Registers (LFSR), a simple iterative mechanism where a shift register of n bits initially contains a seed state, then a new bit is computed as a linear function of the current state, and used as input to the shift register to generate the next state. The correct choice of linear function ensures a "maximum length sequence", where all representable are states traversed once before falling into a cycle. They are widely used in digital hardware as a means to efficiently generate long pseudo-random binary sequences, since the outcome of the next bit appears random given a long enough sequence although the process is deterministic.
The idea is to modify the LFSR to generate binary sequences with all the properties required, namely uniqueness of each fixed length substring along with their bit-wise complement and reversal. This is done by formulating a function that generates the next bit of the sequence by actively filtering outputs that would otherwise violate the uniqueness properties. A valid sequence is thereby extended one bit at a time from an initial base sequence. Since the bit-length of codes for this application are relatively short, the uniqueness checks can be efficiently performed using a lookup table. In order to find the longest sequences, it would be desirable to avoid binary chains that have been explored before, but in practice the search space is too large to remember all previously traversed solutions, plus it is harder to parallelize the search if there is shared data. So instead a randomized approach is used, to generate the initial seed sequence along with a guiding heuristic to select each next bit with probability inversely proportional to number of times the transition state has ever been visited. A minor additional implementation detail is that the sequence can be extended in both directions yielding longer solutions but the idea remains the same.
This procedure claims no guarantees of finding the longest valid sequence, although the longest found will approach it given enough exploration and sampling. For a given code length, the longest found sequences so far are about 1/10 the length of a corresponding Debruijn sequence, which is the theoretical longest if only the first uniqueness property is required. This translates to the additional usage of ~3.5 bits per code window in order to satisfy the bit complement and bit reversal uniqueness. One redeeming aspect though is that codes which are less dense (higher average hamming distance) have better bit error detection properties which is equally important in application.
Decoding in this section refers to querying the position where a given input code uniquely occurs or if it exists at all in a sequence. Finding an efficient procedure is important since it is the basis of the overall localization routine. Unfortunately as per the previous section, generating sequences that satisfy the required properties is already hard enough. Moreover the nature LFSR based generation methods result in pseudo-random sequences, which is by definition a process that is intractable to reverse. This rules out hopes of cleverly coming up with efficient analytical methods, or even automated systems of doing so for that matter (ie can hardly train a regression model or neural network to extract patterns out of random).
However the fallback of building and storing a reverse lookup index turns out to be quite feasible given a few optimizations. Say we design for a localization coverage area of 50m by 50m with every bit resolved at 1mm (this is more than sufficient for the intended use case). That requires 50k millimeters in one axis, and accordingly a sequence with length of 50k bits. The shortest code length discovered that produces a valid sequence longer than 50k is 20-bit. So each index entry should store a 20-bit key (the code) and a 16-bit value (the position index represented by ~log2(50k) bits). The naive approach of using a lookup table would require 2^20 entries of 6 bytes each (uint32 key + uint16 value) resulting in ~6.3MB. 6MB storage is too much to ask of low-cost micro controllers, but serves as a baseline for comparison.
A possible space optimization is to use a hash table instead of a lookup table. This would reduce the number of required entries proportional to the number of unique codes (ie 50k). Assuming contiguous memory allocation and standard load factor of 0.7, the storage required would be 50k entries multiplied by 6 bytes per entry divided by 0.7 resulting in ~430KB. This is a drastic improvement but still in a range where a premium is paid for embedded storage.
The method used in the current implementation is instead based on binary search. This requires exactly the number of entries as there are unique codes, with the entries sorted by key/code prior. At first glance the storage requirement is similar to a hash table only without the load factor overhead. A redeeming key insight is that, given the original sequence in the form of a bit-vector, it is efficient to query for the code from any position index. This removes the necessity of storing the key/code with each entry, since it can be obtained from the value/position index via a single lookup. As the result, each entry only requires the 2 bytes to store position, so 50k entries of 2 bytes each plus 50k bits of the original sequence vector totals to ~100kB. Aiming for less than 128kB flash is more tolerable in a low end micro-controller. Also note that many application do not require a coverage of 50m by 50m at millimetre resolutions and can use shorter codes accordingly with added benefit of more robustness from pixel redundancy.
Actually the argument for removal of storing keys can be made for hash tables as well. But such modification would require a custom implementation that is considerably more complex than the binary search equivalent. Plus there is no empirical evidence yet indicating better performance under intended environment and data sizes.