Inefficient phrase searches in modern versions #131

d-maurer · 2022-04-05T18:59:57Z

Phrase searches use "WidCode"s (i.e. "WordInDex" codes). A "WidCode" is a string which represents a sequence of integers. The representation is particular efficient if the integers are small. Large integers may require up to 3 times the space of small integers.

In former versions, Lexicon tried hard to assign small integers as word indices. In modern versions, the word index is chosen randomly -- avoiding the values for which the "WidCode" is particularly efficient.

The source comment

Products.ZCatalog/src/Products/ZCTextIndex/Lexicon.py

Lines 145 to 148 in e033d4c

    
           # WidCode requires us to use at least 0x4000 as a base number. 
        
           # The algorithm in versions before 2.13 used the length as a base 
        
           # number. So we don't even try to generate numbers below the 
        
           # length as they are likely all taken

may indicate a reason:
apparently, the author thought, he must avoid values below 0x4000. However,

Products.ZCatalog/src/Products/ZCTextIndex/WidCode.py

Lines 68 to 72 in e033d4c

    
               n = len(wid2enc) 
        
               return "".join([w < n and wid2enc[w] or _encode(w) for w in wids]) 
        
           _encoding = [None] * 0x4000  # Filled later, and converted to a tuple

shows that value below 0x4000 are precomputed and therefore particularly (computation) efficient.

The text was updated successfully, but these errors were encountered:

d-maurer added the enhancement label Apr 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inefficient phrase searches in modern versions #131

Inefficient phrase searches in modern versions #131

d-maurer commented Apr 5, 2022

Inefficient phrase searches in modern versions #131

Inefficient phrase searches in modern versions #131

Comments

d-maurer commented Apr 5, 2022