Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inefficient phrase searches in modern versions #131

Open
d-maurer opened this issue Apr 5, 2022 · 0 comments
Open

Inefficient phrase searches in modern versions #131

d-maurer opened this issue Apr 5, 2022 · 0 comments

Comments

@d-maurer
Copy link
Contributor

d-maurer commented Apr 5, 2022

Phrase searches use "WidCode"s (i.e. "WordInDex" codes). A "WidCode" is a string which represents a sequence of integers. The representation is particular efficient if the integers are small. Large integers may require up to 3 times the space of small integers.

In former versions, Lexicon tried hard to assign small integers as word indices. In modern versions, the word index is chosen randomly -- avoiding the values for which the "WidCode" is particularly efficient.

The source comment

# WidCode requires us to use at least 0x4000 as a base number.
# The algorithm in versions before 2.13 used the length as a base
# number. So we don't even try to generate numbers below the
# length as they are likely all taken
may indicate a reason:
apparently, the author thought, he must avoid values below 0x4000. However,
n = len(wid2enc)
return "".join([w < n and wid2enc[w] or _encode(w) for w in wids])
_encoding = [None] * 0x4000 # Filled later, and converted to a tuple
shows that value below 0x4000 are precomputed and therefore particularly (computation) efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant