-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search refactoring to better utilize index and improve capabilities #906
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
msm-cert
reviewed
Feb 20, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've managed to review ~25% of the changes for now.
Repumba
reviewed
Feb 20, 2024
Repumba
reviewed
Feb 20, 2024
Repumba
reviewed
Feb 20, 2024
… and UnsupportedPatternValue
Co-authored-by: Tomek Chytry-Trzeciak <[email protected]>
Repumba
approved these changes
Feb 22, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Your checklist for this pull request
addedfixed automated tests for my change (if applicable, optional)What is the current behaviour?
Current search engine was modified incrementally and few things, especially related with escaping are horrible mess. In addition I can't easily optimize things in array/jsonb columns as I need to use specific operators in query to utilize a GIN index: https://www.postgresql.org/docs/current/gin-builtin-opclasses.html#GIN-BUILTIN-OPCLASSES-TABLE
What is the new behaviour?
I have heavily reworked search engine and here is a list of functional changes:
@?
operator, so they can utilize GIN index making them really fast:(https://www.postgresql.org/docs/current/functions-json.html)
This type of operator is able to utilize GIN index, but we need to build our predicate using jsonpath grammar.
For example query
cfg.cncs*.host:"example.com"
is converted to the following SQL query:Unfortunately queries with wildcards against JSON columns can't be optimized easily so work still in progress.
alt_names column is queried using
@>
operator, so it can utilize GIN index. After adding collection of alternative upload names (#482), bothfile_name
column andalt_names
array are checked by the query making it unexpectedly slow.#661 changed inheritance model from join-based to single-table-based, so we no longer need to join on all types of objects while making queries involving multiple types. This might be useful for
parent/child
queries withOR
operator, but main reason was to remove code that was checking that.That one was really annoying for me as a user, so I just treat exclusive ranges as inclusive. It doesn't make any huge difference if we query for
upload_time:<5d
orupload_time:<=5d
Fixed another annoying thing, especially in dates:
upload_time:[1d TO 5d]
will return you nothing because it meansFROM NOW-1 day TO NOW-5 day
and the left value is greater than the right side...I feel we're finally doing it right...
tokenize_string
which is a heart of a new parser.mwdb.core.search.parse_helpers
module. Most important methods are:transform_for_eq_statement
: trivia, just unescaping characters for__eq__
operatortransform_for_like_statement
converting unescaped Lucene wildcards to SQL wildcards and then escaping all backslashes and SQL wildcard characters.transform_for_quoted_like_statement
made for LIKE statement against JSON typecasted to String. String inside JSON objects are quoted and additionally escaped which needs to be considered while making a patterntransform_for_config_*
which is additionally transforming value usingencode("unicode-escape")
. PostgreSQL is not accepting null-bytes in strings, so if we have high probability of lazily-encoded binary data as a string, we're using additional encoding which needs to be included in patternluqum
and I'm using new TreeVisitor. I decided to use visitor only for building condition. Values are parsed depending on what is expected by specific type of a search field.Breaking changes
Some of them needs discussion because they can be avoided with some additional code
size:">=5kB"
no longer represents a range. The correct forms are:That's because latest luqum added support for OpenRange operators, so now
>=
,>
,<
,<=
is a thing that is no longer a part of Term: Add support for unbounded ranges jurismarches/luqum#91. Of course we can fix that and leave our parsing of>=
in place, but do we really need to?I tried to unify exception messages, so some of them been changed (https://github.com/CERT-Polska/mwdb-core/pull/906/files#diff-03ca869f84201fd8bfc70b98b0e4fc0cb4c0bb1f91c375f8dfbf0af31fa8782c). I don't think anyone relied on that.
jsonb @? jsonpath
is querying values a bit differently than our previous code:Test plan
Already included tests are really good at testing corner cases