Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search refactoring to better utilize index and improve capabilities #906

Merged
merged 30 commits into from
Feb 22, 2024

Conversation

psrok1
Copy link
Member

@psrok1 psrok1 commented Feb 1, 2024

Your checklist for this pull request

  • I've read the contributing guideline.
  • I've tested my changes by building and running the project, and testing changed functionality (if applicable)
  • I've added fixed automated tests for my change (if applicable, optional)
  • I've updated documentation to reflect my change (if applicable)

What is the current behaviour?

Current search engine was modified incrementally and few things, especially related with escaping are horrible mess. In addition I can't easily optimize things in array/jsonb columns as I need to use specific operators in query to utilize a GIN index: https://www.postgresql.org/docs/current/gin-builtin-opclasses.html#GIN-BUILTIN-OPCLASSES-TABLE

What is the new behaviour?

I have heavily reworked search engine and here is a list of functional changes:

  • Exact queries against JSON columns (without wildcards) are using @? operator, so they can utilize GIN index making them really fast:

image
(https://www.postgresql.org/docs/current/functions-json.html)

This type of operator is able to utilize GIN index, but we need to build our predicate using jsonpath grammar.

For example query cfg.cncs*.host:"example.com" is converted to the following SQL query:

SELECT * FROM object WHERE (
    cfg @? '$.cfg.cncs[*] ? (@ == "example.com")'
)

Unfortunately queries with wildcards against JSON columns can't be optimized easily so work still in progress.

  • Exact queries against file names are also faster:

alt_names column is queried using @> operator, so it can utilize GIN index. After adding collection of alternative upload names (#482), both file_name column and alt_names array are checked by the query making it unexpectedly slow.

  • Types of objects can be mixed within the same query

#661 changed inheritance model from join-based to single-table-based, so we no longer need to join on all types of objects while making queries involving multiple types. This might be useful for parent/child queries with OR operator, but main reason was to remove code that was checking that.

  • Both inclusive and exclusive ranges are allowed for date-time columns

That one was really annoying for me as a user, so I just treat exclusive ranges as inclusive. It doesn't make any huge difference if we query for upload_time:<5d or upload_time:<=5d

  • Range boundaries are automatically sorted

Fixed another annoying thing, especially in dates:

upload_time:[1d TO 5d] will return you nothing because it means FROM NOW-1 day TO NOW-5 day and the left value is greater than the right side...

  • More fixed corner cases of escaping

I feel we're finally doing it right...

  • Query values are tokenized using tokenize_string which is a heart of a new parser.
  • All string operations that are transforming value from Lucene pattern syntax to SQL LIKE-specifc or jsonpath-specific patterns are contained in mwdb.core.search.parse_helpers module. Most important methods are:
    • transform_for_eq_statement: trivia, just unescaping characters for __eq__ operator
    • transform_for_like_statement converting unescaped Lucene wildcards to SQL wildcards and then escaping all backslashes and SQL wildcard characters.
    • transform_for_quoted_like_statement made for LIKE statement against JSON typecasted to String. String inside JSON objects are quoted and additionally escaped which needs to be considered while making a pattern
    • transform_for_config_* which is additionally transforming value using encode("unicode-escape"). PostgreSQL is not accepting null-bytes in strings, so if we have high probability of lazily-encoded binary data as a string, we're using additional encoding which needs to be included in pattern
  • I bumped luqum and I'm using new TreeVisitor. I decided to use visitor only for building condition. Values are parsed depending on what is expected by specific type of a search field.

Breaking changes

Some of them needs discussion because they can be avoided with some additional code

Test plan

Already included tests are really good at testing corner cases

@psrok1 psrok1 marked this pull request as ready for review February 6, 2024 18:31
@psrok1 psrok1 requested review from Repumba and msm-cert February 6, 2024 18:39
Copy link
Member

@msm-cert msm-cert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've managed to review ~25% of the changes for now.

mwdb/core/search/exceptions.py Outdated Show resolved Hide resolved
mwdb/core/search/exceptions.py Outdated Show resolved Hide resolved
mwdb/core/search/exceptions.py Outdated Show resolved Hide resolved
mwdb/core/search/exceptions.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/exceptions.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/parse_helpers.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
mwdb/core/search/mappings.py Show resolved Hide resolved
mwdb/core/search/search.py Outdated Show resolved Hide resolved
mwdb/core/search/parse_helpers.py Outdated Show resolved Hide resolved
mwdb/core/search/fields.py Outdated Show resolved Hide resolved
@psrok1 psrok1 requested a review from Repumba February 22, 2024 15:19
@psrok1 psrok1 merged commit 2c1cf9c into master Feb 22, 2024
12 checks passed
@psrok1 psrok1 deleted the refactor/search-revisited branch February 22, 2024 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants