Search refactoring to better utilize index and improve capabilities #906

psrok1 · 2024-02-01T15:12:07Z

Your checklist for this pull request

I've read the contributing guideline.
I've tested my changes by building and running the project, and testing changed functionality (if applicable)
I've ~~added~~ fixed automated tests for my change (if applicable, optional)
I've updated documentation to reflect my change (if applicable)

What is the current behaviour?

Current search engine was modified incrementally and few things, especially related with escaping are horrible mess. In addition I can't easily optimize things in array/jsonb columns as I need to use specific operators in query to utilize a GIN index: https://www.postgresql.org/docs/current/gin-builtin-opclasses.html#GIN-BUILTIN-OPCLASSES-TABLE

What is the new behaviour?

I have heavily reworked search engine and here is a list of functional changes:

Exact queries against JSON columns (without wildcards) are using @? operator, so they can utilize GIN index making them really fast:

(https://www.postgresql.org/docs/current/functions-json.html)

This type of operator is able to utilize GIN index, but we need to build our predicate using jsonpath grammar.

For example query cfg.cncs*.host:"example.com" is converted to the following SQL query:

SELECT * FROM object WHERE (
    cfg @? '$.cfg.cncs[*] ? (@ == "example.com")'
)

Unfortunately queries with wildcards against JSON columns can't be optimized easily so work still in progress.

Exact queries against file names are also faster:

alt_names column is queried using @> operator, so it can utilize GIN index. After adding collection of alternative upload names (#482), both file_name column and alt_names array are checked by the query making it unexpectedly slow.

Types of objects can be mixed within the same query

#661 changed inheritance model from join-based to single-table-based, so we no longer need to join on all types of objects while making queries involving multiple types. This might be useful for parent/child queries with OR operator, but main reason was to remove code that was checking that.

Both inclusive and exclusive ranges are allowed for date-time columns

That one was really annoying for me as a user, so I just treat exclusive ranges as inclusive. It doesn't make any huge difference if we query for upload_time:<5d or upload_time:<=5d

Range boundaries are automatically sorted

Fixed another annoying thing, especially in dates:

upload_time:[1d TO 5d] will return you nothing because it means FROM NOW-1 day TO NOW-5 day and the left value is greater than the right side...

More fixed corner cases of escaping

I feel we're finally doing it right...

Query values are tokenized using tokenize_string which is a heart of a new parser.
All string operations that are transforming value from Lucene pattern syntax to SQL LIKE-specifc or jsonpath-specific patterns are contained in mwdb.core.search.parse_helpers module. Most important methods are:
- transform_for_eq_statement: trivia, just unescaping characters for __eq__ operator
- transform_for_like_statement converting unescaped Lucene wildcards to SQL wildcards and then escaping all backslashes and SQL wildcard characters.
- transform_for_quoted_like_statement made for LIKE statement against JSON typecasted to String. String inside JSON objects are quoted and additionally escaped which needs to be considered while making a pattern
- transform_for_config_* which is additionally transforming value using encode("unicode-escape"). PostgreSQL is not accepting null-bytes in strings, so if we have high probability of lazily-encoded binary data as a string, we're using additional encoding which needs to be included in pattern
I bumped luqum and I'm using new TreeVisitor. I decided to use visitor only for building condition. Values are parsed depending on what is expected by specific type of a search field.

Breaking changes

Some of them needs discussion because they can be avoided with some additional code

size:">=5kB" no longer represents a range. The correct forms are:
```
size:>="5kB"
size:"5kB"
```
That's because latest luqum added support for OpenRange operators, so now >=,>,<,<= is a thing that is no longer a part of Term: Add support for unbounded ranges jurismarches/luqum#91. Of course we can fix that and leave our parsing of >= in place, but do we really need to?
I tried to unify exception messages, so some of them been changed (https://github.com/CERT-Polska/mwdb-core/pull/906/files#diff-03ca869f84201fd8bfc70b98b0e4fc0cb4c0bb1f91c375f8dfbf0af31fa8782c). I don't think anyone relied on that.
jsonb @? jsonpath is querying values a bit differently than our previous code:
- We're not casting exact queries to strings, so if value looks like a number or boolean, we're querying it both in quoted and unquoted way: https://github.com/CERT-Polska/mwdb-core/pull/906/files#diff-defa34ff47121dc90fa86b14135c87d7e621187fe5222d1bf6ae33d4be7e5c9cR327
- Arrays nested in arrays are also considered while querying an array for a value... which was somehow checked in one of the tests. I fixed that test because I don't believe someone is relying on that behavior.

Test plan

Already included tests are really good at testing corner cases

msm-cert

I've managed to review ~25% of the changes for now.

mwdb/core/search/exceptions.py

mwdb/core/search/fields.py

mwdb/core/search/exceptions.py

mwdb/core/search/fields.py

mwdb/core/search/parse_helpers.py

mwdb/core/search/fields.py

mwdb/core/search/mappings.py

mwdb/core/search/search.py

mwdb/core/search/parse_helpers.py

mwdb/core/search/fields.py

mwdb/core/search/parse_helpers.py

… and UnsupportedPatternValue

Co-authored-by: Tomek Chytry-Trzeciak <[email protected]>

psrok1 added 17 commits January 17, 2024 18:05

Initial changes

1d0b298

Value types

3d53b69

Next part of value types

1662f0d

wip

01857ea

I'm going through changes

79803d5

Next steps

3d63ff1

Fixes

cde2b07

Next tries

1e5af6a

Introduce proper value parser

855a689

Ok, let's go

f718460

Fix one test: arrays in arrays are searched a bit differently right now

4ad0710

Next steps

1c22961

Index that is actually working

d0d463c

Multi fields

77ee230

Finally...

fe84106

Fix docs

0aae2f2

Merge branch 'master' into refactor/search-revisited

339965e

psrok1 marked this pull request as ready for review February 6, 2024 18:31

psrok1 requested review from Repumba and msm-cert February 6, 2024 18:39

Merge branch 'master' into refactor/search-revisited

d41d7a5

msm-cert reviewed Feb 20, 2024

View reviewed changes

Repumba reviewed Feb 20, 2024

View reviewed changes

mwdb/core/search/fields.py Outdated Show resolved Hide resolved

Repumba reviewed Feb 20, 2024

View reviewed changes

mwdb/core/search/parse_helpers.py Outdated Show resolved Hide resolved

psrok1 and others added 5 commits February 20, 2024 16:41

Apply suggestions from some comments

4f6d3ae

Renamed has_wildcard and UnsupportedLikeStatement to is_pattern_value…

04544ea

… and UnsupportedPatternValue

Added missing type to parse_size arg

3eaa201

Update mwdb/core/search/fields.py

ab17af7

Co-authored-by: Tomek Chytry-Trzeciak <[email protected]>

anchored match => fullmatch

f80386e

psrok1 added 7 commits February 20, 2024 16:53

Correct split_tokenized_string typing

15b0957

Set naming for _get_condition arg

be6ba58

Fixed select statement

99a2351

Apply suggestions from comments

5e1cc00

Applied other suggestions

16552b5

Merge branch 'master' into refactor/search-revisited

c6e1502

Applied other suggestions

e749429

psrok1 requested a review from Repumba February 22, 2024 15:19

Repumba approved these changes Feb 22, 2024

View reviewed changes

psrok1 merged commit 2c1cf9c into master Feb 22, 2024
12 checks passed

psrok1 deleted the refactor/search-revisited branch February 22, 2024 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search refactoring to better utilize index and improve capabilities #906

Search refactoring to better utilize index and improve capabilities #906

psrok1 commented Feb 1, 2024 •

edited

Loading

msm-cert left a comment

Search refactoring to better utilize index and improve capabilities #906

Search refactoring to better utilize index and improve capabilities #906

Conversation

psrok1 commented Feb 1, 2024 • edited Loading

msm-cert left a comment

Choose a reason for hiding this comment

psrok1 commented Feb 1, 2024 •

edited

Loading