-
-
Notifications
You must be signed in to change notification settings - Fork 418
perf: Optimize searching tags with DB indexes #1129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
01bb83b
to
30d9403
Compare
if limit > 0 and not name: | ||
query = query.limit(limit).order_by(func.lower(Tag.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes the sorting to happen after truncating the results, which differs from the behaviour when searching, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this was not sorting by len(text)
before truncating. The previous implementation only sorted priority results by len so I've updated sort_key
to do that. Which will make this order_by
statement and sort_key
produce the same results when no query is provided.
tags.sort(key=lambda t: sort_key(t[1])) | ||
seen_ids = set() | ||
tag_ids = [] | ||
for row in tags: | ||
id = row[0] | ||
if id in seen_ids: | ||
continue | ||
tag_ids.append(id) | ||
seen_ids.add(id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this be written as the following?
tags = dict(tags)
tag_ids = sorted(tags.keys(), key=lambda t: sort_key(tags[t]))
del tags # not sure if this is makes a diff, but `tags` could become quite large and triggering gc on it sooner can't hurt
this is both simpler code wise and should be faster by only sorting the deduplicated list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so it will use the order from Tag.name
or TagAlias.name
depending on which comes first for each tag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in that case the following should work since dict.keys()
maintains insertion order and dict deduplicates by key:
tags.sort(key=lambda t: sort_key(t[1]))
tag_ids = dict(tags).keys() # get the deduplicated list of ids
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than that I think this is good to merge
src/tagstudio/qt/mixed/tag_search.py
Outdated
direct_tags, ancestor_tags = self.lib.search_tags(name=query, limit=tag_limit) | ||
|
||
if query and query.strip(): | ||
for tag in raw_results: | ||
if tag.name.lower().startswith(query_lower): | ||
priority_results.add(tag) | ||
all_results = [t for t in direct_tags if t.id not in self.exclude] | ||
for tag in ancestor_tags: | ||
if tag.id not in self.exclude: | ||
all_results.append(tag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code previously handled self.exclude
being None, is there a reason you removed that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its type is list[int]
and I couldn't find any code that could cause it to be None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good reason ^^, I missed that
src/tagstudio/qt/mixed/tag_search.py
Outdated
all_results = [t for t in direct_tags if t.id not in self.exclude] | ||
for tag in ancestor_tags: | ||
if tag.id not in self.exclude: | ||
all_results.append(tag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all_results = [t for t in direct_tags if t.id not in self.exclude] | |
for tag in ancestor_tags: | |
if tag.id not in self.exclude: | |
all_results.append(tag) | |
all_results = [t for t in direct_tags if t.id not in self.exclude] | |
all_results += [t for t in ancestor_tags if t.id not in self.exclude] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up doing this to avoid creating extra lists.
all_results.extend(t for t in ancestor_tags if t.id not in self.exclude)
Summary
Create indexes on columns commonly used in queries.
Tag.name, Tag.shorthand
finding tags by nameTagParent.child_id
fetching tag hierarchiesTagEntry.entry_id
fetching entries with their tagsChange query in
Library.search_tags
so it will use above indexes.Sort results before applying limit to prevent truncating tags that should be prioritized.
Tasks Completed