-
Notifications
You must be signed in to change notification settings - Fork 24
Index to search #391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Index to search #391
Conversation
584dc5e to
64573c6
Compare
3aad56e to
8527a84
Compare
lunika
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for your information, the library used in Drive to manage the item tree is not the same from docs. We use django-ltree and use the postgresql ltree extension. This is a significant change between the two projects, changing how to make queries and how to work with the path parameter.
8b3b277 to
a1a9b14
Compare
cdaa5d5 to
eac0ffb
Compare
eac0ffb to
d97444e
Compare
Search in Drive relies on an external project like "La Suite Find". We need to declare a common external network in order to connect to the search app and index our documents. Signed-off-by: Fabre Florian <[email protected]>
Add a new Django app 'demo' that contains the command 'create_demo' Generate dummy users and files with existing factories. Signed-off-by: Fabre Florian <[email protected]>
Add SearchIndexer service that handles indexation & search API calls to Find Add SEARCH_INDEXER_* settings to configure it. Signed-off-by: Fabre Florian <[email protected]>
Add a celery task that send an item changes to the Find API A simple flag is set in cache for an amount of time that block any other task creation and do the throttle. The SEARCH_INDEXER_COUNTDOWN setting gives the number of seconds between tasks Signed-off-by: Fabre Florian <[email protected]>
When the file indexer is enabled (SEARCH_INDEXER_* settings are set) use it in place of the title filtering. Signed-off-by: Fabre Florian <[email protected]>
Reduce the number of Find API calls by grouping all the latest changes for indexation : send all the items updated or deleted since the triggering of the task. Signed-off-by: Fabre Florian <[email protected]>
Set SEARCH_INDEXER_CLASS=None as default configuration for dev. Add documentation for Find service setup. Signed-off-by: Fabre Florian <[email protected]>
Keep ordering by score from Find API on search/ results when the fulltext search is enabled. Refactor pagination to work with a list instead of a queryset Fix Changelog Signed-off-by: Fabre Florian <[email protected]>
Use nb_results instead of page/page_size argument for /search API. Add --batch-size argument to the index command. Fix an issue in SearchIndexer.has_text when item.mimetype is empty. Signed-off-by: Fabre Florian <[email protected]>
In Drive the search should find files in the trashbin for a limited amount of time. So the soft-delete cannot disable a indexed entry Be more strict on mimetype patterns. Signed-off-by: Fabre Florian <[email protected]>
Use SEARCH_INDEXER_CONTENT_MAX_SIZE as limit (in bytes) for the file content. Fix default configuration of OIDC_STORE_ACCESS_TOKEN Signed-off-by: Fabre Florian <[email protected]>
d97444e to
e72f76b
Compare
src/backend/core/api/viewsets.py
Outdated
| return drf.response.Response(serializer.data, status=drf.status.HTTP_200_OK) | ||
|
|
||
| # pylint: disable-next=too-many-arguments,too-many-positional-arguments | ||
| def _fulltext_search(self, queryset, indexer, request, text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something more relevant ? _external_search ? This is not really related to full text search but using an external service managing search
| data={ | ||
| "q": text, | ||
| "visited": visited, | ||
| "services": ["drive"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive should be a constant
| token=token, | ||
| ) | ||
|
|
||
| return [d["_id"] for d in response] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this should not be part of this method. It is the class implementing this BaseItemIndexer class that know what the reponse contains. If we create an other backend and the response in the query does not have this _id property, it will not works. Probably the search method should be abstract and every backend should implement its own logic.
When indexer service is not configured, the search view should work event with a disabled OIDC_STORE_ACCESS_TOKEN. Disable token storage for the unit tests. Add bin/fernetkey that generates a key for the OIDC_STORE_REFRESH_TOKEN_KEY setting. Signed-off-by: Fabre Florian <[email protected]>
b8b2b58 to
69e5346
Compare
Replace "drive" by a SERVICE_NAME constant. Merge search() & search_query() methods. Signed-off-by: Fabre Florian <[email protected]>
66d0331 to
4bd9cc6
Compare
|
|
||
| ## Create an index service for Drive | ||
|
|
||
| Configure a **Service** for Docs application with these settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Configure a **Service** for Docs application with these settings | |
| Configure a **Service** for Drive application with these settings |
|
|
||
| ## Configure settings of Drive | ||
|
|
||
| Add those Django settings the Docs application to enable the feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Add those Django settings the Docs application to enable the feature. | |
| Add those Django settings the Drive application to enable the feature. |
| @method_decorator(refresh_oidc_access_token) | ||
| def _indexed_search(self, request, queryset, indexer, text): | ||
| """ | ||
| Returns a queryset from the results the fulltext search of Find |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't return a queryset but a response.
Purpose
We want to add fulltext (and semantic in a second phase) search to Drive.
The goal is to enable efficient and scalable search across document content by pushing relevant data to a dedicated search backend, such as OpenSearch. The backend should be pluggable.
Proposal
This is a backport of this PR on docs