Skip to content

Conversation

@joehybird
Copy link

@joehybird joehybird commented Nov 6, 2025

Purpose

We want to add fulltext (and semantic in a second phase) search to Drive.

The goal is to enable efficient and scalable search across document content by pushing relevant data to a dedicated search backend, such as OpenSearch. The backend should be pluggable.

Proposal

  • Add indexing logic in a search indexer that can be declared as a backend
  • Implement indexing for the Find backend. See corresponding PR in Find
  • Implement search views as a proxy
  • Implement triggers to update search index when a document or its accesses change.

This is a backport of this PR on docs

@joehybird joehybird requested review from NathanVss and lunika November 6, 2025 09:54
@joehybird joehybird self-assigned this Nov 6, 2025
@joehybird joehybird added enhancement New feature or request Backend labels Nov 6, 2025
@joehybird joehybird changed the title 🔧(compose) configure external network for communication with search Index to search Nov 7, 2025
@joehybird joehybird force-pushed the index-to-search branch 2 times, most recently from 3aad56e to 8527a84 Compare November 7, 2025 14:36
@joehybird joehybird requested a review from qbey November 7, 2025 14:36
Copy link
Member

@lunika lunika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for your information, the library used in Drive to manage the item tree is not the same from docs. We use django-ltree and use the postgresql ltree extension. This is a significant change between the two projects, changing how to make queries and how to work with the path parameter.

@joehybird joehybird force-pushed the index-to-search branch 2 times, most recently from 8b3b277 to a1a9b14 Compare November 13, 2025 06:18
@joehybird joehybird force-pushed the index-to-search branch 4 times, most recently from cdaa5d5 to eac0ffb Compare November 17, 2025 14:26
@joehybird joehybird requested a review from lunika November 17, 2025 14:26
Search in Drive relies on an external project like "La Suite Find".
We need to declare a common external network in order to connect to
the search app and index our documents.

Signed-off-by: Fabre Florian <[email protected]>
Add a new Django app 'demo' that contains the command 'create_demo'
Generate dummy users and files with existing factories.

Signed-off-by: Fabre Florian <[email protected]>
Add SearchIndexer service that handles indexation & search API calls to Find
Add SEARCH_INDEXER_* settings to configure it.

Signed-off-by: Fabre Florian <[email protected]>
Add a celery task that send an item changes to the Find API
A simple flag is set in cache for an amount of time that block any other task
creation and do the throttle.
The SEARCH_INDEXER_COUNTDOWN setting gives the number of seconds between tasks

Signed-off-by: Fabre Florian <[email protected]>
When the file indexer is enabled (SEARCH_INDEXER_* settings are set) use it
in place of the title filtering.

Signed-off-by: Fabre Florian <[email protected]>
Reduce the number of Find API calls by grouping all the latest changes
for indexation : send all the items updated or deleted since the
triggering of the task.

Signed-off-by: Fabre Florian <[email protected]>
Set SEARCH_INDEXER_CLASS=None as default configuration for dev.
Add documentation for Find service setup.

Signed-off-by: Fabre Florian <[email protected]>
Keep ordering by score from Find API on search/ results when the
fulltext search is enabled.
Refactor pagination to work with a list instead of a queryset
Fix Changelog

Signed-off-by: Fabre Florian <[email protected]>
Use nb_results instead of page/page_size argument for /search API.
Add --batch-size argument to the index command.
Fix an issue in SearchIndexer.has_text when item.mimetype is empty.

Signed-off-by: Fabre Florian <[email protected]>
In Drive the search should find files in the trashbin for a limited amount
of time. So the soft-delete cannot disable a indexed entry
Be more strict on mimetype patterns.

Signed-off-by: Fabre Florian <[email protected]>
Use SEARCH_INDEXER_CONTENT_MAX_SIZE as limit (in bytes) for the file content.
Fix default configuration of OIDC_STORE_ACCESS_TOKEN

Signed-off-by: Fabre Florian <[email protected]>
return drf.response.Response(serializer.data, status=drf.status.HTTP_200_OK)

# pylint: disable-next=too-many-arguments,too-many-positional-arguments
def _fulltext_search(self, queryset, indexer, request, text):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something more relevant ? _external_search ? This is not really related to full text search but using an external service managing search

data={
"q": text,
"visited": visited,
"services": ["drive"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive should be a constant

token=token,
)

return [d["_id"] for d in response]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should not be part of this method. It is the class implementing this BaseItemIndexer class that know what the reponse contains. If we create an other backend and the response in the query does not have this _id property, it will not works. Probably the search method should be abstract and every backend should implement its own logic.

When indexer service is not configured, the search view should work
event with a disabled OIDC_STORE_ACCESS_TOKEN.
Disable token storage for the unit tests.
Add bin/fernetkey that generates a key for the OIDC_STORE_REFRESH_TOKEN_KEY
setting.

Signed-off-by: Fabre Florian <[email protected]>
Replace "drive" by a SERVICE_NAME constant.
Merge search() & search_query() methods.

Signed-off-by: Fabre Florian <[email protected]>

## Create an index service for Drive

Configure a **Service** for Docs application with these settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Configure a **Service** for Docs application with these settings
Configure a **Service** for Drive application with these settings


## Configure settings of Drive

Add those Django settings the Docs application to enable the feature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Add those Django settings the Docs application to enable the feature.
Add those Django settings the Drive application to enable the feature.

@method_decorator(refresh_oidc_access_token)
def _indexed_search(self, request, queryset, indexer, text):
"""
Returns a queryset from the results the fulltext search of Find
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't return a queryset but a response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Backend enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants