Skip to content

Commit

Permalink
feat: add SearchAfterMixin for ES search_after capability
Browse files Browse the repository at this point in the history
  • Loading branch information
Ali-D-Akbar committed Jan 9, 2025
1 parent d63d119 commit a70ebb7
Show file tree
Hide file tree
Showing 2 changed files with 82 additions and 18 deletions.
61 changes: 61 additions & 0 deletions course_discovery/apps/course_metadata/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -1149,6 +1149,67 @@ def search(cls, query, queryset=None):
return filtered_queryset


class SearchAfterMixin:
"""
Represents objects to query Elasticsearch with `search_after` pagination and load by primary key.
"""

@classmethod
def search(cls, query, queryset=None):
"""
Queries the Elasticsearch index with optional pagination using `search_after`.
Args:
query (str) -- Elasticsearch querystring (e.g. `title:intro*`)
queryset (models.QuerySet) -- base queryset to search, defaults to objects.all()
Returns:
QuerySet
"""
query = clean_query(query)
page_size = 10000
queryset = queryset or cls.objects.all()

Check warning on line 1171 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1169-L1171

Added lines #L1169 - L1171 were not covered by tests

if query == '(*)':
# Early-exit optimization. Wildcard searching is very expensive in elasticsearch. And since we just
# want everything, we don't need to actually query elasticsearch at all.
return queryset

Check warning on line 1176 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1176

Added line #L1176 was not covered by tests

logger.info(f"Attempting Elasticsearch document search against query: {query}")
es_document, *_ = registry.get_documents(models=(cls,))
dsl_query = ESDSLQ('query_string', query=query, analyze_wildcard=True)

Check warning on line 1180 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1178-L1180

Added lines #L1178 - L1180 were not covered by tests

all_ids = set()
search_after = None

Check warning on line 1183 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1182-L1183

Added lines #L1182 - L1183 were not covered by tests

while True:
search = (

Check warning on line 1186 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1185-L1186

Added lines #L1185 - L1186 were not covered by tests
es_document.search()
.query(dsl_query)
.sort('id')
.extra(size=page_size, search_after=search_after)
)

try:
results = search.execute()
except RequestError as e:
logger.warning(f"Elasticsearch request failed: {e}")
break

Check warning on line 1197 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1193-L1197

Added lines #L1193 - L1197 were not covered by tests

ids = {result.pk for result in results}

Check warning on line 1199 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1199

Added line #L1199 was not covered by tests
if not ids:
logger.info("No more results found.")
break

Check warning on line 1202 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1201-L1202

Added lines #L1201 - L1202 were not covered by tests

all_ids.update(ids)
search_after = results[-1].meta.sort
logger.info(f"Fetched {len(ids)} records; total so far: {len(all_ids)}")

Check warning on line 1206 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1204-L1206

Added lines #L1204 - L1206 were not covered by tests

filtered_queryset = queryset.filter(pk__in=all_ids)
logger.info(f"Filtered queryset of size {len(filtered_queryset)} for query: {query}")
return filtered_queryset

Check warning on line 1210 in course_discovery/apps/course_metadata/models.py

View check run for this annotation

Codecov / codecov/patch

course_discovery/apps/course_metadata/models.py#L1208-L1210

Added lines #L1208 - L1210 were not covered by tests


class Collaborator(TimeStampedModel):
"""
Collaborator model, defining any collaborators who helped write course content.
Expand Down
39 changes: 21 additions & 18 deletions docs/decisions/0030-use-elasticsearch-search-after.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,49 +7,52 @@ Accepted (December 2024)

Context
---------
Elasticsearch enforces a strict limit on the number of records that can be retrieved in a single query,
controlled by the `MAX_RESULT_WINDOW` setting, which defaults to 10,000.
When a query attempts to retrieve more results than this limit, Elasticsearch does not simply truncate the results—instead,
Elasticsearch enforces a strict limit on the number of records that can be retrieved in a single query,
controlled by the `MAX_RESULT_WINDOW` setting, which defaults to 10,000.
When a query attempts to retrieve more results than this limit, Elasticsearch does not simply truncate the results—instead,
it can lead to partial or incomplete data retrieval across search endpoints, potentially causing significant data loss or incomplete query responses.

Increasing this limit is not a viable solution, as it can lead to significant performance issues,
Increasing this limit is not a viable solution, as it can lead to significant performance issues,
including increased memory usage and query latency, which can degrade the cluster's overall stability.

To address this issue, we need a more efficient way to paginate large query results.
The solution must allow for seamless and reliable pagination without imposing excessive resource demands on the system.
To address this issue, we need a more efficient way to paginate large query results.
The solution must allow for seamless and reliable pagination without imposing excessive resource demands on the system.
Furthermore, it should ensure that the existing search functionality and search responses remain unaffected in the current version of the endpoint.

Decision
----------
A new version (v2) of the `search/all/` endpoint will be introduced to enhance functionality while ensuring that the existing v1 functionality remains unaffected.
A new version (v2) of the `search/all/` endpoint will be introduced to enhance functionality while ensuring that the existing v1 functionality remains unaffected.
This version will utilize ElasticSearch's search_after pagination mechanism, specifically designed to handle large query results efficiently.

*How search_after Works:**
`search_after` is a pagination mechanism that allows retrieving results beyond the standard window limit by using the sort values of the last document from the previous page.
**How search_after Works:**
`search_after` is a pagination mechanism that allows retrieving results beyond the standard window limit by using the sort values of the last document from the previous page.
Instead of using traditional offset-based pagination, it uses the actual sort values of the last retrieved document to fetch the next set of results, ensuring efficient and accurate pagination for large datasets.

In the v2 implementation, response documents will include a `sort` field that can be used as the `search_after` query parameter in subsequent queries.
This approach enables scalable retrieval of large datasets by bypassing the `MAX_RESULT_WINDOW` limitations.
In the v2 implementation, response documents will include a `sort` field that can be used as the `search_after` query parameter in subsequent queries.
This approach enables scalable retrieval of large datasets by bypassing the `MAX_RESULT_WINDOW` limitations.
To support this, a new `SearchAfterPagination` class will be introduced, which will parse the `search_after` query parameter to facilitate efficient pagination of ElasticSearch results.

Additionally, new serializers will be integrated for the v2 implementation.
Additionally, new serializers will be integrated for the v2 implementation.
Specifically, the AggregateSearchListSerializerV2 will extend the existing AggregateSearchListSerializer,
supporting the `search_after` pagination mechanism and incorporating newer serializer versions for the same document types.

New versions of the serializers
New versions of the serializers
- CourseRunSearchDocumentSerializer
- CourseSearchDocumentSerializer
- ProgramSearchDocumentSerializer
- LearnerPathwaySearchDocumentSerializer
- PersonSearchDocumentSerializer

will be introduced to include additional search index fields, specifically `sort`` and `aggregate_uuid` in their responses.
will be introduced to include additional search index fields, specifically `sort`` and `aggregation_uuid` in their responses.

Consumers will interact with the new v2 search endpoint by making an initial request to `/api/v2/search/all/`,
which returns search results along with a `next` field representing the `sort` value of the last document.
Consumers will interact with the new v2 search endpoint by making an initial request to `/api/v2/search/all/`,
which returns search results along with a `next` field representing the `sort` value of the last document.
For subsequent pages, they simply include this `next` value as the `search_after` parameter in their next request.

A new mixin, SearchAfterMixin, will be created and added to enable the search_after functionality in catalog query endpoints, such as `/api/v1/catalog/query_contains`.

Example Usage:

```
# First request
response1 = requests.get('/api/v2/search/all/')
Expand All @@ -58,15 +61,15 @@ next_page_search_after = results1['next'] # This is the sort value of the last

# Next request using the 'next' value
response2 = requests.get(f'/api/v2/search/all/?search_after={json.dumps(next_page_search_after)}')

```

Consequences
--------------
- The v2 search endpoint will introduce two new fields, `aggregate_uuid` and `sort`, in response to support the search_after pagination mechanism.
- The v2 search endpoint will introduce two new fields, `aggregation_uuid` and `sort`, in response to support the search_after pagination mechanism.

Next Steps
-------------------------
- Create SearchAfter Mixin: Develop a mixin to enable search_after functionality in the Django shell.
- Extend SearchAfter Functionality: Implement `search_after` for other Elasticsearch search endpoints.
- Notify Users: Inform consumers about the changes and provide support during the transition.
- Monitor Performance: Track the performance of the new endpoint post-deployment.

0 comments on commit a70ebb7

Please sign in to comment.