Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add SearchAfterMixin for ES search_after capability #4536

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Ali-D-Akbar
Copy link
Contributor

@Ali-D-Akbar Ali-D-Akbar commented Jan 8, 2025

PROD-4233
Adds a new SearchAfterMixin to be added in place of PkSearchableMixin that allows using search_after. Once this mixin is used, it will bypass the default search limit of 10k by making multiple calls to ES in case we have more than 10k records in an index.

Previously, we faced an issue regarding the search limit, resulting in less records to be returned. We increased the MAX_RESULT_WINDOW before but a better way is to use search_after capability for an optimal and flexible result.

This PR also adds a v2 CatalogQueryContainsViewSet that fixes the querying mechanism by filtering the items at the time of when we're executing queries on ES. In v1, our CatalogQueryContainsViewSet first searched all the records AND THEN filtered it.

Testing Instructions:

  1. Run update_index locally in Discovery shell.
  2. Visit /api/v1/catalog/query_contains/ endpoint and add a sample query like this: http://localhost:18381/api/v2/catalog/query_contains/?course_uuids=2de67490-f748-4efd-8532-b445f7ecc6f9,f9f1e100-668a-4fd5-a966-a127de1f69de&query=org:edX

You can set ELASTICSEARCH_DSL_QUERYSET_PAGINATION to your specific value in order to test the behavior. While the end result will be the same but this can affect the number of times the search_after mechanism is called.

@Ali-D-Akbar Ali-D-Akbar marked this pull request as ready for review January 10, 2025 09:59
@Ali-D-Akbar Ali-D-Akbar force-pushed the aakbar/PROD-4233 branch 4 times, most recently from 2b92b48 to d93f032 Compare January 13, 2025 20:30
@Ali-D-Akbar Ali-D-Akbar force-pushed the aakbar/PROD-4233 branch 2 times, most recently from de8f6a3 to e2052c4 Compare January 16, 2025 17:19
partner=ESDSLQ('term', partner=partner.short_code),
identifiers=ESDSLQ('terms', **{'uuid': course_uuids}),
document=CourseDocument
).values_list('uuid', flat=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this uses values_list while course_run_ids is using comprehension. We can should make it consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's rather consistent now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it? The code is still the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I misunderstood the earlier comment. Lemme use values_list for both.

@Ali-D-Akbar Ali-D-Akbar force-pushed the aakbar/PROD-4233 branch 3 times, most recently from 66e7fae to c3c3901 Compare January 20, 2025 07:57
@@ -4192,3 +4195,23 @@ def test_basic(self):
self.assertEqual(course_run.restricted_run, restricted_course_run)
self.assertEqual(restricted_course_run.restriction_type, 'custom-b2b-enterprise')
self.assertEqual(str(restricted_course_run), "course-v1:SC+BreadX+3T2015: <custom-b2b-enterprise>")


class TestSearchAfterMixin(ElasticsearchTestMixin, TestCase):
Copy link
Contributor

@DawoudSheraz DawoudSheraz Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this test live in test_mixins? It seems a bit weird to have it in test_models.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to a new file test_mixins.py

Comment on lines 4205 to 4206
for _ in range(self.total_courses):
CourseFactory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use CourseFactory.create_batch(count) instead of doing this loop.

CourseFactory()

@patch("course_discovery.apps.course_metadata.models.registry.get_documents")
def test_fetch_all_courses(self, mock_get_documents):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from courses, try indexing combination of different products like CourseRun, Programs, etc. to ensure a variety and then verify everything is working as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functionalities are well tested in course_discovery/apps/api/v2/tests/test_views/test_catalog_queries.py. We've already added a proxy model to verify the behavior that the search_after functionality is working as expected.

Furthermore, it should ensure that the existing search functionality and search responses remain unaffected in the current version of the endpoint.

Decision
----------
A new version (v2) of the `search/all/` endpoint will be introduced to enhance functionality while ensuring that the existing v1 functionality remains unaffected.
A new version (v2) of the `search/all/` endpoint will be introduced to enhance functionality while ensuring that the existing v1 functionality remains unaffected.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can tweak the decision section to highlight the addition of SearchAfterMixin and SearchAfterPagination instead of just mentioning that new endpoint was added. It would better reflect the capabilities. Then we can build on that and show how new endpoints were added.


search = search.extra(search_after=search_after) if search_after else search

results = search.execute()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: should we not add error handling here, in case any of the sub-sequent request fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error is already handled in this custom dispatch

exception = InvalidQuery(f'Failed to make Elasticsearch request. Got exception: {exc}')
and will be shown on the API response as a result.

{
    "detail": "Failed to make Elasticsearch request. Got exception: RequestError(400, 'search_phase_execution_exception', 'Failed to parse query [(org:)]')"
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants