-
Notifications
You must be signed in to change notification settings - Fork 193
feat: adding the operation delete_all_documents
to the OpenSearchDocumentStore
#2321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: adding the operation delete_all_documents
to the OpenSearchDocumentStore
#2321
Conversation
delete_all_documents
to the OpenSearchDocumentStore
Just a quick thought... I am not sure deleting and recreating the index is a good idea. As a user, I'd expect A bulk delete might be a safer option. Maybe it's also worth checking how this was done in Haystack 1.x. |
I was investigating into it and for both Elastic and Open, specially for large volumes of data, this is the most efficient way to do it. In both cases the indexes are recreated with the same mappings/settings. The only issue is if a user changes the index after creation/population. |
What about something like: def delete_all_documents(self, recreate_index=False) By default we apply a usual delete operation, and if |
I would prefer this idea! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good to me already. My main suggestion is that if we have a try except in delete_all_documents_async, it makes sense to add the same to the sync implementation of delete_all_documents to ensures consistent behavior.
Other than that, I have a suggestion for the tests to make sure they only test one thing. You could create a separate test, for example test_index_functionality_after_delete_all_documents
for testing writing and retrieving of documents after deleting all documents. Something like:
def test_index_functionality_after_delete_all_documents(self, document_store):
"""Test that documents can be written and retrieved after delete_all_documents"""
document_store.write_documents([Document(id="1", content="Test")]) # or add more documents
document_store.delete_all_documents()
new_doc = Document(id="2", content="New test")
document_store.write_documents([new_doc]) # or add more documents
assert document_store.count_documents() == 1
results = document_store.filter_documents()
assert len(results) == 1
assert results[0].content == "New test"
A fast way to clear all documents from the document store while preserving any index settings and mappings. | ||
""" | ||
self._ensure_initialized() | ||
assert self._client is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's consistent with the other methods to have this assert here. Therefore, I agree it's good to have it here.
However, I believe the only case it covers is if self._client was set at some point and then for some reason is set to None again. We have the same assert in _ensure_index_exists
, which get's called as part of _ensure_initialized
when it is run for the first time.
self._client.indices.delete(index=self._index) | ||
|
||
# Recreate with mappings and settings | ||
body = {"mappings": self._mappings, "settings": self._settings} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you talked to Tobi about this: do we always want to recreate the index when we delete all documents? I agree that the default should be to recreate the index.
I could imagine a non-breaking extension of delete_all_documents where we add a parameter bool: recreate_index=True
and then users could choose to not recreate the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the default should be to recreate the index.
I think the default should rather be to delete all the documents, and the optional is to recreate the index with the same settings/mappings as an efficient way to delete all the documents.
# Recreate with mappings and settings | ||
body = {"mappings": self._mappings, "settings": self._settings} | ||
await self._async_client.indices.create(index=self._index, body=body) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a try except here with this error message, I'd say it makes sense to add the same to the sync implementation of delete_all_documents. It's a few extra lines of code but ensures we have more similar behavior in both implementations, delete_all_documents and delete_all_documents_async
# delete all documents | ||
document_store.delete_all_documents() | ||
assert document_store.count_documents() == 0 | ||
|
||
# verify index still exists and can accept new documents and retrieve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could extend the test by checking that the actual index configuration remains unchanged.
While I see the advantages of a test that checks that writing data works after deletion just as before, there is also a downside. If in future there is a bug in how we write or retrieve documents, the test delete_all_documents
will also fail. Even if there is no bug at all in how we delete documents. I prefer limiting the unit tests as much as possible only to the functionality that we want to test.
# delete all documents | |
document_store.delete_all_documents() | |
assert document_store.count_documents() == 0 | |
# verify index still exists and can accept new documents and retrieve | |
# Capture index structure before deletion | |
assert document_store._client is not None | |
index_info_before = document_store._client.indices.get(index=document_store._index) | |
mappings_before = index_info_before[document_store._index]["mappings"] | |
settings_before = index_info_before[document_store._index]["settings"] | |
# delete all documents | |
document_store.delete_all_documents() | |
assert document_store.count_documents() == 0 | |
# Verify index structure is preserved | |
index_info_after = document_store._client.indices.get(index=document_store._index) | |
mappings_after = index_info_after[document_store._index]["mappings"] | |
settings_after = index_info_after[document_store._index]["settings"] | |
assert mappings_after == mappings_before, "delete_all_documents shpuld preserve index mappings" | |
assert settings_after == settings_before, "delete_all_documents should preserve index settings" | |
# verify index can accept new documents and retrieve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea 👍🏽
# delete all documents | ||
await document_store.delete_all_documents_async() | ||
assert await document_store.count_documents_async() == 0 | ||
|
||
# verify index still exists and can accept new documents and retrieve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# delete all documents | |
await document_store.delete_all_documents_async() | |
assert await document_store.count_documents_async() == 0 | |
# verify index still exists and can accept new documents and retrieve | |
# Capture index structure before deletion | |
assert document_store._async_client is not None | |
index_info_before = await document_store._async_client.indices.get(index=document_store._index) | |
mappings_before = index_info_before[document_store._index]["mappings"] | |
settings_before = index_info_before[document_store._index]["settings"] | |
# delete all documents | |
await document_store.delete_all_documents_async() | |
assert await document_store.count_documents_async() == 0 | |
# Verify index structure is preserved | |
index_info_after = await document_store._async_client.indices.get(index=document_store._index) | |
mappings_after = index_info_after[document_store._index]["mappings"] | |
settings_after = index_info_after[document_store._index]["settings"] | |
assert mappings_after == mappings_before, "delete_all_documents_async should preserve index mappings" | |
assert settings_after == settings_before, "delete_all_documents_async should preserve index settings" | |
# verify index can accept new documents and retrieve |
similar suggestion as in test_delete_all_documents
Related Issues
Proposed Changes:
delete_all_documents()
method to theOpenSearchDocumentStore
class to both Synchronous and Asynchronous versions_deserialize_search_hits()
→@staticmethod
_process_bulk_write_errors()
→@staticmethod
_deserialize_document()
→@staticmethod
How did you test it?
test_document_store.py
- Addedtest_delete_all_documents()
test_document_store_async.py
- Addedtest_delete_all_documents_async()
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.