Skip to content

Conversation

davidsbatista
Copy link
Contributor

@davidsbatista davidsbatista commented Sep 25, 2025

Related Issues

Proposed Changes:

  • adds a delete_all_documents() method to the OpenSearchDocumentStore class to both Synchronous and Asynchronous versions
  • strategy is to delete the entire index, and recreate the index with the same mappings and settings
  • other small improvments:
    • _deserialize_search_hits()@staticmethod
    • _process_bulk_write_errors()@staticmethod
    • _deserialize_document()@staticmethod
  • minor typo corrections in docstrings
  • added proper SPDX license headers to test files

How did you test it?

  • unit tests, integration tests + CI tests
  • test_document_store.py - Added test_delete_all_documents()
  • test_document_store_async.py - Added test_delete_all_documents_async()

Checklist

@github-actions github-actions bot added integration:opensearch type:documentation Improvements or additions to documentation labels Sep 25, 2025
@davidsbatista davidsbatista changed the title Feat/adding delete all docs to open search document store feat: adding the operation delete_all_documents to the OpenSearchDocumentStore Sep 25, 2025
@davidsbatista davidsbatista marked this pull request as ready for review September 25, 2025 14:15
@davidsbatista davidsbatista requested a review from a team as a code owner September 25, 2025 14:15
@davidsbatista davidsbatista requested review from julian-risch and removed request for a team September 25, 2025 14:15
@anakin87
Copy link
Member

Just a quick thought... I am not sure deleting and recreating the index is a good idea.

As a user, I'd expect delete_all_documents to only remove the data, not touch the index itself. And even though we recommend managing the DB via Haystack, some users might still have changed index attributes manually. In that case, recreating the index could cause surprises.

A bulk delete might be a safer option. Maybe it's also worth checking how this was done in Haystack 1.x.

@davidsbatista
Copy link
Contributor Author

davidsbatista commented Sep 26, 2025

I was investigating into it and for both Elastic and Open, specially for large volumes of data, this is the most efficient way to do it. In both cases the indexes are recreated with the same mappings/settings.

The only issue is if a user changes the index after creation/population.

@davidsbatista
Copy link
Contributor Author

What about something like:

def delete_all_documents(self, recreate_index=False)

By default we apply a usual delete operation, and if recreate_index=True, we just recreate the index using the mappings/settings defined by the user or the default ones.

@anakin87
Copy link
Member

What about something like:

def delete_all_documents(self, recreate_index=False)

I would prefer this idea!

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good to me already. My main suggestion is that if we have a try except in delete_all_documents_async, it makes sense to add the same to the sync implementation of delete_all_documents to ensures consistent behavior.
Other than that, I have a suggestion for the tests to make sure they only test one thing. You could create a separate test, for example test_index_functionality_after_delete_all_documents for testing writing and retrieving of documents after deleting all documents. Something like:

def test_index_functionality_after_delete_all_documents(self, document_store):
    """Test that documents can be written and retrieved after delete_all_documents"""
    document_store.write_documents([Document(id="1", content="Test")])  # or add more documents
    document_store.delete_all_documents()
    
    new_doc = Document(id="2", content="New test")
    document_store.write_documents([new_doc])  # or add more documents
    assert document_store.count_documents() == 1
    
    results = document_store.filter_documents()
    assert len(results) == 1
    assert results[0].content == "New test"

A fast way to clear all documents from the document store while preserving any index settings and mappings.
"""
self._ensure_initialized()
assert self._client is not None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's consistent with the other methods to have this assert here. Therefore, I agree it's good to have it here.

However, I believe the only case it covers is if self._client was set at some point and then for some reason is set to None again. We have the same assert in _ensure_index_exists, which get's called as part of _ensure_initialized when it is run for the first time.

self._client.indices.delete(index=self._index)

# Recreate with mappings and settings
body = {"mappings": self._mappings, "settings": self._settings}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you talked to Tobi about this: do we always want to recreate the index when we delete all documents? I agree that the default should be to recreate the index.
I could imagine a non-breaking extension of delete_all_documents where we add a parameter bool: recreate_index=True and then users could choose to not recreate the index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the default should be to recreate the index.

I think the default should rather be to delete all the documents, and the optional is to recreate the index with the same settings/mappings as an efficient way to delete all the documents.

# Recreate with mappings and settings
body = {"mappings": self._mappings, "settings": self._settings}
await self._async_client.indices.create(index=self._index, body=body)
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a try except here with this error message, I'd say it makes sense to add the same to the sync implementation of delete_all_documents. It's a few extra lines of code but ensures we have more similar behavior in both implementations, delete_all_documents and delete_all_documents_async

Comment on lines +483 to +487
# delete all documents
document_store.delete_all_documents()
assert document_store.count_documents() == 0

# verify index still exists and can accept new documents and retrieve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could extend the test by checking that the actual index configuration remains unchanged.

While I see the advantages of a test that checks that writing data works after deletion just as before, there is also a downside. If in future there is a bug in how we write or retrieve documents, the test delete_all_documents will also fail. Even if there is no bug at all in how we delete documents. I prefer limiting the unit tests as much as possible only to the functionality that we want to test.

Suggested change
# delete all documents
document_store.delete_all_documents()
assert document_store.count_documents() == 0
# verify index still exists and can accept new documents and retrieve
# Capture index structure before deletion
assert document_store._client is not None
index_info_before = document_store._client.indices.get(index=document_store._index)
mappings_before = index_info_before[document_store._index]["mappings"]
settings_before = index_info_before[document_store._index]["settings"]
# delete all documents
document_store.delete_all_documents()
assert document_store.count_documents() == 0
# Verify index structure is preserved
index_info_after = document_store._client.indices.get(index=document_store._index)
mappings_after = index_info_after[document_store._index]["mappings"]
settings_after = index_info_after[document_store._index]["settings"]
assert mappings_after == mappings_before, "delete_all_documents shpuld preserve index mappings"
assert settings_after == settings_before, "delete_all_documents should preserve index settings"
# verify index can accept new documents and retrieve

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea 👍🏽

Comment on lines +260 to +264
# delete all documents
await document_store.delete_all_documents_async()
assert await document_store.count_documents_async() == 0

# verify index still exists and can accept new documents and retrieve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# delete all documents
await document_store.delete_all_documents_async()
assert await document_store.count_documents_async() == 0
# verify index still exists and can accept new documents and retrieve
# Capture index structure before deletion
assert document_store._async_client is not None
index_info_before = await document_store._async_client.indices.get(index=document_store._index)
mappings_before = index_info_before[document_store._index]["mappings"]
settings_before = index_info_before[document_store._index]["settings"]
# delete all documents
await document_store.delete_all_documents_async()
assert await document_store.count_documents_async() == 0
# Verify index structure is preserved
index_info_after = await document_store._async_client.indices.get(index=document_store._index)
mappings_after = index_info_after[document_store._index]["mappings"]
settings_after = index_info_after[document_store._index]["settings"]
assert mappings_after == mappings_before, "delete_all_documents_async should preserve index mappings"
assert settings_after == settings_before, "delete_all_documents_async should preserve index settings"
# verify index can accept new documents and retrieve

similar suggestion as in test_delete_all_documents

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:opensearch type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants