Skip to content

Implement lakeFS document loader #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 26, 2025
Merged

Conversation

guy-har
Copy link
Contributor

@guy-har guy-har commented Mar 17, 2025

This PR takes the lakeFS implementation from the langchain_communtiy and implements it here:
All of the implementation and tests are taken as is with the following changes

  • Instead of implementing the API calls manually we use the high-level python SDK
  • Removed the test using the HTTP mock

Copy link

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The lakefs client uses the default endpoint and creds from the environment and not the one we pass in the code.
  • UnstructuredLakeFSLoader doesn't hold a client to perform api call

@guy-har guy-har requested a review from nopcoder March 20, 2025 07:55
Copy link

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some suggestions and questions

return partition(filename=local_path)
else:
with tempfile.TemporaryDirectory() as temp_dir:
file_path = f"{temp_dir}/{self.path.split('/')[-1]}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format alternative path under temporary directory

Suggested change
file_path = f"{temp_dir}/{self.path.split('/')[-1]}"
file_path = os.path.join(temp_dir, os.path.basename(self.path))

import requests_mock
from requests_mock.mocker import Mocker

from langchain_community.document_loaders.lakefs import LakeFSLoader

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import from langchain_community.document_loaders.lakefs is not what I saw running in the sample code.
Based on the code it was from langchain_lakefs.document_loaders.

Comment on lines 62 to 63
# endpoint: str = "endpoint"
endpoint: str = "http://localhost:8000"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pick one and remove the comment

@guy-har guy-har requested a review from nopcoder March 25, 2025 09:54
Copy link

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - minor comments



@pytest.fixture
def mock_unstructured_local() -> Any:
with patch(
"langchain_community.document_loaders.lakefs.UnstructuredLakeFSLoader"
"langchain_lakefs.document_loaders.UnstructuredLakeFSLoader"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

@@ -140,3 +96,4 @@ def test_load(self, mocker: Mocker) -> None:
loader.set_path(self.path)
documents = loader.load()
self.assertEqual(len(documents), 2)
self.assertEqual(len(documents[0].metadata),5)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
self.assertEqual(len(documents[0].metadata),5)
self.assertEqual(len(documents[0].metadata), 5)

@guy-har guy-har merged commit 0bbf9df into main Mar 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants