Skip to content

GH-46414: [C++] Fix GCS filesystem getFileInfo method #46416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

HyunWooZZ
Copy link

@HyunWooZZ HyunWooZZ commented May 13, 2025

Rationale for this change

Result GetFileInfo(const GcsPath& path) method is changed.
Before this change, It return directory. However it should be return not found info.
So i added code that iterator is valid

What changes are included in this PR?

AS-IS
if (list_result.begin() != list_result.end())

TO-BE
if (list_result.begin() != list_result.end() && *list_result.begin())

Are these changes tested?

Yes I did locally.

Are there any user-facing changes?

No

Copy link

⚠️ GitHub issue #46414 has been automatically assigned in GitHub to PR creator.

@HyunWooZZ HyunWooZZ marked this pull request as draft May 13, 2025 01:39
@HyunWooZZ HyunWooZZ marked this pull request as ready for review May 13, 2025 01:39
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for this case?

if (list_result.begin() != list_result.end()) {

// Check that the result is valid before determining whether the list is empty.
if (list_result.begin() != list_result.end() && *list_result.begin()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our empty result usage is wrong, right?
It seems that this is not the correct Google Cloud Platform C++ Client Libraries usage.

@coryan Could you tell us the correct usage of this? We want to detect whether ListObjects() returns an empty result or not here.

Copy link
Author

@HyunWooZZ HyunWooZZ May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our empty result usage is wrong, right?

Yes you are right.
Basically without any permission of GCS, this part is true. list_result.begin() != list_result.end()

Could you add a test for this case?

I will add some testcases. :)

@HyunWooZZ
Copy link
Author

HyunWooZZ commented May 13, 2025

The testbench performs far fewer error checks, and no permission checks (ACL/IAM). testbench_storage_github

It seems the testbench skips permission checks.
Do you have any thoughts on how to test that?

@HyunWooZZ HyunWooZZ force-pushed the FIX/gcs-getfileinfo-directory-bug branch from 614d277 to fbd6645 Compare May 13, 2025 06:21
@HyunWooZZ HyunWooZZ requested a review from kou May 13, 2025 12:15
Comment on lines 357 to 359
// Check that the result is valid before determining whether the list is empty.
auto it = list_result.begin();
if (it != list_result.end() && *it) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether this is a correct approach to check an empty list or not.
Could you share some URLs you referred when you use this approach?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou Sure:)
ListObjects gcs c++
StatusOr

I believe the type of list_result is ListObjectsReader, based on the first link.
I tried to understand what ListObjectsReader contains, and I found this: list_objects_reader

when using *it, we are actually dereferencing a StatusOr<ObjectMetadata> object.
If the object is not ok(), dereferencing it will lead to undefined behavior.
This is why checking *it or it->ok() before accessing the value is necessary.

If there are any other approaches to handle this more effectively, please let me know.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou
Also can you tell me more how to test locally?
It is my first time contribute, so i'm little bit confused.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the correct approach...

In your case, does list_result has one error object or is list_result an empty list?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou Thank ouy reply!!

I think I need to clearly define the problem first.
The issue I'm currently experiencing is that even though I don't have the appropriate permissions for GCS, instead of throwing an error, it returns the file as if it were a directory.

The function logic is as follows:

  1. It checks whether the file exists in the bucket and stores that information in info.
  • In GetFileInfoObject, it seems to return FileInfo(path.full_path, FileType::NotFound).
  1. Therefore, the info.ok() is true and FileType is NotFound, it does not enter the following branch:
if (!info.ok() || info->type() != FileType::NotFound)
  1. As a result, it proceeds to check if there are any objects under full path/.

  2. In step 3, it should first verify if the iterator object is valid before comparing begin and end. However, it doesn't perform this check, leading to the issue.

My test code.

#include "google/cloud/storage/client.h"
#include "google/cloud/status.h"
#include <iostream>
#include <stdexcept>

namespace gcs = ::google::cloud::storage;

int main(int argc, char* argv[]) {
    if (argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <bucket-name>\n";
        return 1;
    }

    std::string bucket_name = argv[1];

    auto client = gcs::Client();

    std::string object_name = "test-object";

    auto meta = client.GetObjectMetadata(bucket_name, object_name);

    if (!meta) {
	std::cout << "Error Code: " << static_cast<int>(meta.status().code()) << std::endl;
	std::cout << "Status string: " << google::cloud::StatusCodeToString(meta.status().code()) << std::endl;
	std::cerr << "object_meta: "  << meta.status().message() << "\n";
        return 1;
    }

    return 0;
}

My test result.

Error Code: 5
Status string: NOT_FOUND
object_meta: Permanent error, with a last message of Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: {"error":"invalid_request","error_description":"Service account not enabled on this instance"}

Copy link
Author

@HyunWooZZ HyunWooZZ May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for i did't post whole code.
I just changed your original code.
If the iterator has just one valid, We can think that is directory.
However if iterate all and there is nothing valid, I think we should return the origin meta status.

diff --git a/cpp/src/arrow/filesystem/gcsfs.cc b/cpp/src/arrow/filesystem/gcsfs.cc
index 9869687a8b..9d607711a2 100644
--- a/cpp/src/arrow/filesystem/gcsfs.cc
+++ b/cpp/src/arrow/filesystem/gcsfs.cc
@@ -353,12 +353,16 @@ class GcsFileSystem::Impl {
     // matches the prefix we assume it is a directory.
     std::string canonical = internal::EnsureTrailingSlash(path.object);
     auto list_result = client_.ListObjects(path.bucket, gcs::Prefix(canonical));
-    if (list_result.begin() != list_result.end()) {
-      // If there is at least one result it indicates this is a directory (at
-      // least one object exists that starts with "path/")
+    for (auto&& object_metadata : list_result) {
+      if (!object_metadata) {
+        continue;
+      }
+      // If there is at least one valid result, it indicates this is a
+      // directory (at least one object exists that starts with
+      // "path/")
       return FileInfo(path.full_path, FileType::Directory);
     }
-    // Return the original not-found info if there was no match.
+    // Return the original not-found info if there was no valid result.
     return internal::ToArrowStatus(meta.status());
   }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the original status of meta not changed by GetFileInfoObject.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? In the case, internal::ToArrowStatus(meta.status()) is arrow::Status::IOError. Should we return an error when there is no objects under path/?

Copy link
Author

@HyunWooZZ HyunWooZZ May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou, forget above proposal.
If we make that change, we should consider it in more detail.

That return another error.

Status ToArrowStatus(const google::cloud::Status& s) {
  std::ostringstream os;
  os << "google::cloud::Status(" << s << ")";
  Status st;
  switch (s.code()) {
    case GcsCode::kCancelled:
      st = Status::Cancelled(os.str());
      break;
    case GcsCode::kUnknown:
      st = Status::UnknownError(os.str());
      break;
    case GcsCode::kInvalidArgument:
      st = Status::Invalid(os.str());
      break;
    case GcsCode::kDeadlineExceeded:
    case GcsCode::kNotFound:
      st = Status::IOError(os.str());
      break;

I like your first proposal code.
Can you push that code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed code that you suggested.
After that i tested, all passed !!!!

image

AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/", FileType::Directory);
}

TEST_F(GcsIntegrationTest, GetFileInfo_WithoutPermission) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And How we test the permission eeror?
I have no idea to test.

Copy link
Member

@kou kou May 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this happen with any other error too? Does permission error is only required for this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is not any error except below notFound error.

Error Code: 5
Status string: NOT_FOUND
object_meta: Permanent error, with a last message of Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: {"error":"invalid_request","error_description":"Service account not enabled on this instance"}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry. I wanted to say that "can we use another error (not permission error) to reproduce this case?".

Copy link
Author

@HyunWooZZ HyunWooZZ May 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we return the original meta info like this?
return internal::ToArrowStatus(meta.status());

I will propose above thread.
Plz check there!

@HyunWooZZ HyunWooZZ force-pushed the FIX/gcs-getfileinfo-directory-bug branch from fbd6645 to aef990a Compare May 15, 2025 08:52
@HyunWooZZ HyunWooZZ requested a review from kou May 15, 2025 11:20
@HyunWooZZ
Copy link
Author

Hooray!! LGTM:)

Comment on lines 632 to 651
TEST_F(GcsIntegrationTest, GetFileInfo) {
ASSERT_OK_AND_ASSIGN(auto fs, GcsFileSystem::Make(TestGcsOptions()));
constexpr auto kTextFileName = "dir/foo/bar.txt";
ASSERT_OK_AND_ASSIGN(
auto output,
fs->OpenOutputStream(PreexistingBucketPath() + kTextFileName, /*metadata=*/{}));
const auto data = std::string(kLoremIpsum);
ASSERT_OK(output->Write(data.data(), data.size()));
ASSERT_OK(output->Close());

// check this is the File.
AssertFileInfo(fs.get(), PreexistingBucketPath() + kTextFileName, FileType::File);

// check parent directories are recognized as directories.
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/", FileType::Directory);

// check not exists.
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/unexpected_dir/",
FileType::NotFound);
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/not_bar.txt",
FileType::NotFound);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the important part in this test?
(What is the uncovered case in other GetFileInfo* tests?)

Copy link
Author

@HyunWooZZ HyunWooZZ May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your comment is fair.
Below, GetFileInfoObjectWithNestedStructure is tested in the same way.
Sorry about the confusion.

I should test it like this:

Test with an invalid project without permission.
Verify that if there is no valid permission to read GCS, the FileType is NotFound.
To recap, the main difference is that this test code uses a real GCS with no valid permission.

TEST_F(GcsIntegrationTest, GetFileInfoWithoutPermission) {
  auto options = GcsOptions::Anonymous();
  options.retry_limit_seconds = 15;
  options.project_id = "test-only-invalid-project-id";

  ASSERT_OK_AND_ASSIGN(auto fs, GcsFileSystem::Make(options));
  // check with out permission.
  AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/unexpected_dir/",
                 FileType::NotFound);
  AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/not_bar.txt",
                 FileType::NotFound);
}

image
image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kou Really sorry.
After your confirm i will push the test code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Origin code can't pass the test.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the invalid project ID approach work with storage-testbench?
(Can we reproduce the one error element list case with storage-testbench?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will edit the comment before the commit.
plz just check the logic.

At test bench we can't check the permission error.
So i made dummy objects.

I have question.

  • Is there another good approach?
  • Do we have to get rid of GetFileInfoWithoutPermission test?

Copy link
Author

@HyunWooZZ HyunWooZZ May 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the invalid project ID approach work with storage-testbench?

GetFileInfoWithoutPermission test is not work with storage-testbench.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testcase is passed.
image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's abandon writing a test for this case.
Could you add a comment why we don't have a test for the implementation instead of adding a test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok :) I tried to create a mock object MockListObjectsReader, but it was quite difficult for me.
Sorry about that.

Could you add a comment explaining why we don't have a test for the implementation instead of adding a test?

I will add the comment.

To recap our conversation:

  1. I will add a comment in the code explaining why we didn’t test an invalid project ID using the storage testbench.
  1. I will keep the following test case:
TEST_F(GcsIntegrationTest, GetFileInfoWithoutPermission) {
  auto options = GcsOptions::Anonymous();
  options.retry_limit_seconds = 15;
  options.project_id = "test-only-invalid-project-id";
  // Make the real GcsFileSystem.
  ASSERT_OK_AND_ASSIGN(auto fs, GcsFileSystem::Make(options));
  // Check FileInfo without permission.
  AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/dir1/",
                 FileType::NotFound);
  AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/bar.txt",
                 FileType::NotFound);
}

Thank you for your response even night :)
I hope you have a good weekend!!!

@HyunWooZZ HyunWooZZ force-pushed the FIX/gcs-getfileinfo-directory-bug branch from 24f7092 to bfd60c1 Compare May 16, 2025 14:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants