-
Notifications
You must be signed in to change notification settings - Fork 3.7k
GH-46414: [C++] Fix GCS filesystem getFileInfo method #46416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
GH-46414: [C++] Fix GCS filesystem getFileInfo method #46416
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for this case?
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
if (list_result.begin() != list_result.end()) { | ||
|
||
// Check that the result is valid before determining whether the list is empty. | ||
if (list_result.begin() != list_result.end() && *list_result.begin()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our empty result usage is wrong, right?
It seems that this is not the correct Google Cloud Platform C++ Client Libraries usage.
@coryan Could you tell us the correct usage of this? We want to detect whether ListObjects()
returns an empty result or not here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our empty result usage is wrong, right?
Yes you are right.
Basically without any permission of GCS, this part is true. list_result.begin() != list_result.end()
Could you add a test for this case?
I will add some testcases. :)
It seems the testbench skips permission checks. |
614d277
to
fbd6645
Compare
cpp/src/arrow/filesystem/gcsfs.cc
Outdated
// Check that the result is valid before determining whether the list is empty. | ||
auto it = list_result.begin(); | ||
if (it != list_result.end() && *it) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether this is a correct approach to check an empty list or not.
Could you share some URLs you referred when you use this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou Sure:)
ListObjects gcs c++
StatusOr
I believe the type of list_result is ListObjectsReader, based on the first link.
I tried to understand what ListObjectsReader contains, and I found this: list_objects_reader
when using *it
, we are actually dereferencing a StatusOr<ObjectMetadata>
object.
If the object is not ok()
, dereferencing it will lead to undefined behavior.
This is why checking *it
or it->ok()
before accessing the value is necessary.
If there are any other approaches to handle this more effectively, please let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou
Also can you tell me more how to test locally?
It is my first time contribute, so i'm little bit confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the correct approach...
In your case, does list_result
has one error object or is list_result
an empty list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou Thank ouy reply!!
I think I need to clearly define the problem first.
The issue I'm currently experiencing is that even though I don't have the appropriate permissions for GCS, instead of throwing an error, it returns the file as if it were a directory.
The function logic is as follows:
- It checks whether the file exists in the bucket and stores that information in info.
- In GetFileInfoObject, it seems to return FileInfo(path.full_path, FileType::NotFound).
- Therefore, the info.ok() is true and FileType is NotFound, it does not enter the following branch:
if (!info.ok() || info->type() != FileType::NotFound)
-
As a result, it proceeds to check if there are any objects under full path/.
-
In step 3, it should first verify if the iterator object is valid before comparing begin and end. However, it doesn't perform this check, leading to the issue.
My test code.
#include "google/cloud/storage/client.h"
#include "google/cloud/status.h"
#include <iostream>
#include <stdexcept>
namespace gcs = ::google::cloud::storage;
int main(int argc, char* argv[]) {
if (argc != 2) {
std::cerr << "Usage: " << argv[0] << " <bucket-name>\n";
return 1;
}
std::string bucket_name = argv[1];
auto client = gcs::Client();
std::string object_name = "test-object";
auto meta = client.GetObjectMetadata(bucket_name, object_name);
if (!meta) {
std::cout << "Error Code: " << static_cast<int>(meta.status().code()) << std::endl;
std::cout << "Status string: " << google::cloud::StatusCodeToString(meta.status().code()) << std::endl;
std::cerr << "object_meta: " << meta.status().message() << "\n";
return 1;
}
return 0;
}
My test result.
Error Code: 5
Status string: NOT_FOUND
object_meta: Permanent error, with a last message of Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: {"error":"invalid_request","error_description":"Service account not enabled on this instance"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for i did't post whole code.
I just changed your original code.
If the iterator has just one valid, We can think that is directory.
However if iterate all and there is nothing valid, I think we should return the origin meta status.
diff --git a/cpp/src/arrow/filesystem/gcsfs.cc b/cpp/src/arrow/filesystem/gcsfs.cc
index 9869687a8b..9d607711a2 100644
--- a/cpp/src/arrow/filesystem/gcsfs.cc
+++ b/cpp/src/arrow/filesystem/gcsfs.cc
@@ -353,12 +353,16 @@ class GcsFileSystem::Impl {
// matches the prefix we assume it is a directory.
std::string canonical = internal::EnsureTrailingSlash(path.object);
auto list_result = client_.ListObjects(path.bucket, gcs::Prefix(canonical));
- if (list_result.begin() != list_result.end()) {
- // If there is at least one result it indicates this is a directory (at
- // least one object exists that starts with "path/")
+ for (auto&& object_metadata : list_result) {
+ if (!object_metadata) {
+ continue;
+ }
+ // If there is at least one valid result, it indicates this is a
+ // directory (at least one object exists that starts with
+ // "path/")
return FileInfo(path.full_path, FileType::Directory);
}
- // Return the original not-found info if there was no match.
+ // Return the original not-found info if there was no valid result.
return internal::ToArrowStatus(meta.status());
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the original status of meta not changed by GetFileInfoObject.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? In the case, internal::ToArrowStatus(meta.status())
is arrow::Status::IOError
. Should we return an error when there is no objects under path/
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou, forget above proposal.
If we make that change, we should consider it in more detail.
That return another error.
Status ToArrowStatus(const google::cloud::Status& s) {
std::ostringstream os;
os << "google::cloud::Status(" << s << ")";
Status st;
switch (s.code()) {
case GcsCode::kCancelled:
st = Status::Cancelled(os.str());
break;
case GcsCode::kUnknown:
st = Status::UnknownError(os.str());
break;
case GcsCode::kInvalidArgument:
st = Status::Invalid(os.str());
break;
case GcsCode::kDeadlineExceeded:
case GcsCode::kNotFound:
st = Status::IOError(os.str());
break;
I like your first proposal code.
Can you push that code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/", FileType::Directory); | ||
} | ||
|
||
TEST_F(GcsIntegrationTest, GetFileInfo_WithoutPermission) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And How we test the permission eeror?
I have no idea to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this happen with any other error too? Does permission error is only required for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there is not any error except below notFound error.
Error Code: 5
Status string: NOT_FOUND
object_meta: Permanent error, with a last message of Could not create a OAuth2 access token to authenticate the request. The request was not sent, as such an access token is required to complete the request successfully. Learn more about Google Cloud authentication at https://cloud.google.com/docs/authentication. The underlying error message was: {"error":"invalid_request","error_description":"Service account not enabled on this instance"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry. I wanted to say that "can we use another error (not permission error) to reproduce this case?".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we return the original meta info like this?
return internal::ToArrowStatus(meta.status());
I will propose above thread.
Plz check there!
fbd6645
to
aef990a
Compare
Hooray!! LGTM:) |
TEST_F(GcsIntegrationTest, GetFileInfo) { | ||
ASSERT_OK_AND_ASSIGN(auto fs, GcsFileSystem::Make(TestGcsOptions())); | ||
constexpr auto kTextFileName = "dir/foo/bar.txt"; | ||
ASSERT_OK_AND_ASSIGN( | ||
auto output, | ||
fs->OpenOutputStream(PreexistingBucketPath() + kTextFileName, /*metadata=*/{})); | ||
const auto data = std::string(kLoremIpsum); | ||
ASSERT_OK(output->Write(data.data(), data.size())); | ||
ASSERT_OK(output->Close()); | ||
|
||
// check this is the File. | ||
AssertFileInfo(fs.get(), PreexistingBucketPath() + kTextFileName, FileType::File); | ||
|
||
// check parent directories are recognized as directories. | ||
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/", FileType::Directory); | ||
|
||
// check not exists. | ||
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/unexpected_dir/", | ||
FileType::NotFound); | ||
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/not_bar.txt", | ||
FileType::NotFound); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the important part in this test?
(What is the uncovered case in other GetFileInfo*
tests?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your comment is fair.
Below, GetFileInfoObjectWithNestedStructure is tested in the same way.
Sorry about the confusion.
I should test it like this:
Test with an invalid project without permission.
Verify that if there is no valid permission to read GCS, the FileType is NotFound.
To recap, the main difference is that this test code uses a real GCS with no valid permission.
TEST_F(GcsIntegrationTest, GetFileInfoWithoutPermission) {
auto options = GcsOptions::Anonymous();
options.retry_limit_seconds = 15;
options.project_id = "test-only-invalid-project-id";
ASSERT_OK_AND_ASSIGN(auto fs, GcsFileSystem::Make(options));
// check with out permission.
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/unexpected_dir/",
FileType::NotFound);
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/not_bar.txt",
FileType::NotFound);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kou Really sorry.
After your confirm i will push the test code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the invalid project ID approach work with storage-testbench?
(Can we reproduce the one error element list case with storage-testbench?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will edit the comment before the commit.
plz just check the logic.
At test bench we can't check the permission error.
So i made dummy objects.
I have question.
- Is there another good approach?
- Do we have to get rid of GetFileInfoWithoutPermission test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the invalid project ID approach work with storage-testbench?
GetFileInfoWithoutPermission test is not work with storage-testbench.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let's abandon writing a test for this case.
Could you add a comment why we don't have a test for the implementation instead of adding a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok :) I tried to create a mock object MockListObjectsReader, but it was quite difficult for me.
Sorry about that.
Could you add a comment explaining why we don't have a test for the implementation instead of adding a test?
I will add the comment.
To recap our conversation:
- I will add a comment in the code explaining why we didn’t test an invalid project ID using the storage testbench.
- The testbench does not enforce any permission checks (ACL/IAM):
https://github.com/googleapis/storage-testbench?tab=readme-ov-file#what-is-this-testbench
- I will keep the following test case:
TEST_F(GcsIntegrationTest, GetFileInfoWithoutPermission) {
auto options = GcsOptions::Anonymous();
options.retry_limit_seconds = 15;
options.project_id = "test-only-invalid-project-id";
// Make the real GcsFileSystem.
ASSERT_OK_AND_ASSIGN(auto fs, GcsFileSystem::Make(options));
// Check FileInfo without permission.
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/dir1/",
FileType::NotFound);
AssertFileInfo(fs.get(), PreexistingBucketPath() + "dir/foo/bar.txt",
FileType::NotFound);
}
Thank you for your response even night :)
I hope you have a good weekend!!!
24f7092
to
bfd60c1
Compare
Rationale for this change
Result GetFileInfo(const GcsPath& path) method is changed.
Before this change, It return directory. However it should be return not found info.
So i added code that iterator is valid
What changes are included in this PR?
AS-IS
if (list_result.begin() != list_result.end())
TO-BE
if (list_result.begin() != list_result.end() && *list_result.begin())
Are these changes tested?
Yes I did locally.
Are there any user-facing changes?
No