-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(services/azdls): list start from and sas token support #5242
base: main
Are you sure you want to change the base?
feat(services/azdls): list start from and sas token support #5242
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you @alexwilcoxson-rel for working on this. I've been considering porting logic from Hadoop to OpenDAL for a while but haven't taken action yet.
This feature is undocumented, so it's best not to enable it by default. How about adding a new flag called enable_list_with_start_after
, where we set the capability list_with_start_after
to true
?
@@ -133,6 +133,15 @@ impl AzdlsBuilder { | |||
self | |||
} | |||
|
|||
/// Set sas_token of this backend. | |||
pub fn sas_token(mut self, sas_token: &str) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reqsign has sas_token
support; we can use it here: https://docs.rs/reqsign/0.16.0/reqsign/struct.AzureStorageConfig.html#structfield.sas_token
Maybe it's worth a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we use reqsign::AzureStorageConfig
on AzdlsBuilder
or AzdlsConfig
or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can construct like s3
does:
opendal/core/src/services/s3/backend.rs
Lines 745 to 783 in 3fe8caa
// This is our current config. | |
let mut cfg = AwsConfig::default(); | |
if !self.config.disable_config_load { | |
#[cfg(not(target_arch = "wasm32"))] | |
{ | |
cfg = cfg.from_profile(); | |
cfg = cfg.from_env(); | |
} | |
} | |
if let Some(v) = self.config.region.clone() { | |
cfg.region = Some(v); | |
} | |
if cfg.region.is_none() { | |
return Err(Error::new( | |
ErrorKind::ConfigInvalid, | |
"region is missing. Please find it by S3::detect_region() or set them in env.", | |
) | |
.with_operation("Builder::build") | |
.with_context("service", Scheme::S3)); | |
} | |
let region = cfg.region.to_owned().unwrap(); | |
debug!("backend use region: {region}"); | |
// Building endpoint. | |
let endpoint = self.build_endpoint(®ion); | |
debug!("backend use endpoint: {endpoint}"); | |
// Setting all value from user input if available. | |
if let Some(v) = self.config.access_key_id { | |
cfg.access_key_id = Some(v) | |
} | |
if let Some(v) = self.config.secret_access_key { | |
cfg.secret_access_key = Some(v) | |
} | |
if let Some(v) = self.config.session_token { | |
cfg.session_token = Some(v) | |
} |
hey @Xuanwo as I was refactoring the code I wrote a test against one of our actual storage accounts. I then executed the list call with recursive = true as the object_store integration does. However when I do that list result from azure contains everything. I'm trying to figure out how recursive could be impacting the azdls list but I don't see it used by azdls service. EDIT: ah I see the CompleteLayer, looking into this. |
Ok so I found the FlatLister does not retain the OpList args. Adding modifying FlatLister to take in the original args and pass them to the inner lister seems to work for this case, but I'm not sure how recursive listing and start_after should interact for all edge cases. The underlying Azure REST API does have a recursive parameter, but when actually using that:
|
Thank you @alexwilcoxson-rel for catching this.
It should be fine as long as we handle the capability correctly.
That seems to be a problem. Would you like to raise a separate issue for this? We can track it and switch to azdls's own |
@@ -92,7 +94,7 @@ where | |||
async fn next(&mut self) -> Result<Option<oio::Entry>> { | |||
loop { | |||
if let Some(de) = self.next_dir.take() { | |||
let (_, l) = self.acc.list(de.path(), OpList::new()).await?; | |||
let (_, l) = self.acc.list(de.path(), self.args.clone()).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit tricky since FlatLister
is used to simulate recursive listing on unsupported services. Therefore, we can't forward OpList
directly. Instead, let's introduce a start_after
here, and FlatLister
should ensure the OpList
passed to undering are constructed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning OpList::new()
with start_after
set to a self.start_after
value? or am I misinterpreting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning
OpList::new()
withstart_after
set to aself.start_after
value?
Yep.
@@ -133,6 +133,15 @@ impl AzdlsBuilder { | |||
self | |||
} | |||
|
|||
/// Set sas_token of this backend. | |||
pub fn sas_token(mut self, sas_token: &str) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can construct like s3
does:
opendal/core/src/services/s3/backend.rs
Lines 745 to 783 in 3fe8caa
// This is our current config. | |
let mut cfg = AwsConfig::default(); | |
if !self.config.disable_config_load { | |
#[cfg(not(target_arch = "wasm32"))] | |
{ | |
cfg = cfg.from_profile(); | |
cfg = cfg.from_env(); | |
} | |
} | |
if let Some(v) = self.config.region.clone() { | |
cfg.region = Some(v); | |
} | |
if cfg.region.is_none() { | |
return Err(Error::new( | |
ErrorKind::ConfigInvalid, | |
"region is missing. Please find it by S3::detect_region() or set them in env.", | |
) | |
.with_operation("Builder::build") | |
.with_context("service", Scheme::S3)); | |
} | |
let region = cfg.region.to_owned().unwrap(); | |
debug!("backend use region: {region}"); | |
// Building endpoint. | |
let endpoint = self.build_endpoint(®ion); | |
debug!("backend use endpoint: {endpoint}"); | |
// Setting all value from user input if available. | |
if let Some(v) = self.config.access_key_id { | |
cfg.access_key_id = Some(v) | |
} | |
if let Some(v) = self.config.secret_access_key { | |
cfg.secret_access_key = Some(v) | |
} | |
if let Some(v) = self.config.session_token { | |
cfg.session_token = Some(v) | |
} |
|
|
Hey @Xuanwo just letting you know I probably won't get back to this until later in the week. I will share though that I am using this in our fork of version ~0.48 I think. We're using it along size the base object_store azure store (since those have all the put support for delta), and it has gone well! Looking forward to getting this in along with the put if-not-match changes that are in progress as well! |
Which issue does this PR close?
Closes #.
Rationale for this change
This changes proposes implementation of the start_after workaround used by Hadoop to improve listing in ADLS from a particular point.
This is important for table formats like Delta Lake that make use of list_with_offset in object store to get the latest version of a table.
In my testing listing directories with > 10k records similar to what Delta Lake would do improves times from seconds to milliseconds (~3seconds to 60ms).
Also this adds SAS token support which could be split into its own PR but needed it for testing against my company's ADLS accounts.
What changes are included in this PR?
Changes to azdls module to include list with start_after functionality. This includes a module crc64 which computes crc for use in the continuation token generated for start after
Are there any user-facing changes?
Yes, start_after will function for azdls, and SAS token support.
@Xuanwo I'm opening this as a draft to get initial feedback. This is not directly supported by Microsoft, or I should say documented. In pursuit of them adding this capability to the azblob endpoints, they were the ones that directed me to Hadoop example, so I have confidence in it but could see if its something they would officially support/document.
One caveat as well is I believe (per the hadoop code) the change is will only work if hierarchical namespace (xns) is enabled which isn't something I believe we can tell when the list request is being made. It could be the user has to opt in and inform if xns is enabled. Hadoop code does another request to figure that out dynamically which I'd personally want to avoid.