|
| 1 | +- Proposal Name: `read_returns_metadata` |
| 2 | +- Start Date: 2025-03-24 |
| 3 | +- RFC PR: [apache/opendal#5871](https://github.com/apache/opendal/pull/5871) |
| 4 | +- Tracking Issue: [apache/opendal#5872](https://github.com/apache/opendal/issues/5872) |
| 5 | + |
| 6 | +# Summary |
| 7 | + |
| 8 | +Enhance read operations by returning metadata along with data in read operations. |
| 9 | + |
| 10 | +# Motivation |
| 11 | + |
| 12 | +Currently, read operations (`read`, `read_with`, `reader`, `reader_with`) only return the data content. Users who need metadata |
| 13 | +during reads (like `Content-Type`, `ETag`, `version_id`, etc.) must make an additional `stat()` call. This is inefficient and |
| 14 | +can lead to race conditions if the file is modified between the read and stat operations. |
| 15 | + |
| 16 | +Many storage services (like S3, GCS, Azure Blob) return metadata in their read responses. For example, S3's GetObject API returns |
| 17 | +important metadata like `ContentType`, `ETag`, `VersionId`, `LastModified`, etc. We should expose this information to users |
| 18 | +directly during read operations. |
| 19 | + |
| 20 | +# Guide-level explanation |
| 21 | + |
| 22 | +For `reader` API, we will introduce a new method `metadata()` that returns metadata: |
| 23 | + |
| 24 | +```rust |
| 25 | +// Before |
| 26 | +let data = op.reader("path/to/file").await?.read(..).await?; |
| 27 | +let meta = op.stat("path/to/file").await?; |
| 28 | +if let Some(etag) = meta.etag() { |
| 29 | + println!("ETag: {}", etag); |
| 30 | +} |
| 31 | + |
| 32 | +// After |
| 33 | +let reader = op.reader("path/to/file").await?; |
| 34 | +let meta = reader.metadata(); |
| 35 | +if let Some(etag) = meta.etag() { |
| 36 | + println!("ETag: {}", etag); |
| 37 | +} |
| 38 | +let data = reader.read(..).await?; |
| 39 | +``` |
| 40 | +The new API will be provided alongside existing functionality, allowing users to continue using current `reader` methods without modification. |
| 41 | + |
| 42 | +For backward compatibility and to minimize migration costs, We won't change the existing `read` API. Anyone who wants |
| 43 | +to obtain metadata during reading can use the new reader operations instead. |
| 44 | + |
| 45 | +# Reference-level explanation |
| 46 | + |
| 47 | +## Changes to `Reader` API |
| 48 | + |
| 49 | +The `impl Reader` will be modified to include a new function `metadata()` that returns metadata. |
| 50 | + |
| 51 | +```rust |
| 52 | +impl Reader { |
| 53 | + // Existing fields... |
| 54 | + |
| 55 | + fn metadata(&self) -> &Metadata {} |
| 56 | +} |
| 57 | +``` |
| 58 | + |
| 59 | +## Changes to struct `raw::RpRead` |
| 60 | + |
| 61 | +The `raw::RpRead` struct will be modified to include a new field `metadata` that stores the metadata returned by the read operation. |
| 62 | +Existing fields will be evaluated and potentially removed if they become redundant. |
| 63 | + |
| 64 | +```rust |
| 65 | +pub struct RpRead { |
| 66 | + // New field to store metadata |
| 67 | + metadata: Metadata, |
| 68 | +} |
| 69 | +``` |
| 70 | + |
| 71 | + |
| 72 | +## Implementation Details |
| 73 | + |
| 74 | +For services that return metadata in their read responses: |
| 75 | +- The metadata will be captured from the service response. |
| 76 | +- All available fields (content_type, etag, version_id, last_modified, etc.) will be populated |
| 77 | + |
| 78 | +For services that don't return metadata in read responses: |
| 79 | +- We'll make an additional `stat` call to fetch the metadata and populate the `metadata` field in `raw::RpRead`. |
| 80 | + |
| 81 | +Special considerations: |
| 82 | +- We should always return total object size in the metadata, even if it's not part of the read response |
| 83 | +- For range reads, the metadata should reflect the full object's properties (like total size) rather than the range |
| 84 | +- For versioned objects, the metadata should include version information if available |
| 85 | + |
| 86 | +# Drawbacks |
| 87 | + |
| 88 | +- Additional memory overhead for storing metadata during reads |
| 89 | +- Potential complexity in handling metadata for range reads |
| 90 | + |
| 91 | +# Rationale and alternatives |
| 92 | + |
| 93 | +- Maintains full backward compatibility with existing read operations |
| 94 | +- Improves performance by avoiding additional stat calls |
| 95 | +- Aligns with common storage service APIs (S3, GCS, Azure) |
| 96 | + |
| 97 | +# Prior art |
| 98 | + |
| 99 | +Similar patterns exist in other storage SDKs: |
| 100 | + |
| 101 | +- `object_store` crate returns metadata in `GetResult` after calling `get_opts` |
| 102 | +- AWS S3 SDK returns comprehensive metadata in `GetObjectOutput` |
| 103 | +- Azure Blob SDK returns properties and metadata in `DownloadResponse` |
| 104 | + |
| 105 | +# Unresolved questions |
| 106 | + |
| 107 | +None |
| 108 | + |
| 109 | +# Future possibilities |
| 110 | + |
| 111 | +- Once we return metadata during reader initialization, we can optimize `ReadContext::parse_into_range` by using the |
| 112 | +`content_length` from `metadata` directly, eliminating the need for an additional `stat` call |
0 commit comments