Skip to content

Commit d16ba00

Browse files
authored
GH-48260: [C++][Python][R] Move S3 bucket references to new bucket as Voltron Data ones will be removed soon (#48261)
### Rationale for this change No more VD, no more VD S3 bucket! ### What changes are included in this PR? Move references to S3 bucket to the new Arrow one, update a few references to regions and things. ### Are these changes tested? Yeah, for the most part. ### Are there any user-facing changes? No * GitHub Issue: #48260 Authored-by: Nic Crane <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
1 parent ab4a096 commit d16ba00

File tree

12 files changed

+47
-47
lines changed

12 files changed

+47
-47
lines changed

cpp/src/arrow/filesystem/s3fs_test.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -420,7 +420,7 @@ TEST_F(S3OptionsTest, FromAssumeRole) {
420420
class S3RegionResolutionTest : public AwsTestMixin {};
421421

422422
TEST_F(S3RegionResolutionTest, PublicBucket) {
423-
ASSERT_OK_AND_EQ("us-east-2", ResolveS3BucketRegion("voltrondata-labs-datasets"));
423+
ASSERT_OK_AND_EQ("us-east-1", ResolveS3BucketRegion("arrow-datasets"));
424424

425425
// Taken from a registry of open S3-hosted datasets
426426
// at https://github.com/awslabs/open-data-registry

docs/source/python/dataset.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -350,7 +350,7 @@ specifying a S3 path:
350350

351351
.. code-block:: python
352352
353-
dataset = ds.dataset("s3://voltrondata-labs-datasets/nyc-taxi/")
353+
dataset = ds.dataset("s3://arrow-datasets/nyc-taxi/")
354354
355355
Typically, you will want to customize the connection parameters, and then
356356
a file system object can be created and passed to the ``filesystem`` keyword:
@@ -359,8 +359,8 @@ a file system object can be created and passed to the ``filesystem`` keyword:
359359
360360
from pyarrow import fs
361361
362-
s3 = fs.S3FileSystem(region="us-east-2")
363-
dataset = ds.dataset("voltrondata-labs-datasets/nyc-taxi/", filesystem=s3)
362+
s3 = fs.S3FileSystem(region="us-east-1")
363+
dataset = ds.dataset("arrow-datasets/nyc-taxi/", filesystem=s3)
364364
365365
The currently available classes are :class:`~pyarrow.fs.S3FileSystem` and
366366
:class:`~pyarrow.fs.HadoopFileSystem`. See the :ref:`filesystem` docs for more
@@ -381,7 +381,7 @@ useful for testing or benchmarking.
381381
382382
# By default, MinIO will listen for unencrypted HTTP traffic.
383383
minio = fs.S3FileSystem(scheme="http", endpoint_override="localhost:9000")
384-
dataset = ds.dataset("voltrondata-labs-datasets/nyc-taxi/", filesystem=minio)
384+
dataset = ds.dataset("arrow-datasets/nyc-taxi/", filesystem=minio)
385385
386386
387387
Working with Parquet Datasets

python/pyarrow/_s3fs.pyx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,8 +91,8 @@ def resolve_s3_region(bucket):
9191
9292
Examples
9393
--------
94-
>>> fs.resolve_s3_region('voltrondata-labs-datasets')
95-
'us-east-2'
94+
>>> fs.resolve_s3_region('arrow-datasets')
95+
'us-east-1'
9696
"""
9797
cdef:
9898
c_string c_bucket

python/pyarrow/tests/test_fs.py

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1461,20 +1461,20 @@ def test_s3fs_wrong_region():
14611461
# anonymous=True incase CI/etc has invalid credentials
14621462
fs = S3FileSystem(region='eu-north-1', anonymous=True)
14631463

1464-
msg = ("When getting information for bucket 'voltrondata-labs-datasets': "
1464+
msg = ("When getting information for bucket 'arrow-datasets': "
14651465
r"AWS Error UNKNOWN \(HTTP status 301\) during HeadBucket "
14661466
"operation: No response body. Looks like the configured region is "
1467-
"'eu-north-1' while the bucket is located in 'us-east-2'."
1467+
"'eu-north-1' while the bucket is located in 'us-east-1'."
14681468
"|NETWORK_CONNECTION")
14691469
with pytest.raises(OSError, match=msg) as exc:
1470-
fs.get_file_info("voltrondata-labs-datasets")
1470+
fs.get_file_info("arrow-datasets")
14711471

14721472
# Sometimes fails on unrelated network error, so next call would also fail.
14731473
if 'NETWORK_CONNECTION' in str(exc.value):
14741474
return
14751475

1476-
fs = S3FileSystem(region='us-east-2', anonymous=True)
1477-
fs.get_file_info("voltrondata-labs-datasets")
1476+
fs = S3FileSystem(region='us-east-1', anonymous=True)
1477+
fs.get_file_info("arrow-datasets")
14781478

14791479

14801480
@pytest.mark.azure
@@ -1912,15 +1912,15 @@ def test_s3_real_aws():
19121912
fs = S3FileSystem(anonymous=True)
19131913
assert fs.region == default_region
19141914

1915-
fs = S3FileSystem(anonymous=True, region='us-east-2')
1915+
fs = S3FileSystem(anonymous=True, region='us-east-1')
19161916
entries = fs.get_file_info(FileSelector(
1917-
'voltrondata-labs-datasets/nyc-taxi'))
1917+
'arrow-datasets/nyc-taxi'))
19181918
assert len(entries) > 0
1919-
key = 'voltrondata-labs-datasets/nyc-taxi/year=2019/month=6/part-0.parquet'
1919+
key = 'arrow-datasets/nyc-taxi/year=2019/month=6/part-0.parquet'
19201920
with fs.open_input_stream(key) as f:
19211921
md = f.metadata()
19221922
assert 'Content-Type' in md
1923-
assert md['Last-Modified'] == b'2022-07-12T23:32:00Z'
1923+
assert md['Last-Modified'] == b'2025-11-26T10:28:55Z'
19241924
# For some reason, the header value is quoted
19251925
# (both with AWS and Minio)
19261926
assert md['ETag'] == b'"4c6a76826a695c6ac61592bc30cda3df-16"'
@@ -1963,7 +1963,7 @@ def test_s3_real_aws_region_selection():
19631963
@pytest.mark.s3
19641964
def test_resolve_s3_region():
19651965
from pyarrow.fs import resolve_s3_region
1966-
assert resolve_s3_region('voltrondata-labs-datasets') == 'us-east-2'
1966+
assert resolve_s3_region('arrow-datasets') == 'us-east-1'
19671967
assert resolve_s3_region('mf-nwp-models') == 'eu-west-1'
19681968

19691969
with pytest.raises(ValueError, match="Not a valid bucket name"):
@@ -2120,7 +2120,7 @@ def test_s3_finalize_region_resolver():
21202120
with pytest.raises(ValueError, match="S3 .* finalized"):
21212121
resolve_s3_region('mf-nwp-models')
21222122
with pytest.raises(ValueError, match="S3 .* finalized"):
2123-
resolve_s3_region('voltrondata-labs-datasets')
2123+
resolve_s3_region('arrow-datasets')
21242124
"""
21252125
subprocess.check_call([sys.executable, "-c", code])
21262126

r/R/filesystem.R

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -499,13 +499,13 @@ default_s3_options <- list(
499499
#' relative path. Note that this function's success does not guarantee that you
500500
#' are authorized to access the bucket's contents.
501501
#' @examplesIf FALSE
502-
#' bucket <- s3_bucket("voltrondata-labs-datasets")
502+
#' bucket <- s3_bucket("arrow-datasets")
503503
#'
504504
#' @examplesIf FALSE
505505
#' # Turn on debug logging. The following line of code should be run in a fresh
506506
#' # R session prior to any calls to `s3_bucket()` (or other S3 functions)
507507
#' Sys.setenv("ARROW_S3_LOG_LEVEL" = "DEBUG")
508-
#' bucket <- s3_bucket("voltrondata-labs-datasets")
508+
#' bucket <- s3_bucket("arrow-datasets")
509509
#'
510510
#' @export
511511
s3_bucket <- function(bucket, ...) {
@@ -541,7 +541,7 @@ s3_bucket <- function(bucket, ...) {
541541
#' relative path. Note that this function's success does not guarantee that you
542542
#' are authorized to access the bucket's contents.
543543
#' @examplesIf FALSE
544-
#' bucket <- gs_bucket("voltrondata-labs-datasets")
544+
#' bucket <- gs_bucket("arrow-datasets")
545545
#' @export
546546
gs_bucket <- function(bucket, ...) {
547547
assert_that(is.string(bucket))

r/man/gs_bucket.Rd

Lines changed: 1 addition & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/man/s3_bucket.Rd

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

r/tests/testthat/test-filesystem.R

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -146,20 +146,20 @@ test_that("FileSystem$from_uri", {
146146
skip_on_cran()
147147
skip_if_not_available("s3")
148148
skip_if_offline()
149-
fs_and_path <- FileSystem$from_uri("s3://voltrondata-labs-datasets")
149+
fs_and_path <- FileSystem$from_uri("s3://arrow-datasets")
150150
expect_r6_class(fs_and_path$fs, "S3FileSystem")
151-
expect_identical(fs_and_path$fs$region, "us-east-2")
151+
expect_identical(fs_and_path$fs$region, "us-east-1")
152152
})
153153

154154
test_that("SubTreeFileSystem$create() with URI", {
155155
skip_on_cran()
156156
skip_if_not_available("s3")
157157
skip_if_offline()
158-
fs <- SubTreeFileSystem$create("s3://voltrondata-labs-datasets")
158+
fs <- SubTreeFileSystem$create("s3://arrow-datasets")
159159
expect_r6_class(fs, "SubTreeFileSystem")
160160
expect_identical(
161161
capture.output(print(fs)),
162-
"SubTreeFileSystem: s3://voltrondata-labs-datasets/"
162+
"SubTreeFileSystem: s3://arrow-datasets/"
163163
)
164164
})
165165

@@ -193,12 +193,12 @@ test_that("gs_bucket", {
193193
skip_on_cran()
194194
skip_if_not_available("gcs")
195195
skip_if_offline()
196-
bucket <- gs_bucket("voltrondata-labs-datasets")
196+
bucket <- gs_bucket("arrow-datasets")
197197
expect_r6_class(bucket, "SubTreeFileSystem")
198198
expect_r6_class(bucket$base_fs, "GcsFileSystem")
199199
expect_identical(
200200
capture.output(print(bucket)),
201-
"SubTreeFileSystem: gs://voltrondata-labs-datasets/"
201+
"SubTreeFileSystem: gs://arrow-datasets/"
202202
)
203-
expect_identical(bucket$base_path, "voltrondata-labs-datasets/")
203+
expect_identical(bucket$base_path, "arrow-datasets/")
204204
})

r/vignettes/arrow.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,7 +178,7 @@ To learn more about analyzing Arrow data, see the [data wrangling article](./dat
178178
Another use for the arrow R package is to read, write, and analyze data sets stored remotely on cloud services. The package currently supports both Amazon Simple Storage Service (S3) and Google Cloud Storage (GCS). The example below illustrates how you can use `s3_bucket()` to refer to a an S3 bucket, and use `open_dataset()` to connect to the data set stored there:
179179

180180
```{r, eval=FALSE}
181-
bucket <- s3_bucket("voltrondata-labs-datasets/nyc-taxi")
181+
bucket <- s3_bucket("arrow-datasets/nyc-taxi")
182182
nyc_taxi <- open_dataset(bucket)
183183
```
184184

r/vignettes/dataset.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@ This multi-file data set is comprised of 158 distinct Parquet files, each corres
2222
If you have Amazon S3 support enabled in arrow (true for most users; see links at the end of this article if you need to troubleshoot this), you can connect to a copy of the "tiny taxi data" stored on S3 with this command:
2323

2424
```r
25-
bucket <- s3_bucket("voltrondata-labs-datasets/nyc-taxi-tiny")
25+
bucket <- s3_bucket("arrow-datasets/nyc-taxi-tiny")
2626
```
2727

2828
Alternatively you could connect to a copy of the data on Google Cloud Storage (GCS) using the following command:
2929

3030
```r
31-
bucket <- gs_bucket("voltrondata-labs-datasets/nyc-taxi-tiny", anonymous = TRUE)
31+
bucket <- gs_bucket("arrow-datasets/nyc-taxi-tiny", anonymous = TRUE)
3232
```
3333

3434
If you want to use the full data set, replace `nyc-taxi-tiny` with `nyc-taxi` in the code above. Apart from size -- and with it the cost in time, bandwidth usage, and CPU cycles -- there is no difference in the two versions of the data: you can test your code using the tiny taxi data and then check how it scales using the full data set.

0 commit comments

Comments
 (0)