From 77ea803396a2fb8aa14723a340ba7092ad901d98 Mon Sep 17 00:00:00 2001 From: SpeekeR Date: Wed, 22 Jan 2025 23:10:30 +0100 Subject: [PATCH 1/5] S3 clients table and boto3 description added --- docs/data/storage-department.md | 92 +++++++++++++++++++++++++++++---- 1 file changed, 82 insertions(+), 10 deletions(-) diff --git a/docs/data/storage-department.md b/docs/data/storage-department.md index d054a7c4..546bb6de 100644 --- a/docs/data/storage-department.md +++ b/docs/data/storage-department.md @@ -1,26 +1,95 @@ # Storage Department services -The CESNET Storage Department provides various types of data services. It is available to all users with **MetaCentrum login and password**. +The CESNET Storage Department provides various types of data services. +It is available to all users with **MetaCentrum login and password**. -Storage Department data policies will be described to a certain level at this page. For more detailed information, users should however navigate the [Storage Department documentation pages](https://docs.du.cesnet.cz). +Storage Department data policies will be described to a certain level at this page. +For more detailed information, users should however navigate the [Storage Department documentation pages](https://docs.du.cesnet.cz). !!! warning "Data storage technology in the Data Storage Department has changed by May 2024" For a long time the data were stored on hierarchical storage machines ("HSM" for short) with a directory structure accessible from `/storage/du-cesnet`.
Due to technological innovation of operated system were the HSM storages disconnected and decommissioned. User data have been transferred to [machines with Object storage technology](https://docs.du.cesnet.cz/en/object-storage-s3/s3-service).
Object storage is successor of HSM with slightly different set of commands, i.e. it **does not** work in the same way. ## Object storage -S3 storage is available for all Metacentrum users. You can generate your credetials via [Gatekeeper service](https://access.du.cesnet.cz/#/). Where you will select your Metacentrum account and you should obtain your `access_key` and `secret_key`. +S3 storage is available for all Metacentrum users. +You can generate your credetials via [Gatekeeper service](https://access.du.cesnet.cz/#/). +Where you will select your Metacentrum account and you should obtain your `access_key` and `secret_key`. ### Simple storage - use when you need commonly store your data -You can use the S3 storage as simple storage to store your data. You can use your credentials to configure some of the supported S3 clients like s3cmd, s5cmd (large datasets) and rclone. The detailed tutorial for S3 client configuration can be found in the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients) +You can use the S3 storage as simple storage to store your data. +You can use your credentials to configure some of the supported S3 clients like s3cmd, s5cmd (large datasets) and rclone. +The detailed tutorial for S3 client configuration can be found in the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients). ### Direct usage in the job file -You can add s5cmd and rclone commands directly into your job file. +You can add `s5cmd` and `rclone` commands directly into your job file. + !!! warning "Bucket creation" Do not forget that the bucket being used for staging MUST exist on the remote S3 data storage. If you plan to stage-out your data into a non-existing bucket the job will fail. You need to prepare the bucket for stage-out in advance. You can use the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients) for particular S3 client. +### S3 service clients + +| Binary | Source code language | Library | Console usage | Python usage | Fit for Big Data transfers | +|-----------------|----------------------|-----------------|---------------|--------------|----------------------------| +| aws cli | Python | aws cli | Yes | Yes | No | +| s3cmd | Python | s3cmd | Yes | Yes | No | +| s4cmd | Python | [boto3](#boto3) | No | Yes | Yes | +| [s5cmd](#s5cmd) | Go | --- ? --- | Yes | No | Yes | + +#### boto3 + +`boto3` is a **Python** library that allows you to interact with the S3 storage. +You have to use it from your **Python** scripts - it is not a standalone tool like `s3cmd` or `s5cmd`. + +##### Installation + +In order to use `boto3` library, you need to install it first. +You can do it by running the following command inside your Python environment: + +```bash +pip install boto3 +``` + +##### Usage + +First, you need to create an instance of the `s3 client` object with your credentials and the endpoint URL: + +```python +import boto3 + +access_key = "********************" +secret_key = "****************************************" +endpoint_url = "https://s3.cl4.du.cesnet.cz" + +s3 = boto3.client("s3", aws_access_key_id=access_key, aws_secret_access_key=secret_key, endpoint_url=endpoint_url) +``` + +(*Side note*: You can use the `~/.aws/credentials` file to store your credentials and use them in your scripts. +For more information, see the [official boto3 documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).) + +Then you can use the `s3` object to interact with the S3 storage. +For example, you can **list all the buckets**: + +```python +response = s3.list_buckets() +for bucket in response["Buckets"]: + print(f"{bucket['Name']}") +``` + +Or you can **upload** a file to the S3 storage: + +```python +s3.upload_file("/local/path/to/file", "bucket-name", "remote/path/to/object") +``` + +Or alternatively, you can **download** an object from the S3 storage (be aware of the parameters order!): + +```python +s3.download_file("bucket-name", "remote/path/to/object", "/local/path/to/file") +``` + #### s5cmd -To use s5cmd tool (preferred) you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home//.aws/credentials`. + +To use `s5cmd` tool you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home//.aws/credentials`. ``` [profile-name] @@ -32,7 +101,8 @@ multipart_threshold = 128MB multipart_chunksize = 32MB ``` -Then you can continue to use `s5cmd` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/s5cmd). Alternatively, you can directly add the following lines into your job file. +Then you can continue to use `s5cmd` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/s5cmd). +Alternatively, you can directly add the following lines into your job file. ``` #define CREDDIR, where you stored your S3 credentials for, default is your home directory @@ -47,7 +117,9 @@ s5cmd --credentials-file "${S3CRED}" --profile profile-name --endpoint-url=https #### rclone -Alternatively, you can use rclone tool, which is less handy for large data sets. In case of large data sets (tens of terabytes) please use `s5cmd` above. For rclone you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home//.config/rclone/rclone.conf`. +Alternatively, you can use `rclone` tool, which is less handy for large data sets. +In case of large data sets (tens of terabytes) please use `s5cmd` or `boto3`, mentioned above. +For rclone you need to create a credentials file (copy the content below) in your home dir, e.g. `/storage/brno2/home//.config/rclone/rclone.conf`. ``` [profile-name] @@ -59,7 +131,8 @@ endpoint = s3.cl4.du.cesnet.cz acl = private ``` -Then you can continue to use `rclone` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/rclone). Or you can directly add following lines into your job file. +Then you can continue to use `rclone` via commands described in [Data Storage guide](https://docs.du.cesnet.cz/en/object-storage-s3/rclone). +Or you can directly add following lines into your job file. ``` #define CREDDIR, where you stored your S3 credentials for, default is your home directory @@ -71,4 +144,3 @@ rclone sync --progress --fast-list --config ${S3CRED} profile-name:my-bucket/h2o #stage out command for rclone rclone sync --progress --fast-list --config ${S3CRED} ${DATADIR}/h2o.out profile-name:my-bucket/ ``` - From 37a138ba6d49551fd72ca9fa1c92a635df32e001 Mon Sep 17 00:00:00 2001 From: SpeekeR Date: Tue, 4 Feb 2025 16:24:42 +0100 Subject: [PATCH 2/5] Big data transfers tips and tricks added --- docs/data/big-data-tips-and-tricks.md | 46 +++++++++++++++++++++++++++ docs/data/storage-department.md | 7 ++++ 2 files changed, 53 insertions(+) create mode 100644 docs/data/big-data-tips-and-tricks.md diff --git a/docs/data/big-data-tips-and-tricks.md b/docs/data/big-data-tips-and-tricks.md new file mode 100644 index 00000000..320b7589 --- /dev/null +++ b/docs/data/big-data-tips-and-tricks.md @@ -0,0 +1,46 @@ +# Big Data Tips and Tricks + +For **Big Data** sets, we recommend using the `boto3` library or `s5cmd` tool. + +## Fewer big files are better than many small files + +When transferring **Big Data**, it is better to use fewer big files instead of many small files. + +Be aware of that when you are transferring a lot of small files, the overhead of the transfer process can be significant. +You can save time and resources by packing the small files into a single big file and transferring it as one object. + +## Chunk size matters + +When transferring big files, the upload (or download) process is divided into chunks - so called `multipart uploads` (or downloads). +The size of these chunks can have a significant impact on the transfer speed. + +The optimal chunk size depends on the size of the files you are transferring and the network conditions. + +There is no one-size-fits-all solution, so you should experiment with different chunk sizes to find the optimal one for your use case. +We recommend starting with a chunk size of `file_size / 1000` (where `file_size` is the size of the file you are transferring). +You can then adjust the chunk size based on the results of your experiments. + +## Cluster choice matters + +Some cluster offer better `network interface` than others. + +When transferring big files, it is important to choose a cluster with a good network interface. +One such cluster is the `halmir` machines, which offer a `10 Gbps` network interface. + +You can check the possible clusters and their network interfaces on the [official website](https://metavo.metacentrum.cz/pbsmon2/nodes/physical) of the MetaCentrum. + +## Hard disk speed does not matter + +Our research has shown that the speed of the hard disk does not have a significant impact on the transfer speed. + +When transferring big files, the network interface is the bottleneck, not the hard disk speed. + +Therefore, you do not need to worry about the usage of `tmpfs` or `ramdisk` when transferring big files. + +## Use the right tool for the job + +When transferring big files, it is important to use the right tool for the job. + +If you are unsure which tool to use, we recommend checking the [Storage Department](storage-department.md) page with a table of S3 service clients. + +In short, we recommend using the `boto3` library or `s5cmd` tool for **Big Data** transfers. diff --git a/docs/data/storage-department.md b/docs/data/storage-department.md index 546bb6de..8bdd8695 100644 --- a/docs/data/storage-department.md +++ b/docs/data/storage-department.md @@ -26,6 +26,11 @@ You can add `s5cmd` and `rclone` commands directly into your job file. !!! warning "Bucket creation" Do not forget that the bucket being used for staging MUST exist on the remote S3 data storage. If you plan to stage-out your data into a non-existing bucket the job will fail. You need to prepare the bucket for stage-out in advance. You can use the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/object-storage-s3/s3-clients) for particular S3 client. +### Big Data transfers + +For **Big Data** sets, we recommend using the `boto3` library or `s5cmd` tool. + +For general tips and tricks regarding **Big Data** and **CESNET S3 storage**, please visit the [Big Data Tips and Tricks](big-data-tips-and-tricks.md) page. ### S3 service clients | Binary | Source code language | Library | Console usage | Python usage | Fit for Big Data transfers | @@ -35,6 +40,8 @@ You can add `s5cmd` and `rclone` commands directly into your job file. | s4cmd | Python | [boto3](#boto3) | No | Yes | Yes | | [s5cmd](#s5cmd) | Go | --- ? --- | Yes | No | Yes | +For further details and more information about all the possible S3 clients, please refer to the [official Data Storage Department tutorials](https://du.cesnet.cz/en/navody/object_storage/cesnet_s3/start). + #### boto3 `boto3` is a **Python** library that allows you to interact with the S3 storage. From f36eeae8678557559050e25b54dab608358d15ec Mon Sep 17 00:00:00 2001 From: SpeekeR Date: Tue, 4 Feb 2025 20:51:35 +0100 Subject: [PATCH 3/5] boto3 recent upload issues as a warning --- docs/data/storage-department.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/docs/data/storage-department.md b/docs/data/storage-department.md index 8bdd8695..7f203cd5 100644 --- a/docs/data/storage-department.md +++ b/docs/data/storage-department.md @@ -47,6 +47,11 @@ For further details and more information about all the possible S3 clients, plea `boto3` is a **Python** library that allows you to interact with the S3 storage. You have to use it from your **Python** scripts - it is not a standalone tool like `s3cmd` or `s5cmd`. +[//]: # (TODO: Remove this warning when the issue is resolved.) + +!!! warning "Issues with uploads within the latest boto3 versions" + There have been some issues lately with the latest releases of the `boto3` library and third-party S3 storage providers (like the CESNET S3 storage).
As of now, the issues still persist with versions `1.36.x` (`1.36.12` to be exact). For this reason, we recommend using the `boto3` library with the version `1.35.99` or lower.
For further information, see this [GitHub issue](https://github.com/boto/boto3/issues/4400). + ##### Installation In order to use `boto3` library, you need to install it first. @@ -56,6 +61,14 @@ You can do it by running the following command inside your Python environment: pip install boto3 ``` +[//]: # (TODO: Remove this installation instruction when the issue is resolved.) + +Or you can use the specific version of the library: + +```bash +pip install boto3==1.35.99 +``` + ##### Usage First, you need to create an instance of the `s3 client` object with your credentials and the endpoint URL: From fa913f08e7a14d5d11a1fbd582369ecd9cf48872 Mon Sep 17 00:00:00 2001 From: SpeekeR Date: Wed, 5 Feb 2025 00:26:09 +0100 Subject: [PATCH 4/5] compression tip and information added --- docs/data/big-data-tips-and-tricks.md | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/docs/data/big-data-tips-and-tricks.md b/docs/data/big-data-tips-and-tricks.md index 320b7589..be702366 100644 --- a/docs/data/big-data-tips-and-tricks.md +++ b/docs/data/big-data-tips-and-tricks.md @@ -37,6 +37,18 @@ When transferring big files, the network interface is the bottleneck, not the ha Therefore, you do not need to worry about the usage of `tmpfs` or `ramdisk` when transferring big files. +## Utilize compression + +When transferring big files, it is a good idea to utilize compression. + +You can compress the files before transferring them, effectively reducing the time and resources needed for the transfer. + +Choice of the compression algorithm depends on the type of the files you are transferring, there is no one-size-fits-all solution. +We recommend using the `zstandard` algorithm, as it offers a good balance between compression ratio and decompression speed. +Depending on the type of your files, you can also consider using the `gzip`, `bzip2`, or `xz` algorithms. + +For more information about the compression algorithms, please check this [comparison](https://quixdb.github.io/squash-benchmark/). + ## Use the right tool for the job When transferring big files, it is important to use the right tool for the job. From 3f2a89dff1e142de4a52716317d0ff031d09b5f1 Mon Sep 17 00:00:00 2001 From: SpeekeR Date: Wed, 19 Feb 2025 13:11:42 +0100 Subject: [PATCH 5/5] Changes based on PR review --- docs/data/storage-department.md | 62 ++------------------------------- 1 file changed, 2 insertions(+), 60 deletions(-) diff --git a/docs/data/storage-department.md b/docs/data/storage-department.md index 7f203cd5..24162038 100644 --- a/docs/data/storage-department.md +++ b/docs/data/storage-department.md @@ -40,72 +40,14 @@ For general tips and tricks regarding **Big Data** and **CESNET S3 storage**, pl | s4cmd | Python | [boto3](#boto3) | No | Yes | Yes | | [s5cmd](#s5cmd) | Go | --- ? --- | Yes | No | Yes | -For further details and more information about all the possible S3 clients, please refer to the [official Data Storage Department tutorials](https://du.cesnet.cz/en/navody/object_storage/cesnet_s3/start). +For further details and more information about all the possible S3 clients, please refer to the [official Data Storage Department tutorials](https://docs.du.cesnet.cz/en/docs/object-storage-s3/s3-service). #### boto3 `boto3` is a **Python** library that allows you to interact with the S3 storage. You have to use it from your **Python** scripts - it is not a standalone tool like `s3cmd` or `s5cmd`. -[//]: # (TODO: Remove this warning when the issue is resolved.) - -!!! warning "Issues with uploads within the latest boto3 versions" - There have been some issues lately with the latest releases of the `boto3` library and third-party S3 storage providers (like the CESNET S3 storage).
As of now, the issues still persist with versions `1.36.x` (`1.36.12` to be exact). For this reason, we recommend using the `boto3` library with the version `1.35.99` or lower.
For further information, see this [GitHub issue](https://github.com/boto/boto3/issues/4400). - -##### Installation - -In order to use `boto3` library, you need to install it first. -You can do it by running the following command inside your Python environment: - -```bash -pip install boto3 -``` - -[//]: # (TODO: Remove this installation instruction when the issue is resolved.) - -Or you can use the specific version of the library: - -```bash -pip install boto3==1.35.99 -``` - -##### Usage - -First, you need to create an instance of the `s3 client` object with your credentials and the endpoint URL: - -```python -import boto3 - -access_key = "********************" -secret_key = "****************************************" -endpoint_url = "https://s3.cl4.du.cesnet.cz" - -s3 = boto3.client("s3", aws_access_key_id=access_key, aws_secret_access_key=secret_key, endpoint_url=endpoint_url) -``` - -(*Side note*: You can use the `~/.aws/credentials` file to store your credentials and use them in your scripts. -For more information, see the [official boto3 documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html).) - -Then you can use the `s3` object to interact with the S3 storage. -For example, you can **list all the buckets**: - -```python -response = s3.list_buckets() -for bucket in response["Buckets"]: - print(f"{bucket['Name']}") -``` - -Or you can **upload** a file to the S3 storage: - -```python -s3.upload_file("/local/path/to/file", "bucket-name", "remote/path/to/object") -``` - -Or alternatively, you can **download** an object from the S3 storage (be aware of the parameters order!): - -```python -s3.download_file("bucket-name", "remote/path/to/object", "/local/path/to/file") -``` +For more details and information about `boto3`, please check the [Data Storage guide](https://docs.du.cesnet.cz/en/docs/object-storage-s3/boto3). #### s5cmd