Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating remote access notebook #291

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
2 changes: 1 addition & 1 deletion .github/workflows/qaqc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
check_filenames: true
check_hidden: true
skip: ".git,*.js,qaqc.yml"
ignore_words_list: hist,nd
ignore_words_list: hist,nd,fo

# borrowed from https://github.com/ProjectPythia/pythia-foundations/blob/main/.github/workflows/link-checker.yaml
- name: Disable Notebook Execution Before Linkcheck
Expand Down
116 changes: 93 additions & 23 deletions intermediate/remote_data/remote-data.ipynb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to leave some review comments on the whole notebook top to bottom rather than just the new changes since I didn't go over it carefully the first round!

  1. Consider using Jupyter Book admonitions for your notes. Because jupyterlab-myst is in the environment these are rendered similarly on both the website and in jupyterlab

> It is important to note that there are...

```{note}
there are...
```

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this first note you say "use of a file handler and a cache" but I didn't see anything about the cache

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supported file formats by backend: It's not clear what BufferedIOBase and AbstractDataStore are and where they come from. Consider defining these in a bit more detail, or introduce them later on. What is a "buffer" vs a file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it’s really an anti pattern when we work with scientific data formats. Benchmarks show that any of the caching schemas will perform better than using the default read-ahead." -> "It's not ideal with common multidimensional binary data formats." Can you link to benchmarking results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file = fsspec.open_local(f"simplecache::{uri}", filecache={'cache_storage': '/tmp/fsspec_cache'})

The keyword argument and uri should match here (filecache::{uri}). If I remember correctly filecache exposes a bit more control and the cache persists if you say close a notebook and come back to it, so I recommend that! I also like using same_names=True so that if you're working with multiple files you can do other things with them (like open in QGIS or other software) if you want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new ability to keep track of cache activity is super handy! I'm confused though, I think it's important to note that 'total requested bytes' != 'total transferred bytes' ? Also what is the cause of the cache hits and cache misses? :


        <BytesCache:
            block size  :   5242880
            block count :   0
            file size   :   4024972
            cache hits  :   175
            cache misses:   2
            total requested bytes: 4024972>
         
        <BlockCache:
            block size  :   8388608
            block count :   1
            file size   :   4024972
            cache hits  :   45
            cache misses:   1
            total requested bytes: 8388608>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to cloud storage, but using the default caching." --> remove ", but using the default caching"? since you use blockcache in the code below

Copy link
Contributor

@scottyhq scottyhq Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remote data access and chunking: How about adding a tiny bit more here? For example, I think it's often a great starting point for people wanting to work with remote data to consider 1. the total file size (ds.nbytes) 2. is the file chunked (ds.sst.encoding) 3. what do you want to do with it (in particular are you needing to compute the mean of all pixels or just read a single pixel?).

Maybe a specific example? If I understand correctly, let's say you want the value of a single pixel from s3://sst.mnmean.nc (ds.isel(lon=0,lat=0,time=0)). Using defaults, Xarray dispatches to h5netcdf/h5py which tries to read one 'chunk' (1, 89, 180) containing your pixel of interest. So this request is translated by fsspec to read ~128kb via an HTTP range request to S3, but using the default caching an additional 5MB is read. The entire file size in this case is just 4MB, so for efficiency rather than fiddling with cache settings, it might be best to just use filecache:: so that all your Xarray computations read from the local file rather than possibly reading bits and pieces over the network.

Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@
"source": [
"# Access Patterns to Remote Data with *fsspec*\n",
"\n",
"Accessing remote data with xarray usually means working with cloud-optimized formats like Zarr or COGs, the [CMIP6 tutorial](remote-data.ipynb) shows this pattern in detail. These formats were designed to be efficiently accessed over the internet, however in many cases we might need to access data that is not available in such formats.\n",
"Accessing remote data with xarray usually means working with cloud-optimized formats like Zarr or COGs, the [CMIP6 tutorial](https://tutorial.xarray.dev/intermediate/remote_data/cmip6-cloud.html) shows this pattern in detail. These formats were designed to be efficiently accessed over the internet, however in many cases we might need to access data that is not available in such formats.\n",
"\n",
"This notebook will explore how we can leverage xarray's backends to access remote files. For this we will make use of [`fsspec`](https://github.com/fsspec/filesystem_spec), a powerful Python library that abstracts the internal implementation of remote storage systems into a uniform API that can be used by many file-format specific libraries.\n",
"This notebook will explore how we can leverage xarray's backends to access remote files. For this we will make use of [`fsspec`](https://github.com/fsspec/filesystem_spec), a powerful Python library that abstracts the internal implementation of remote storage systems into a uniform API that can be used by many format specific libraries.\n",
"\n",
"Before starting with remote data, it may be helpful to understand how xarray handles local files and how xarray backends work. The following diagram shows the different components involved in accessing data either locally or remote using the `h5netcdf` backend which uses a format specific library to access HDF5 files.\n",
"Before starting with remote data, it may be helpful to understand how xarray handles local files and how xarray backends work. The following diagram shows the different components involved in accessing data either locally or remote using the `h5netcdf` backend with fsspec which uses a format specific library to access HDF5 files.\n",
"\n",
"![xarray-access(3)](https://gist.github.com/assets/717735/3c3c6801-11ed-43a4-98ea-636b7dd612d8)\n",
"\n",
Expand Down Expand Up @@ -59,7 +59,7 @@
"\n",
"\n",
"tracing_output = []\n",
"_match_pattern = \"xarray\"\n",
"_match_pattern = \"xarray/\"\n",
"\n",
"\n",
"def trace_calls(frame, event, arg):\n",
Expand Down Expand Up @@ -132,7 +132,7 @@
"The `open_dataset()` method is our entry point to n-dimensional data with xarray, the first argument we pass indicates what we want to open and is used by xarray to get the right backend and in turn is used by the backend to open the file locally or remote. The accepted types by xarray are:\n",
"\n",
"\n",
"* **str**: \"my-file.nc\" or \"s3:://my-zarr-store/data.zarr\"\n",
"* **str**: \"../../data/sst.mnmean.nc\" or \"s3:://my-zarr-store/data.zarr\"\n",
"* **os.PathLike**: Posix compatible path, most of the times is a Pathlib cross-OS compatible path.\n",
"* **BufferedIOBase**: some xarray backends can read data from a buffer, this is key for remote access.\n",
"* **AbstractDataStore**: This one is the generic store and backends should subclass it, if we do we can pass a \"store\" to xarray like in the case of Opendap/Pydap\n",
Expand Down Expand Up @@ -178,7 +178,7 @@
"id": "11",
"metadata": {},
"source": [
"xarray iterated through the registered backends and netcdf4 returned a `\"yes, I can open that extension\"` see: [netCDF4_.py#L618 ](https://github.com/pydata/xarray/blob/6c2d8c3389afe049ccbfd1393e9a81dd5c759f78/xarray/backends/netCDF4_.py#L618). However, **the backend doesn't know how to \"talk\" to a remote store** and thus it fails to open our file.\n",
"xarray iterated through the registered backends and netcdf4 returned a `\"yes, I can open that extension\"` see: [netCDF4_.py Line #L618 ](https://github.com/pydata/xarray/blob/6c2d8c3389afe049ccbfd1393e9a81dd5c759f78/xarray/backends/netCDF4_.py). However, **the backend doesn't know how to \"talk\" to a remote store** and thus it fails to open our file.\n",
"\n"
]
},
Expand Down Expand Up @@ -228,7 +228,9 @@
"source": [
"## Remote Access and File Caching\n",
"\n",
"When we use fsspec to abstract a remote file we are in essence translating byte requests to HTTP range requests over the internet. An HTTP request is a costly I/O operation compared to accessing a local file. Because of this, it's common that libraries that handle over the network data transfers implement a cache to avoid requesting the same data over and over. In the case of fsspec there are different ways to ask the library to handle this **caching and this is one of the most relevant performance considerations** when we work with xarray and remote data.\n",
"When we use fsspec to abstract a remote file we are in essence translating byte requests to HTTP range requests over the internet. An HTTP request is a costly I/O operation compared to accessing a local file. Because of this, it's common that libraries that handle over the network data transfers implement a cache to avoid requesting the same data over and over. In the case of fsspec there are different ways to ask the library to handle this. \n",
"\n",
"> **NOTE**: Caching is one of the most relevant performance considerations when we work with xarray and remote data.\n",
"\n",
"fsspec default cache is called `read-ahead` and as its name suggests it will read ahead of our request a fixed amount of bytes, this is good when we are working with text or tabular data but it's really an anti pattern when we work with scientific data formats. Benchmarks show that any of the caching schemas will perform better than using the default `read-ahead`.\n",
"\n",
Expand Down Expand Up @@ -261,11 +263,9 @@
"id": "15",
"metadata": {},
"source": [
"#### block cache + `open()`\n",
"\n",
"If our backend support reading from a buffer we can cache only the parts of the file that we are reading, this is useful but tricky. As we mentioned before fsspec default cache will request an overhead of 5MB ahead of the byte offset we request, and if we are reading small chunks from our file it will be really slow and incur in unnecessary transfers.\n",
"### `open()` remotely + caching strategies\n",
"\n",
"Let's open the same file but using the `h5netcdf` engine and we'll use a block cache strategy that stores predefined block sizes from our remote file.\n"
"If our backend support reading from a buffer we can cache only the parts of the file that we are reading, this is useful but tricky. As we mentioned before fsspec default cache will request an overhead of 5MB ahead of the byte offset we request, and if we are reading small chunks from our file it will be really slow and incur in unnecessary transfers."
]
},
{
Expand All @@ -280,36 +280,91 @@
"\n",
"fs = fsspec.filesystem('http')\n",
"\n",
"fsspec_caching = {\n",
"# Note that if we use a context, we'll close the file after the block so operations on xarray may fail if we don't load our data arrays.\n",
"with fs.open(uri) as file:\n",
" ds = xr.open_dataset(file, engine=\"h5netcdf\")\n",
" mean = ds.sst.mean()\n",
" default_cache_info = file.cache\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "17",
"metadata": {},
"source": [
"### Using one of the many fsspec caching implementations.\n",
"\n",
"The file we are working with is small and we don't really see the performance implications of using the default caching vs better caching strategies. Now we are going to open the same file using a `block cache` strategy that stores predefined block sizes from our remote file.\n",
"\n",
"> **Note**: For a list of caching implementations see [fsspec API docs](https://filesystem-spec.readthedocs.io/en/latest/api.html#read-buffering)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18",
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"uri = \"https://its-live-data.s3-us-west-2.amazonaws.com/test-space/sample-data/sst.mnmean.nc\"\n",
"\n",
"fs = fsspec.filesystem('http')\n",
"\n",
"fsspec_better_caching = {\n",
" \"cache_type\": \"blockcache\", # block cache stores blocks of fixed size and uses eviction using a LRU strategy.\n",
" \"block_size\": 8\n",
" * 1024\n",
" * 1024, # size in bytes per block, adjust depends on the file size but the recommended size is in the MB\n",
" * 1024, # size in bytes per block, adjust depends on the file size but the recommended size range is in the MB\n",
"}\n",
"\n",
"# Note that if we use a context, we'll close the file after the block so operations on xarray may fail if we don't load our data arrays.\n",
"with fs.open(uri, **fsspec_caching) as file:\n",
" ds = xr.open_dataset(file, engine=\"h5netcdf\")\n",
" mean = ds.sst.mean()\n",
"# we are not using a context, we can use ds until we manually close it.\n",
"fo = fs.open(uri, **fsspec_better_caching)\n",
"ds = xr.open_dataset(fo, engine=\"h5netcdf\")\n",
"better_cache_info = fo.cache\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "17",
"id": "19",
"metadata": {},
"source": [
"#### Comparing performance\n",
"\n",
"Since `v2024.05` we can inspect fsspec file-like objects' cache to measure their I/O performance. We'll notice that using a different implementation (this is not read-ahead) we managed to reduce the total requested bytes and since we can control the page buffer size the total request also decreases. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20",
"metadata": {},
"outputs": [],
"source": [
"print(default_cache_info, better_cache_info)"
]
},
{
"cell_type": "markdown",
"id": "21",
"metadata": {},
"source": [
"### Reading data from cloud storage\n",
"\n",
"So far we have only used HTTP to access a remote file, however the commercial cloud has their own implementations with specific features. fsspec allows us to talk to different cloud storage implementations hiding these details from us and the libraries we use. Now we are going to access the same file using the S3 protocol. \n",
"\n",
"> Note: S3, Azure blob, etc all have their names and prefixes but under the hood they still work with the HTTP protocol.\n"
"> Note: S3, Azure blob, etc all have their names and prefixes but under the hood they still work with the HTTP protocol.\n",
"\n",
"\n",
"Let's now open our file with a backend that understands NetCDF and fsspec to abstract remote I/O to cloud storage, but using the default caching.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18",
"id": "22",
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -319,21 +374,36 @@
"# If we need to pass credentials to our remote storage we can do it here, in this case this is a public bucket\n",
"fs = fsspec.filesystem('s3', anon=True)\n",
"\n",
"fsspec_caching = {\n",
"fsspec_better_caching = {\n",
" \"cache_type\": \"blockcache\", # block cache stores blocks of fixed size and uses eviction using a LRU strategy.\n",
" \"block_size\": 8\n",
" * 1024\n",
" * 1024, # size in bytes per block, adjust depends on the file size but the recommended size is in the MB\n",
"}\n",
"\n",
"# we are not using a context, we can use ds until we manually close it.\n",
"ds = xr.open_dataset(fs.open(uri, **fsspec_caching), engine=\"h5netcdf\")\n",
"fo = fs.open(uri, **fsspec_better_caching)\n",
"ds = xr.open_dataset(fo, engine=\"h5netcdf\")\n",
"ds"
]
},
{
"cell_type": "markdown",
"id": "19",
"id": "23",
"metadata": {},
"source": [
"#### Remote data access and chunking\n",
"\n",
"One last but important consideration when we access remote data is that we should be aware of the chunking. Internal chunking of data affects how fast xarray engines can get data out of our files. If the chunk size is too small there is a considerable performance penalty, a deep dive into this topic is out of scope for this notebook but here is a list of good resources to better understand it.\n",
"\n",
"* Unidata's [Chunking Data: Why it Matters](https://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters) article.\n",
"* [Chunking in HDF5](https://davis.lbl.gov/Manuals/HDF5-1.8.7/Advanced/Chunking/index.html)\n",
"* [HDF5 chunking tutorial](https://www.youtube.com/watch?v=0HbL-0cqkPo) "
]
},
{
"cell_type": "markdown",
"id": "24",
"metadata": {},
"source": [
"## Key Takeaways\n",
Expand Down