xarray-contrib · betolink · Jul 6, 2024 · Jul 6, 2024 · Jul 6, 2024 · Jul 6, 2024
diff --git a/.github/workflows/qaqc.yaml b/.github/workflows/qaqc.yaml
@@ -36,7 +36,7 @@ jobs:
           check_filenames: true
           check_hidden: true
           skip: ".git,*.js,qaqc.yml"
-          ignore_words_list: hist,nd
+          ignore_words_list: hist,nd,fo
 
       # borrowed from https://github.com/ProjectPythia/pythia-foundations/blob/main/.github/workflows/link-checker.yaml
       - name: Disable Notebook Execution Before Linkcheck

diff --git a/intermediate/remote_data/remote-data.ipynb b/intermediate/remote_data/remote-data.ipynb
@@ -7,11 +7,11 @@
    "source": [
     "# Access Patterns to Remote Data with *fsspec*\n",
     "\n",
-    "Accessing remote data with xarray usually means working with cloud-optimized formats like Zarr or COGs, the [CMIP6 tutorial](remote-data.ipynb) shows this pattern in detail. These formats were designed to be efficiently accessed over the internet, however in many cases we might need to access data that is not available in such formats.\n",
+    "Accessing remote data with xarray usually means working with cloud-optimized formats like Zarr or COGs, the [CMIP6 tutorial](https://tutorial.xarray.dev/intermediate/remote_data/cmip6-cloud.html) shows this pattern in detail. These formats were designed to be efficiently accessed over the internet, however in many cases we might need to access data that is not available in such formats.\n",
     "\n",
-    "This notebook will explore how we can leverage xarray's backends to access remote files. For this we will make use of [`fsspec`](https://github.com/fsspec/filesystem_spec), a powerful Python library that abstracts the internal implementation of remote storage systems into a uniform API that can be used by many file-format specific libraries.\n",
+    "This notebook will explore how we can leverage xarray's backends to access remote files. For this we will make use of [`fsspec`](https://github.com/fsspec/filesystem_spec), a powerful Python library that abstracts the internal implementation of remote storage systems into a uniform API that can be used by many format specific libraries.\n",
     "\n",
-    "Before starting with remote data, it may be helpful to understand how xarray handles local files and how xarray backends work. The following diagram shows the different components involved in accessing data either locally or remote using the `h5netcdf` backend which uses a format specific library to access HDF5 files.\n",
+    "Before starting with remote data, it may be helpful to understand how xarray handles local files and how xarray backends work. The following diagram shows the different components involved in accessing data either locally or remote using the `h5netcdf` backend with fsspec which uses a format specific library to access HDF5 files.\n",
     "\n",
     "![xarray-access(3)](https://gist.github.com/assets/717735/3c3c6801-11ed-43a4-98ea-636b7dd612d8)\n",
     "\n",
@@ -59,7 +59,7 @@
     "\n",
     "\n",
     "tracing_output = []\n",
-    "_match_pattern = \"xarray\"\n",
+    "_match_pattern = \"xarray/\"\n",
     "\n",
     "\n",
     "def trace_calls(frame, event, arg):\n",
@@ -132,7 +132,7 @@
     "The `open_dataset()` method is our entry point to n-dimensional data with xarray, the first argument we pass indicates what we want to open and is used by xarray to get the right backend and in turn is used by the backend to open the file locally or remote. The accepted types by xarray are:\n",
     "\n",
     "\n",
-    "* **str**: \"my-file.nc\" or \"s3:://my-zarr-store/data.zarr\"\n",
+    "* **str**: \"../../data/sst.mnmean.nc\" or \"s3:://my-zarr-store/data.zarr\"\n",
     "* **os.PathLike**: Posix compatible path, most of the times is a Pathlib cross-OS compatible path.\n",
     "* **BufferedIOBase**: some xarray backends can read data from a buffer, this is key for remote access.\n",
     "* **AbstractDataStore**: This one is the generic store and backends should subclass it, if we do we can pass a \"store\" to xarray like in the case of Opendap/Pydap\n",
@@ -178,7 +178,7 @@
    "id": "11",
    "metadata": {},
    "source": [
-    "xarray iterated through the registered backends and netcdf4 returned a `\"yes, I can open that extension\"` see: [netCDF4_.py#L618 ](https://github.com/pydata/xarray/blob/6c2d8c3389afe049ccbfd1393e9a81dd5c759f78/xarray/backends/netCDF4_.py#L618). However, **the backend doesn't know how to \"talk\" to a remote store** and thus it fails to open our file.\n",
+    "xarray iterated through the registered backends and netcdf4 returned a `\"yes, I can open that extension\"` see: [netCDF4_.py Line #L618 ](https://github.com/pydata/xarray/blob/6c2d8c3389afe049ccbfd1393e9a81dd5c759f78/xarray/backends/netCDF4_.py). However, **the backend doesn't know how to \"talk\" to a remote store** and thus it fails to open our file.\n",
     "\n"
    ]
   },
@@ -228,7 +228,9 @@
    "source": [
     "## Remote Access and File Caching\n",
     "\n",
-    "When we use fsspec to abstract a remote file we are in essence translating byte requests to HTTP range requests over the internet. An HTTP request is a costly I/O operation compared to accessing a local file. Because of this, it's common that libraries that handle over the network data transfers implement a cache to avoid requesting the same data over and over. In the case of fsspec there are different ways to ask the library to handle this **caching and this is one of the most relevant performance considerations** when we work with xarray and remote data.\n",
+    "When we use fsspec to abstract a remote file we are in essence translating byte requests to HTTP range requests over the internet. An HTTP request is a costly I/O operation compared to accessing a local file. Because of this, it's common that libraries that handle over the network data transfers implement a cache to avoid requesting the same data over and over. In the case of fsspec there are different ways to ask the library to handle this. \n",
+    "\n",
+    "> **NOTE**: Caching is one of the most relevant performance considerations when we work with xarray and remote data.\n",
     "\n",
     "fsspec default cache is called `read-ahead` and as its name suggests it will read ahead of our request a fixed amount of bytes, this is good when we are working with text or tabular data but it's really an anti pattern when we work with scientific data formats. Benchmarks show that any of the caching schemas will perform better than using the default `read-ahead`.\n",
     "\n",
@@ -261,11 +263,9 @@
    "id": "15",
    "metadata": {},
    "source": [
-    "####  block cache + `open()`\n",
-    "\n",
-    "If our backend support reading from a buffer we can cache only the parts of the file that we are reading, this is useful but tricky. As we mentioned before fsspec default cache will request an overhead of 5MB ahead of the byte offset we request, and if we are reading small chunks from our file it will be really slow and incur in unnecessary transfers.\n",
+    "###  `open()` remotely + caching strategies\n",
     "\n",
-    "Let's open the same file but using the `h5netcdf` engine and we'll use a block cache strategy that stores predefined block sizes from our remote file.\n"
+    "If our backend support reading from a buffer we can cache only the parts of the file that we are reading, this is useful but tricky. As we mentioned before fsspec default cache will request an overhead of 5MB ahead of the byte offset we request, and if we are reading small chunks from our file it will be really slow and incur in unnecessary transfers."
    ]
   },
   {
@@ -280,36 +280,91 @@
     "\n",
     "fs = fsspec.filesystem('http')\n",
     "\n",
-    "fsspec_caching = {\n",
+    "# Note that if we use a context, we'll close the file after the block so operations on xarray may fail if we don't load our data arrays.\n",
+    "with fs.open(uri) as file:\n",
+    "    ds = xr.open_dataset(file, engine=\"h5netcdf\")\n",
+    "    mean = ds.sst.mean()\n",
+    "    default_cache_info = file.cache\n",
+    "ds"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "17",
+   "metadata": {},
+   "source": [
+    "### Using one of the many fsspec caching implementations.\n",
+    "\n",
+    "The file we are working with is small and we don't really see the performance implications of using the default caching vs better caching strategies. Now we are going to open the same file using a `block cache` strategy that stores predefined block sizes from our remote file.\n",
+    "\n",
+    "> **Note**: For a list of caching implementations see [fsspec API docs](https://filesystem-spec.readthedocs.io/en/latest/api.html#read-buffering)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "18",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "uri = \"https://its-live-data.s3-us-west-2.amazonaws.com/test-space/sample-data/sst.mnmean.nc\"\n",
+    "\n",
+    "fs = fsspec.filesystem('http')\n",
+    "\n",
+    "fsspec_better_caching = {\n",
     "    \"cache_type\": \"blockcache\",  # block cache stores blocks of fixed size and uses eviction using a LRU strategy.\n",
     "    \"block_size\": 8\n",
     "    * 1024\n",
-    "    * 1024,  # size in bytes per block, adjust depends on the file size but the recommended size is in the MB\n",
+    "    * 1024,  # size in bytes per block, adjust depends on the file size but the recommended size range is in the MB\n",
     "}\n",
     "\n",
-    "# Note that if we use a context, we'll close the file after the block so operations on xarray may fail if we don't load our data arrays.\n",
-    "with fs.open(uri, **fsspec_caching) as file:\n",
-    "    ds = xr.open_dataset(file, engine=\"h5netcdf\")\n",
-    "    mean = ds.sst.mean()\n",
+    "# we are not using a context, we can use ds until we manually close it.\n",
+    "fo = fs.open(uri, **fsspec_better_caching)\n",
+    "ds = xr.open_dataset(fo, engine=\"h5netcdf\")\n",
+    "better_cache_info = fo.cache\n",
     "ds"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "17",
+   "id": "19",
+   "metadata": {},
+   "source": [
+    "#### Comparing performance\n",
+    "\n",
+    "Since `v2024.05` we can inspect fsspec file-like objects' cache to measure their I/O performance. We'll notice that using a different implementation (this is not read-ahead) we managed to reduce the total requested bytes and since we can control the page buffer size the total request also decreases. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(default_cache_info, better_cache_info)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21",
    "metadata": {},
    "source": [
     "### Reading data from cloud storage\n",
     "\n",
     "So far we have only used HTTP to access a remote file, however the commercial cloud has their own implementations with specific features. fsspec allows us to talk to different cloud storage implementations hiding these details from us and the libraries we use. Now we are going to access the same file using the S3 protocol. \n",
     "\n",
-    "> Note: S3, Azure blob, etc all have their names and prefixes but under the hood they still work with the HTTP protocol.\n"
+    "> Note: S3, Azure blob, etc all have their names and prefixes but under the hood they still work with the HTTP protocol.\n",
+    "\n",
+    "\n",
+    "Let's now open our file with a backend that understands NetCDF and fsspec to abstract remote I/O to cloud storage, but using the default caching.\n"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "18",
+   "id": "22",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -319,21 +374,36 @@
     "# If we need to pass credentials to our remote storage we can do it here, in this case this is a public bucket\n",
     "fs = fsspec.filesystem('s3', anon=True)\n",
     "\n",
-    "fsspec_caching = {\n",
+    "fsspec_better_caching = {\n",
     "    \"cache_type\": \"blockcache\",  # block cache stores blocks of fixed size and uses eviction using a LRU strategy.\n",
     "    \"block_size\": 8\n",
     "    * 1024\n",
     "    * 1024,  # size in bytes per block, adjust depends on the file size but the recommended size is in the MB\n",
     "}\n",
     "\n",
     "# we are not using a context, we can use ds until we manually close it.\n",
-    "ds = xr.open_dataset(fs.open(uri, **fsspec_caching), engine=\"h5netcdf\")\n",
+    "fo = fs.open(uri, **fsspec_better_caching)\n",
+    "ds = xr.open_dataset(fo, engine=\"h5netcdf\")\n",
     "ds"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "19",
+   "id": "23",
+   "metadata": {},
+   "source": [
+    "#### Remote data access and chunking\n",
+    "\n",
+    "One last but important consideration when we access remote data is that we should be aware of the chunking. Internal chunking of data affects how fast xarray engines can get data out of our files. If the chunk size is too small there is a considerable performance penalty, a deep dive into this topic is out of scope for this notebook but here is a list of good resources to better understand it.\n",
+    "\n",
+    "* Unidata's [Chunking Data: Why it Matters](https://www.unidata.ucar.edu/blogs/developer/entry/chunking_data_why_it_matters) article.\n",
+    "* [Chunking in HDF5](https://davis.lbl.gov/Manuals/HDF5-1.8.7/Advanced/Chunking/index.html)\n",
+    "* [HDF5 chunking tutorial](https://www.youtube.com/watch?v=0HbL-0cqkPo) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "24",
    "metadata": {},
    "source": [
     "## Key Takeaways\n",