diff --git a/8 Model Hub: Importing models via H2O Drive.ipynb b/8 Model Hub: Importing models via H2O Drive.ipynb new file mode 100644 index 0000000..3c66a8a --- /dev/null +++ b/8 Model Hub: Importing models via H2O Drive.ipynb @@ -0,0 +1,346 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9b2ba7f8-608f-4db4-96bd-a7bf245b9a09", + "metadata": {}, + "source": [ + "# Model Hub - Importing models via H2O Drive\n", + "\n", + "This notebook uses the H2O Drive Python Client (v4) to import a model downloaded from Hugging Face into H2O AI Cloud. Models written H2O Model Hub can be:\n", + "- Used across the H2O AI platform\n", + "- Shared with other users and services\n", + "- Operated on via some Hugging Face libraries" + ] + }, + { + "cell_type": "markdown", + "id": "8e0d9494-8362-4bd4-9409-2f6b77e1a77b", + "metadata": {}, + "source": [ + "## Required permissions" + ] + }, + { + "cell_type": "markdown", + "id": "86358b4e-66a6-4a0a-94ba-cfd6debe0313", + "metadata": {}, + "source": [ + "As this notebook will guide us through uploading data to H2O AI Cloud, we must have the appropriate access permissions to do so.\n", + "\n", + "Unless modified, this notebook will upload a model to the \"global\" H2O Model Hub registry, backed by the H2O Drive bucket for the \"global\" H2O workspace.\n", + "\n", + "**Thus, please ensure you have the correct level of access to write to the \"global\" workspace.** Contact your H2O AI Cloud adminstrator for any questions." + ] + }, + { + "cell_type": "markdown", + "id": "15655546-e40f-4811-989a-60576fb54be8", + "metadata": {}, + "source": [ + "## Helpers\n", + "\n", + "In this section, we install packages and define helpers used in the rest of the notebook. This section is safe to skim over or to read at your own leisure." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b2ddc972-c26b-4d86-954e-c5a9734f554f", + "metadata": {}, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "!{sys.executable} -m pip install -q \"h2o-drive>=4.0.0\"\n", + "!{sys.executable} -m pip install -q huggingface_hub" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fe8e4060-7cd5-4760-a9f5-e586609b2de4", + "metadata": {}, + "outputs": [], + "source": [ + "import fnmatch\n", + "import os\n", + "from typing import List\n", + "\n", + "import h2o_drive\n", + "\n", + "_MODELHUB_BUCKET_PREFIX = \".modelhub/data/\"\n", + "\n", + "async def upload_folder(\n", + " bucket: h2o_drive.Bucket,\n", + " repo_id: str,\n", + " folder_path: str,\n", + " *,\n", + " revision: str = \"main\",\n", + " ignore_patterns: List[str] = [],\n", + ") -> None:\n", + " # We expect the specified bucket to not be prefixed.\n", + " # For convenience, we rebase to the prefix which Model Hub reads from.\n", + " modelhub_bucket = bucket.with_prefix(_MODELHUB_BUCKET_PREFIX)\n", + "\n", + " for root, dirs, files in os.walk(folder_path):\n", + " for file in files:\n", + " # Compute file paths.\n", + " full_filepath = os.path.join(root, file)\n", + " relative_filepath = os.path.relpath(full_filepath, folder_path)\n", + "\n", + " # Skip file if it matches an ignored pattern.\n", + " if any(fnmatch.fnmatch(relative_filepath, p) for p in ignore_patterns):\n", + " continue\n", + "\n", + " # Upload to the Drive bucket under an appropriate key.\n", + " key = f\"{repo_id}/{revision}/{relative_filepath}\"\n", + " await modelhub_bucket.upload_file(full_filepath, key)\n", + "\n", + " # Log the upload.\n", + " print(f\"{relative_filepath} uploaded to Model Hub repo {repo_id}\")" + ] + }, + { + "cell_type": "markdown", + "id": "ada7b106-40d3-4838-83b8-df1920359acb", + "metadata": { + "tags": [] + }, + "source": [ + "## Download a model from Hugging Face\n", + "\n", + "In this section, we download the `albert/albert-base-v2` model from Hugging Face in preparation to then upload it to H2O AI Cloud." + ] + }, + { + "cell_type": "markdown", + "id": "3a853900-9b09-4b33-89bc-1a394ee51230", + "metadata": {}, + "source": [ + "> 💡 Tip\n", + ">\n", + "> This is just one example of how to source a Hugging Face repository. You may instead use any method of retrieving model files.\n", + ">\n", + "> See Hugging Face's how-to guide for information on other ways to download Hugging Face repository files:\n", + "> https://huggingface.co/docs/huggingface_hub/guides/download" + ] + }, + { + "cell_type": "markdown", + "id": "69f1db3f-ccdb-4fe8-b5c4-fdd46e13ad91", + "metadata": {}, + "source": [ + "Let's decide on where to temporarily download the model. Change this directory if necessary based on your environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a7ac129-148a-49ce-b50c-3cc05a36187d", + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "current_directory = os.getcwd()\n", + "\n", + "download_dir = os.path.join(current_directory, \"downloaded_model\")" + ] + }, + { + "cell_type": "markdown", + "id": "48066a24-9ace-4313-8856-d5ad57d3e72b", + "metadata": {}, + "source": [ + "We'll now use Hugging Face's `snapshot_download()` function to download the desired model repository files. Supposing that we want to ignore certain model formats, we'll also declare some file patterns to ignore.\n", + "\n", + "For more information about `snapshot_download()` and available its options, see the [relevant Hugging Face docs](https://huggingface.co/docs/huggingface_hub/main/en/package_reference/file_download#huggingface_hub.snapshot_download)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f737df7f-9785-4ddd-a4cf-e533adb9504a", + "metadata": {}, + "outputs": [], + "source": [ + "import huggingface_hub as hf\n", + "\n", + "hf.snapshot_download(\n", + " repo_id=\"albert/albert-base-v2\",\n", + " local_dir=download_dir,\n", + " ignore_patterns=[\"*.msgpack\", \"*.h5\", \"*.ot\"],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "c2b2243e-92be-4d9d-8890-916953f98bec", + "metadata": {}, + "source": [ + "## Connect to H2O Drive\n", + "\n", + "In this section, we connect to H2O Drive in preparation of uploading our model." + ] + }, + { + "cell_type": "markdown", + "id": "b31f3f71-7378-4ca7-ae83-44cb65256a79", + "metadata": {}, + "source": [ + "> 📢 Important\n", + ">\n", + "> This section assumes that an H2O AI Cloud environment can be discovered from your environment.\n", + "> On local environments, this means having the H2O CLI installed and configured.\n", + ">\n", + "> For information on connecting to Drive from different environments, see the notebook tutorial titled _\"Drive - Connecting from different environments\"_.\n", + "\n", + "H2O Drive provides object storage for H2O AI Cloud. Objects in Drive can be used across the H2O AI platform and shared with other users and services.\n", + "\n", + "In order to upload our model to Drive, we first need to connect to it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f727298c-cb91-43bd-afab-58745c90dd60", + "metadata": {}, + "outputs": [], + "source": [ + "import h2o_drive\n", + "\n", + "drive = h2o_drive.connect()" + ] + }, + { + "cell_type": "markdown", + "id": "1868b6d2-5ea5-4cd4-8939-5cb44ff33bf4", + "metadata": {}, + "source": [ + "To upload a model to the \"global\" H2O Model Hub registry, we'll be uploading to the Drive bucket for the \"global\" H2O workspace.\n", + "\n", + "Let's open that bucket now." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "853df9d2-ca2d-4199-94a0-858fabcbeefb", + "metadata": {}, + "outputs": [], + "source": [ + "bucket = drive.workspace_bucket(\"global\")" + ] + }, + { + "cell_type": "markdown", + "id": "ca3c2dc8-ba7c-46e8-8d6d-c92d48860cd4", + "metadata": {}, + "source": [ + "## Upload model\n", + "\n", + "With the model files downloaded, and a connection to H2O Drive open, we're ready to upload the model to H2O AI Cloud." + ] + }, + { + "cell_type": "markdown", + "id": "7c07f3ec-68f8-4e5b-856b-346d4d8448c3", + "metadata": {}, + "source": [ + "Using the `upload_folder()` helper function defined at the top of this notebook, we will:\n", + "- Upload the model files from the local `download_dir` directory we have them saved in.\n", + "- Upload the model files with the same repository ID, `albert/albert-base-v2`, as the original.\n", + "- Skip uploading files matching certain patterns (i.e., any caches that may have been created)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "476ac002-0505-4d48-a824-6bf03a814a57", + "metadata": {}, + "outputs": [], + "source": [ + "repo_id = \"albert/albert-base-v2\"\n", + "ignore_patterns = [\".cache*\"]\n", + "\n", + "await upload_folder(\n", + " bucket=bucket,\n", + " repo_id=repo_id,\n", + " folder_path=download_dir,\n", + " ignore_patterns=ignore_patterns,\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "49d7ef24-51d3-4f31-b346-62bfbcf00af1", + "metadata": {}, + "source": [ + "🎉 That's it! The model is now uploaded to H2O Drive, where it can be used across the H2O AI platform and shared with users and services.\n", + "\n", + "As a result, the model can now also be retrieved, and operated on, via some Hugging Face libraries while being stored in H2O AI Cloud. For examples, see the notebook tutorial titled _\"Model Hub - Using Hugging Face libraries\"_." + ] + }, + { + "cell_type": "markdown", + "id": "0a10c072-5c13-4d4f-985e-ae613f7380fe", + "metadata": {}, + "source": [ + "## Clean up\n", + "\n", + "Let's clean up the temporary model files we downloaded." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1ea582be-8c9d-44a5-9e0c-fb64761e5ca2", + "metadata": {}, + "outputs": [], + "source": [ + "import shutil\n", + "\n", + "shutil.rmtree(download_dir)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.7" + }, + "toc": { + "base_numbering": 1, + "nav_menu": {}, + "number_sections": true, + "sideBar": true, + "skip_h1_title": true, + "title_cell": "Table of Contents", + "title_sidebar": "Contents", + "toc_cell": false, + "toc_position": {}, + "toc_section_display": true, + "toc_window_display": true + }, + "toc-autonumbering": false, + "toc-showcode": false, + "toc-showmarkdowntxt": false, + "toc-showtags": false + }, + "nbformat": 4, + "nbformat_minor": 5 +}