FSDP, PEFT and Readme adjustments

lyudmil-pelov · lyudmil-pelov · commit 164b2d43d443 · 2023-10-18T19:07:31.000+02:00
- adding FSDP checkpoint load example
- adjust the lora model merge example
- change the 7b model to use PEFT fine-tuning
- readme spelling fixes
diff --git a/distributed_training/llama2/README.md b/distributed_training/llama2/README.md
@@ -11,7 +11,7 @@ You can select your preferred Llama2 model size in the setup configuration betwe
 
 ## Prerequisite
 
-The key prerequisites that you would need to set tup before you can proceed to run the distributed fine-tuning process on Oracle Cloud Infrastructure Data Science Service.
+The key prerequisites that you would need to setup before you can proceed to run the distributed fine-tuning process on Oracle Cloud Infrastructure Data Science Service.
 
 * [Configure custom subnet](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/distributed_training#1-networking) - with security list to allow ingress into any port from the IPs originating within the CIDR block of the subnet. This is to ensure that the hosts on the subnet can connect to each other during distributed training.
 * [Create an object storage bucket](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/distributed_training#2-object-storage) - to save the fine tuned weights
@@ -90,8 +90,6 @@ spec:
           appdirs==1.4.4
           loralib==0.1.2
           bitsandbytes==0.39.1
-          black==23.9.1
-          'black[jupyter]'
           datasets==2.12.0
           fire==0.5.0
           'git+https://github.com/huggingface/peft.git@15a013af5ff5660b9377af24d3eee358213d72d4'
@@ -128,7 +126,7 @@ spec:
   infrastructure:
     kind: infrastructure
     spec:
-      blockStorageSize: 512
+      blockStorageSize: 256
       logGroupId: ocid1.loggroup.<>
       logId: ocid1.log.<>
       subnetId: ocid1.subnet.<>
@@ -148,7 +146,7 @@ spec:
         --peft_method lora
         --pure_bf16
         --mixed_precision
-        --batch_size_training 4
+        --batch_size_training 1
         --model_name $MODEL_NAME
         --output_dir /home/datascience/outputs
         --num_epochs 1
@@ -164,8 +162,6 @@ spec:
           appdirs==1.4.4
           loralib==0.1.2
           bitsandbytes==0.39.1
-          black==23.9.1
-          'black[jupyter]'
           datasets==2.12.0
           fire==0.5.0
           'git+https://github.com/huggingface/peft.git@15a013af5ff5660b9377af24d3eee358213d72d4'
@@ -176,7 +172,7 @@ spec:
           scipy==1.10.0
           optimum==1.13.1
       outputDir: /home/datascience/outputs
-      outputUri: oci://<bucket-for-finetuned-model>@<namespace>/$JOB_OCID
+      outputUri: oci://llama2@bigdatadatasciencelarge/outputs/lvp-7b/$JOB_OCID
       env:
         - name: MODEL_NAME
           value: meta-llama/Llama-2-7b-hf
@@ -214,7 +210,7 @@ ads opctl watch <job run ocid of job-run-ocid>
 
 ### ADS Python API
 
-As we mention you could also run the distributed fine-tuning process directly via the ADS Python API. Here the examples for fine-tuning full parameters of the [7B model](https://huggingface.co/meta-llama/Llama-2-7b-hf) using [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/).
+As we mention you could also run the distributed fine-tuning process directly via the ADS Python API. Here the examples for fine-tuning full parameters of the [7B model](https://huggingface.co/meta-llama/Llama-2-7b-hf) using [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). Notice that in the following example we used `--dist_checkpoint_root_folder` and `--dist_checkpoint_folder` as those are required when only FSDP fine-tuning process is executed.
 
 ```python
 from ads.jobs import Job, DataScienceJob, PyTorchDistributedRuntime
@@ -245,8 +241,6 @@ job = (
             "appdirs==1.4.4",
             "loralib==0.1.2",
             "bitsandbytes==0.39.1",
-            "black==23.9.1",
-            "black[jupyter]",
             "datasets==2.12.0",
             "fire==0.5.0",
             "git+https://github.com/huggingface/peft.git@15a013af5ff5660b9377af24d3eee358213d72d4",
@@ -264,7 +258,6 @@ job = (
           "--enable_fsdp",
           "--pure_bf16",
           "--batch_size_training 1",
-          "--micro_batch_size 1",
           "--model_name $MODEL_NAME",
           "--dist_checkpoint_root_folder /home/datascience/outputs",
           "--dist_checkpoint_folder fine-tuned"
@@ -274,7 +267,7 @@ job = (
           MODEL_NAME="meta-llama/Llama-2-7b-hf",
           HUGGING_FACE_HUB_TOKEN="<access_token>",
           LD_LIBRARY_PATH="/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/opt/conda/lib",
-          OCI__METRICS_NAMESPACE="finetune_llama2_7b_hf_peft_lora"
+          OCI__METRICS_NAMESPACE="finetune_llama2_7b_hf_fsdp"
         )
     )
 )
@@ -341,11 +334,18 @@ Additionally under the OCI Monitoring Service, if you enabled the `OCI__METRICS_
 
 After the fine-tuning process is complete, to test the new model, we have to merge the weights to the base model and upload to the OCI Data Science Model Catalog.
 
+### PEFT Weights Merging
+
 1. Create a notebook session with VM.GPU.A10.2 shape or higher. Specify the object storage location where the fine-tuned weights are saved in the mount path while creating the notebook session.
 2. Upload `lora-model-merge.ipynb` notebook to the notebook session
-3. Run the notebook for verifying the fine tuned weights.
+3. Run the notebook for verifying the fine-tuned weights.
 4. The notebook also has code to upload the fine tuned model to model catalog.
 
+### FSDP Weights Merging
+
+1. Create a notebook session with VM.GPU.A10.2 shape or higher. Specify the object storage location where the fine-tuned weights are saved in the mount path while creating the notebook session.
+2. Upload `load-back-FSDP-checkpoints` notebook to the notebook session and follow the instructions.
+
 ## Deployment
 
 We recommend to use vLLM based inference container for serving the fine-tuned model. vLLM offers various optimizations for efficient usage of GPU and offers good throughput out of the box. For the deployment, use the model that was saved to the model catalog after fine tuning job.
diff --git a/distributed_training/llama2/load-back-FSDP-checkpoints.ipynb b/distributed_training/llama2/load-back-FSDP-checkpoints.ipynb
@@ -0,0 +1,113 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "58901f38",
+   "metadata": {},
+   "source": [
+    "# Loading back FSDP checkpoints\n",
+    "\n",
+    "For more information: https://github.com/facebookresearch/llama-recipes/blob/main/docs/inference.md#loading-back-fsdp-checkpoints"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98b57d30",
+   "metadata": {},
+   "source": [
+    "## All of the code in this notebook should be run in the OCI Data Science Notebook Terminal!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05a59132",
+   "metadata": {},
+   "source": [
+    "Before you start make sure that you've installed the `pytorch20_p39_gpu_v2` Conda and activate it in the `Terminal`\n",
+    "\n",
+    "```bash\n",
+    "odsc conda install -s pytorch20_p39_gpu_v2\n",
+    "```\n",
+    "\n",
+    "... then activate it\n",
+    "\n",
+    "```bash\n",
+    "conda activate /home/datascience/conda/pytorch20_p39_gpu_v2\n",
+    "```\n",
+    "\n",
+    "Then install all of the required dependancies\n",
+    "\n",
+    "```bash\n",
+    "!pip install tokenizers==0.13.3 -U && pip install transformers -U && pip install llama-recipes==0.0.1\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6e2aecea",
+   "metadata": {},
+   "source": [
+    "Following commands work best when you execute them in the `terminal` too!\n",
+    "\n",
+    "First you have to login to access the Llama2 model\n",
+    "```bash\n",
+    "!huggingface-cli login\n",
+    "```\n",
+    "\n",
+    "Then run the checkpoint conververter, it looks like following\n",
+    "\n",
+    "```bash\n",
+    "python -m llama_recipes.inference.checkpoint_converter_fsdp_hf --fsdp_checkpoint_path  /mnt/llama2/outputs/lvp-7b/ocid1.datasciencejob.oc1.eu-frankfurt-1.amaaaaaan/fine-tuned-meta-llama/Llama-2-7b-hf --consolidated_model_path /mnt/llama2/fsdp_consolidated_checkpoints --HF_model_path_or_name \"meta-llama/Llama-2-13b-hf\"\n",
+    "```\n",
+    "\n",
+    "Replace the `--fsdp_checkpoint_path` with the folder you specified by the `--dist_checkpoint_root_folder` which will be the location at your object storage bucket, as per the example above. Notice that we ran this in OCI Data Science Notebooks and mounted the object storage bucket used to store the FSDP checkpoints under `/mnt/llama2`. The `--consolidated_model_path` is the path where the consolidated weights will be stored back. The `--HF_model_path_or_name` is the name of the model used for the fine-tuning, or if you downloaded the model locally, the location of the downloaded model.\n",
+    "\n",
+    "If the merging process was successful, you should see in your `--consolidated_model_path` folder something like this:\n",
+    "\n",
+    "```bash\n",
+    "   0 drwxr-xr-x. 1 datascience users    0 Oct 18 15:48 .\n",
+    "   0 drwxr-xr-x. 1 datascience users    0 Oct 18 14:38 ..\n",
+    " 512 -rw-r--r--. 1 datascience users   42 Oct 18 16:35 added_tokens.json\n",
+    "1.0K -rw-r--r--. 1 datascience users  656 Oct 18 16:35 config.json\n",
+    " 512 -rw-r--r--. 1 datascience users  111 Oct 18 16:35 generation_config.json\n",
+    "9.2G -rw-r--r--. 1 datascience users 9.2G Oct 18 16:35 pytorch_model-00001-of-00003.bin\n",
+    "9.3G -rw-r--r--. 1 datascience users 9.3G Oct 18 16:36 pytorch_model-00002-of-00003.bin\n",
+    "6.7G -rw-r--r--. 1 datascience users 6.7G Oct 18 16:36 pytorch_model-00003-of-00003.bin\n",
+    " 24K -rw-r--r--. 1 datascience users  24K Oct 18 16:36 pytorch_model.bin.index.json\n",
+    " 512 -rw-r--r--. 1 datascience users   72 Oct 18 16:35 special_tokens_map.json\n",
+    "1.5K -rw-r--r--. 1 datascience users 1.2K Oct 18 16:35 tokenizer_config.json\n",
+    "489K -rw-r--r--. 1 datascience users 489K Oct 18 16:35 tokenizer.model\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2407ae40",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python [conda env:pytorch20_p39_gpu_v2]",
+   "language": "python",
+   "name": "conda-env-pytorch20_p39_gpu_v2-py"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.16"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/distributed_training/llama2/lora-model-merge.ipynb b/distributed_training/llama2/lora-model-merge.ipynb