You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: distributed_training/llama2/README.md
+14-14
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ You can select your preferred Llama2 model size in the setup configuration betwe
11
11
12
12
## Prerequisite
13
13
14
-
The key prerequisites that you would need to set tup before you can proceed to run the distributed fine-tuning process on Oracle Cloud Infrastructure Data Science Service.
14
+
The key prerequisites that you would need to setup before you can proceed to run the distributed fine-tuning process on Oracle Cloud Infrastructure Data Science Service.
15
15
16
16
*[Configure custom subnet](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/distributed_training#1-networking) - with security list to allow ingress into any port from the IPs originating within the CIDR block of the subnet. This is to ensure that the hosts on the subnet can connect to each other during distributed training.
17
17
*[Create an object storage bucket](https://github.com/oracle-samples/oci-data-science-ai-samples/tree/main/distributed_training#2-object-storage) - to save the fine tuned weights
@@ -214,7 +210,7 @@ ads opctl watch <job run ocid of job-run-ocid>
214
210
215
211
### ADS Python API
216
212
217
-
As we mention you could also run the distributed fine-tuning process directly via the ADS Python API. Here the examples for fine-tuning full parameters of the [7B model](https://huggingface.co/meta-llama/Llama-2-7b-hf) using [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/).
213
+
As we mention you could also run the distributed fine-tuning process directly via the ADS Python API. Here the examples for fine-tuning full parameters of the [7B model](https://huggingface.co/meta-llama/Llama-2-7b-hf) using [FSDP](https://engineering.fb.com/2021/07/15/open-source/fsdp/). Notice that in the following example we used `--dist_checkpoint_root_folder` and `--dist_checkpoint_folder` as those are required when only FSDP fine-tuning process is executed.
218
214
219
215
```python
220
216
from ads.jobs import Job, DataScienceJob, PyTorchDistributedRuntime
@@ -341,11 +334,18 @@ Additionally under the OCI Monitoring Service, if you enabled the `OCI__METRICS_
341
334
342
335
After the fine-tuning process is complete, to test the new model, we have to merge the weights to the base model and upload to the OCI Data Science Model Catalog.
343
336
337
+
### PEFT Weights Merging
338
+
344
339
1. Create a notebook session with VM.GPU.A10.2 shape or higher. Specify the object storage location where the fine-tuned weights are saved in the mount path while creating the notebook session.
345
340
2. Upload `lora-model-merge.ipynb` notebook to the notebook session
346
-
3. Run the notebook for verifying the finetuned weights.
341
+
3. Run the notebook for verifying the fine-tuned weights.
347
342
4. The notebook also has code to upload the fine tuned model to model catalog.
348
343
344
+
### FSDP Weights Merging
345
+
346
+
1. Create a notebook session with VM.GPU.A10.2 shape or higher. Specify the object storage location where the fine-tuned weights are saved in the mount path while creating the notebook session.
347
+
2. Upload `load-back-FSDP-checkpoints` notebook to the notebook session and follow the instructions.
348
+
349
349
## Deployment
350
350
351
351
We recommend to use vLLM based inference container for serving the fine-tuned model. vLLM offers various optimizations for efficient usage of GPU and offers good throughput out of the box. For the deployment, use the model that was saved to the model catalog after fine tuning job.
"Replace the `--fsdp_checkpoint_path` with the folder you specified by the `--dist_checkpoint_root_folder` which will be the location at your object storage bucket, as per the example above. Notice that we ran this in OCI Data Science Notebooks and mounted the object storage bucket used to store the FSDP checkpoints under `/mnt/llama2`. The `--consolidated_model_path` is the path where the consolidated weights will be stored back. The `--HF_model_path_or_name` is the name of the model used for the fine-tuning, or if you downloaded the model locally, the location of the downloaded model.\n",
64
+
"\n",
65
+
"If the merging process was successful, you should see in your `--consolidated_model_path` folder something like this:\n",
0 commit comments