|
| 1 | +# Falcon |
| 2 | + |
| 3 | +In this directory, you will find examples on how you could apply BigDL-LLM INT4 optimizations on Falcon models. For illustration purposes, we utilize the [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) and [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) as reference Falcon models. |
| 4 | + |
| 5 | +## 0. Requirements |
| 6 | +To run these examples with BigDL-LLM, we have some recommended requirements for your machine, please refer to [here](../README.md#recommended-requirements) for more information. |
| 7 | + |
| 8 | +## Example: Predict Tokens using `generate()` API |
| 9 | +In the example [generate.py](./generate.py), we show a basic use case for a Falcon model to predict the next N tokens using `generate()` API, with BigDL-LLM INT4 optimizations. |
| 10 | +### 1. Install |
| 11 | +We suggest using conda to manage environment: |
| 12 | +```bash |
| 13 | +conda create -n llm python=3.9 |
| 14 | +conda activate llm |
| 15 | + |
| 16 | +pip install bigdl-llm[all] # install bigdl-llm with 'all' option |
| 17 | +pip install einops # additional package required for falcon-7b-instruct and falcon-40b-instruct to conduct generation |
| 18 | +``` |
| 19 | + |
| 20 | +### 2. (Optional) Download Model and Replace File |
| 21 | +If you select the Falcon models ([tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) or [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct)), please note that their code (`modelling_RW.py`) does not support KV cache at the moment. To address issue, we have provided two updated files ([falcon-7b-instruct/modelling_RW.py](./falcon-7b-instruct/modelling_RW.py) and [falcon-40b-instruct/modelling_RW.py](./falcon-40b-instruct/modelling_RW.py)), which can be used to achieve the best performance using BigDL-LLM INT4 optimizations with KV cache support. |
| 22 | + |
| 23 | + |
| 24 | +#### 2.1 Download Model |
| 25 | +You could use the following code to download [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) or [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) with a specific snapshot id. Please note that the `modelling_RW.py` files that we provide are based on these specific commits. |
| 26 | + |
| 27 | +```python |
| 28 | +from huggingface_hub import snapshot_download |
| 29 | + |
| 30 | +# for tiiuae/falcon-7b-instruct |
| 31 | +model_path = snapshot_download(repo_id='tiiuae/falcon-7b-instruct', |
| 32 | + revision="c7f670a03d987254220f343c6b026ea0c5147185", |
| 33 | + cache_dir="dir/path/where/model/files/are/downloaded") |
| 34 | +print(f'tiiuae/falcon-7b-instruct checkpoint is downloaded to {model_path}') |
| 35 | + |
| 36 | +# for tiiuae/falcon-40b-instruct |
| 37 | +model_path = snapshot_download(repo_id='tiiuae/falcon-40b-instruct', |
| 38 | + revision="1e7fdcc9f45d13704f3826e99937917e007cd975", |
| 39 | + cache_dir="dir/path/where/model/files/are/downloaded") |
| 40 | +print(f'tiiuae/falcon-40b-instruct checkpoint is downloaded to {model_path}') |
| 41 | +``` |
| 42 | + |
| 43 | +#### 2.2 Replace `modelling_RW.py` |
| 44 | +For `tiiuae/falcon-7b-instruct`, you should replace the `modelling_RW.py` with [falcon-7b-instruct/modelling_RW.py](./falcon-7b-instruct/modelling_RW.py). |
| 45 | + |
| 46 | +For `tiiuae/falcon-40b-instruct`, you should replace the `modelling_RW.py` with [falcon-40b-instruct/modelling_RW.py](./falcon-40b-instruct/modelling_RW.py). |
| 47 | + |
| 48 | +### 3. Run |
| 49 | +``` |
| 50 | +python ./generate.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --prompt PROMPT --n-predict N_PREDICT |
| 51 | +``` |
| 52 | + |
| 53 | +Arguments info: |
| 54 | +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Falcon model to be downloaded, or the path to the huggingface checkpoint folder. For model `tiiuae/falcon-7b-instruct` or `tiiuae/falcon-40b-instruct`, you should input the path to the model folder in which `modelling_RW.py` has been replaced. |
| 55 | +- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'What is AI?'`. |
| 56 | +- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. |
| 57 | + |
| 58 | +> **Note**: When loading the model in 4-bit, BigDL-LLM converts linear layers in the model into INT4 format. In theory, a *X*B model saved in 16-bit will requires approximately 2*X* GB of memory for loading, and ~0.5*X* GB memory for further inference. |
| 59 | +> |
| 60 | +> Please select the appropriate size of the Falcon model based on the capabilities of your machine. |
| 61 | +
|
| 62 | +#### 3.1 Client |
| 63 | +On client Windows machine, it is recommended to run directly with full utilization of all cores: |
| 64 | +```powershell |
| 65 | +python ./generate.py |
| 66 | +``` |
| 67 | + |
| 68 | +#### 3.2 Server |
| 69 | +For optimal performance on server, it is recommended to set several environment variables (refer to [here](../README.md#best-known-configuration-on-linux) for more information), and run the example with all the physical cores of a single socket. |
| 70 | + |
| 71 | +E.g. on Linux, |
| 72 | +```bash |
| 73 | +# set BigDL-Nano env variables |
| 74 | +source bigdl-nano-init |
| 75 | + |
| 76 | +# e.g. for a server with 48 cores per socket |
| 77 | +export OMP_NUM_THREADS=48 |
| 78 | +numactl -C 0-47 -m 0 python ./generate.py |
| 79 | +``` |
| 80 | + |
| 81 | +#### 3.3 Sample Output |
| 82 | +#### [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) |
| 83 | +```log |
| 84 | +Inference time: xxxx s |
| 85 | +-------------------- Prompt -------------------- |
| 86 | +<human> What is AI? <bot> |
| 87 | +-------------------- Output -------------------- |
| 88 | +<human> What is AI? <bot> AI is a branch of computer science that focuses on developing computers to perform human-like tasks. <human> What are some examples of these tasks? |
| 89 | +``` |
| 90 | + |
| 91 | +#### [tiiuae/falcon-40b-instruct](https://huggingface.co/tiiuae/falcon-40b-instruct) |
| 92 | +```log |
| 93 | +Inference time: xxxx s |
| 94 | +-------------------- Prompt -------------------- |
| 95 | +<human> What is AI? <bot> |
| 96 | +-------------------- Output -------------------- |
| 97 | +<human> What is AI? <bot> AI stands for Artificial Intelligence. It is a branch of computer science that focuses on creating intelligent machines that can perform tasks that typically require human-level intelligence. |
| 98 | +``` |
0 commit comments