You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/app/docs/introduction-backend-families/page.md
+68-9Lines changed: 68 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -45,19 +45,73 @@ SKLLMConfig.set_azure_api_base("<API_BASE>") # e.g. https://<YOUR_PROJECT_NAME>.
45
45
46
46
When using the Azure backend, the model should be specified as `model = "azure::<model_deployment_name>"`. For example, if you created a _gpt-3.5_ deployment under the name _my-model_, you should use `model = "azure::my-model"`.
47
47
48
-
### GPT4ALL
48
+
### GGUF
49
49
50
-
GPT4ALL is an open-source library that provides a unified API for multiple small-scale language models, that can be run locally on a consumer-grade hardware, even without a GPU. To use the GPT4ALL backend, you need to install the corresponding extension as follows:
50
+
GGUF is an open-source binary file format designed for storing the quantized versions of model weights as well as high level model configurations. GGUF is primarily used in combination with the [Llama CPP](https://github.com/ggerganov/llama.cpp) project, but can also be loaded by some other runtimes.
51
51
52
-
```bash
53
-
pip install scikit-llm[gpt4all]
52
+
In order to use GGUF models with scikit-llm, the llama-cpp and its python bindings have to be installed first.
53
+
54
+
The installation command slightly varies depending on your hardware.
Then, you can use the GPT4ALL by specifying the model as `model = "gpt4all::<model_name>"`, which will be downloaded automatically. For the full list of available models, please refer to the [GPT4ALL official documentation](https://gpt4all.io/index.html).
The models available through the GPT4ALL out of the box have very limited capabilities and are not recommended for most of the use cases. In addition, not all models are permitted for commercial use. Please check the license of the model you are using before deploying it in production.
For all of the models, the quantized version is used. The precision is indicated by the suffix in the name (e.g. `q4` stands for 4-bit quantization). By default, we choose the models with 4-bit quantization, but might decide to include models with lower/higher precision as well (for models with higher/lower number of parameters respectively). When picking the model for your use case, the following rule of thumb can be applied:
84
+
- q < 4 : Substantial performance loss, low size
85
+
- q = 4 : Optimal trade-off between the loss and the size
- q = 8 : Almost no performance loss, very large size
88
+
89
+
In addition, there exist several quantization schemas of the same precision (e.g. for q4 those can be Q4_0, Q4_K_S, Q4_K_M, etc.). In order to keep it simpler for the users, we omit this information in the model name and select a single sub-type which we consider to be the most optimal.
90
+
91
+
92
+
#### GPU acceleration
93
+
94
+
GGUF models can be unloaded to a GPU (both fully and partially).
95
+
96
+
The following command specifies the maximum number of GPU layers:
97
+
```python
98
+
from skllm.config import SKLLMConfig
99
+
100
+
SKLLMConfig.set_gguf_max_gpu_layers(-1)
101
+
```
102
+
- 0 : all layers on the CPU
103
+
- -1 : all layers on the GPU
104
+
- n>0 : n layers on the GPU, remaining on the CPU
105
+
106
+
Note, that changing the configuration does not reload the model automatically (even if the new estimator is created afterwards). The models can be off-loaded from the memory as follows:
107
+
108
+
```python
109
+
from skllm.llm.gpt.clients.llama_cpp.handler import ModelCache
110
+
111
+
ModelCache.clear()
112
+
```
113
+
114
+
This command can also be handy when experimenting with different models in an interactive environment like JupyterNotebook, as the models remain in the memory until the termination of the process.
61
115
62
116
### Custom URL
63
117
@@ -104,4 +158,9 @@ Additionally, for tuning LLMs in Vertex, it is required to have to have 64 cores
104
158
1. Go to [Quotas](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota) and filter them for “Restricted image training TPU V3 pod cores per region”.
105
159
2. Select “europe-west4” region (currently this is the only supported region).
106
160
3. Click on “Edit Quotas”, set the limit to 64 and submit the request.
107
-
The request should be approved within a few hours, but it might take up to several days.
161
+
The request should be approved within a few hours, but it might take up to several days.
162
+
163
+
164
+
## Third party integrations
165
+
166
+
- [scikit-ollama](https://github.com/AndreasKarasenko/scikit-ollama): Scikit-Ollama provides scikit-llm estimators that allow to use self-hosted LLMs through [Ollama](https://ollama.com/).
0 commit comments