Skip to content

Commit 33a4b51

Browse files
New format of Documentation (#240)
New format of Documentation for inference and finetuning. --------- Signed-off-by: Amit Raj <[email protected]> Signed-off-by: Amit Raj <[email protected]> Signed-off-by: Abukhoyer Shaik <[email protected]> Co-authored-by: Abukhoyer Shaik <[email protected]>
1 parent 34a2e1a commit 33a4b51

File tree

7 files changed

+193
-119
lines changed

7 files changed

+193
-119
lines changed

docs/index.md

Lines changed: 2 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -36,27 +36,14 @@ source/upgrade
3636
```
3737

3838
```{toctree}
39-
:caption: 'Quick start'
39+
:caption: 'Inference on Cloud AI 100'
4040
:maxdepth: 4
4141
4242
source/quick_start
43-
```
44-
45-
```{toctree}
46-
:caption: 'Command Line Interface Use (CLI)'
47-
:maxdepth: 2
4843
source/cli_api
44+
source/python_api
4945
```
5046

51-
52-
```{toctree}
53-
:caption: 'Python API'
54-
:maxdepth: 2
55-
56-
source/hl_api
57-
source/ll_api
58-
59-
```
6047

6148
```{toctree}
6249
:caption: 'QAIC Finetune'

docs/source/cli_api.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,32 @@
11

2+
# Command Line Interface Use (CLI)
3+
24
```{NOTE}
35
Use ``bash terminal``, else if using ``ZSH terminal`` then ``device_group``should be in single quotes e.g. ``'--device_group [0]'``
46
```
57

68
(infer_api)=
7-
# `QEfficient.cloud.infer`
9+
## `QEfficient.cloud.infer`
810
```{eval-rst}
911
.. automodule:: QEfficient.cloud.infer.main
1012
```
11-
# `QEfficient.cloud.execute`
13+
## `QEfficient.cloud.execute`
1214
```{eval-rst}
1315
.. automodule:: QEfficient.cloud.execute.main
1416
```
15-
# `QEfficient.cloud.compile`
17+
## `QEfficient.cloud.compile`
1618
```{eval-rst}
1719
.. automodule:: QEfficient.compile.compile_helper.compile
1820
.. code-block:: bash
1921
2022
python -m QEfficient.cloud.compile OPTIONS
2123
```
22-
# `QEfficient.cloud.export`
24+
## `QEfficient.cloud.export`
2325
```{eval-rst}
2426
.. automodule:: QEfficient.cloud.export.main
2527
2628
```
27-
# `QEfficient.cloud.finetune`
29+
## `QEfficient.cloud.finetune`
2830
```{eval-rst}
2931
.. automodule:: QEfficient.cloud.finetune.main
3032

docs/source/introduction.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@ For other models, there is comprehensive documentation to inspire upon the chang
2323
***Latest news*** : <br>
2424

2525
- [coming soon] Support for more popular [models](models_coming_soon)<br>
26+
- [01/2025] [FP8 models support](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127) Added support for inference of FP8 models.
27+
28+
- [01/2025] Added support for [Ibm-Granite](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)
2629
- [11/2024] [finite adapters support](https://github.com/quic/efficient-transformers/pull/153) allows mixed adapter usage for peft models.
2730
- [11/2024] [Speculative decoding TLM](https://github.com/quic/efficient-transformers/pull/119) QEFFAutoModelForCausalLM model can be compiled for returning more than 1 logits during decode for TLM.
2831
- [11/2024] Added support for [Meta-Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), [Meta-Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) and [Meta-Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
@@ -31,6 +34,8 @@ For other models, there is comprehensive documentation to inspire upon the chang
3134
<details>
3235
<summary>More</summary>
3336

37+
- [01/2025] Added support for [Ibm-Granite](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)
38+
- [01/2025] Added support for [Ibm-Granite-Guardian](https://huggingface.co/ibm-granite/granite-guardian-3.1-8b)
3439
- [09/2024] Added support for [Gemma-2-Family](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)<br>
3540
- [09/2024] Added support for [CodeGemma-Family](https://huggingface.co/collections/google/codegemma-release-66152ac7b683e2667abdee11)
3641
- [09/2024] Added support for [Gemma-Family](https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b)

docs/source/ll_api.md

Lines changed: 0 additions & 38 deletions
This file was deleted.
Lines changed: 83 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,34 +1,64 @@
1+
# Python API
2+
13
**This page give you an overview about the all the APIs that you might need to integrate the `QEfficient` into your python applications.**
24

3-
# High Level API
5+
## High Level API
6+
7+
### `QEFFAutoModelForCausalLM`
48

5-
## `QEFFAutoModelForCausalLM`
69
```{eval-rst}
710
.. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCausalLM
811
:member-order: bysource
912
:members:
1013
```
11-
## `QEFFAutoModel`
14+
15+
(QEFFAutoModel)=
16+
### `QEFFAutoModel`
17+
1218
```{eval-rst}
1319
.. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModel
1420
:member-order: bysource
1521
:members:
1622
```
17-
## `QEffAutoPeftModelForCausalLM`
23+
24+
(QEffAutoPeftModelForCausalLM)=
25+
### `QEffAutoPeftModelForCausalLM`
26+
1827
```{eval-rst}
1928
.. autoclass:: QEfficient.peft.auto.QEffAutoPeftModelForCausalLM
2029
:member-order: bysource
2130
:members:
2231
```
2332

24-
## `QEffAutoLoraModelForCausalLM`
33+
(QEffAutoLoraModelForCausalLM)=
34+
### `QEffAutoLoraModelForCausalLM`
35+
2536
```{eval-rst}
2637
.. autoclass:: QEfficient.peft.lora.auto.QEffAutoLoraModelForCausalLM
2738
:member-order: bysource
2839
:members:
2940
```
3041

31-
## `export`
42+
(QEFFAutoModelForImageTextToText)=
43+
### `QEFFAutoModelForImageTextToText`
44+
45+
```{eval-rst}
46+
.. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForImageTextToText
47+
:member-order: bysource
48+
:members:
49+
```
50+
51+
(QEFFAutoModelForSpeechSeq2Seq)=
52+
### `QEFFAutoModelForSpeechSeq2Seq`
53+
54+
```{eval-rst}
55+
.. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq
56+
:member-order: bysource
57+
:members:
58+
```
59+
60+
### `export`
61+
3262
```{eval-rst}
3363
.. automodule:: QEfficient.exporter.export_hf_to_cloud_ai_100
3464
:members:
@@ -37,7 +67,9 @@
3767
.. deprecated::
3868
This function will be deprecated in version 1.19, please use QEFFAutoModelForCausalLM.export instead
3969
```
40-
## `compile`
70+
71+
### `compile`
72+
4173
```{eval-rst}
4274
.. automodule:: QEfficient.compile.compile_helper
4375
:members:
@@ -50,10 +82,53 @@
5082
.. deprecated::
5183
This function will be deprecated in version 1.19, please use QEFFAutoModelForCausalLM.compile instead
5284
```
53-
## `Execute`
85+
86+
### `Execute`
87+
5488
```{eval-rst}
5589
.. automodule:: QEfficient.generation.text_generation_inference
5690
:members:
5791
:show-inheritance:
5892
:exclude-members: latency_stats_bertstyle,cloud_ai_100_exec_kv_helper
5993
```
94+
## Low Level API
95+
96+
### `convert_to_cloud_kvstyle`
97+
98+
```{eval-rst}
99+
.. automodule:: QEfficient.exporter.export_hf_to_cloud_ai_100
100+
:members:
101+
:show-inheritance:
102+
:exclude-members: qualcomm_efficient_converter, convert_to_cloud_bertstyle
103+
```
104+
105+
### `convert_to_cloud_bertstyle`
106+
107+
```{eval-rst}
108+
.. automodule:: QEfficient.exporter.export_hf_to_cloud_ai_100
109+
:members:
110+
:show-inheritance:
111+
:exclude-members: qualcomm_efficient_converter, convert_to_cloud_kvstyle
112+
```
113+
114+
### `utils`
115+
116+
```{eval-rst}
117+
.. automodule:: QEfficient.utils.device_utils
118+
:members:
119+
:show-inheritance:
120+
```
121+
122+
```{eval-rst}
123+
.. automodule:: QEfficient.utils.generate_inputs
124+
:members:
125+
:undoc-members:
126+
:show-inheritance:
127+
```
128+
129+
```{eval-rst}
130+
.. automodule:: QEfficient.utils.run_utils
131+
:members:
132+
:undoc-members:
133+
:show-inheritance:
134+
```

docs/source/quick_start.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
# Quick Start
12

23
QEfficient Library was designed with one goal:
34

@@ -8,6 +9,30 @@ To achieve this, we have 2 levels of APIs, with different levels of abstraction.
89

910
2. Python high level APIs offer more granular control, ideal for when customization is necessary.
1011

12+
## Supported Features
13+
14+
| Feature | Impact |
15+
| --- | --- |
16+
| Context Length Specializations (upcoming) | Increases the maximum context length that models can handle, allowing for better performance on tasks requiring long sequences of text. |
17+
| Swift KV (upcoming) | Reduces computational overhead during inference by optimizing key-value pair processing, leading to improved throughput. |
18+
| Block Attention (in progress) | Reduces inference latency and computational cost by dividing context into blocks and reusing key-value states, particularly useful in RAG. |
19+
| [Vision Language Model](QEFFAutoModelForImageTextToText) | Provides support for the AutoModelForImageTextToText class from the transformers library, enabling advanced vision-language tasks. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text_inference.py) for more **details**. |
20+
| [Speech Sequence to Sequence Model](QEFFAutoModelForSpeechSeq2Seq) | Provides support for the QEFFAutoModelForSpeechSeq2Seq Facilitates speech-to-text sequence models. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/speech_to_text/run_whisper_speech_to_text.py) for more **details**. |
21+
| Support for FP8 Execution | Enables execution with FP8 precision, significantly improving performance and reducing memory usage for computational tasks. |
22+
| Prefill caching | Enhances inference speed by caching key-value pairs for shared prefixes, reducing redundant computations and improving efficiency. |
23+
|Prompt-Lookup Decoding | Speeds up text generation by using overlapping parts of the input prompt and the generated text, making the process faster without losing quality. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/pld_spd_inference.py) for more **details**.|
24+
| [PEFT LoRA support](QEffAutoPeftModelForCausalLM) | Enables parameter-efficient fine-tuning using low-rank adaptation techniques, reducing the computational and memory requirements for fine-tuning large models. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/peft_models.py) for more **details**. |
25+
| [QNN support](#qnn-compilation) | Enables compilation using QNN SDK, making Qeff adaptable for various backends in the future. |
26+
| [Embedding model support](QEFFAutoModel) | Facilitates the generation of vector embeddings for retrieval tasks. |
27+
| [Speculative Decoding](#draft-based-speculative-decoding) | Accelerates text generation by using a draft model to generate preliminary predictions, which are then verified by the target model, reducing latency and improving efficiency. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/draft_spd_inference.py) for more **details**. |
28+
| [Finite lorax](QEffAutoLoraModelForCausalLM) | Users can activate multiple LoRA adapters and compile them with the base model. At runtime, they can specify which prompt should use which adapter, enabling mixed adapter usage within the same batch. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/lora_models.py) for more **details**. |
29+
| Python and CPP Inferencing API support | Provides flexibility while running inference with Qeff and enabling integration with various applications and improving accessibility for developers. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/cpp_execution/text_inference_using_cpp.py) for more **details**.|
30+
| [Continuous batching](#continuous-batching) | Optimizes throughput and latency by dynamically batching requests, ensuring efficient use of computational resources. |
31+
| AWQ and GPTQ support | Supports advanced quantization techniques, improving model efficiency and performance on AI 100. |
32+
| Support serving successive requests in same session | An API that yields tokens as they are generated, facilitating seamless integration with various applications and enhancing accessibility for developers. |
33+
| Perplexity calculation | A script for computing the perplexity of a model, allowing for the evaluation of model performance and comparison across different models and datasets. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/scripts/perplexity_computation/calculate_perplexity.py) for more **details**. |
34+
| KV Heads Replication Script| A sample script for replicating key-value (KV) heads for the Llama-3-8B-Instruct model, running inference with the original model, replicating KV heads, validating changes, and exporting the modified model to ONNX format. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/scripts/replicate_kv_head/replicate_kv_heads.py) for more **details**.|
35+
1136
## Transformed models and QPC storage
1237

1338
By default, the library exported models and Qaic Program Container (QPC) files, which are compiled and inference-ready model binaries generated by the compiler, are stored in `~/.cache/qeff_cache`. You can customize this storage path using the following environment variables:

0 commit comments

Comments
 (0)