You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/introduction.md
+5Lines changed: 5 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -23,6 +23,9 @@ For other models, there is comprehensive documentation to inspire upon the chang
23
23
***Latest news*** : <br>
24
24
25
25
-[coming soon] Support for more popular [models](models_coming_soon)<br>
26
+
-[01/2025][FP8 models support](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127) Added support for inference of FP8 models.
27
+
28
+
-[01/2025] Added support for [Ibm-Granite](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)
26
29
-[11/2024][finite adapters support](https://github.com/quic/efficient-transformers/pull/153) allows mixed adapter usage for peft models.
27
30
-[11/2024][Speculative decoding TLM](https://github.com/quic/efficient-transformers/pull/119) QEFFAutoModelForCausalLM model can be compiled for returning more than 1 logits during decode for TLM.
28
31
-[11/2024] Added support for [Meta-Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), [Meta-Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) and [Meta-Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
@@ -31,6 +34,8 @@ For other models, there is comprehensive documentation to inspire upon the chang
31
34
<details>
32
35
<summary>More</summary>
33
36
37
+
-[01/2025] Added support for [Ibm-Granite](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)
38
+
-[01/2025] Added support for [Ibm-Granite-Guardian](https://huggingface.co/ibm-granite/granite-guardian-3.1-8b)
34
39
-[09/2024] Added support for [Gemma-2-Family](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)<br>
35
40
-[09/2024] Added support for [CodeGemma-Family](https://huggingface.co/collections/google/codegemma-release-66152ac7b683e2667abdee11)
36
41
-[09/2024] Added support for [Gemma-Family](https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b)
Copy file name to clipboardExpand all lines: docs/source/quick_start.md
+25Lines changed: 25 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,4 @@
1
+
# Quick Start
1
2
2
3
QEfficient Library was designed with one goal:
3
4
@@ -8,6 +9,30 @@ To achieve this, we have 2 levels of APIs, with different levels of abstraction.
8
9
9
10
2. Python high level APIs offer more granular control, ideal for when customization is necessary.
10
11
12
+
## Supported Features
13
+
14
+
| Feature | Impact |
15
+
| --- | --- |
16
+
| Context Length Specializations (upcoming) | Increases the maximum context length that models can handle, allowing for better performance on tasks requiring long sequences of text. |
17
+
| Swift KV (upcoming) | Reduces computational overhead during inference by optimizing key-value pair processing, leading to improved throughput. |
18
+
| Block Attention (in progress) | Reduces inference latency and computational cost by dividing context into blocks and reusing key-value states, particularly useful in RAG. |
19
+
|[Vision Language Model](QEFFAutoModelForImageTextToText)| Provides support for the AutoModelForImageTextToText class from the transformers library, enabling advanced vision-language tasks. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text_inference.py) for more **details**. |
20
+
|[Speech Sequence to Sequence Model](QEFFAutoModelForSpeechSeq2Seq)| Provides support for the QEFFAutoModelForSpeechSeq2Seq Facilitates speech-to-text sequence models. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/speech_to_text/run_whisper_speech_to_text.py) for more **details**. |
21
+
| Support for FP8 Execution | Enables execution with FP8 precision, significantly improving performance and reducing memory usage for computational tasks. |
22
+
| Prefill caching | Enhances inference speed by caching key-value pairs for shared prefixes, reducing redundant computations and improving efficiency. |
23
+
|Prompt-Lookup Decoding | Speeds up text generation by using overlapping parts of the input prompt and the generated text, making the process faster without losing quality. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/pld_spd_inference.py) for more **details**.|
24
+
|[PEFT LoRA support](QEffAutoPeftModelForCausalLM)| Enables parameter-efficient fine-tuning using low-rank adaptation techniques, reducing the computational and memory requirements for fine-tuning large models. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/peft_models.py) for more **details**. |
25
+
|[QNN support](#qnn-compilation)| Enables compilation using QNN SDK, making Qeff adaptable for various backends in the future. |
26
+
|[Embedding model support](QEFFAutoModel)| Facilitates the generation of vector embeddings for retrieval tasks. |
27
+
|[Speculative Decoding](#draft-based-speculative-decoding)| Accelerates text generation by using a draft model to generate preliminary predictions, which are then verified by the target model, reducing latency and improving efficiency. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/draft_spd_inference.py) for more **details**. |
28
+
|[Finite lorax](QEffAutoLoraModelForCausalLM)| Users can activate multiple LoRA adapters and compile them with the base model. At runtime, they can specify which prompt should use which adapter, enabling mixed adapter usage within the same batch. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/lora_models.py) for more **details**. |
29
+
| Python and CPP Inferencing API support | Provides flexibility while running inference with Qeff and enabling integration with various applications and improving accessibility for developers. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/cpp_execution/text_inference_using_cpp.py) for more **details**.|
30
+
|[Continuous batching](#continuous-batching)| Optimizes throughput and latency by dynamically batching requests, ensuring efficient use of computational resources. |
31
+
| AWQ and GPTQ support | Supports advanced quantization techniques, improving model efficiency and performance on AI 100. |
32
+
| Support serving successive requests in same session | An API that yields tokens as they are generated, facilitating seamless integration with various applications and enhancing accessibility for developers. |
33
+
| Perplexity calculation | A script for computing the perplexity of a model, allowing for the evaluation of model performance and comparison across different models and datasets. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/scripts/perplexity_computation/calculate_perplexity.py) for more **details**. |
34
+
| KV Heads Replication Script| A sample script for replicating key-value (KV) heads for the Llama-3-8B-Instruct model, running inference with the original model, replicating KV heads, validating changes, and exporting the modified model to ONNX format. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/scripts/replicate_kv_head/replicate_kv_heads.py) for more **details**.|
35
+
11
36
## Transformed models and QPC storage
12
37
13
38
By default, the library exported models and Qaic Program Container (QPC) files, which are compiled and inference-ready model binaries generated by the compiler, are stored in `~/.cache/qeff_cache`. You can customize this storage path using the following environment variables:
0 commit comments