New format of Documentation (#240)

quic-amitraj · abukhoy · web-flow · commit 33a4b5175ea9 · 2025-02-28T19:26:57.000+05:30
New format of Documentation for inference and finetuning.

---------

Signed-off-by: Amit Raj &lt;quic_amitraj@quicinc.com&gt;
Signed-off-by: Amit Raj &lt;168538872+quic-amitraj@users.noreply.github.com&gt;
Signed-off-by: Abukhoyer Shaik &lt;quic_abukhoye@quicinc.com&gt;
Co-authored-by: Abukhoyer Shaik &lt;quic_abukhoye@quicinc.com&gt;
diff --git a/docs/index.md b/docs/index.md
@@ -36,27 +36,14 @@ source/upgrade
 ```
 
 ```{toctree}
-:caption: 'Quick start'
+:caption: 'Inference on Cloud AI 100'
 :maxdepth: 4
 
 source/quick_start
-```
-
-```{toctree}
-:caption: 'Command Line Interface Use (CLI)'
-:maxdepth: 2
 source/cli_api
+source/python_api
 ```
 
- 
-```{toctree}
-:caption: 'Python API'
-:maxdepth: 2
-
-source/hl_api
-source/ll_api
-
-```
 
 ```{toctree}
 :caption: 'QAIC Finetune'
diff --git a/docs/source/cli_api.md b/docs/source/cli_api.md
@@ -1,30 +1,32 @@
 
+# Command Line Interface Use (CLI)
+
 ```{NOTE}
 Use ``bash terminal``, else if using ``ZSH terminal`` then ``device_group``should be in single quotes e.g.  ``'--device_group [0]'``
 ```
 
 (infer_api)=
-# `QEfficient.cloud.infer`
+## `QEfficient.cloud.infer`
 ```{eval-rst}
 .. automodule:: QEfficient.cloud.infer.main
 ``` 
-# `QEfficient.cloud.execute`
+## `QEfficient.cloud.execute`
 ```{eval-rst}
 .. automodule:: QEfficient.cloud.execute.main
 ```
-# `QEfficient.cloud.compile`
+## `QEfficient.cloud.compile`
 ```{eval-rst}
    .. automodule:: QEfficient.compile.compile_helper.compile
    .. code-block:: bash
     
         python -m QEfficient.cloud.compile OPTIONS
 ```
-# `QEfficient.cloud.export`
+## `QEfficient.cloud.export`
 ```{eval-rst}
    .. automodule:: QEfficient.cloud.export.main
    
 ```
-# `QEfficient.cloud.finetune`
+## `QEfficient.cloud.finetune`
 ```{eval-rst}
    .. automodule:: QEfficient.cloud.finetune.main
    
diff --git a/docs/source/introduction.md b/docs/source/introduction.md
@@ -23,6 +23,9 @@ For other models, there is comprehensive documentation to inspire upon the chang
 ***Latest news*** : <br>
 
 - [coming soon] Support for more popular [models](models_coming_soon)<br>
+- [01/2025] [FP8 models support](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127) Added support for inference of FP8 models.
+
+- [01/2025] Added support for [Ibm-Granite](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)
 - [11/2024] [finite adapters support](https://github.com/quic/efficient-transformers/pull/153) allows mixed adapter usage for peft models.
 - [11/2024] [Speculative decoding TLM](https://github.com/quic/efficient-transformers/pull/119) QEFFAutoModelForCausalLM model can be compiled for returning more than 1 logits during decode for TLM.
 - [11/2024] Added support for [Meta-Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct), [Meta-Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) and [Meta-Llama-3.2-3B](https://huggingface.co/meta-llama/Llama-3.2-3B)
@@ -31,6 +34,8 @@ For other models, there is comprehensive documentation to inspire upon the chang
 <details>
 <summary>More</summary>
 
+- [01/2025] Added support for [Ibm-Granite](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct)
+- [01/2025] Added support for [Ibm-Granite-Guardian](https://huggingface.co/ibm-granite/granite-guardian-3.1-8b)
 - [09/2024] Added support for [Gemma-2-Family](https://huggingface.co/collections/google/gemma-2-release-667d6600fd5220e7b967f315)<br>
 - [09/2024] Added support for [CodeGemma-Family](https://huggingface.co/collections/google/codegemma-release-66152ac7b683e2667abdee11)
 - [09/2024] Added support for [Gemma-Family](https://huggingface.co/collections/google/gemma-release-65d5efbccdbb8c4202ec078b)
diff --git a/docs/source/ll_api.md b/docs/source/ll_api.md
diff --git a/docs/source/python_api.md b/docs/source/python_api.md
@@ -1,34 +1,64 @@
+# Python API
+
 **This page give you an overview about the all the APIs that you might need to integrate the `QEfficient` into your python applications.**
 
-# High Level API
+## High Level API
+
+### `QEFFAutoModelForCausalLM`
 
-## `QEFFAutoModelForCausalLM`
 ```{eval-rst}
 .. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForCausalLM
    :member-order: bysource
    :members:
 ``` 
-## `QEFFAutoModel`
+
+(QEFFAutoModel)=
+### `QEFFAutoModel`
+
 ```{eval-rst}
 .. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModel
    :member-order: bysource
    :members:
 ``` 
-## `QEffAutoPeftModelForCausalLM`
+
+(QEffAutoPeftModelForCausalLM)=
+### `QEffAutoPeftModelForCausalLM`
+
 ```{eval-rst}
 .. autoclass:: QEfficient.peft.auto.QEffAutoPeftModelForCausalLM
    :member-order: bysource
    :members:
 ```
 
-## `QEffAutoLoraModelForCausalLM`
+(QEffAutoLoraModelForCausalLM)=
+### `QEffAutoLoraModelForCausalLM`
+
 ```{eval-rst}
 .. autoclass:: QEfficient.peft.lora.auto.QEffAutoLoraModelForCausalLM
    :member-order: bysource
    :members:
 ```
 
-## `export`
+(QEFFAutoModelForImageTextToText)=
+### `QEFFAutoModelForImageTextToText`
+
+```{eval-rst}
+.. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForImageTextToText
+   :member-order: bysource
+   :members:
+```
+
+(QEFFAutoModelForSpeechSeq2Seq)=
+### `QEFFAutoModelForSpeechSeq2Seq`
+
+```{eval-rst}
+.. autoclass:: QEfficient.transformers.models.modeling_auto.QEFFAutoModelForSpeechSeq2Seq
+   :member-order: bysource
+   :members:
+```
+
+### `export`
+
 ```{eval-rst}
 .. automodule:: QEfficient.exporter.export_hf_to_cloud_ai_100
    :members:
@@ -37,7 +67,9 @@
 .. deprecated::
    This function will be deprecated in version 1.19, please use QEFFAutoModelForCausalLM.export instead
 ```
-## `compile`
+
+### `compile`
+
 ```{eval-rst}
 .. automodule:: QEfficient.compile.compile_helper
    :members:
@@ -50,10 +82,53 @@
 .. deprecated::
    This function will be deprecated in version 1.19, please use QEFFAutoModelForCausalLM.compile instead
 ```
-## `Execute`
+
+### `Execute`
+
 ```{eval-rst}
 .. automodule:: QEfficient.generation.text_generation_inference
    :members:
    :show-inheritance:
    :exclude-members:  latency_stats_bertstyle,cloud_ai_100_exec_kv_helper
 ```
+## Low Level API
+
+### `convert_to_cloud_kvstyle`
+
+```{eval-rst}
+.. automodule:: QEfficient.exporter.export_hf_to_cloud_ai_100
+   :members:
+   :show-inheritance:
+   :exclude-members: qualcomm_efficient_converter, convert_to_cloud_bertstyle
+```
+
+### `convert_to_cloud_bertstyle`
+
+```{eval-rst}
+.. automodule:: QEfficient.exporter.export_hf_to_cloud_ai_100
+   :members:
+   :show-inheritance:
+   :exclude-members: qualcomm_efficient_converter, convert_to_cloud_kvstyle
+```
+
+### `utils`
+
+```{eval-rst}
+.. automodule:: QEfficient.utils.device_utils
+   :members:
+   :show-inheritance:
+```
+
+```{eval-rst}
+.. automodule:: QEfficient.utils.generate_inputs
+   :members:
+   :undoc-members:
+   :show-inheritance:
+```
+
+```{eval-rst}
+.. automodule:: QEfficient.utils.run_utils
+   :members:
+   :undoc-members:
+   :show-inheritance:
+```
diff --git a/docs/source/quick_start.md b/docs/source/quick_start.md
@@ -1,3 +1,4 @@
+# Quick Start
 
 QEfficient Library was designed with one goal:
 
@@ -8,6 +9,30 @@ To achieve this, we have 2 levels of APIs, with different levels of abstraction.
 
 2. Python high level APIs offer more granular control, ideal for when customization is necessary.
 
+## Supported Features
+
+| Feature | Impact |
+| --- | --- |
+| Context Length Specializations (upcoming) | Increases the maximum context length that models can handle, allowing for better performance on tasks requiring long sequences of text. |
+| Swift KV (upcoming) | Reduces computational overhead during inference by optimizing key-value pair processing, leading to improved throughput. |
+| Block Attention (in progress) | Reduces inference latency and computational cost by dividing context into blocks and reusing key-value states, particularly useful in RAG. |
+| [Vision Language Model](QEFFAutoModelForImageTextToText) | Provides support for the AutoModelForImageTextToText class from the transformers library, enabling advanced vision-language tasks. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/image_text_to_text_inference.py) for more **details**. |
+| [Speech Sequence to Sequence Model](QEFFAutoModelForSpeechSeq2Seq) | Provides support for the QEFFAutoModelForSpeechSeq2Seq Facilitates speech-to-text sequence models. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/speech_to_text/run_whisper_speech_to_text.py) for more **details**. |
+| Support for FP8 Execution | Enables execution with FP8 precision, significantly improving performance and reducing memory usage for computational tasks. |
+| Prefill caching  | Enhances inference speed by caching key-value pairs for shared prefixes, reducing redundant computations and improving efficiency. |
+|Prompt-Lookup Decoding | Speeds up text generation by using overlapping parts of the input prompt and the generated text, making the process faster without losing quality. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/pld_spd_inference.py) for more **details**.|
+| [PEFT LoRA support](QEffAutoPeftModelForCausalLM) | Enables parameter-efficient fine-tuning using low-rank adaptation techniques, reducing the computational and memory requirements for fine-tuning large models. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/peft_models.py) for more **details**. |
+| [QNN support](#qnn-compilation) | Enables compilation using QNN SDK, making Qeff adaptable for various backends in the future. |
+| [Embedding model support](QEFFAutoModel) | Facilitates the generation of vector embeddings for retrieval tasks. |
+| [Speculative Decoding](#draft-based-speculative-decoding) | Accelerates text generation by using a draft model to generate preliminary predictions, which are then verified by the target model, reducing latency and improving efficiency. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/draft_spd_inference.py) for more **details**. |
+| [Finite lorax](QEffAutoLoraModelForCausalLM) | Users can activate multiple LoRA adapters and compile them with the base model. At runtime, they can specify which prompt should use which adapter, enabling mixed adapter usage within the same batch. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/lora_models.py) for more **details**. |
+| Python and CPP Inferencing API support | Provides flexibility while running inference with Qeff and enabling integration with various applications and improving accessibility for developers. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/examples/cpp_execution/text_inference_using_cpp.py) for more **details**.|
+| [Continuous batching](#continuous-batching) | Optimizes throughput and latency by dynamically batching requests, ensuring efficient use of computational resources. |
+| AWQ and GPTQ support | Supports advanced quantization techniques, improving model efficiency and performance on AI 100. |
+| Support serving successive requests in same session | An API that yields tokens as they are generated, facilitating seamless integration with various applications and enhancing accessibility for developers. |
+| Perplexity calculation | A script for computing the perplexity of a model, allowing for the evaluation of model performance and comparison across different models and datasets. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/scripts/perplexity_computation/calculate_perplexity.py) for more **details**. |
+| KV Heads Replication Script| A sample script for replicating key-value (KV) heads for the Llama-3-8B-Instruct model, running inference with the original model, replicating KV heads, validating changes, and exporting the modified model to ONNX format. Refer [sample script](https://github.com/quic/efficient-transformers/blob/main/scripts/replicate_kv_head/replicate_kv_heads.py) for more **details**.|
+
 ## Transformed models and QPC storage
 
 By default, the library exported models and Qaic Program Container (QPC) files, which are compiled and inference-ready model binaries generated by the compiler, are stored in `~/.cache/qeff_cache`. You can customize this storage path using the following environment variables:
diff --git a/docs/source/validate.md b/docs/source/validate.md