Push changes for 0.2.0 release

- Changes are listed down in CHANGELOG.md file at the root
ckrapu-nv · Dec 15, 2023 · c871c49 · c871c49
1 parent 60fca22
commit c871c49
Show file tree

Hide file tree

Showing 104 changed files with 3,812 additions and 5,668 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,29 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+
+## [0.2.0] - 2023-12-15
+
+### Added
+
+- Support for using [Nvidia AI Foundational LLM models](./docs/rag/aiplayground.md#using-nvdia-cloud-based-llms)
+- Support for using [Nvidia AI Foundational embedding models](./docs/rag/aiplayground.md#using-nvidia-cloud-based-embedding-models)
+- Support for [deploying and using quantized LLM models](./docs/rag/llm_inference_server.md#quantized-llama2-model-deployment)
+- Support for [evaluating RAG pipeline](./evaluation/README.md)
+
+### Changed
+
+- Repository restructing to allow better open source contributions
+- [Upgraded dependencies](./RetrievalAugmentedGeneration/Dockerfile) for chain server container
+- [Upgraded NeMo Inference Framework container version](./RetrievalAugmentedGeneration/llm-inference-server/Dockerfile), no seperate sign up needed now for access.
+- Main [README](./README.md) now provides more details.
+- Documentation improvements.
+- Better error handling and reporting mechanism for corner cases.
+- Renamed `triton-inference-server` container and service to `llm-inference-server`
+
+### Fixed
+
+- [Fixed issue #13](https://github.com/NVIDIA/GenerativeAIExamples/issues/13) of pipeline not able to answer questions unrelated to knowledge base
+- [Fixed issue #12](https://github.com/NVIDIA/GenerativeAIExamples/issues/12) typechecking while uploading PDF files
diff --git a/LICENSE.md b/LICENSE.md
@@ -198,4 +198,4 @@
    distributed under the License is distributed on an "AS IS" BASIS,
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
-   limitations under the License.
+   limitations under the License.
diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ State-of-the-art Generative AI examples that are easy to deploy, test, and exten
 ## NVIDIA NGC
 Generative AI Examples uses resources from the [NVIDIA NGC AI Development Catalog](https://ngc.nvidia.com).
 
-Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to access: 
+Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to access:
 
 - The GPU-optimized NVIDIA containers, models, scripts, and tools used in these examples
 - The latest NVIDIA upstream contributions to the respective programming frameworks
@@ -16,7 +16,7 @@ Sign up for a [free NGC developer account](https://ngc.nvidia.com/signin) to acc
 
 ## Retrieval Augmented Generation (RAG)
 
-A RAG pipeline embeds multimodal data --  such as documents, images, and video -- into a database connected to a Large Language Model.  RAG lets users use an LLM to chat with their own data. 
+A RAG pipeline embeds multimodal data --  such as documents, images, and video -- into a database connected to a Large Language Model.  RAG lets users use an LLM to chat with their own data.
 
 | Name          | Description           | LLM        | Framework               | Multi-GPU | Multi-node | Embedding   | TRT-LLM | Triton | VectorDB | K8s |
 |---------------|-----------------------|------------|-------------------------|-----------|------------|-------------|---------|--------|----------|-----|

diff --git a/RetrievalAugmentedGeneration/Dockerfile b/RetrievalAugmentedGeneration/Dockerfile
@@ -3,9 +3,12 @@ ARG BASE_IMAGE_TAG=23.08-py3
 
 
 FROM ${BASE_IMAGE_URL}:${BASE_IMAGE_TAG}
-COPY chain_server /opt/chain_server
-RUN --mount=type=bind,source=requirements.txt,target=/opt/requirements.txt \
+COPY RetrievalAugmentedGeneration/__init__.py /opt/RetrievalAugmentedGeneration/
+COPY RetrievalAugmentedGeneration/common /opt/RetrievalAugmentedGeneration/common
+COPY RetrievalAugmentedGeneration/examples /opt/RetrievalAugmentedGeneration/examples
+COPY integrations /opt/integrations
+RUN --mount=type=bind,source=RetrievalAugmentedGeneration/requirements.txt,target=/opt/requirements.txt \
     python3 -m pip install --no-cache-dir -r /opt/requirements.txt
 
 WORKDIR /opt
-ENTRYPOINT ["uvicorn", "chain_server.server:app"]
+ENTRYPOINT ["uvicorn", "RetrievalAugmentedGeneration.common.server:app"]
diff --git a/RetrievalAugmentedGeneration/README.md b/RetrievalAugmentedGeneration/README.md
@@ -4,16 +4,16 @@
 **Project Goal**: A reference Retrieval Augmented Generation(RAG) workflow for a chatbot to question answer off public press releases & tech blogs. It performs document ingestion & Q&A interface using open source models deployed on any cloud or customer datacenter, leverages the power of GPU-accelerated Milvus for efficient vector storage and retrieval, along with TRT-LLM, to achieve lightning-fast inference speeds with custom LangChain LLM wrapper.
 
 ## Components
-- **LLM**: [Llama2](https://ai.meta.com/llama/) - 7b, 13b, and 70b all supported. 13b and 70b generate good responses.
+- **LLM**: [Llama2](https://ai.meta.com/llama/) - 7b-chat, 13b-chat, and 70b-chat all supported. 13b-chat and 70b-chat generate good responses.
 - **LLM Backend**: Nemo framework inference container with Triton inference server & TRT-LLM backend for speed.
 - **Vector DB**: Milvus because it's GPU accelerated.
 - **Embedding Model**: [e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) since it is one of the best embedding model available at the moment.
 - **Framework(s)**: LangChain and LlamaIndex.
 
-This reference workflow uses a variety of components and services to customize and deploy the RAG based chatbot. The following diagram illustrates how they work together. Refer to the [detailed architecture guide](./docs/architecture.md) to understand more about these components and how they are tied together.
+This reference workflow uses a variety of components and services to customize and deploy the RAG based chatbot. The following diagram illustrates how they work together. Refer to the [detailed architecture guide](../docs/rag/architecture.md) to understand more about these components and how they are tied together.
 
 
-![Diagram](./../RetrievalAugmentedGeneration/images/image3.jpg)
+![Diagram](../docs/rag/images/image3.jpg)
 
 *Note:*
 We've used [Llama2](https://ai.meta.com/llama/) and [e5-large-v2](https://huggingface.co/intfloat/e5-large-v2) models as example defaults in this workflow, you should ensure that both the LLM and embedding model are appropriate for your use case, and validate that they are secure and have not been tampered with prior to use.
@@ -30,8 +30,6 @@ Before proceeding with this guide, make sure you meet the following prerequisite
     - If you are running multiple GPUs they must all be set to the same mode (ie Compute vs. Display). You can check compute mode for each GPU using
     ``nvidia-smi -q -d compute``
 
-- You should have access to [NeMo Framework](https://developer.nvidia.com/nemo-framework) to download the container used for deploying the Large Language Model. To access nemo-framework inference container please register at https://developer.nvidia.com/nemo-framework. After submitting a form you will be automatically accepted.
-
 ### Setup the following
 
 - Docker and Docker-Compose are essential. Please follow the [installation instructions](https://docs.docker.com/engine/install/ubuntu/).
@@ -53,28 +51,36 @@ Before proceeding with this guide, make sure you meet the following prerequisite
         docker login nvcr.io
       ```
 
-- You can download Llama2 Chat Model Weights from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or [HuggingFace](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/).
+- git-lfs
+    - Make sure you have [git-lfs](https://git-lfs.github.com) installed.
+
+- You can download Llama2 Chat Model Weights from [Meta](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) or [HuggingFace](https://huggingface.co/meta-llama/Llama-2-13b-chat-hf/). You can skip this step [if you are interested in using cloud based LLM's using Nvidia AI Playground](#using-nvdia-cloud-based-llm).
 
     **Note for checkpoint downloaded using Meta**:
 
-        When downloading model weights from Meta, you can follow the instructions up to the point of downloading the models using ``download.sh``. There is no need to deploy the model using the steps mentioned in the repository. We will use Triton to deploy the model.
+    - When downloading model weights from Meta, you can follow the instructions up to the point of downloading the models using ``download.sh``. There is no need to deploy the model using the steps mentioned in the repository. We will use Triton to deploy the model.
 
-        Meta will download two additional files, namely tokenizer.model and tokenizer_checklist.chk, outside of the model checkpoint directory. Ensure that you copy these files into the same directory as the model checkpoint directory.
+    - Meta will download two additional files, namely `tokenizer.model` and `tokenizer_checklist.chk`, outside of the model checkpoint directory. Ensure that you copy these files into the same directory as the model checkpoint directory.
 
+    **Using Cloud based Nvidia AI Foundational models**:
 
-    **Note**:
+    - Instead of deploying the models on-prem if you will like to use LLM models deployed from NVIDIA AI Playground then follow the instructions from [here.](../docs/rag/aiplayground.md)
+
+    **Using Quantized models**:
 
-        In this workflow, we will be leveraging a Llama2 (13B parameters) chat model, which requires 50 GB of GPU memory.  If you prefer to leverage 7B parameter model, this will require 38GB memory. The 70B parameter model initially requires 240GB memory.
-        IMPORTANT:  For this initial version of the workflow, A100 and H100 GPUs are supported.
+    - In this workflow, we will be leveraging a Llama2 (7B parameters) chat model, which requires 38 GB of GPU memory.  <br>
+    IMPORTANT:  For this initial version of the workflow only 7B chat model is supported on A100 and H100 GPUs.
+
+    - We also support quantization of LLama2 model using AWQ, which changes model precision to INT4, thereby reducing memory usage. Checkout the steps [here](../docs/rag/llm_inference_server.md) to enable quantization.
 
 
 ## Install Guide
-###  Step 1: Move to deploy directory
-    cd deploy
 
-###  Step 2: Set Environment Variables
+NVIDIA TensorRT LLM providex state of the art performance for running LLM inference. Follow the below steps from the root of this project to setup the RAG example with TensorRT LLM and Triton deployed locally.
+
+###  Step 1: Set Environment Variables
 
-Modify ``compose.env`` in the ``deploy`` directory to set your environment variables. The following variables are required.
+Modify ``compose.env`` in the ``deploy/compose`` directory to set your environment variables. The following variables are required as shown below for using a llama based model.
 
     # full path to the local copy of the model weights
     export MODEL_DIRECTORY="$HOME/src/Llama-2-13b-chat-hf"
@@ -89,46 +95,64 @@ Modify ``compose.env`` in the ``deploy`` directory to set your environment varia
     APP_CONFIG_FILE=/dev/null
 
 
-### Step 3: Build and Start Containers
+### Step 2: Build and Start Containers
 - Pull lfs files. This will pull large files from repository.
     ```
         git lfs pull
     ```
 - Run the following command to build containers.
     ```
-        source compose.env; docker compose build
+        source deploy/compose/compose.env;  docker compose -f deploy/compose/docker-compose.yaml build
     ```
 
 - Run the following command to start containers.
     ```
-        source compose.env; docker compose up -d
+        source deploy/compose/compose.env; docker compose -f deploy/compose/docker-compose.yaml up -d
     ```
     > ⚠️ **NOTE**: It will take a few minutes for the containers to come up and may take up to 5 minutes for the Triton server to be ready. Adding the `-d` flag will have the services run in the background. ⚠️
 
 - Run ``docker ps -a``. When the containers are ready the output should look similar to the image below.
-    ![Docker Output](./images/docker-output.png "Docker Output Image")
+    ![Docker Output](../docs/rag/images/docker-output.png "Docker Output Image")
+
+    **Note**:
+    - Default prompts are optimized for llama chat model if you're using completion model then prompts need to be finetuned accordingly.
+
+#### Multi GPU deployment
+
+By default the LLM model will be deployed using all available GPU's of the system. To use some specific GPU's you can provide the GPU ID(s) in the [docker compose file](../deploy/compose/docker-compose.yaml) under `llm` service's `deploy` section:
+
+
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              # count: ${INFERENCE_GPU_COUNT:-all} # Comment this out
+              device_ids: ["0"] # Provide the device id of GPU. It can be found using `nvidia-smi` command
+              capabilities: [gpu]
+
 
-### Step 4: Experiment with RAG in JupyterLab
+### Step 3: Experiment with RAG in JupyterLab
 
 This AI Workflow includes Jupyter notebooks which allow you to experiment with RAG.
 
 - Using a web browser, type in the following URL to open Jupyter
 
     ``http://host-ip:8888``
 
-- Locate the [LLM Streaming Client notebook](notebooks/01-llm-streaming-client.ipynb) which demonstrates how to stream responses from the LLM.
+- Locate the [LLM Streaming Client notebook](../notebooks/01-llm-streaming-client.ipynb) which demonstrates how to stream responses from the LLM.
 
 - Proceed with the next 4 notebooks:
 
-    - [Document Question-Answering with LangChain](notebooks/02_langchain_simple.ipynb)
+    - [Document Question-Answering with LangChain](../notebooks/02_langchain_simple.ipynb)
 
-    - [Document Question-Answering with LlamaIndex](notebooks/03_llama_index_simple.ipynb)
+    - [Document Question-Answering with LlamaIndex](../notebooks/03_llama_index_simple.ipynb)
 
-    - [Advanced Document Question-Answering with LlamaIndex](notebooks/04_llamaindex_hier_node_parser.ipynb)
+    - [Advanced Document Question-Answering with LlamaIndex](../notebooks/04_llamaindex_hier_node_parser.ipynb)
 
-    - [Interact with REST FastAPI Server](notebooks/05_dataloader.ipynb)
+    - [Interact with REST FastAPI Server](../notebooks/05_dataloader.ipynb)
 
-### Step 5: Run the Sample Web Application
+### Step 4: Run the Sample Web Application
 A sample chatbot web application is provided in the workflow. Requests to the chat system are wrapped in FastAPI calls.
 
 - Open the web application at ``http://host-ip:8090``.
@@ -139,22 +163,43 @@ A sample chatbot web application is provided in the workflow. Requests to the ch
 
 - To use a knowledge base:
 
-    - Click the **Knowledge Base** tab and upload the file [dataset.zip](./RetrievalAugmentedGeneration/notebook/dataset.zip).
+    - Click the **Knowledge Base** tab and upload the file [dataset.zip](../notebooks/dataset.zip).
 
 - Return to **Converse** tab and check **[X] Use knowledge base**.
 
 - Retype the question:  "How many cores are on the Nvidia Grace superchip?"
 
+# RAG Evaluation
+
+## Prerequisites
+Make sure the corps comm dataset is loaded into the vector database using the [Dataloader](../notebooks/05_dataloader.ipynb) notebook as part of step-3 of setup.
+
+This workflow include jupyter notebooks which allow you perform evaluation of your RAG application on the sample dataset and they can be extended to other datasets as well.
+Setup the workflow by building and starting the containers by following the steps [outlined here using docker compose.](#step-2-build-and-start-containers)
+
+After setting up the workflow follow these steps:
+
+- Using a web browser, type in the following URL to open Jupyter Labs
+
+    ``http://host-ip:8889``
+
+- Locate the [synthetic data generation](../evaluation/01_synthetic_data_generation.ipynb) which demonstrates how to generate synthetic data of question answer pairs for evaluation
+
+- Proceed with the next 3 notebooks:
+
+    - [Filling generated answers](../evaluation/02_filling_RAG_outputs_for_Evaluation.ipynb)
+
+    - [Ragas evaluation with NVIDIA AI playground](../evaluation/03_eval_ragas.ipynb)
+
+    - [LLM as a Judge evaluation with NVIDIA AI playground](../evaluation/04_Human_Like_RAG_Evaluation-AIP.ipynb)
+
 
 # Learn More
-1. [Architecture Guide](./docs/architecture.md): Detailed explanation of different components and how they are tried up together.
+1. [Architecture Guide](../docs/rag/architecture.md): Detailed explanation of different components and how they are tried up together.
 2. Component Guides: Component specific features are enlisted in these sections.
-   1. [Chain Server](./docs/chat_server.md)
-   2. [NeMo Framework Inference Server](./docs/llm_inference_server.md)
-   3. [Jupyter Server](./docs/jupyter_server.md)
-   4. [Sample frontend](./docs/frontend.md)
-3. [Configuration Guide](./docs/configuration.md): This guide covers different configurations available for this workflow.
-4. [Support Matrix](./docs/support_matrix.md): This covers GPU, CPU, Memory and Storage requirements for deploying this workflow.
-
-# Known Issues
-- Uploading a file with size more than 10 MB may fail due to preset timeouts during the ingestion process.
+   1. [Chain Server](../docs/rag/chat_server.md)
+   2. [NeMo Framework Inference Server](../docs/rag/llm_inference_server.md)
+   3. [Jupyter Server](../docs/rag/jupyter_server.md)
+   4. [Sample frontend](../docs/rag/frontend.md)
+3. [Configuration Guide](../docs/rag/configuration.md): This guide covers different configurations available for this workflow.
+4. [Support Matrix](../docs/rag/support_matrix.md): This covers GPU, CPU, Memory and Storage requirements for deploying this workflow.
diff --git a/RetrievalAugmentedGeneration/__init__.py b/RetrievalAugmentedGeneration/__init__.py
@@ -0,0 +1,14 @@
+# SPDX-FileCopyrightText: Copyright (c) 2023 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.