Merge branch 'llmware-ai:main' into main

llmware-ai · Feb 28, 2024 · 3f82fb6 · 3f82fb6
2 parents 53a67e2 + 3151e11
commit 3f82fb6
Show file tree

Hide file tree

Showing 42 changed files with 5,466 additions and 665 deletions.
diff --git a/.devcontainer b/.devcontainer
@@ -0,0 +1 @@
+devcontainer
diff --git a/Dockerfile b/Dockerfile
@@ -1,4 +1,11 @@
 FROM python:3.11-slim-bookworm
+
+ARG USERNAME=llmware
+ARG USER_UID=1000
+ARG USER_GID=$USER_UID
+ENV PYTHONPATH=/llmware
+
+
 RUN apt-get update \ 
 && apt-get install -y --no-install-recommends git bash \
 && apt-get purge -y --auto-remove
@@ -7,6 +14,11 @@ RUN git clone https://github.com/llmware-ai/llmware.git
 RUN /llmware/scripts/dev/load_native_libraries.sh
 RUN cd llmware/llmware && pip install -r requirements.txt
 
+
+# Create the user
+RUN groupadd --gid $USER_GID $USERNAME \
+    && useradd --uid $USER_UID --gid $USER_GID -m $USERNAME \
+    && chown -R $USERNAME:$USER_GID /llmware
 ENV PYTHONPATH=/llmware
 WORKDIR /llmware
 

diff --git a/README.md b/README.md
diff --git a/devcontainer/README b/devcontainer/README
@@ -0,0 +1,4 @@
+If you wish to use devcontainers for development in vscode.   You will need to rename or create a symlink of this directory to .devcontainer and reload your window.   This will trigger the an option to open the code in a container.   Once the code is opened in a container you will be able to contribute like normally without having to install all the dependencies on your local system.   Also, the development container provides access to your local home directory via the /code directory.   
+
+quick how-to:
+run this command on linux from the root llmware: ln -s devcontainer .devcontainer
diff --git a/devcontainer/devcontainer.json b/devcontainer/devcontainer.json
@@ -0,0 +1,34 @@
+{
+  "name": "LLMWARE Dev",
+  //"build": { "dockerfile": "../Dockerfile" },
+  "image": "provocoai/llmware:dev-01", // 
+  "RemoteUser": "${localEnv:USER}",
+
+
+  "runArgs": [
+     "--name",
+     "${localWorkspaceFolderBasename}", // Container name
+     "-it",
+     "-l",
+     "com.docker.compose.project=devcontainers" // Container group name
+  ],
+	// you can setup your local directory in the devcontainer here.   The mount line is an example and mounts your home directory into the /code directory  
+  "mounts" : [
+    //"source=${localEnv:HOME},target=/code,type=bind,consistency=cached"
+  ],
+  "features": {
+   "ghcr.io/devcontainers/features/docker-outside-of-docker:1": {
+      "dockerDashComposeVersion": "v2"
+    },
+   "ghcr.io/devcontainers/features/github-cli:1": {}
+  },
+  "customizations": {
+    "vscode": {
+      "extensions": [
+        "esbenp.prettier-vscode", // prettify the code extension
+	"ms-python.python", //python code extensions
+	"ms-python.vscode-pylance" / vscode python extension
+      ]
+    }
+  }
+}
diff --git a/examples/Embedding/using_chromadb.py b/examples/Embedding/using_chromadb.py
@@ -0,0 +1,115 @@
+
+"""This example shows how to use ChromaDB as a vector embedding database with llmware"""
+
+""" (A) Python Dependencies - 
+
+    As a first step, you should pip install the ChromaDB, which is not included in the llmware package:
+    1.  pip3 install chromadb
+    
+    (B) Using ChromaDB - 
+    
+    Installing ChromaDB via pip installs everything you need.
+    However, if you need help, there are many great online sources and communities, e.g.,:
+        -- ChromaDB documentation - https://docs.trychroma.com/
+        -- Docker - https://hub.docker.com/u/chromadb
+        -- please also see the docker-compose-chromadb.yaml script provided in the llmware script repository
+        
+    (C) Configurations - 
+
+    You can configure ChromaDB with environment variables. Here is the list of variable names we currently
+    support - for more information see ChromaDBConfig.
+        -- CHROMADB_COLLECTION
+        -- CHROMADB_PERSISTENT_PATH
+        -- CHROMADB_HOST
+        -- CHROMADB_PORT
+        -- CHROMADB_SSL
+        -- CHROMADB_HEADERS
+        -- CHROMADB_SERVER_AUTH_PROVIDER
+        -- CHROMADB_SERVER_AUTH_CREDENTIALS_PROVIDER
+        -- CHROMADB_SERVER_AUTH_CREDENTIALS_PROVIDER
+        -- CHROMADB_PASSWORD
+        -- CHROMADB_SERVER_AUTH_CREDENTIALS_FILE
+        -- CHROMADB_SERVER_AUTH_CREDENTIALS
+        -- CHROMADB_SERVER_AUTH_TOKEN_TRANSPORT_HEADER
+"""
+
+
+import os
+
+from llmware.setup import Setup
+from llmware.library import Library
+from llmware.retrieval import Query
+
+#  example with using ChromaDB as an in-memory database
+os.environ["CHROMADB_COLLECTION"] = "llmware"
+
+#  note: in default mode, Chroma will persist in memory only - to persist to disk, then uncomment the following line and add local folder path:
+#  os.environ["CHROMA_PERSISTENT_PATH"] = "/local/folder/path/to/save/chromadb/"
+
+
+def build_lib (library_name, folder="Agreements"):
+
+    # Step 1 - Create library which is the main 'organizing construct' in llmware
+    print ("\nupdate: Step 1 - Creating library: {}".format(library_name))
+
+    library = Library().create_new_library(library_name)
+
+    # Step 2 - Pull down the sample files from S3 through the .load_sample_files() command
+    #   --note: if you need to refresh the sample files, set 'over_write=True'
+    print ("update: Step 2 - Downloading Sample Files")
+
+    sample_files_path = Setup().load_sample_files(over_write=False)
+
+    # Step 3 - point ".add_files" method to the folder of documents that was just created
+    #   this method parses the documents, text chunks, and captures in MongoDB
+    print("update: Step 3 - Parsing and Text Indexing Files")
+
+    #   options:   Agreements | UN-Resolutions-500
+    library.add_files(input_folder_path=os.path.join(sample_files_path, folder))
+
+    return library
+
+
+# start script
+
+print("update: Step 1- starting here- building library- parsing PDFs into text chunks")
+
+lib = build_lib("chromadb_lib_0")
+
+# optional - check the status of the library card and embedding
+lib_card = lib.get_library_card()
+print("update: -- before embedding process - check library card - ", lib_card)
+
+print("update: Step 2 - starting to install embeddings")
+
+#   alt embedding models - "mini-lm-sbert" | industry-bert-contracts |  text-embedding-ada-002
+#   note: if you want to use text-embedding-ada-002, you will need an OpenAI key and enter into os.environ variable
+#   e.g., os.environ["USER_MANAGED_OPENAI_API_KEY"] = "<insert your key>"
+
+#   batch sizes from 100-500 usually give good performance and work on most environments
+lib.install_new_embedding(embedding_model_name="industry-bert-contracts",vector_db="chromadb",batch_size=300)
+
+#   optional - check the status of the library card and embedding
+lib_card = lib.get_library_card()
+print("update: -- after embedding process - check updated library card - ", lib_card)
+
+#   run a query
+#   note: embedding_model_name is optional, but useful if you create multiple embeddings on the same library
+#   --see other example scripts for multiple embeddings
+
+#   create query object
+query_chromadb = Query(lib, embedding_model_name="industry-bert-contracts")
+
+#   run multiple queries using query_chromadb
+my_search_results = query_chromadb.semantic_query("What is the sale bonus?", result_count = 24)
+
+for i, qr in enumerate(my_search_results):
+    print("update: semantic query results: ", i, qr)
+
+# if you want to delete the embedding  - uncomment the line below
+# lib.delete_installed_embedding("industry-bert-contracts", "chromadb")
+
+#   optional - check the embeddings on the library
+emb_record = lib.get_embedding_status()
+for j, entries in enumerate(emb_record):
+    print("update: embeddings on library: ", j, entries)
diff --git a/examples/README.md b/examples/README.md
@@ -1,44 +1,24 @@
-# Getting started with `llmware`
+# 🔥 Top New Examples 🔥  
+
+New to LLMWare - [**Fast Start tutorial series**](https://github.com/llmware-ai/llmware/tree/main/fast_start)  
+SLIM Examples -  [**SLIM Models**](SLIM-Agents/)
 
 | Example     |  Detail      |
 |-------------|--------------|
-| 1.   Getting Started ([code](Getting_Started/getting_started_with_rag.py) / [video](https://www.youtube.com/watch?v=0naqpH93eEU)) | End-to-end Basic RAG Recipe illustrating key LLMWare classes. |
-| 2.   Prompts ([code](Prompts/llm_prompts.py)) | Prompt LLMs with various sources, explore the out-of-the-box Prompt Catalog, and use different prompt styles.|
-| 3.   Retrieval ([code](Retrieval/semantic_retrieval.py)) | Explore the breadth of retrieval capabilities and persisting, loading and saving retrieval history.|
-| 4.   Embedding ([code](Embedding/embeddings_fast_start.py)) | Simple access to multiple embedding models and vector DBs (“mix and match”). 
-| 5.   Parsing ([code](Parsing/parse_documents.py)) | Ingest at scale into library and ‘at runtime' into any Prompt.
-| 6.   Prompts With Sources ([code](Prompts/prompt_with_sources.py)) | Attach wide range of knowledge sources directly into Prompts.
-| 7.   BLING models ([code](Models/bling_fast_start.py) / [video](https://www.youtube.com/watch?v=JjgqOZ2v5oU))   | Explore `llmware`'s BLING model series ("Best Little Instruction-following No-GPU-required").  See how they perform in common RAG scenarios - question-answering, key-value extraction, and basic summarization.   |
-| 8.   RAG with BLING ([code](RAG/contract_analysis_on_laptop_with_bling_models.py) / [video](https://www.youtube.com/watch?v=8aV5p3tErP0)) | Using contract analysis as an example, experiment with RAG for complex document analysis and text extraction using `llmware`'s BLING ~1B parameter GPT model running on your laptop.   |
-| 9.   DRAGON RAG benchmark testing with huggingface ([code](Models/dragon_rag_benchmark_tests_huggingface.py)) | Run RAG instruct benchmark tests against the `llmware` DRAGON models to find the best one for your RAG workflow.  This example uses basic Transformer APIs. |
-| 10.  DRAGON RAG benchmark testing with llmware ([code](Models/dragon_rag_benchmark_tests_llmware.py)) | Run RAG instruct benchmark tests against the `llmware` DRAGON models to find the best one for your RAG workflow. This example uses the llmware Prompt API which provides additional capabilities such as evidence/fact checking |
-| 11.  Fact Checking ([code](Prompts/fact_checking.py))  | Explore the full set of evidence methods in this example script that analyzes a set of contracts. |
-| 12.  Working with Prompts ([code](Getting_Started/working_with_prompts.py)) |  Inspection of Prompt history which is useful in AI Audit scenarios.|
-| 13.  Hugging Face Integration ([code](Models/huggingface_integration.py)) | How to bring your favorite HF model into llmware seamlessly.  Customize a generative model with weights from a custom fine-tuned model. |
-| 14.  Working with Datasets ([code](Datasets/working_with_datasets.py)) | Dataset generation streamlined for fine-tuning generative and embedding models and formats such as Alpaca, ChatGPT, Human-Bot.  |
-| 15.  Working without Databases ([code](Getting_Started/working_without_a_database.py) / [video](https://www.youtube.com/watch?v=tAGz6yR14lw))| Parse, Prompt and generate Datasets from Prompt history without installing MongoDB or a vector database.|
-| 16.  Working without Databases with a minimal Web UI([code](Getting_Started/ui_without_a_database.py) | Upload pdfs, and run inference on llmware BLING models without installing MongoDB or a vector database.|
-
-
-# Using `llmware` without a database
-You can do some interesting things using `llmware` without a database or vector embeddings.  Parsing can be done in memory and outputted to text or json. Prompts can be crafted with sources from files, Wikipedia or the Yahoo Finance API.  The **Working without Databases** ([code](Getting_Started/working_without_a_database.py) / [video](https://www.youtube.com/watch?v=tAGz6yR14lw)), [LLM Prompts](Getting_Started/working_with_prompts.py), and [Parsing](Parsing/parse_documents.py) examples show scenarios that can be accomplished and throughout the examples are specific methods that do not require MongoDB or embeddings.  
-
-# `llmware` Open Source Models
-The `llmware` public model repository has 3 model collections:
-- **Industry BERT models:**  out-of-the-box custom trained sentence transformer embedding models fine-tuned for the following industries:  Insurance, Contracts, Asset Management, SEC.
-- **BLING model series:**  Small CPU-based RAG-optimized, instruct-following 1B-3B parameter models.
-- **DRAGON model series:**  Production-grade RAG-optimized 6-7B parameter models - "Delivering RAG on ..." the leading foundation base models.
-
-These models collections are available at [`llmware` on Hugging Face](https://huggingface.co/llmware). Explore their use in the [Embedding](Embedding/embeddings_fast_start.py), [Hugging Face Integration](Models/huggingface_integration.py),[`llmware` BLING model](Models/bling_fast_start.py), [RAG with BLING](RAG/contract_analysis_on_laptop_with_bling_models.py), and [RAG benchmark testing](Models/dragon_rag_benchmark_tests_llmware.py) examples.
-
-# Additional `llmware` capabilities
-- Create knowledge graphs with a high-powered and fast C-based co-occurrence table matrix builder, the output of which can feed NLP statistics as well as potentially graph databases.  Explore the [Knowledge Graph](Datasets/knowledge_graph.py) example.
-
-- Generate datasets for fine-tuning both generative and embedding models.  `llmware` uses sophisticated data-crafting strategies, and leveraging the data captured throughout the system.  Explore the [Datasets](Datasets/working_with_datasets.py) example.  
-
-- Library is the simple, flexible, unifying construct in `llmware` to assemble and normalize parsed text chunks, and is linked to both a text search index, and an open platform of embedding models and vector databases. Explore the [Working with Libraries](Getting_Started/working_with_libraries.py) example.
-
-- The `llmware` parsers follow a consistent 27 key metadata dictionary, so that you can extract the same information from a PDF as a PowerPoint or Text file. The parsers generally extract images, tables, and all available document metadata.  There is a complete set of text chunking tools to parse a batch of documents (across multiple formats) and chunk and store in consistent format in a document store.  Explore the [Parsing](Parsing/parse_documents.py) example.
+| 1.   BLING models fast start ([code](Models/bling_fast_start.py) / [video](https://www.youtube.com/watch?v=JjgqOZ2v5oU)) | Get started with fast, accurate, CPU-based models - question-answering, key-value extraction, and basic summarization.  |
+| 2.   Parse and Embed 500 PDF Documents ([code](Embedding/docs2vecs_with_milvus-un_resolutions.py))  | End-to-end example for Parsing, Embedding and Querying UN Resolution documents with Milvus  |
+| 3.  Hybrid Retrieval - Semantic + Text ([code](Retrieval/dual_pass_with_custom_filter.py)) | Using 'dual pass' retrieval to combine best of semantic and text search |  
+| 4.   Multiple Embeddings with PG Vector ([code](Embedding/using_multiple_embeddings.py) / [video](https://www.youtube.com/watch?v=Bncvggy6m5Q)) | Comparing Multiple Embedding Models using Postgres / PG Vector |
+| 5.   DRAGON GGUF Models ([code](Models/dragon_gguf_fast_start.py) / [video](https://www.youtube.com/watch?v=BI1RlaIJcsc&t=130s)) | State-of-the-Art 7B RAG GGUF Models.  | 
+| 6.   RAG with BLING ([code](RAG/contract_analysis_on_laptop_with_bling_models.py) / [video](https://www.youtube.com/watch?v=8aV5p3tErP0)) | Using contract analysis as an example, experiment with RAG for complex document analysis and text extraction using `llmware`'s BLING ~1B parameter GPT model running on your laptop. |  
+| 7.   Master Service Agreement Analysis with DRAGON ([code](RAG/msa_processing.py) / [video](https://www.youtube.com/watch?v=Cf-07GBZT68&t=2s)) | Analyzing MSAs using DRAGON YI 6B Model.   |                                                                                                                         
+| 8.   Streamlit Example ([code](Getting_Started/ui_without_a_database.py))  | Upload pdfs, and run inference on llmware BLING models.  |  
+| 9.   Integrating LM Studio ([code](Models/using-open-chat-models.py) / [video](https://www.youtube.com/watch?v=h2FDjUyvsKE&t=101s)) | Integrating LM Studio Models with LLMWare  |                                                                                                                                       
+| 10.  Prompts With Sources ([code](Prompts/prompt_with_sources.py))  | Attach wide range of knowledge sources directly into Prompts.   |   
+| 11.  Fact Checking ([code](Prompts/fact_checking.py))  | Explore the full set of evidence methods in this example script that analyzes a set of contracts.   |
+| 12.  Using 7B GGUF Chat Models ([code](Models/chat_models_gguf_fast_start.py)) | Using 4 state of the art 7B chat models in minutes running locally |  
+
+
+Check back from time-to-time as we are always updating these examples - especially with new use cases and contributions from the llmware Community!  
 
-- All data artifacts are published in standard formats – json, txt files, pytorch_model.bin files, and fully portable and exportable to any platform.