Skip to content

feat: Integrate Milvus, Ollama, PDF ingestion, and Streamlit UI #439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# LangSmith Tracing (Optional, but recommended)
# LANGCHAIN_TRACING_V2="true"
# LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
# LANGCHAIN_API_KEY="YOUR_LANGSMITH_API_KEY" # Replace with your LangSmith API key
# LANGCHAIN_PROJECT="YOUR_LANGCHAIN_PROJECT_NAME" # Replace with your LangSmith project name

# API Base URL for the frontend (if applicable)
# NEXT_PUBLIC_API_BASE_URL="http://localhost:8000/api" # Default for local LangServe backend

# --- Core LLM and Embedding Configuration ---

# Select the provider for embeddings and LLMs
# Options: "openai", "ollama"
EMBEDDING_PROVIDER="openai" # Default
LLM_PROVIDER="openai" # Default

# OpenAI Configuration (Required if EMBEDDING_PROVIDER or LLM_PROVIDER is "openai")
OPENAI_API_KEY="YOUR_OPENAI_API_KEY" # Replace with your OpenAI API key

# Ollama Configuration (Required if EMBEDDING_PROVIDER or LLM_PROVIDER is "ollama")
OLLAMA_BASE_URL="http://localhost:11434" # Default
OLLAMA_EMBEDDING_MODEL="nomic-embed-text" # Default model for Ollama embeddings
OLLAMA_LLM_MODEL="llama2" # Default model for Ollama LLM

# Milvus Vector Store Configuration (Required for vector storage)
MILVUS_HOST="localhost" # Default
MILVUS_PORT="19530" # Default
MILVUS_COLLECTION_NAME="knowledge_base" # Default

# --- Other Provider API Keys (Examples) ---
# ANTHROPIC_API_KEY="YOUR_ANTHROPIC_API_KEY"
# COHERE_API_KEY="YOUR_COHERE_API_KEY"
# FIREWORKS_API_KEY="YOUR_FIREWORKS_API_KEY"
# GOOGLE_API_KEY="YOUR_GOOGLE_API_KEY" # Or GOOGLE_APPLICATION_CREDENTIALS for service accounts

# --- Optional: Database for Record Manager (for langchain ingest state) ---
# RECORD_MANAGER_DB_URL="postgresql+psycopg2://user:pass@localhost:5432/mydatabase" # Example for PostgreSQL
204 changes: 159 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,159 @@
# 🦜️🔗 Chat LangChain

This repo is an implementation of a chatbot specifically focused on question answering over the [LangChain documentation](https://python.langchain.com/).
Built with [LangChain](https://github.com/langchain-ai/langchain/), [LangGraph](https://github.com/langchain-ai/langgraph/), and [Next.js](https://nextjs.org).

Deployed version: [chat.langchain.com](https://chat.langchain.com)

> Looking for the JS version? Click [here](https://github.com/langchain-ai/chat-langchainjs).

The app leverages LangChain and LangGraph's streaming support and async API to update the page in real time for multiple users.

## Running locally

This project is now deployed using [LangGraph Cloud](https://langchain-ai.github.io/langgraph/cloud/), which means you won't be able to run it locally (or without a LangGraph Cloud account). If you want to run it WITHOUT LangGraph Cloud, please use the code and documentation from this [branch](https://github.com/langchain-ai/chat-langchain/tree/langserve).

> [!NOTE]
> This [branch](https://github.com/langchain-ai/chat-langchain/tree/langserve) **does not** have the same set of features.

## 📚 Technical description

There are two components: ingestion and question-answering.

Ingestion has the following steps:

1. Pull html from documentation site as well as the Github Codebase
2. Load html with LangChain's [RecursiveURLLoader](https://python.langchain.com/docs/integrations/document_loaders/recursive_url_loader) and [SitemapLoader](https://python.langchain.com/docs/integrations/document_loaders/sitemap)
3. Split documents with LangChain's [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
4. Create a vectorstore of embeddings, using LangChain's [Weaviate vectorstore wrapper](https://python.langchain.com/docs/integrations/vectorstores/weaviate) (with OpenAI's embeddings).

Question-Answering has the following steps:

1. Given the chat history and new user input, determine what a standalone question would be using an LLM.
2. Given that standalone question, look up relevant documents from the vectorstore.
3. Pass the standalone question and relevant documents to the model to generate and stream the final answer.
4. Generate a trace URL for the current chat session, as well as the endpoint to collect feedback.

## Documentation

Looking to use or modify this Use Case Accelerant for your own needs? We've added a few docs to aid with this:

- **[Concepts](./CONCEPTS.md)**: A conceptual overview of the different components of Chat LangChain. Goes over features like ingestion, vector stores, query analysis, etc.
- **[Modify](./MODIFY.md)**: A guide on how to modify Chat LangChain for your own needs. Covers the frontend, backend and everything in between.
- **[LangSmith](./LANGSMITH.md)**: A guide on adding robustness to your application using LangSmith. Covers observability, evaluations, and feedback.
- **[Production](./PRODUCTION.md)**: Documentation on preparing your application for production usage. Explains different security considerations, and more.
- **[Deployment](./DEPLOYMENT.md)**: How to deploy your application to production. Covers setting up production databases, deploying the frontend, and more.
# 🦜️🔗 Chat LangChain - Custom Support Bot

This project is a customer support chatbot built using LangChain, LangServe, Milvus, and Ollama, with a Streamlit frontend. It is based on the original `chat-langchain` repository but adapted for PDF document ingestion and local model support.

## Features

* Ingests PDF documents from a local folder (`knowledge_docs`).
* Extracts text from PDFs for question answering.
* **Placeholder logic for extracting embedded source URLs from PDFs for citations (requires user modification).**
* Uses Milvus as the vector store for efficient document retrieval.
* Supports OpenAI and Ollama for both embeddings and language models, configurable via environment variables.
* Provides an interactive Streamlit chat interface for user interaction.
* Includes source citations in responses (dependent on successful URL extraction).

## Setup & Installation

### Prerequisites

* Python 3.9+
* Poetry for dependency management
* A running Milvus instance (version 2.3.x or compatible).
* Ollama server running (optional, if you intend to use Ollama models).
* Ensure you have pulled the necessary models (e.g., `ollama pull llama2`, `ollama pull nomic-embed-text`).

### Installation Steps

1. **Clone the repository:**
```bash
git clone <repository_url> # Replace <repository_url> with the actual URL
cd <repository_name> # Replace <repository_name> with the cloned directory name
```

2. **Install dependencies:**
This project uses Poetry to manage dependencies.
```bash
poetry install --no-root
```
This command installs all necessary Python packages defined in `pyproject.toml`.

3. **Set up Environment Variables:**
Copy the example environment file to a new `.env` file:
```bash
cp .env.example .env
```
Then, edit the `.env` file and provide the necessary values. Below are the key variables:

* `LANGCHAIN_TRACING_V2`: (Optional) Set to `"true"` to enable LangSmith tracing.
* `LANGCHAIN_ENDPOINT`: (Optional) LangSmith API endpoint, if tracing.
* `LANGCHAIN_API_KEY`: (Optional) Your LangSmith API key, if tracing.
* `LANGCHAIN_PROJECT`: (Optional) Your LangSmith project name, if tracing.

* `EMBEDDING_PROVIDER`: Specifies the embedding model provider.
* Set to `"openai"` to use OpenAI embeddings.
* Set to `"ollama"` to use a local Ollama embedding model.
* *Default: "openai"*
* `LLM_PROVIDER`: Specifies the language model provider for the chat.
* Set to `"openai"` to use an OpenAI model.
* Set to `"ollama"` to use a local Ollama model.
* *Default: "openai"*

* `OPENAI_API_KEY`: **Required if `EMBEDDING_PROVIDER` or `LLM_PROVIDER` is "openai".** Your API key for OpenAI.

* `OLLAMA_BASE_URL`: **Required if `EMBEDDING_PROVIDER` or `LLM_PROVIDER` is "ollama".** The base URL for your Ollama server.
* *Default: "http://localhost:11434"*
* `OLLAMA_EMBEDDING_MODEL`: The name of the embedding model to use with Ollama (e.g., "nomic-embed-text").
* *Default: "nomic-embed-text"*
* `OLLAMA_LLM_MODEL`: The name of the language model to use with Ollama (e.g., "llama2").
* *Default: "llama2"*

* `MILVUS_HOST`: The hostname or IP address of your Milvus instance.
* *Default: "localhost"*
* `MILVUS_PORT`: The port number for your Milvus instance.
* *Default: "19530"*
* `MILVUS_COLLECTION_NAME`: The name of the collection to be used in Milvus for storing document embeddings.
* *Default: "knowledge_base"*

* `RECORD_MANAGER_DB_URL`: The database URL for the record manager, which keeps track of ingested documents. SQLite is used by default.
* *Example: "sqlite:///./chat_langchain_ingestion.db"*

## Data Preparation (IMPORTANT)

1. **Place PDF Files:**
Put all your PDF documents that you want the chatbot to use into the `knowledge_docs` folder located in the root of this repository. If this folder does not exist, please create it.

2. **Implement Source URL Extraction (User Action Required):**
The current system for extracting source URLs from PDFs (to provide accurate citations) uses a **placeholder**. You **must** modify the PDF ingestion logic to correctly extract the source URL for each document.

* **File to Modify:** `backend/ingest.py`
* **Function to Modify:** `load_knowledge_base_pdfs`
* **Area to Modify:** Look for the variable `extracted_url`. Currently, it is hardcoded:
```python
extracted_url = "placeholder_url_needs_implementation"
```
* **Your Task:** Implement logic to find and assign the correct source URL for each PDF. This could involve:
* Parsing the PDF content for a specific line or pattern (e.g., a line starting with "Source: http://..." or "Canonical URL: https://...").
* Using a sidecar metadata file for each PDF.
* Deriving the URL from the PDF's filename if it follows a consistent pattern.
* Querying an external system based on PDF content or filename.

**Example of what you might look for in a PDF's text content:**
```
Source: https://my-company.com/original-document.pdf
```
Or it might be embedded in the PDF's metadata if the generating system includes it. The `PyPDFLoader` loads document pages; you might need to inspect `page.page_content` or `page.metadata` for clues.

**Failure to implement this step will result in all sources being cited with the placeholder URL.**

## Running the Application

Ensure your Milvus instance (and Ollama server, if using it) is running before starting the application.

1. **Step 1: Run Ingestion (One-time per data update)**
This script processes the PDFs in `knowledge_docs`, creates embeddings, and stores them in Milvus. Run it once initially, and then again whenever you add, remove, or change PDF files.
```bash
poetry run python backend/ingest.py
```

2. **Step 2: Run Backend Server (LangServe)**
This starts the FastAPI server that serves your chat graph using LangServe.
```bash
poetry run uvicorn backend.server:app --reload --host 0.0.0.0 --port 8000
```
The backend will be accessible at `http://localhost:8000`.

3. **Step 3: Run Streamlit Frontend**
In a new terminal, start the Streamlit application.
```bash
poetry run streamlit run streamlit_app.py
```
Streamlit will typically open in your web browser automatically, or provide a URL (usually `http://localhost:8501`).

## Usage

1. Open the Streamlit application URL in your web browser (e.g., `http://localhost:8501`).
2. In the sidebar, you can:
* Verify or change the LangServe URL (defaults to `http://localhost:8000/chat_langchain/`).
* Select the LLM model you wish to use from the dropdown menu. The available models depend on your backend configuration and environment variables (e.g., OpenAI keys, Ollama setup).
3. Type your question related to the content of your ingested PDF documents into the chat input box at the bottom of the page and press Enter.
4. The chatbot will process your question, retrieve relevant information from the documents in Milvus, generate an answer, and display it along with source citations.

---

This README provides a comprehensive guide to setting up and running the custom chatbot. Remember to implement the PDF source URL extraction logic for accurate citations.

## Project Structure and Key Files

Here's a brief overview of the key files and directories in this project:

* **`streamlit_app.py`**: The main Streamlit application file that provides the user interface for the chatbot. You run this file to start the chat application.
* **`backend/`**: This directory contains all the Python backend logic.
* **`backend/server.py`**: The FastAPI server that exposes the LangServe chat graph as an API. This is what the Streamlit frontend communicates with.
* **`backend/graph.py`**: Defines the core chat logic using LangGraph. It constructs the conversational chain, including document retrieval, history management, LLM calls, and response synthesis.
* **`backend/ingest.py`**: Handles the data ingestion pipeline. It reads PDFs from the `knowledge_docs` folder, extracts text (and is designed for you to add logic to extract source URLs), generates embeddings, and stores them in the Milvus vector database.
* **`backend/constants.py`**: Originally for Weaviate constants; now mostly cleaned up. Can be used for any backend-specific global constants if needed.
* **`knowledge_docs/`**: This is the directory where you must place your PDF documents for ingestion. The `ingest.py` script processes files from here.
* **`.env.example`**: A template file for environment variables. You should copy this to a `.env` file and fill in your specific configurations (API keys, Milvus/Ollama URLs, etc.).
* **`.env`**: (User-created) This file stores your actual environment variable configurations. It's listed in `.gitignore` and should not be committed to version control.
* **`pyproject.toml`**: Defines project dependencies and metadata for Poetry.
* **`poetry.lock`**: The lock file generated by Poetry, ensuring consistent dependency versions.
2 changes: 1 addition & 1 deletion backend/constants.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
WEAVIATE_DOCS_INDEX_NAME = "LangChain_Combined_Docs_OpenAI_text_embedding_3_small"
# WEAVIATE_DOCS_INDEX_NAME = "LangChain_Combined_Docs_OpenAI_text_embedding_3_small"
Loading