langchain-ai · Sasapu-Hemasai · May 23, 2025
diff --git a/.env.example b/.env.example
@@ -0,0 +1,37 @@
+# LangSmith Tracing (Optional, but recommended)
+# LANGCHAIN_TRACING_V2="true"
+# LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
+# LANGCHAIN_API_KEY="YOUR_LANGSMITH_API_KEY" # Replace with your LangSmith API key
+# LANGCHAIN_PROJECT="YOUR_LANGCHAIN_PROJECT_NAME" # Replace with your LangSmith project name
+
+# API Base URL for the frontend (if applicable)
+# NEXT_PUBLIC_API_BASE_URL="http://localhost:8000/api" # Default for local LangServe backend
+
+# --- Core LLM and Embedding Configuration ---
+
+# Select the provider for embeddings and LLMs
+# Options: "openai", "ollama"
+EMBEDDING_PROVIDER="openai" # Default
+LLM_PROVIDER="openai" # Default
+
+# OpenAI Configuration (Required if EMBEDDING_PROVIDER or LLM_PROVIDER is "openai")
+OPENAI_API_KEY="YOUR_OPENAI_API_KEY" # Replace with your OpenAI API key
+
+# Ollama Configuration (Required if EMBEDDING_PROVIDER or LLM_PROVIDER is "ollama")
+OLLAMA_BASE_URL="http://localhost:11434" # Default
+OLLAMA_EMBEDDING_MODEL="nomic-embed-text" # Default model for Ollama embeddings
+OLLAMA_LLM_MODEL="llama2" # Default model for Ollama LLM
+
+# Milvus Vector Store Configuration (Required for vector storage)
+MILVUS_HOST="localhost" # Default
+MILVUS_PORT="19530" # Default
+MILVUS_COLLECTION_NAME="knowledge_base" # Default
+
+# --- Other Provider API Keys (Examples) ---
+# ANTHROPIC_API_KEY="YOUR_ANTHROPIC_API_KEY"
+# COHERE_API_KEY="YOUR_COHERE_API_KEY"
+# FIREWORKS_API_KEY="YOUR_FIREWORKS_API_KEY"
+# GOOGLE_API_KEY="YOUR_GOOGLE_API_KEY" # Or GOOGLE_APPLICATION_CREDENTIALS for service accounts
+
+# --- Optional: Database for Record Manager (for langchain ingest state) ---
+# RECORD_MANAGER_DB_URL="postgresql+psycopg2://user:pass@localhost:5432/mydatabase" # Example for PostgreSQL
diff --git a/README.md b/README.md
@@ -1,45 +1,159 @@
-# 🦜️🔗 Chat LangChain
-
-This repo is an implementation of a chatbot specifically focused on question answering over the [LangChain documentation](https://python.langchain.com/).
-Built with [LangChain](https://github.com/langchain-ai/langchain/), [LangGraph](https://github.com/langchain-ai/langgraph/), and [Next.js](https://nextjs.org).
-
-Deployed version: [chat.langchain.com](https://chat.langchain.com)
-
-> Looking for the JS version? Click [here](https://github.com/langchain-ai/chat-langchainjs).
-
-The app leverages LangChain and LangGraph's streaming support and async API to update the page in real time for multiple users.
-
-## Running locally
-
-This project is now deployed using [LangGraph Cloud](https://langchain-ai.github.io/langgraph/cloud/), which means you won't be able to run it locally (or without a LangGraph Cloud account). If you want to run it WITHOUT LangGraph Cloud, please use the code and documentation from this [branch](https://github.com/langchain-ai/chat-langchain/tree/langserve).
-
-> [!NOTE]
-> This [branch](https://github.com/langchain-ai/chat-langchain/tree/langserve) **does not** have the same set of features.
-
-## 📚 Technical description
-
-There are two components: ingestion and question-answering.
-
-Ingestion has the following steps:
-
-1. Pull html from documentation site as well as the Github Codebase
-2. Load html with LangChain's [RecursiveURLLoader](https://python.langchain.com/docs/integrations/document_loaders/recursive_url_loader) and [SitemapLoader](https://python.langchain.com/docs/integrations/document_loaders/sitemap)
-3. Split documents with LangChain's [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
-4. Create a vectorstore of embeddings, using LangChain's [Weaviate vectorstore wrapper](https://python.langchain.com/docs/integrations/vectorstores/weaviate) (with OpenAI's embeddings).
-
-Question-Answering has the following steps:
-
-1. Given the chat history and new user input, determine what a standalone question would be using an LLM.
-2. Given that standalone question, look up relevant documents from the vectorstore.
-3. Pass the standalone question and relevant documents to the model to generate and stream the final answer.
-4. Generate a trace URL for the current chat session, as well as the endpoint to collect feedback.
-
-## Documentation
-
-Looking to use or modify this Use Case Accelerant for your own needs? We've added a few docs to aid with this:
-
-- **[Concepts](./CONCEPTS.md)**: A conceptual overview of the different components of Chat LangChain. Goes over features like ingestion, vector stores, query analysis, etc.
-- **[Modify](./MODIFY.md)**: A guide on how to modify Chat LangChain for your own needs. Covers the frontend, backend and everything in between.
-- **[LangSmith](./LANGSMITH.md)**: A guide on adding robustness to your application using LangSmith. Covers observability, evaluations, and feedback.
-- **[Production](./PRODUCTION.md)**: Documentation on preparing your application for production usage. Explains different security considerations, and more.
-- **[Deployment](./DEPLOYMENT.md)**: How to deploy your application to production. Covers setting up production databases, deploying the frontend, and more.
+# 🦜️🔗 Chat LangChain - Custom Support Bot
+
+This project is a customer support chatbot built using LangChain, LangServe, Milvus, and Ollama, with a Streamlit frontend. It is based on the original `chat-langchain` repository but adapted for PDF document ingestion and local model support.
+
+## Features
+
+*   Ingests PDF documents from a local folder (`knowledge_docs`).
+*   Extracts text from PDFs for question answering.
+*   **Placeholder logic for extracting embedded source URLs from PDFs for citations (requires user modification).**
+*   Uses Milvus as the vector store for efficient document retrieval.
+*   Supports OpenAI and Ollama for both embeddings and language models, configurable via environment variables.
+*   Provides an interactive Streamlit chat interface for user interaction.
+*   Includes source citations in responses (dependent on successful URL extraction).
+
+## Setup & Installation
+
+### Prerequisites
+
+*   Python 3.9+
+*   Poetry for dependency management
+*   A running Milvus instance (version 2.3.x or compatible).
+*   Ollama server running (optional, if you intend to use Ollama models).
+    *   Ensure you have pulled the necessary models (e.g., `ollama pull llama2`, `ollama pull nomic-embed-text`).
+
+### Installation Steps
+
+1.  **Clone the repository:**
+    ```bash
+    git clone <repository_url> # Replace <repository_url> with the actual URL
+    cd <repository_name>     # Replace <repository_name> with the cloned directory name
+    ```
+
+2.  **Install dependencies:**
+    This project uses Poetry to manage dependencies.
+    ```bash
+    poetry install --no-root
+    ```
+    This command installs all necessary Python packages defined in `pyproject.toml`.
+
+3.  **Set up Environment Variables:**
+    Copy the example environment file to a new `.env` file:
+    ```bash
+    cp .env.example .env
+    ```
+    Then, edit the `.env` file and provide the necessary values. Below are the key variables:
+
+    *   `LANGCHAIN_TRACING_V2`: (Optional) Set to `"true"` to enable LangSmith tracing.
+    *   `LANGCHAIN_ENDPOINT`: (Optional) LangSmith API endpoint, if tracing.
+    *   `LANGCHAIN_API_KEY`: (Optional) Your LangSmith API key, if tracing.
+    *   `LANGCHAIN_PROJECT`: (Optional) Your LangSmith project name, if tracing.
+
+    *   `EMBEDDING_PROVIDER`: Specifies the embedding model provider.
+        *   Set to `"openai"` to use OpenAI embeddings.
+        *   Set to `"ollama"` to use a local Ollama embedding model.
+        *   *Default: "openai"*
+    *   `LLM_PROVIDER`: Specifies the language model provider for the chat.
+        *   Set to `"openai"` to use an OpenAI model.
+        *   Set to `"ollama"` to use a local Ollama model.
+        *   *Default: "openai"*
+
+    *   `OPENAI_API_KEY`: **Required if `EMBEDDING_PROVIDER` or `LLM_PROVIDER` is "openai".** Your API key for OpenAI.
+
+    *   `OLLAMA_BASE_URL`: **Required if `EMBEDDING_PROVIDER` or `LLM_PROVIDER` is "ollama".** The base URL for your Ollama server.
+        *   *Default: "http://localhost:11434"*
+    *   `OLLAMA_EMBEDDING_MODEL`: The name of the embedding model to use with Ollama (e.g., "nomic-embed-text").
+        *   *Default: "nomic-embed-text"*
+    *   `OLLAMA_LLM_MODEL`: The name of the language model to use with Ollama (e.g., "llama2").
+        *   *Default: "llama2"*
+
+    *   `MILVUS_HOST`: The hostname or IP address of your Milvus instance.
+        *   *Default: "localhost"*
+    *   `MILVUS_PORT`: The port number for your Milvus instance.
+        *   *Default: "19530"*
+    *   `MILVUS_COLLECTION_NAME`: The name of the collection to be used in Milvus for storing document embeddings.
+        *   *Default: "knowledge_base"*
+
+    *   `RECORD_MANAGER_DB_URL`: The database URL for the record manager, which keeps track of ingested documents. SQLite is used by default.
+        *   *Example: "sqlite:///./chat_langchain_ingestion.db"*
+
+## Data Preparation (IMPORTANT)
+
+1.  **Place PDF Files:**
+    Put all your PDF documents that you want the chatbot to use into the `knowledge_docs` folder located in the root of this repository. If this folder does not exist, please create it.
+
+2.  **Implement Source URL Extraction (User Action Required):**
+    The current system for extracting source URLs from PDFs (to provide accurate citations) uses a **placeholder**. You **must** modify the PDF ingestion logic to correctly extract the source URL for each document.
+
+    *   **File to Modify:** `backend/ingest.py`
+    *   **Function to Modify:** `load_knowledge_base_pdfs`
+    *   **Area to Modify:** Look for the variable `extracted_url`. Currently, it is hardcoded:
+        ```python
+        extracted_url = "placeholder_url_needs_implementation"
+        ```
+    *   **Your Task:** Implement logic to find and assign the correct source URL for each PDF. This could involve:
+        *   Parsing the PDF content for a specific line or pattern (e.g., a line starting with "Source: http://..." or "Canonical URL: https://...").
+        *   Using a sidecar metadata file for each PDF.
+        *   Deriving the URL from the PDF's filename if it follows a consistent pattern.
+        *   Querying an external system based on PDF content or filename.
+
+    **Example of what you might look for in a PDF's text content:**
+    ```
+    Source: https://my-company.com/original-document.pdf
+    ```
+    Or it might be embedded in the PDF's metadata if the generating system includes it. The `PyPDFLoader` loads document pages; you might need to inspect `page.page_content` or `page.metadata` for clues.
+
+    **Failure to implement this step will result in all sources being cited with the placeholder URL.**
+
+## Running the Application
+
+Ensure your Milvus instance (and Ollama server, if using it) is running before starting the application.
+
+1.  **Step 1: Run Ingestion (One-time per data update)**
+    This script processes the PDFs in `knowledge_docs`, creates embeddings, and stores them in Milvus. Run it once initially, and then again whenever you add, remove, or change PDF files.
+    ```bash
+    poetry run python backend/ingest.py
+    ```
+
+2.  **Step 2: Run Backend Server (LangServe)**
+    This starts the FastAPI server that serves your chat graph using LangServe.
+    ```bash
+    poetry run uvicorn backend.server:app --reload --host 0.0.0.0 --port 8000
+    ```
+    The backend will be accessible at `http://localhost:8000`.
+
+3.  **Step 3: Run Streamlit Frontend**
+    In a new terminal, start the Streamlit application.
+    ```bash
+    poetry run streamlit run streamlit_app.py
+    ```
+    Streamlit will typically open in your web browser automatically, or provide a URL (usually `http://localhost:8501`).
+
+## Usage
+
+1.  Open the Streamlit application URL in your web browser (e.g., `http://localhost:8501`).
+2.  In the sidebar, you can:
+    *   Verify or change the LangServe URL (defaults to `http://localhost:8000/chat_langchain/`).
+    *   Select the LLM model you wish to use from the dropdown menu. The available models depend on your backend configuration and environment variables (e.g., OpenAI keys, Ollama setup).
+3.  Type your question related to the content of your ingested PDF documents into the chat input box at the bottom of the page and press Enter.
+4.  The chatbot will process your question, retrieve relevant information from the documents in Milvus, generate an answer, and display it along with source citations.
+
+---
+
+This README provides a comprehensive guide to setting up and running the custom chatbot. Remember to implement the PDF source URL extraction logic for accurate citations.
+
+## Project Structure and Key Files
+
+Here's a brief overview of the key files and directories in this project:
+
+*   **`streamlit_app.py`**: The main Streamlit application file that provides the user interface for the chatbot. You run this file to start the chat application.
+*   **`backend/`**: This directory contains all the Python backend logic.
+    *   **`backend/server.py`**: The FastAPI server that exposes the LangServe chat graph as an API. This is what the Streamlit frontend communicates with.
+    *   **`backend/graph.py`**: Defines the core chat logic using LangGraph. It constructs the conversational chain, including document retrieval, history management, LLM calls, and response synthesis.
+    *   **`backend/ingest.py`**: Handles the data ingestion pipeline. It reads PDFs from the `knowledge_docs` folder, extracts text (and is designed for you to add logic to extract source URLs), generates embeddings, and stores them in the Milvus vector database.
+    *   **`backend/constants.py`**: Originally for Weaviate constants; now mostly cleaned up. Can be used for any backend-specific global constants if needed.
+*   **`knowledge_docs/`**: This is the directory where you must place your PDF documents for ingestion. The `ingest.py` script processes files from here.
+*   **`.env.example`**: A template file for environment variables. You should copy this to a `.env` file and fill in your specific configurations (API keys, Milvus/Ollama URLs, etc.).
+*   **`.env`**: (User-created) This file stores your actual environment variable configurations. It's listed in `.gitignore` and should not be committed to version control.
+*   **`pyproject.toml`**: Defines project dependencies and metadata for Poetry.
+*   **`poetry.lock`**: The lock file generated by Poetry, ensuring consistent dependency versions.
diff --git a/backend/constants.py b/backend/constants.py
@@ -1 +1 @@
-WEAVIATE_DOCS_INDEX_NAME = "LangChain_Combined_Docs_OpenAI_text_embedding_3_small"
+# WEAVIATE_DOCS_INDEX_NAME = "LangChain_Combined_Docs_OpenAI_text_embedding_3_small"
Original file line number	Diff line number	Diff line change
		@@ -1 +1 @@
		WEAVIATE_DOCS_INDEX_NAME = "LangChain_Combined_Docs_OpenAI_text_embedding_3_small"
		# WEAVIATE_DOCS_INDEX_NAME = "LangChain_Combined_Docs_OpenAI_text_embedding_3_small"