diff --git a/.gitignore b/.gitignore index d5dde41..2b8ce4c 100644 --- a/.gitignore +++ b/.gitignore @@ -5,6 +5,9 @@ changelogs/ +# Generated embeddings file +embeddings.json + # dotenv files .env diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md new file mode 100644 index 0000000..98fa0c5 --- /dev/null +++ b/ARCHITECTURE.md @@ -0,0 +1,247 @@ +# Vector Store Persistence Architecture + +## Workflow Overview + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ BUILD TIME │ +├─────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. build.sh script runs │ +│ │ │ +│ ├─> Builds csla-embeddings-generator CLI tool │ +│ │ │ +│ ├─> Runs CLI tool: │ +│ │ ├─> Scans csla-examples/ directory │ +│ │ ├─> Connects to Azure OpenAI │ +│ │ ├─> Generates embeddings for each file │ +│ │ └─> Saves to embeddings.json │ +│ │ │ +│ └─> Builds Docker container: │ +│ ├─> Copies csla-examples/ → /csla-examples │ +│ ├─> Copies embeddings.json → /app/embeddings.json │ +│ └─> Copies application → /app │ +│ │ +└─────────────────────────────────────────────────────────────────────┘ + +┌─────────────────────────────────────────────────────────────────────┐ +│ RUNTIME (Container) │ +├─────────────────────────────────────────────────────────────────────┤ +│ │ +│ 1. Server starts │ +│ │ │ +│ ├─> Initializes VectorStoreService │ +│ │ ├─> Requires Azure OpenAI credentials │ +│ │ └─> Used for user query embeddings │ +│ │ │ +│ └─> Loads embeddings: │ +│ │ │ +│ ├─> Checks for /app/embeddings.json │ +│ │ │ +│ ├─> If found: │ +│ │ ├─> Loads pre-generated embeddings (FAST!) │ +│ │ ├─> Populates in-memory vector store │ +│ │ └─> Ready in seconds │ +│ │ │ +│ └─> If not found: │ +│ ├─> Displays warning message │ +│ ├─> Semantic search disabled │ +│ └─> Keyword search still available │ +│ │ +│ 2. Server handles user requests │ +│ │ │ +│ └─> Search requests: │ +│ ├─> Uses pre-loaded file embeddings (if available) │ +│ ├─> Generates embedding for user query (Azure OpenAI) │ +│ └─> Returns semantic search results │ +│ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +## Component Interaction + +``` +┌──────────────────────┐ +│ csla-embeddings- │ +│ generator │ +│ (CLI Tool) │ +└──────────┬───────────┘ + │ generates + ↓ + ┌──────────────┐ + │ embeddings. │ + │ json │ + └──────┬───────┘ + │ included in + ↓ + ┌──────────────┐ ┌─────────────────┐ + │ Docker │────────→│ csla-mcp- │ + │ Container │ contains │ server │ + └──────────────┘ └────────┬────────┘ + │ loads on startup + ↓ + ┌─────────────────┐ + │ VectorStore │ + │ Service │ + │ (in-memory) │ + └────────┬────────┘ + │ serves + ↓ + ┌─────────────────┐ + │ Search API │ + │ Requests │ + └─────────────────┘ +``` + +## Data Flow + +### Build Time Data Flow +``` +csla-examples/*.{cs,md} + │ + │ (read) + ↓ +csla-embeddings-generator + │ + │ (Azure OpenAI API) + ↓ +embeddings.json (5-20 MB) + │ + │ (COPY in Dockerfile) + ↓ +Docker Container: /app/embeddings.json +``` + +### Runtime Data Flow +``` +/app/embeddings.json + │ + │ (LoadEmbeddingsFromJsonAsync) + ↓ +VectorStoreService._vectorStore + │ (in-memory Dictionary) + │ + │ (semantic search) + ↓ +Search Results +``` + +## Key Design Decisions + +### 1. JSON Format Choice +- **Decision**: Use JSON for embeddings storage +- **Rationale**: + - Simple to implement + - Human-readable for debugging + - Built-in .NET serialization support + - No external database dependencies + - Sufficient for expected file count (<1000 files) + +### 2. Build-Time Generation +- **Decision**: Generate embeddings during Docker build +- **Rationale**: + - Separates concerns (build vs. runtime) + - Reduces runtime dependencies + - Faster container startup + - Embeddings immutable per build + +### 3. No Runtime Generation +- **Decision**: Server does not generate embeddings for example files at runtime +- **Rationale**: + - Separates concerns (build-time generation vs. runtime loading) + - Reduces runtime dependencies on Azure OpenAI for file indexing + - Faster startup guaranteed (no waiting for generation) + - Clear separation of build-time and runtime operations + - Azure OpenAI only needed at runtime for user queries + +### 4. In-Memory Storage +- **Decision**: Continue using in-memory Dictionary +- **Rationale**: + - Fast lookups (O(1)) + - Simple implementation + - No serialization overhead during queries + - Matches existing architecture + +## File Formats + +### embeddings.json Structure +```json +[ + { + "FileName": "relative/path/to/file.cs", + "Content": "full file content...", + "Embedding": [0.1, 0.2, ..., 0.n], // 1536 floats for text-embedding-3-small + "Version": 10 // or null for version-agnostic + } +] +``` + +### File Size Estimates +- Per file: ~50-200 KB (depending on content size) +- Typical total: 5-20 MB for full example set +- Embedding vector: ~6 KB (1536 floats × 4 bytes) + +## Performance Characteristics + +### Build Time +- Embeddings generation: ~30-60 seconds (depends on file count) +- Docker build: +2-5 seconds (copy embeddings.json) + +### Runtime (with embeddings.json) +- Startup time: 2-5 seconds +- Memory footprint: +5-20 MB +- Search latency: Unchanged (still requires user query embedding) + +### Runtime (without embeddings.json) +- Startup time: 2-5 seconds (same as with embeddings) +- Semantic search: Disabled (keyword search still available) +- Memory footprint: Minimal (no embeddings loaded) + +## Security Considerations + +### Build Time +- Azure OpenAI credentials required +- Credentials should be in CI/CD environment variables +- embeddings.json contains no secrets (only vectors and content) + +### Runtime +- Azure OpenAI credentials still required for user queries +- embeddings.json is read-only +- No external database credentials needed + +## Scalability + +### Current Implementation +- Suitable for: 10-1000 files +- Memory: ~5-20 MB +- Load time: ~2-5 seconds + +### If Scale Increases +Consider: +- Compression (gzip embeddings.json) +- Lazy loading (load embeddings on demand) +- External vector database (Pinecone, Weaviate, etc.) +- Pagination/chunking + +## Maintenance + +### When to Regenerate Embeddings +- When code samples change +- When CSLA version files are added/modified +- When switching embedding models +- As part of CI/CD pipeline + +### Monitoring +- Check embeddings.json file size +- Monitor server startup time +- Track API costs (should decrease significantly) + +## Future Enhancements + +Potential improvements: +1. **Incremental Updates**: Only regenerate changed files +2. **Versioning**: Track embeddings format version +3. **Compression**: Compress embeddings.json +4. **Checksums**: Validate embeddings file integrity +5. **Metadata**: Store generation timestamp, model version +6. **Caching**: Add HTTP caching headers for embeddings diff --git a/IMPLEMENTATION.md b/IMPLEMENTATION.md new file mode 100644 index 0000000..ac7bb18 --- /dev/null +++ b/IMPLEMENTATION.md @@ -0,0 +1,181 @@ +# Vector Store Persistence Implementation + +## Overview + +This document describes the implementation of pre-generated vector embeddings for the CSLA MCP server, addressing issue #XX which requested storing vector results to avoid regenerating embeddings on every server startup. + +## Problem Statement + +Previously, the csla-mcp-server would: +1. Start up +2. Connect to Azure OpenAI +3. Generate embeddings for all code samples in the csla-examples directory +4. Store embeddings in memory + +This approach had several drawbacks: +- Slow startup time (waiting for all embeddings to be generated) +- High Azure OpenAI API costs (regenerating embeddings on every restart) +- Unnecessary computation for unchanged files + +## Solution + +The solution implements a build-time embedding generation process: + +### 1. CLI Tool (csla-embeddings-generator) + +A new console application that: +- Scans the csla-examples directory for .cs and .md files +- Connects to Azure OpenAI to generate embeddings +- Saves all embeddings to a JSON file (`embeddings.json`) +- Can be run independently or as part of the build process + +**Location**: `csla-embeddings-generator/` + +**Key Classes**: +- `Program.cs`: Entry point with command-line argument parsing +- `EmbeddingsGenerator.cs`: Core logic for generating embeddings +- `DocumentEmbedding.cs`: Data model for document embeddings + +### 2. Enhanced VectorStoreService + +The VectorStoreService now supports: +- Loading embeddings from JSON file via `LoadEmbeddingsFromJsonAsync()` +- Exporting embeddings to JSON file via `ExportEmbeddingsToJsonAsync()` +- Maintains the same in-memory structure as before + +**Changes**: `csla-mcp-server/Services/VectorStoreService.cs` + +### 3. Updated Startup Logic + +The server now: +1. Checks for `embeddings.json` in the application directory +2. If found, loads pre-generated embeddings (fast startup) +3. If not found, falls back to the original behavior (generate at runtime) +4. Still requires Azure OpenAI credentials at runtime for user query embeddings + +**Changes**: `csla-mcp-server/Program.cs` + +### 4. Build Process Integration + +The `build.sh` script now: +1. Builds the embeddings generator CLI tool +2. Runs the CLI tool to generate `embeddings.json` +3. Creates an empty JSON array if generation fails (allows Docker build to succeed) +4. Builds the Docker container with embeddings included + +**Changes**: `build.sh` + +### 5. Docker Container + +The Dockerfile now: +- Copies the `embeddings.json` file into the container at `/app/embeddings.json` +- Server loads embeddings from this location on startup + +**Changes**: `csla-mcp-server/Dockerfile` + +## Benefits + +1. **Faster Startup**: Server starts immediately, loading pre-generated embeddings from disk +2. **Reduced Costs**: Embeddings generated once during build, not on every server restart +3. **Simpler Architecture**: Server only loads embeddings, doesn't generate them for example files +4. **Build-time Validation**: Embedding generation happens during build, catching issues early + +## Usage + +### Building with Embeddings + +Use the provided build script: + +```bash +./build.sh +``` + +This will generate embeddings and build the Docker container. + +### Manual Embedding Generation + +Generate embeddings independently: + +```bash +dotnet run --project csla-embeddings-generator -- --examples-path ./csla-examples --output ./embeddings.json +``` + +### Running the Server + +The server automatically loads embeddings if available: + +```bash +docker run --rm -p 8080:80 \ + -e AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/" \ + -e AZURE_OPENAI_API_KEY="your-api-key" \ + csla-mcp-server:latest +``` + +## File Structure + +``` +csla-mcp/ +├── csla-embeddings-generator/ # New CLI tool +│ ├── Program.cs +│ ├── EmbeddingsGenerator.cs +│ ├── DocumentEmbedding.cs +│ ├── csla-embeddings-generator.csproj +│ └── README.md +├── csla-mcp-server/ +│ ├── Services/ +│ │ └── VectorStoreService.cs # Enhanced with JSON I/O +│ ├── Program.cs # Updated startup logic +│ └── Dockerfile # Updated to copy embeddings.json +├── build.sh # Updated build script +├── embeddings.json # Generated file (gitignored) +└── readme.md # Updated documentation + +``` + +## Configuration + +### Environment Variables (Build Time) + +Required for embedding generation: +- `AZURE_OPENAI_ENDPOINT`: Azure OpenAI service endpoint +- `AZURE_OPENAI_API_KEY`: Azure OpenAI API key +- `AZURE_OPENAI_EMBEDDING_MODEL`: Embedding model name (default: text-embedding-3-small) + +### Environment Variables (Runtime) + +Still required for user query embeddings: +- `AZURE_OPENAI_ENDPOINT`: Azure OpenAI service endpoint +- `AZURE_OPENAI_API_KEY`: Azure OpenAI API key + +## Testing + +Since this environment doesn't have .NET 10.0 SDK or Azure OpenAI credentials, full testing requires: + +1. **Build Environment**: .NET 10.0 SDK +2. **Azure Credentials**: Valid Azure OpenAI endpoint and API key +3. **Code Samples**: The csla-examples directory with sample files + +### Manual Testing Steps + +1. Set Azure OpenAI environment variables +2. Run `./build.sh` to generate embeddings and build container +3. Verify `embeddings.json` is created and contains embeddings +4. Run the Docker container +5. Check startup logs for "Loaded X pre-generated embeddings" +6. Test search functionality to ensure semantic search works + +## Backward Compatibility + +The implementation maintains API compatibility: +- If `embeddings.json` is missing, server will start but semantic search will be disabled +- Keyword search continues to work without embeddings +- Users are notified to generate embeddings using the CLI tool +- No breaking changes to the API or user experience + +## Future Enhancements + +Potential improvements: +1. Incremental updates: Only regenerate embeddings for changed files +2. Compression: Compress embeddings.json to reduce size +3. Versioning: Track embeddings file version for compatibility +4. Caching: Add timestamps to detect stale embeddings diff --git a/TESTING.md b/TESTING.md new file mode 100644 index 0000000..1ca55c6 --- /dev/null +++ b/TESTING.md @@ -0,0 +1,225 @@ +# Testing Guide for Vector Store Persistence + +## Overview + +This guide explains how to test the new vector store persistence feature that was implemented to address the issue of storing vector results in a database. + +## Prerequisites for Testing + +1. **.NET 10.0 SDK** - The solution targets .NET 10.0 +2. **Azure OpenAI Account** with: + - Valid endpoint URL + - API key + - Deployed embedding model (e.g., `text-embedding-3-small`) +3. **Docker** (for container testing) + +## Setting Up Environment Variables + +Before testing, set the required environment variables: + +### PowerShell (Windows) +```powershell +$env:AZURE_OPENAI_ENDPOINT = "https://your-resource.openai.azure.com/" +$env:AZURE_OPENAI_API_KEY = "your-api-key-here" +$env:AZURE_OPENAI_EMBEDDING_MODEL = "text-embedding-3-small" +``` + +### Bash (Linux/macOS) +```bash +export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/" +export AZURE_OPENAI_API_KEY="your-api-key-here" +export AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-3-small" +``` + +## Test 1: CLI Tool - Generate Embeddings + +### Run the embeddings generator +```bash +cd /path/to/csla-mcp +dotnet run --project csla-embeddings-generator -- --examples-path ./csla-examples --output ./embeddings.json +``` + +### Expected Output +``` +[EmbeddingsGenerator] Starting CSLA Embeddings Generator +[EmbeddingsGenerator] Examples path: ./csla-examples +[EmbeddingsGenerator] Output path: ./embeddings.json +[EmbeddingsGenerator] Using Azure OpenAI endpoint: https://... +[EmbeddingsGenerator] Using embedding model: text-embedding-3-small +[EmbeddingsGenerator] Found XX files to process +[EmbeddingsGenerator] Processing: DataPortalOperationCreate.md (common) +[EmbeddingsGenerator] Processed 5/XX files... +[EmbeddingsGenerator] Successfully processed XX files +[EmbeddingsGenerator] Generated XX embeddings +[EmbeddingsGenerator] Embeddings saved to ./embeddings.json +``` + +### Verification +1. Check that `embeddings.json` was created +2. Verify file size (should be several MB for typical example set) +3. Open file and verify JSON structure: +```json +[ + { + "FileName": "DataPortalOperationCreate.md", + "Content": "...", + "Embedding": [0.123, 0.456, ...], + "Version": null + }, + ... +] +``` + +## Test 2: Server - Load Pre-generated Embeddings + +### Copy embeddings to server directory +```bash +cp embeddings.json csla-mcp-server/bin/Debug/net10.0/ +``` + +### Run the server +```bash +cd csla-mcp-server +dotnet run +``` + +### Expected Output +Look for these key log messages: +``` +[Startup] Using Azure OpenAI endpoint: https://... +[Startup] Using embedding model deployment: text-embedding-3-small +[Startup] Vector store initialized successfully - semantic search enabled. +[Startup] Loaded XX pre-generated embeddings from /path/to/embeddings.json +``` + +### Verification +- Server should start quickly (within seconds) +- No embedding generation messages for individual files +- Search functionality should work normally + +## Test 3: Server - Missing Embeddings + +### Remove embeddings.json +```bash +rm csla-mcp-server/bin/Debug/net10.0/embeddings.json +``` + +### Run the server again +```bash +cd csla-mcp-server +dotnet run +``` + +### Expected Output +``` +[Startup] Using Azure OpenAI endpoint: https://... +[Startup] Using embedding model deployment: text-embedding-3-small +[Startup] Vector store initialized successfully - semantic search enabled. +[Startup] Warning: No pre-generated embeddings found. Semantic search will not be available. +[Startup] To enable semantic search, generate embeddings using: dotnet run --project csla-embeddings-generator +``` + +### Verification +- Server starts quickly +- Warning message indicates semantic search is not available +- Keyword search continues to work +- Server does NOT attempt to generate embeddings at runtime + +## Test 4: Build Script + +### Run the full build script +```bash +./build.sh +``` + +### Expected Output +``` +Building embeddings generator... +Build succeeded. +Generating embeddings... +[EmbeddingsGenerator] Starting CSLA Embeddings Generator +... +[EmbeddingsGenerator] Embeddings saved to ./embeddings.json +Building Docker container... +[+] Building XX.Xs (XX/XX finished) +... +``` + +### Verification +1. `embeddings.json` created in repository root +2. Docker image built successfully: `docker images | grep csla-mcp-server` +3. Image includes embeddings file + +## Test 5: Docker Container + +### Run the container +```bash +docker run --rm -p 8080:80 \ + -e AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com/" \ + -e AZURE_OPENAI_API_KEY="your-api-key" \ + csla-mcp-server:latest +``` + +### Expected Output (in container logs) +``` +[Startup] Vector store initialized successfully - semantic search enabled. +[Startup] Loaded XX pre-generated embeddings from /app/embeddings.json +[Startup] Skipping embedding generation - using pre-generated embeddings +``` + +### Verification +- Container starts quickly +- Pre-generated embeddings are used +- Search API responds correctly + +## Test 6: Semantic Search Functionality + +### Test the search functionality +Once the server is running, test the search endpoint to ensure semantic search works with pre-generated embeddings. + +The search should return relevant results based on semantic similarity. + +## Troubleshooting + +### Issue: CLI tool fails with "AZURE_OPENAI_ENDPOINT not set" +**Solution**: Ensure environment variables are set in the current shell session + +### Issue: Docker build fails with "embeddings.json not found" +**Solution**: Run `build.sh` which generates embeddings before building Docker image + +### Issue: Server loads 0 embeddings +**Solution**: +- Check that embeddings.json exists in the application directory +- Verify JSON file is valid and not empty +- Check file permissions + +### Issue: Embeddings file is too large +**Solution**: This is expected - embeddings contain float arrays with 1536+ dimensions per file + +## Performance Comparison + +### Before (Runtime Generation) +- Startup time: ~30-60 seconds for typical example set +- Azure OpenAI API calls: One per file on every startup + +### After (Pre-generated Embeddings) +- Startup time: ~2-5 seconds +- Azure OpenAI API calls: Zero for file embeddings (only for user queries) + +## Success Criteria + +The implementation is successful if: +1. ✅ CLI tool generates valid embeddings.json +2. ✅ Server loads pre-generated embeddings on startup +3. ✅ Server startup is significantly faster with pre-generated embeddings +4. ✅ Server provides clear warnings if embeddings.json is missing +5. ✅ Semantic search functionality works correctly with pre-generated embeddings +6. ✅ Docker container includes and uses pre-generated embeddings +7. ✅ Build script generates embeddings before building container + +## Notes + +- The embeddings.json file is gitignored as it's a build artifact +- File size will vary based on number of examples (typically 5-20 MB) +- Each embedding is a float array (1536 dimensions for text-embedding-3-small) +- Pre-generated embeddings are immutable - regenerate when examples change diff --git a/build.sh b/build.sh index 296f2ec..75b5be5 100644 --- a/build.sh +++ b/build.sh @@ -1,3 +1,20 @@ #!/usr/bin/env bash +# Build the embeddings generator CLI tool +echo "Building embeddings generator..." +dotnet build csla-embeddings-generator/csla-embeddings-generator.csproj -c Release + +# Run the embeddings generator to create embeddings.json +echo "Generating embeddings..." +dotnet run --project csla-embeddings-generator/csla-embeddings-generator.csproj --configuration Release -- --examples-path ./csla-examples --output ./embeddings.json + +# If embeddings.json doesn't exist (e.g., missing Azure credentials), create an empty array JSON file +# This allows the Docker build to succeed, but semantic search will be disabled at runtime +if [ ! -f ./embeddings.json ]; then + echo "Warning: embeddings.json not created, creating empty file for Docker build" + echo "[]" > ./embeddings.json +fi + +# Build the Docker container with the embeddings.json file +echo "Building Docker container..." docker build -t csla-mcp-server:latest -f csla-mcp-server/Dockerfile . diff --git a/csla-embeddings-generator/DocumentEmbedding.cs b/csla-embeddings-generator/DocumentEmbedding.cs new file mode 100644 index 0000000..1654cfc --- /dev/null +++ b/csla-embeddings-generator/DocumentEmbedding.cs @@ -0,0 +1,12 @@ +namespace CslaEmbeddingsGenerator; + +/// +/// Represents a document with its embedding vector +/// +public class DocumentEmbedding +{ + public string FileName { get; set; } = string.Empty; + public string Content { get; set; } = string.Empty; + public float[] Embedding { get; set; } = Array.Empty(); + public int? Version { get; set; } = null; +} diff --git a/csla-embeddings-generator/EmbeddingsGenerator.cs b/csla-embeddings-generator/EmbeddingsGenerator.cs new file mode 100644 index 0000000..3d421dd --- /dev/null +++ b/csla-embeddings-generator/EmbeddingsGenerator.cs @@ -0,0 +1,117 @@ +using Azure.AI.OpenAI; +using Azure; +using OpenAI.Embeddings; + +namespace CslaEmbeddingsGenerator; + +/// +/// Generates vector embeddings for CSLA code samples +/// +public class EmbeddingsGenerator +{ + private readonly AzureOpenAIClient _openAIClient; + private readonly string _embeddingModelName; + + public EmbeddingsGenerator(string azureOpenAIEndpoint, string azureOpenAIApiKey, string embeddingModelName = "text-embedding-3-small") + { + var clientOptions = new AzureOpenAIClientOptions(); + _openAIClient = new AzureOpenAIClient(new Uri(azureOpenAIEndpoint), new AzureKeyCredential(azureOpenAIApiKey), clientOptions); + _embeddingModelName = embeddingModelName; + } + + /// + /// Generates embeddings for all files in the examples directory + /// + public async Task> GenerateEmbeddingsAsync(string examplesPath) + { + var embeddings = new List(); + + // Find all .cs and .md files + var csFiles = Directory.GetFiles(examplesPath, "*.cs", SearchOption.AllDirectories); + var mdFiles = Directory.GetFiles(examplesPath, "*.md", SearchOption.AllDirectories); + var allFiles = csFiles.Concat(mdFiles).ToArray(); + + Console.WriteLine($"[EmbeddingsGenerator] Found {allFiles.Length} files to process"); + + int processedCount = 0; + foreach (var file in allFiles) + { + try + { + var content = await File.ReadAllTextAsync(file); + + // Get relative path from examples directory + var relativePath = Path.GetRelativePath(examplesPath, file); + + // Detect version from path + int? version = null; + var pathParts = relativePath.Split(Path.DirectorySeparatorChar); + if (pathParts.Length > 1 && pathParts[0].StartsWith("v") && int.TryParse(pathParts[0].Substring(1), out var versionNumber)) + { + version = versionNumber; + } + + // Normalize path separators to forward slash for consistency + var normalizedPath = relativePath.Replace("\\", "/"); + + var versionInfo = version.HasValue ? $" (v{version})" : " (common)"; + Console.WriteLine($"[EmbeddingsGenerator] Processing: {normalizedPath}{versionInfo}"); + + var embedding = await GenerateEmbeddingAsync(content); + + if (embedding != null && embedding.Length > 0) + { + embeddings.Add(new DocumentEmbedding + { + FileName = normalizedPath, + Content = content, + Embedding = embedding, + Version = version + }); + + processedCount++; + if (processedCount % 5 == 0) + { + Console.WriteLine($"[EmbeddingsGenerator] Processed {processedCount}/{allFiles.Length} files..."); + } + } + else + { + Console.WriteLine($"[EmbeddingsGenerator] Warning: Failed to generate embedding for {normalizedPath}"); + } + } + catch (Exception ex) + { + Console.WriteLine($"[EmbeddingsGenerator] Error processing file {file}: {ex.Message}"); + } + } + + Console.WriteLine($"[EmbeddingsGenerator] Successfully processed {processedCount} files"); + return embeddings; + } + + /// + /// Generates an embedding vector for the given text + /// + private async Task GenerateEmbeddingAsync(string text) + { + try + { + var embeddingClient = _openAIClient.GetEmbeddingClient(_embeddingModelName); + var response = await embeddingClient.GenerateEmbeddingAsync(text); + + if (response?.Value != null) + { + var embedding = response.Value.ToFloats().ToArray(); + return embedding; + } + + return null; + } + catch (Exception ex) + { + Console.WriteLine($"[EmbeddingsGenerator] Error generating embedding: {ex.Message}"); + throw; + } + } +} diff --git a/csla-embeddings-generator/Program.cs b/csla-embeddings-generator/Program.cs new file mode 100644 index 0000000..f4f03a3 --- /dev/null +++ b/csla-embeddings-generator/Program.cs @@ -0,0 +1,94 @@ +using System.Text.Json; +using Azure.AI.OpenAI; +using Azure; +using OpenAI.Embeddings; + +namespace CslaEmbeddingsGenerator; + +class Program +{ + static async Task Main(string[] args) + { + Console.WriteLine("[EmbeddingsGenerator] Starting CSLA Embeddings Generator"); + + // Parse command line arguments + string? examplesPath = null; + string? outputPath = null; + + for (int i = 0; i < args.Length; i++) + { + if (args[i] == "--examples-path" && i + 1 < args.Length) + { + examplesPath = args[i + 1]; + i++; + } + else if (args[i] == "--output" && i + 1 < args.Length) + { + outputPath = args[i + 1]; + i++; + } + } + + // Default values + examplesPath ??= Path.Combine(Directory.GetCurrentDirectory(), "csla-examples"); + outputPath ??= Path.Combine(Directory.GetCurrentDirectory(), "embeddings.json"); + + Console.WriteLine($"[EmbeddingsGenerator] Examples path: {examplesPath}"); + Console.WriteLine($"[EmbeddingsGenerator] Output path: {outputPath}"); + + // Validate examples path + if (!Directory.Exists(examplesPath)) + { + Console.Error.WriteLine($"[EmbeddingsGenerator] Error: Examples directory not found at {examplesPath}"); + return 1; + } + + // Get Azure OpenAI configuration + var azureOpenAIEndpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT"); + var azureOpenAIApiKey = Environment.GetEnvironmentVariable("AZURE_OPENAI_API_KEY"); + var embeddingModel = Environment.GetEnvironmentVariable("AZURE_OPENAI_EMBEDDING_MODEL") ?? "text-embedding-3-small"; + + if (string.IsNullOrWhiteSpace(azureOpenAIEndpoint)) + { + Console.Error.WriteLine("[EmbeddingsGenerator] Error: AZURE_OPENAI_ENDPOINT environment variable is not set"); + return 2; + } + + if (string.IsNullOrWhiteSpace(azureOpenAIApiKey)) + { + Console.Error.WriteLine("[EmbeddingsGenerator] Error: AZURE_OPENAI_API_KEY environment variable is not set"); + return 3; + } + + Console.WriteLine($"[EmbeddingsGenerator] Using Azure OpenAI endpoint: {azureOpenAIEndpoint}"); + Console.WriteLine($"[EmbeddingsGenerator] Using embedding model: {embeddingModel}"); + + try + { + var generator = new EmbeddingsGenerator(azureOpenAIEndpoint, azureOpenAIApiKey, embeddingModel); + var embeddings = await generator.GenerateEmbeddingsAsync(examplesPath); + + Console.WriteLine($"[EmbeddingsGenerator] Generated {embeddings.Count} embeddings"); + + // Save to JSON file + var json = JsonSerializer.Serialize(embeddings, new JsonSerializerOptions + { + WriteIndented = true + }); + + await File.WriteAllTextAsync(outputPath, json); + Console.WriteLine($"[EmbeddingsGenerator] Embeddings saved to {outputPath}"); + + return 0; + } + catch (Exception ex) + { + Console.Error.WriteLine($"[EmbeddingsGenerator] Error: {ex.Message}"); + if (ex.InnerException != null) + { + Console.Error.WriteLine($"[EmbeddingsGenerator] Inner exception: {ex.InnerException.Message}"); + } + return 4; + } + } +} diff --git a/csla-embeddings-generator/README.md b/csla-embeddings-generator/README.md new file mode 100644 index 0000000..b50c1ef --- /dev/null +++ b/csla-embeddings-generator/README.md @@ -0,0 +1,81 @@ +# CSLA Embeddings Generator + +A command-line tool that generates vector embeddings for CSLA .NET code samples using Azure OpenAI. + +## Purpose + +This tool pre-generates vector embeddings for all code samples in the `csla-examples` directory and saves them to a JSON file. This eliminates the need to regenerate embeddings every time the MCP server starts, significantly reducing startup time and Azure OpenAI API costs. + +## Prerequisites + +- .NET 10.0 SDK +- Azure OpenAI service with a deployed embedding model (e.g., `text-embedding-3-small`) +- Required environment variables: + - `AZURE_OPENAI_ENDPOINT`: Your Azure OpenAI service endpoint + - `AZURE_OPENAI_API_KEY`: Your Azure OpenAI API key + - `AZURE_OPENAI_EMBEDDING_MODEL` (optional): The embedding model deployment name (default: `text-embedding-3-small`) + +## Usage + +### Basic Usage + +```bash +dotnet run --project csla-embeddings-generator +``` + +This will: +- Look for the `csla-examples` directory in the current working directory +- Generate embeddings for all `.cs` and `.md` files +- Save the results to `embeddings.json` in the current directory + +### Custom Paths + +You can specify custom paths for the examples directory and output file: + +```bash +dotnet run --project csla-embeddings-generator -- --examples-path /path/to/examples --output /path/to/embeddings.json +``` + +### Command-Line Options + +- `--examples-path `: Path to the directory containing code samples (default: `./csla-examples`) +- `--output `: Path where the embeddings JSON file will be saved (default: `./embeddings.json`) + +## Build Process Integration + +This tool is integrated into the Docker build process via the `build.sh` script: + +1. The tool is built and run to generate `embeddings.json` +2. The `embeddings.json` file is copied into the Docker container during build +3. The MCP server loads the pre-generated embeddings on startup instead of regenerating them + +## Output Format + +The tool generates a JSON file containing an array of document embeddings: + +```json +[ + { + "FileName": "DataPortalOperationCreate.md", + "Content": "...", + "Embedding": [0.123, 0.456, ...], + "Version": null + }, + ... +] +``` + +Each embedding includes: +- `FileName`: Relative path to the file from the examples directory +- `Content`: Full text content of the file +- `Embedding`: Array of floating-point numbers representing the embedding vector +- `Version`: CSLA version number (extracted from path) or null for common files + +## Error Handling + +The tool will exit with specific error codes: +- `0`: Success +- `1`: Examples directory not found +- `2`: `AZURE_OPENAI_ENDPOINT` environment variable not set +- `3`: `AZURE_OPENAI_API_KEY` environment variable not set +- `4`: Error during embedding generation or file writing diff --git a/csla-embeddings-generator/csla-embeddings-generator.csproj b/csla-embeddings-generator/csla-embeddings-generator.csproj new file mode 100644 index 0000000..fa5976b --- /dev/null +++ b/csla-embeddings-generator/csla-embeddings-generator.csproj @@ -0,0 +1,16 @@ + + + + Exe + net10.0 + CslaEmbeddingsGenerator + enable + enable + + + + + + + + diff --git a/csla-mcp-server/Dockerfile b/csla-mcp-server/Dockerfile index 511a3b4..f2a7b0f 100644 --- a/csla-mcp-server/Dockerfile +++ b/csla-mcp-server/Dockerfile @@ -29,4 +29,6 @@ WORKDIR / COPY --from=build /src/csla-examples /csla-examples WORKDIR /app COPY --from=publish /app/publish . +# Copy pre-generated embeddings (created by build.sh script) +COPY embeddings.json ./embeddings.json ENTRYPOINT ["dotnet", "csla-mcp-server.dll"] \ No newline at end of file diff --git a/csla-mcp-server/Program.cs b/csla-mcp-server/Program.cs index 3804e30..0b90316 100644 --- a/csla-mcp-server/Program.cs +++ b/csla-mcp-server/Program.cs @@ -142,83 +142,48 @@ public override int Execute([NotNull] CommandContext context, [NotNull] AppSetti CslaCodeTool.VectorStore = vectorStore; - // Index all files asynchronously + // Load pre-generated embeddings asynchronously var indexingTask = Task.Run(async () => { try { - if (Directory.Exists(CslaCodeTool.CodeSamplesPath)) + if (vectorStore != null) { - // Test connectivity first if vector store is available - bool canIndex = vectorStore == null || await vectorStore.TestConnectivityAsync(); + // Load pre-generated embeddings from JSON file + var embeddingsPath = Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "embeddings.json"); + var loadedCount = await vectorStore.LoadEmbeddingsFromJsonAsync(embeddingsPath); - if (!canIndex) + if (loadedCount > 0) { - Console.WriteLine("[Startup] Skipping vector indexing due to Azure OpenAI connectivity issues."); - Console.WriteLine("[Startup] Server will run with keyword search only."); - return; + Console.WriteLine($"[Startup] Loaded {loadedCount} pre-generated embeddings from {embeddingsPath}"); } - - var csFiles = Directory.GetFiles(CslaCodeTool.CodeSamplesPath, "*.cs", SearchOption.AllDirectories); - var mdFiles = Directory.GetFiles(CslaCodeTool.CodeSamplesPath, "*.md", SearchOption.AllDirectories); - var allFiles = csFiles.Concat(mdFiles).ToArray(); - - if (vectorStore != null) + else { - Console.WriteLine($"[Startup] Starting to index {allFiles.Length} files for semantic search..."); - - var indexedCount = 0; - foreach (var file in allFiles) - { - try - { - var content = File.ReadAllText(file); - - // Get relative path from CodeSamplesPath - var relativePath = Path.GetRelativePath(CslaCodeTool.CodeSamplesPath, file); - - // Detect version from path - int? version = null; - var pathParts = relativePath.Split(Path.DirectorySeparatorChar); - if (pathParts.Length > 1 && pathParts[0].StartsWith("v") && int.TryParse(pathParts[0].Substring(1), out var versionNumber)) - { - version = versionNumber; - } - - // Normalize path separators to forward slash for consistency - var normalizedPath = relativePath.Replace("\\", "/"); - - await vectorStore.IndexDocumentAsync(normalizedPath, content, version); - indexedCount++; - - if (indexedCount % 5 == 0) - { - Console.WriteLine($"[Startup] Indexed {indexedCount}/{allFiles.Length} files..."); - } - } - catch (Exception ex) - { - Console.WriteLine($"[Startup] Failed to index file {file}: {ex.Message}"); - } - } - - Console.WriteLine($"[Startup] Completed indexing {indexedCount} files"); + Console.WriteLine("[Startup] Warning: No pre-generated embeddings found. Semantic search will not be available."); + Console.WriteLine("[Startup] To enable semantic search, generate embeddings using: dotnet run --project csla-embeddings-generator"); } - else + } + else + { + Console.WriteLine("[Startup] Vector store not initialized - semantic search disabled."); + + if (Directory.Exists(CslaCodeTool.CodeSamplesPath)) { - Console.WriteLine("[Startup] Vector store not available - skipping semantic indexing."); + var csFiles = Directory.GetFiles(CslaCodeTool.CodeSamplesPath, "*.cs", SearchOption.AllDirectories); + var mdFiles = Directory.GetFiles(CslaCodeTool.CodeSamplesPath, "*.md", SearchOption.AllDirectories); + var allFiles = csFiles.Concat(mdFiles).ToArray(); Console.WriteLine($"[Startup] Found {allFiles.Length} files available for keyword search."); } } } catch (Exception ex) { - Console.WriteLine($"[Startup] Error during file indexing: {ex.Message}"); + Console.WriteLine($"[Startup] Error loading embeddings: {ex.Message}"); } }); - // Don't wait for indexing to complete before starting the server - // The server will start immediately and semantic search will become available as indexing progresses + // Don't wait for loading embeddings to complete before starting the server + // The server will start immediately and semantic search will become available once embeddings are loaded var builder = WebApplication.CreateBuilder(); builder.Services.AddMcpServer() diff --git a/csla-mcp-server/Services/VectorStoreService.cs b/csla-mcp-server/Services/VectorStoreService.cs index 1be1d16..65a66ce 100644 --- a/csla-mcp-server/Services/VectorStoreService.cs +++ b/csla-mcp-server/Services/VectorStoreService.cs @@ -280,5 +280,71 @@ public bool IsHealthy() { return _isHealthy; } + + /// + /// Loads embeddings from a JSON file into the vector store + /// + public async Task LoadEmbeddingsFromJsonAsync(string jsonFilePath) + { + try + { + Console.WriteLine($"[VectorStore] Loading embeddings from {jsonFilePath}"); + + if (!File.Exists(jsonFilePath)) + { + Console.WriteLine($"[VectorStore] Embeddings file not found at {jsonFilePath}"); + return 0; + } + + var json = await File.ReadAllTextAsync(jsonFilePath); + var embeddings = JsonSerializer.Deserialize>(json); + + if (embeddings == null || embeddings.Count == 0) + { + Console.WriteLine("[VectorStore] No embeddings found in JSON file"); + return 0; + } + + foreach (var embedding in embeddings) + { + _vectorStore[embedding.FileName] = embedding; + } + + Console.WriteLine($"[VectorStore] Successfully loaded {embeddings.Count} embeddings from JSON"); + return embeddings.Count; + } + catch (Exception ex) + { + Console.WriteLine($"[VectorStore] Error loading embeddings from JSON: {ex.Message}"); + return 0; + } + } + + /// + /// Exports all embeddings to a JSON file + /// + public async Task ExportEmbeddingsToJsonAsync(string jsonFilePath) + { + try + { + Console.WriteLine($"[VectorStore] Exporting embeddings to {jsonFilePath}"); + + var embeddings = _vectorStore.Values.ToList(); + var json = JsonSerializer.Serialize(embeddings, new JsonSerializerOptions + { + WriteIndented = true + }); + + await File.WriteAllTextAsync(jsonFilePath, json); + Console.WriteLine($"[VectorStore] Successfully exported {embeddings.Count} embeddings to JSON"); + + return true; + } + catch (Exception ex) + { + Console.WriteLine($"[VectorStore] Error exporting embeddings to JSON: {ex.Message}"); + return false; + } + } } } diff --git a/csla-mcp.sln b/csla-mcp.sln index a5abfcf..9a26fde 100644 --- a/csla-mcp.sln +++ b/csla-mcp.sln @@ -1,6 +1,7 @@ + Microsoft Visual Studio Solution File, Format Version 12.00 # Visual Studio Version 18 -VisualStudioVersion = 18.0.11012.119 d18.0 +VisualStudioVersion = 18.0.11012.119 MinimumVisualStudioVersion = 10.0.40219.1 Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "csla-mcp-server", "csla-mcp-server\csla-mcp-server.csproj", "{91F8FEB4-628C-D193-FA56-A582E87B0A24}" EndProject @@ -9,16 +10,42 @@ Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "Solution Items", "Solution readme.md = readme.md EndProjectSection EndProject +Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "csla-embeddings-generator", "csla-embeddings-generator\csla-embeddings-generator.csproj", "{136CA094-1043-4AC2-A552-DDBC2EC1C1B7}" +EndProject Global GlobalSection(SolutionConfigurationPlatforms) = preSolution Debug|Any CPU = Debug|Any CPU + Debug|x64 = Debug|x64 + Debug|x86 = Debug|x86 Release|Any CPU = Release|Any CPU + Release|x64 = Release|x64 + Release|x86 = Release|x86 EndGlobalSection GlobalSection(ProjectConfigurationPlatforms) = postSolution {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Debug|Any CPU.ActiveCfg = Debug|Any CPU {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Debug|Any CPU.Build.0 = Debug|Any CPU + {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Debug|x64.ActiveCfg = Debug|Any CPU + {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Debug|x64.Build.0 = Debug|Any CPU + {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Debug|x86.ActiveCfg = Debug|Any CPU + {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Debug|x86.Build.0 = Debug|Any CPU {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Release|Any CPU.ActiveCfg = Release|Any CPU {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Release|Any CPU.Build.0 = Release|Any CPU + {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Release|x64.ActiveCfg = Release|Any CPU + {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Release|x64.Build.0 = Release|Any CPU + {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Release|x86.ActiveCfg = Release|Any CPU + {91F8FEB4-628C-D193-FA56-A582E87B0A24}.Release|x86.Build.0 = Release|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Debug|Any CPU.ActiveCfg = Debug|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Debug|Any CPU.Build.0 = Debug|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Debug|x64.ActiveCfg = Debug|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Debug|x64.Build.0 = Debug|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Debug|x86.ActiveCfg = Debug|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Debug|x86.Build.0 = Debug|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Release|Any CPU.ActiveCfg = Release|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Release|Any CPU.Build.0 = Release|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Release|x64.ActiveCfg = Release|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Release|x64.Build.0 = Release|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Release|x86.ActiveCfg = Release|Any CPU + {136CA094-1043-4AC2-A552-DDBC2EC1C1B7}.Release|x86.Build.0 = Release|Any CPU EndGlobalSection GlobalSection(SolutionProperties) = preSolution HideSolutionNode = FALSE diff --git a/readme.md b/readme.md index 3be7d3a..88a2fe7 100644 --- a/readme.md +++ b/readme.md @@ -60,6 +60,37 @@ export AZURE_OPENAI_API_VERSION="2024-02-01" # Optional, API version For more detailed configuration information, see [azure-openai-config.md](azure-openai-config.md). +## Vector Embeddings + +The server uses pre-generated vector embeddings for semantic search functionality. This significantly reduces startup time and Azure OpenAI API costs. + +### How It Works + +1. **Build Time**: When building the Docker container, the `build.sh` script runs the `csla-embeddings-generator` CLI tool to generate embeddings for all code samples +2. **Container Build**: The generated `embeddings.json` file is copied into the Docker container +3. **Runtime**: When the server starts, it loads the pre-generated embeddings from `embeddings.json` instead of regenerating them +4. **User Queries**: The server still needs Azure OpenAI credentials at runtime to generate embeddings for user search queries + +### Generating Embeddings Manually + +You can manually generate embeddings using the CLI tool: + +```bash +# Generate embeddings for the default csla-examples directory +dotnet run --project csla-embeddings-generator + +# Or specify custom paths +dotnet run --project csla-embeddings-generator -- --examples-path ./csla-examples --output ./embeddings.json +``` + +See [csla-embeddings-generator/README.md](csla-embeddings-generator/README.md) for more details. + +### Benefits + +- **Faster Startup**: Server starts immediately without waiting for embedding generation +- **Reduced Costs**: Embeddings are only generated once during build time, not on every server restart +- **Offline Development**: Container includes pre-generated embeddings, reducing dependency on Azure OpenAI during startup + ## MCP Tools The server currently exposes two MCP tools implemented in `CslaMcpServer.Tools.CslaCodeTool`: @@ -188,11 +219,24 @@ For questions about CSLA .NET, visit: This project includes a multi-stage `Dockerfile` for the `csla-mcp-server` located at `csla-mcp-server/Dockerfile` that builds and publishes the app, then produces a small runtime image. -Below are PowerShell-friendly (Windows) commands to build and run the container locally. Run these from the repository root (`s:\src\rdl\csla-mcp`) or adjust paths if running from elsewhere. +**Note**: Use the `build.sh` script to build the Docker image, as it first generates the vector embeddings and then builds the container with the embeddings included. -1) Build the Docker image (tags the image as `csla-mcp-server:latest`): +Below are commands to build and run the container locally. Run these from the repository root or adjust paths if running from elsewhere. -```powershell +1) Build the Docker image using the build script (recommended): + +```bash +./build.sh +``` + +This script will: +- Build the embeddings generator CLI tool +- Generate embeddings for all code samples +- Build the Docker image with embeddings included + +Alternatively, you can build manually (but this requires embeddings.json to exist): + +```bash docker build -f csla-mcp-server/Dockerfile -t csla-mcp-server:latest . ```