Embedding Service Implementation#17
Embedding Service Implementation#17Ricardo1232 wants to merge 2 commits intoCodeandoGuadalajara:mainfrom
Conversation
…e with RAG service
There was a problem hiding this comment.
Pull request overview
This PR implements a real embedding service using the Qwen/Qwen3-Embedding-0.6B model to replace mock embeddings in the RAG pipeline. It introduces a thread-safe singleton service with lazy initialization and integrates it into the existing RAG workflow.
Key changes:
- New
EmbeddingServiceclass with thread-safe singleton pattern and lazy model loading - Integration of the embedding service into
RAGService, replacing the previous mockembed_querymethod - Addition of
force_mock_modeconfiguration flag for testing without model loading
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 14 comments.
| File | Description |
|---|---|
| service/embedding_service.py | New service implementing text embedding with Qwen model, thread-safe singleton pattern, and mock fallback |
| service/init.py | Module initialization exporting the EmbeddingService class |
| rag_service.py | Integration of embedding service, removal of mock embed_query method, delegation to EmbeddingService |
| config.py | Addition of force_mock_mode flag for testing configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Simplified singleton pattern - GIL provides basic atomicity | ||
| if cls._instance is None: | ||
| with cls._lock: | ||
| if cls._instance is None: | ||
| cls._instance = super().__new__(cls) | ||
| return cls._instance | ||
|
|
||
| def __init__(self): | ||
| # Only initialize once | ||
| if not hasattr(self, '_initialized'): | ||
| self._initialized = False | ||
| self._model = None | ||
| self._init_lock = threading.Lock() # Lock for initialization only |
There was a problem hiding this comment.
Race condition in singleton initialization. The __init__ method sets self._initialized = False if the attribute doesn't exist, but this can race with the initialize() method. If two threads call get_embedding_service() concurrently and both enter __init__ before either enters initialize(), they could have race conditions. Consider setting _initialized in __new__ or using a more robust initialization pattern to ensure thread safety during the initial setup phase.
| # Simplified singleton pattern - GIL provides basic atomicity | |
| if cls._instance is None: | |
| with cls._lock: | |
| if cls._instance is None: | |
| cls._instance = super().__new__(cls) | |
| return cls._instance | |
| def __init__(self): | |
| # Only initialize once | |
| if not hasattr(self, '_initialized'): | |
| self._initialized = False | |
| self._model = None | |
| self._init_lock = threading.Lock() # Lock for initialization only | |
| # Thread-safe singleton pattern using double-checked locking | |
| if cls._instance is None: | |
| with cls._lock: | |
| if cls._instance is None: | |
| instance = super().__new__(cls) | |
| # Initialize instance attributes once, under lock, to avoid races | |
| instance._initialized = False | |
| instance._model = None | |
| instance._init_lock = threading.Lock() # Lock for initialization only | |
| cls._instance = instance | |
| return cls._instance | |
| def __init__(self): | |
| # Initialization of singleton state is handled in __new__ to ensure thread safety. | |
| # __init__ is intentionally left as a no-op to avoid racing re-initialization. | |
| pass |
| logger.info(f"Embedding service ready ({settings.device}, max_seq: {settings.model_max_seq_length})") | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Failed to initialize embedding service: {e}") |
There was a problem hiding this comment.
Missing stack trace in error logging. The error log at line 90 logs the exception message but doesn't include the full stack trace, which makes debugging difficult. Consider using logger.error(f"Failed to initialize embedding service: {e}", exc_info=True) to include the full traceback, consistent with how errors are logged in rag_service.py at line 233.
|
|
||
| def get_embedding_service() -> EmbeddingService: | ||
| """Retrieve the initialized EmbeddingService singleton.""" | ||
| if not embedding_service._initialized: |
There was a problem hiding this comment.
Direct access to the private attribute _initialized breaks encapsulation. Consider adding a public method like is_initialized() or checking if the service is ready through a public interface instead of accessing _initialized directly.
|
|
||
|
|
||
| def get_embedding_service() -> EmbeddingService: | ||
| """Retrieve the initialized EmbeddingService singleton.""" |
There was a problem hiding this comment.
The get_embedding_service function exports the helper function but not the global embedding_service instance. However, looking at the service/init.py file, only EmbeddingService (the class) is exported in __all__, not get_embedding_service. This creates an inconsistency - the function exists but isn't part of the public API. Consider either adding get_embedding_service to __all__ in service/init.py, or making it clear that the function is for internal use only.
| """Retrieve the initialized EmbeddingService singleton.""" | |
| """Retrieve the initialized EmbeddingService singleton. | |
| Note: | |
| This is an internal convenience helper and is not part of the | |
| public ``service`` package API (it is not exported via | |
| ``service.__all__``). Prefer accessing :class:`EmbeddingService` | |
| through the package's documented public interfaces. | |
| """ |
| # Global singleton instance | ||
| embedding_service = EmbeddingService() | ||
|
|
||
|
|
||
| def get_embedding_service() -> EmbeddingService: | ||
| """Retrieve the initialized EmbeddingService singleton.""" |
There was a problem hiding this comment.
The global embedding_service instance at line 181 is instantiated at module import time, which could lead to unexpected side effects. If the singleton initialization is deferred to when it's actually needed, consider using lazy initialization or documenting that module import triggers instance creation. This is inconsistent with the lazy initialization pattern described in the PR description.
| # Global singleton instance | |
| embedding_service = EmbeddingService() | |
| def get_embedding_service() -> EmbeddingService: | |
| """Retrieve the initialized EmbeddingService singleton.""" | |
| # Global singleton instance (lazily instantiated) | |
| embedding_service = None | |
| def get_embedding_service() -> EmbeddingService: | |
| """Retrieve the initialized EmbeddingService singleton.""" | |
| global embedding_service | |
| if embedding_service is None: | |
| embedding_service = EmbeddingService() |
| _lock = threading.Lock() | ||
|
|
||
| def __new__(cls): | ||
| # Simplified singleton pattern - GIL provides basic atomicity |
There was a problem hiding this comment.
The comment states "Simplified singleton pattern - GIL provides basic atomicity" but the implementation uses proper double-checked locking, which is actually more sophisticated than relying solely on the GIL. The comment is misleading since the code correctly implements double-checked locking for thread safety, not just relying on the GIL. Consider updating the comment to reflect the actual implementation.
| # Simplified singleton pattern - GIL provides basic atomicity | |
| # Thread-safe singleton using double-checked locking with a class-level lock |
| # Optimize model for inference | ||
| self._model.to(settings.device) | ||
| self._model.eval() | ||
| torch.set_grad_enabled(False) |
There was a problem hiding this comment.
Setting torch.set_grad_enabled(False) globally affects the entire process, not just this model. This is a side effect that can interfere with other parts of the application or other models that may need gradients. Use the context manager torch.no_grad() or torch.inference_mode() during actual inference calls instead of setting this globally during initialization. The global setting could break training or fine-tuning workflows if they're added later.
| torch.set_grad_enabled(False) |
| formatted_text = f"query: {text}" | ||
| if settings.task_description: | ||
| formatted_text = f"{settings.task_description}: {text}" |
There was a problem hiding this comment.
The task formatting logic has a potential issue. When settings.task_description is set, it overrides the "query:" prefix entirely (line 120), but if it's empty or None, the code falls back to "query:" (line 118). This means the formatted text could be inconsistent. Consider making the logic explicit: if settings.task_description exists, use it with the text; otherwise, use "query:" as the prefix. The current implementation works but could be clearer about the intended behavior.
| formatted_text = f"query: {text}" | |
| if settings.task_description: | |
| formatted_text = f"{settings.task_description}: {text}" | |
| prefix = settings.task_description if settings.task_description else "query" | |
| formatted_text = f"{prefix}: {text}" |
| Returns: | ||
| List[float]: Randomized vector with fixed seed. | ||
| """ | ||
| import random |
There was a problem hiding this comment.
Importing random inside a function is inefficient and non-idiomatic. The import statement should be moved to the top of the file with the other imports. Local imports are typically only used to avoid circular dependencies or for lazy loading of expensive modules, neither of which applies here.
| self._model[0].max_position_embeddings = settings.model_max_seq_length | ||
|
|
||
| # Optimize model for inference | ||
| self._model.to(settings.device) |
There was a problem hiding this comment.
Redundant device placement. The model is already placed on the device via device_map parameter in the SentenceTransformer initialization (line 70), and then explicitly moved again with .to(settings.device) at line 82. This redundant operation is unnecessary and could cause issues or confusion. Remove the redundant .to() call since device placement is already handled by the device_map parameter.
| self._model.to(settings.device) |
Related Issue
Resolves #6
Implementation of
EmbeddingServicewith Qwen/Qwen3-Embedding-0.6B model and integration intoRAGService. Replaces mocks with real inference optimized for legal documents.Key Changes
service/embedding_service.pytorch.inference_mode()and production-optimized configuration.rag_service.pyEmbeddingService.