feat(backend): implement document parser system for structured block output #802

2561056571 · 2025-12-31T08:25:40Z

Summary

Add document parsing module that converts uploaded documents into structured Block format for preview, editing, and RAG/vectorization support
Implement parsers for Markdown, PDF, DOCX, and image files with automatic format detection
Create DocumentBlock model and Alembic migration for storing parsed blocks with metadata

Test plan

Run backend tests to verify existing functionality
Test parsing of Markdown files with headings, code blocks, tables, and lists
Test parsing of PDF files with text and image extraction
Test parsing of DOCX files with paragraph, heading, and table extraction
Test image file parsing with OCR/description placeholders
Verify API endpoints for triggering parse, getting blocks, and updating editable blocks
Verify database migration creates document_blocks table correctly

Summary by CodeRabbit

Release Notes

New Features
- Added support for parsing multiple document formats (Markdown, PDF, Word documents, images).
- Organized parsed content into editable blocks for review and modification.
- Integrated OCR for extracting text from images within documents.
- Automatic image extraction and storage during document parsing.
Chores
- Added PyMuPDF dependency for enhanced PDF parsing support.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

- Add PublicRetrieverService for managing system-level retrievers (user_id=0) - Add admin API endpoints for public retriever CRUD operations - Update retriever_kinds_service to include public retrievers with fallback - Add PublicRetrieverList frontend component for admin management - Add public retriever tab to admin navigation - Update RetrievalSettingsSection to display source type labels (Personal/Public/Group) - Add i18n translations for public retriever features

…ptions - Add complete index strategy options (per_user, per_dataset, fixed, rolling) - Add retrieval methods configuration with checkboxes (vector, keyword, hybrid) - Add connection test functionality - Add storage type specific auth fields (username/password for ES, API key for Qdrant) - Add password visibility toggles - Add dynamic retrieval methods loading from API - Match personal retriever configuration page layout and functionality

Add logging to help diagnose why public retrievers are not being returned in the list API. Logs the count of found public retrievers and each added retriever name.

…-retriever-support

- Add RetrieverFormFields component for reusable form fields - Update PublicRetrieverList with create/edit/delete functionality - Add API endpoints for public retriever CRUD operations - Update storage base to support public retriever configuration - Update DocumentList and RetrievalSettingsSection for retriever selection

- Add public retriever support in RetrievalSettingsSection - Update useEmbeddingModels and useRetrievers hooks to fetch public models - Fix storage base to handle retriever configuration properly - Add i18n translations for public model labels

…-retriever-support

…output Add document parsing module that converts uploaded documents into structured Block format for preview, editing, and RAG/vectorization support. Key features: - DocumentBlock model and database migration for storing parsed blocks - Parser factory with auto-selection based on content type or file extension - Parsers for Markdown, PDF, DOCX, and images - Storage service interface with local filesystem implementation - OCR service interface with mock implementation - API endpoints for triggering parsing and managing blocks - Celery task template for async parsing

coderabbitai · 2025-12-31T08:25:47Z

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive document parsing subsystem with database storage, multiple parser implementations, OCR/storage service abstractions, and API endpoints to parse documents into structured blocks with support for Markdown, PDF, DOCX, and image formats.

Changes

Cohort / File(s)	Summary
Database & ORM `backend/alembic/versions/o5p6q7r8s9t0_add_document_blocks_table.py`, `backend/app/models/document_block.py`, `backend/app/models/__init__.py`	New migration creates `document_blocks` table with fields for id, document_id, source_type, block_type, content, editable, order_index, source_ref, block_metadata, timestamps, and index on document_id. DocumentBlock SQLAlchemy model maps to table with UTF-8MB4 charset and auto-timestamp handling.
Schemas & Request/Response DTOs `backend/app/schemas/document_block.py`	Defines Pydantic models and enums (BlockTypeEnum, SourceTypeEnum) for document block operations: DocumentBlockBase, DocumentBlockCreate, DocumentBlockResponse, DocumentBlockUpdate, DocumentBlockListResponse, ParseDocumentRequest, ParseDocumentResponse, and SupportedFormatsResponse.
Parser Base & Factory `backend/app/services/document_parser/base.py`, `backend/app/services/document_parser/factory.py`, `backend/app/services/document_parser/models/block.py`, `backend/app/services/document_parser/models/__init__.py`	Introduces abstract BaseParser interface with parse, supported_content_types, supported_extensions, and can_handle methods. ParserFactory dynamically selects parsers via registry pattern with optional storage/OCR service injection. Data models include BlockType, SourceType enums, DocumentBlockData, and ParseResult classes.
Parser Implementations `backend/app/services/document_parser/parsers/markdown_parser.py`, `backend/app/services/document_parser/parsers/pdf_parser.py`, `backend/app/services/document_parser/parsers/docx_parser.py`, `backend/app/services/document_parser/parsers/image_parser.py`, `backend/app/services/document_parser/parsers/fallback_parser.py`, `backend/app/services/document_parser/parsers/__init__.py`	Concrete parsers registered with factory: MarkdownParser (headings, lists, paragraphs, code, tables), PDFParser (text classification, images with OCR), DocxParser (heading styles, lists, tables, images), ImageParser (content type detection, OCR extraction), and FallbackParser (unsupported formats). All include error handling and optional OCR/storage integration.
OCR Services `backend/app/services/document_parser/ocr/base.py`, `backend/app/services/document_parser/ocr/mock_ocr.py`, `backend/app/services/document_parser/ocr/__init__.py`	Abstract BaseOCRService defines async extract_text and describe_image methods with sync wrappers handling event loop state. MockOCRService provides placeholder implementations for development.
Storage Services `backend/app/services/document_parser/storage/base.py`, `backend/app/services/document_parser/storage/local_storage.py`, `backend/app/services/document_parser/storage/__init__.py`	Abstract BaseStorageService defines async save_image, delete_image, delete_document_images with sync wrappers. LocalStorageService implements filesystem storage with document-specific directories and URL prefix generation.
API & Integration `backend/app/api/api.py`, `backend/app/api/endpoints/document_parser.py`	New `/document-parser` API router with endpoints: GET /supported-formats, POST /documents/{document_id}/parse (async), POST /documents/{document_id}/parse-sync (sync in-process), GET /documents/{document_id}/blocks (list), GET /blocks/{block_id} (single), PATCH /blocks/{block_id} (update editable), DELETE /documents/{document_id}/blocks (remove all). Includes auth and DB dependencies.
Async Task `backend/app/services/document_parser/tasks.py`	Defines parse_document_task function for asynchronous document parsing workflow: retrieves parser, parses blocks, clears/inserts to database, returns status with timing; includes commented Celery integration points.
Package Initialization `backend/app/services/document_parser/__init__.py`	Re-exports BaseParser and ParserFactory for public API access.
Dependencies `backend/pyproject.toml`	Adds PyMuPDF ≥1.23.0 for PDF parsing capability.

Sequence Diagram

sequenceDiagram
    participant Client
    participant API
    participant Factory
    participant Parser
    participant Storage
    participant OCR
    participant DB
    
    Client->>API: POST /documents/{id}/parse-sync<br/>(with file)
    API->>Factory: get_parser(content_type, filename)
    Factory->>Parser: instantiate with storage/OCR
    API->>Parser: parse(binary_data, document_id, filename)
    
    par Parallel Processing
        Parser->>Storage: save_image(optional)
        Parser->>OCR: extract_text(optional)
        Parser->>OCR: describe_image(optional)
    end
    
    Parser-->>API: List[DocumentBlockData]
    API->>DB: delete existing blocks for document
    API->>DB: insert new blocks<br/>(id, document_id, type, content, etc.)
    DB-->>API: committed
    API-->>Client: ParseDocumentResponse<br/>(status, total_blocks, parse_time_ms)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

Micro66
cc-yafei

Poem

🐰 With parser's might and blocks so bright,
Documents transform through day and night,
From Markdown, PDF, and images grand,
Into structured chunks—oh isn't it planned!
Factory-crafted, storage-blessed,
Our parsing system's putting docs to the test! 📚✨

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The pull request title directly and clearly describes the main change: implementing a document parser system that outputs structured blocks. The title is concise, specific, and accurately reflects the primary objective of the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 92.66% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 57e3fd3 and 3ad4f09.

⛔ Files ignored due to path filters (1)

backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

backend/app/models/__init__.py
backend/pyproject.toml

🚧 Files skipped from review as they are similar to previous changes (1)

backend/pyproject.toml

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{py,ts,tsx,js,jsx}

📄 CodeRabbit inference engine (AGENTS.md)

**/*.{py,ts,tsx,js,jsx}: All code comments MUST be written in English
File size MUST NOT exceed 1000 lines - split into multiple sub-modules if exceeded
Function length SHOULD NOT exceed 50 lines (preferred)

Files:

backend/app/models/__init__.py

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Python code MUST follow PEP 8, use Black formatter with line length 88, and isort for imports
Python code MUST include type hints
Python functions and classes MUST have descriptive names and docstrings for public functions/classes
Python MUST extract magic numbers to named constants

Files:

backend/app/models/__init__.py

backend/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

backend/**/*.py: Python backend module imports MUST use uv run prefix when executing commands
Task and Workspace resources MUST use TaskResource model from app.models.task, not the Kind model
Ghost, Model, Shell, Bot, Team, and Skill CRDs MUST use Kind model from app.models.kind

Files:

backend/app/models/__init__.py

🧠 Learnings (1)

📚 Learning: 2025-12-31T03:47:12.160Z

Learnt from: CR
Repo: wecode-ai/Wegent PR: 0
File: AGENTS.md:0-0
Timestamp: 2025-12-31T03:47:12.160Z
Learning: Applies to backend/**/*.py : Ghost, Model, Shell, Bot, Team, and Skill CRDs MUST use Kind model from `app.models.kind`

Applied to files:

backend/app/models/__init__.py

🧬 Code graph analysis (1)

backend/app/models/__init__.py (1)

backend/app/models/document_block.py (1)

DocumentBlock (20-109)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: E2E Tests (Shard 2/3)
GitHub Check: E2E Tests (Shard 1/3)
GitHub Check: E2E Tests (Shard 3/3)
GitHub Check: Test wegent CLI Integration
GitHub Check: Test Frontend

🔇 Additional comments (2)

backend/app/models/__init__.py (2)

46-46: LGTM!

The addition of "DocumentBlock" to the __all__ list correctly exposes the new model as part of the public API. The placement at the end is consistent with the existing pattern in this file.

12-12: The import order is correct and follows isort conventions.

No SQLAlchemy relationship() definitions exist between DocumentBlock and KnowledgeDocument. While DocumentBlock has a document_id field referencing an uploaded document, it is implemented as a plain column reference (String(36)) without a ForeignKey constraint or relationship() definition. The alphabetical import order (document_block before knowledge) is appropriate and follows PEP 8 / isort standards.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Add source_type field to track document origin type (markdown, pdf, docx, image, git, ai) for better content categorization and RAG support. Changes: - Add source_type column to DocumentBlock model and migration - Add SourceType enum to data models and Pydantic schemas - Update all parsers to set appropriate source_type - Update API endpoints and Celery task to persist source_type

…rser-system

coderabbitai

Actionable comments posted: 16

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

backend/app/services/rag/embedding/factory.py (1)

17-18: Add return type hint to function signature.

The function signature is missing a return type annotation. Per coding guidelines, Python code MUST include type hints for all functions.
🔎 Proposed fix

Based on the function's return statements (CustomEmbedding or OpenAIEmbedding), add a return type:
+from typing import Any, Dict, Union
+from llama_index.core.embeddings import BaseEmbedding
+
 def create_embedding_model_from_crd(
     db: Session, user_id: int, model_name: str, model_namespace: str = "default"
-):
+) -> BaseEmbedding:
Note: Verify that both CustomEmbedding and OpenAIEmbedding inherit from BaseEmbedding or adjust the type accordingly.
As per coding guidelines, Python code MUST include type hints.

executor/agents/claude_code/progress_state_manager.py (3)

160-222: Refactor oversized _build_workbench_structure method.

This method spans 62 lines and performs multiple responsibilities: extracting user info, determining git type, and constructing the workbench dictionary. Split into smaller helper methods to improve clarity and testability, aligning with the 50-line function guideline.

🔎 Suggested refactoring approach

Extract helper methods:

_determine_git_type(git_domain: str) -> str
_determine_summary(result_value: str | None) -> str

def _determine_git_type(self, git_domain: str) -> str:
    """Determine git type from domain."""
    return "github" if "github.com" in git_domain.lower() else "gitlab"

def _determine_summary(self, result_value: str | None) -> str:
    """Determine summary from multiple sources."""
    if result_value:
        return result_value
    if self.task_data.get("subtask_title"):
        return self.task_data.get("subtask_title")
    if self.task_data.get("prompt"):
        return self.task_data.get("prompt")
    return ""

def _build_workbench_structure(
    self, status: str = "running", result_value: str | None = None
) -> Dict[str, Any]:
    """Build workbench data structure."""
    current_time = datetime.now().isoformat()
    git_domain = self.task_data["git_domain"]
    
    workbench = {
        "taskTitle": self.task_data.get("task_title", ""),
        "taskNumber": str(self.task_data.get("task_id", "")),
        "status": status,
        "completedTime": current_time if status == "completed" else "",
        "repository": self.task_data.get("git_repo", ""),
        "branch": self.task_data.get("branch_name", ""),
        "sessions": 1,
        "premiumRequests": 0,
        "lastUpdated": current_time,
        "summary": self._determine_summary(result_value),
        "changes": [],
        "originalPrompt": self.task_data.get("prompt", ""),
        "file_changes": [],
        "git_info": {
            "initial_commit_id": "",
            "initial_commit_message": "",
            "task_commits": [],
            "source_branch": "",
            "target_branch": "",
        },
        "git_domain": git_domain,
        "git_type": self._determine_git_type(git_domain),
    }
    return workbench

261-351: Refactor oversized _update_task_commits method.

This method spans 90 lines, handling commit iteration, commit info collection, markdown formatting, and workbench updates. Break into focused helper methods to improve maintainability and testability per the 50-line guideline.

🔎 Suggested refactoring approach

Extract helper methods:

_build_commit_info(commit) -> Dict[str, Any]
_format_commits_as_markdown(task_commits: List[Dict]) -> str
_get_commit_range() -> List[Dict[str, Any]] (handles repo iteration)

def _build_commit_info(self, commit) -> Dict[str, Any]:
    """Build commit information dictionary."""
    return {
        "commit_id": commit.hexsha,
        "short_id": commit.hexsha[:8],
        "message": commit.message.strip(),
        "author": commit.author.name,
        "author_email": commit.author.email,
        "committed_date": datetime.fromtimestamp(
            commit.committed_date
        ).isoformat(),
        "stats": {
            "files_changed": len(commit.stats.files),
            "insertions": commit.stats.total["insertions"],
            "deletions": commit.stats.total["deletions"],
        },
    }

def _format_commits_as_markdown(self, task_commits: List[Dict[str, Any]]) -> str:
    """Format commits as markdown."""
    if not task_commits:
        return ""
    commit_lines = [
        f"### commit id: {c['commit_id']}\n{c['message']}\n---\n"
        for c in task_commits
    ]
    return "\n".join(commit_lines)

Then simplify _update_task_commits() to call these helpers.

353-492: Refactor oversized _get_git_file_changes method.

This method spans 139 lines and combines repository access, diff computation, and line-counting logic. Decompose into focused helpers (e.g., _collect_all_diffs(), _parse_diff_lines(), _build_file_change_record()) to improve readability and testability per the 50-line guideline.

🔎 Suggested refactoring approach

Extract key helpers:

_collect_all_diffs(repo: Repo) -> List[DiffIndex] — consolidates committed, staged, and unstaged diffs
_parse_diff_line_counts(diff_text: str) -> tuple[int, int] — counts added/removed lines
_build_file_change_record(diff) -> Dict[str, Any] — creates file change dict

def _parse_diff_line_counts(self, diff_text: str) -> tuple[int, int]:
    """Parse diff text to count added and removed lines."""
    added = removed = 0
    for line in diff_text.split("\n"):
        if line.startswith("+") and not line.startswith("+++"):
            added += 1
        elif line.startswith("-") and not line.startswith("---"):
            removed += 1
    return added, removed

def _build_file_change_record(self, diff) -> Dict[str, Any]:
    """Build file change record from diff."""
    old_path = diff.a_path if diff.a_path else diff.b_path
    new_path = diff.b_path if diff.b_path else diff.a_path
    
    added_lines = removed_lines = 0
    if diff.diff:
        diff_text = diff.diff.decode("utf-8", errors="ignore")
        added_lines, removed_lines = self._parse_diff_line_counts(diff_text)
    
    return {
        "old_path": old_path,
        "new_path": new_path,
        "new_file": diff.new_file,
        "renamed_file": diff.renamed_file,
        "deleted_file": diff.deleted_file,
        "added_lines": added_lines,
        "removed_lines": removed_lines,
        "diff_title": os.path.basename(new_path),
    }

Then simplify the main method to iterate diffs and call _build_file_change_record().

🟡 Minor comments (7)

frontend/src/features/knowledge/document/components/DocumentList.tsx-114-114 (1)
114-114: Handle edge case for files without extensions.

The current implementation using file.name.split('.').pop() will return the entire filename if there's no extension (no dot in the filename). This could lead to unexpected behavior or display issues.
🔎 Proposed fix to handle files without extensions
-    const extension = file.name.split('.').pop() || '';
+    const parts = file.name.split('.');
+    const extension = parts.length > 1 ? parts.pop() || '' : '';
This ensures that files without extensions get an empty string instead of the full filename.
frontend/src/features/knowledge/document/components/RetrievalSettingsSection.tsx-137-155 (1)
137-155: Inconsistent namespace check for embedding model existence.

The retriever existence check (line 119) verifies both name and namespace, but the embedding model check (line 141) only verifies name. If two embedding models share the same name but have different namespaces, the wrong model might be considered as "existing."
🔎 Proposed fix
       const currentModelExists =
         config.embedding_config?.model_name &&
-        embeddingModels.some(m => m.name === config.embedding_config?.model_name);
+        embeddingModels.some(
+          m =>
+            m.name === config.embedding_config?.model_name &&
+            (m.namespace || 'default') === config.embedding_config?.model_namespace
+        );
frontend/src/features/settings/components/RetrieverList.tsx-402-402 (1)
402-402: Missing namespace prefix in translation key.

Same issue here - t('retrievers.public') should include the common: namespace prefix.
🔎 Proposed fix
-                            publicLabel={t('retrievers.public')}
+                            publicLabel={t('common:retrievers.public')}
frontend/src/features/settings/components/RetrieverList.tsx-388-388 (1)
388-388: Missing namespace prefix in translation key.

The translation key t('retrievers.public_retrievers') is missing the common: namespace prefix. Other translation calls in this file consistently use t('common:retrievers...') format. This will cause a missing translation at runtime.
🔎 Proposed fix
-                    {t('retrievers.public_retrievers')} ({publicRetrievers.length})
+                    {t('common:retrievers.public_retrievers')} ({publicRetrievers.length})
backend/alembic/versions/o5p6q7r8s9t0_add_document_blocks_table.py-29-35 (1)
29-35: Duplicate index creation for document_id.

The column definition at line 33 includes index=True, and there's also an explicit create_index call at lines 95-100. This may cause an error or create duplicate indexes depending on how the migration runs. Remove one of them.
🔎 Proposed fix - remove redundant index=True from column
         sa.Column(
             "document_id",
             sa.String(36),
             nullable=False,
-            index=True,
             comment="Reference to uploaded document",
         ),
Or alternatively, remove the explicit create_index call if you prefer the inline approach. The explicit approach gives you control over the index name.
Also applies to: 94-100
executor/agents/claude_code/progress_state_manager.py-177-179 (1)
177-179: Remove unused variable bot_info.

Line 177 assigns to bot_info but it is never referenced. The assignment can be simplified to access self.task_data.get("bot", [{}])[0] directly where needed, or removed if not required.
🔎 Proposed fix
         # Extract task information from task_data
         user_info = self.task_data.get("user", {})
-        bot_info = (
-            self.task_data.get("bot", [{}])[0] if self.task_data.get("bot") else {}
-        )
If bot_info is intended for future use, document that intent or store it as an instance variable.
executor/agents/claude_code/progress_state_manager.py-46-46 (1)

46-46: Update parameter type hints to use modern union syntax.

Line 46 uses Optional[str]. Replace with str | None (PEP 604, fully supported in Python 3.10+). Project requires Python 3.10+, so update all Optional[T] instances to T | None throughout the file:

Line 46: project_path: Optional[str] → str | None

Line 61: workbench_data: Optional[Dict[str, Any]] → Dict[str, Any] | None

Line 62: initial_commit_id: Optional[str] → str | None

Line 63: _monitor_timer: Optional[threading.Timer] → threading.Timer | None

Line 113: extra_result: Optional[Dict[str, Any]] → Dict[str, Any] | None

Also remove Optional from the imports on line 15.

🧹 Nitpick comments (49)

frontend/src/features/settings/components/prompt-fine-tune/index.ts (1)
5-7: Remove trailing semicolons to align with TypeScript guidelines.

Lines 5–7 end with semicolons, which violates the coding guideline stating that TypeScript files MUST NOT use semicolons. Please remove them for consistency with Prettier and ESLint configuration.
🔎 Proposed fix
-export { default as PromptFineTuneDialog } from './PromptFineTuneDialog';
-export { default as PromptTestPanel } from './PromptTestPanel';
-export { default as PromptComparePanel } from './PromptComparePanel';
+export { default as PromptFineTuneDialog } from './PromptFineTuneDialog'
+export { default as PromptTestPanel } from './PromptTestPanel'
+export { default as PromptComparePanel } from './PromptComparePanel'
As per coding guidelines, TypeScript/React code must follow Prettier formatting with no semicolons.
frontend/src/features/knowledge/document/hooks/useEmbeddingModels.ts (1)
30-44: Verify priority handling for 'all' scope and avoid mutating API response data.

Two concerns with the sorting implementation:

Incomplete priority map for 'all' scope: When scope is 'all', the API response may include models with type of 'group', 'user', and 'public'. However, the priority map for non-group scopes (lines 34-35) only defines 'user' and 'public'. Models with type: 'group' will fall back to priority 1 via the nullish coalescing on line 37, placing them at the same priority as public models. Please verify this is the intended behavior when viewing all models across scopes.

Array mutation: Line 36 calls data.sort(), which mutates the original array from the API response. If response?.data is not null, the original array reference is being modified. Consider creating a shallow copy before sorting to avoid potential side effects if the data is cached or reused elsewhere.
🔎 Suggested refactor to address both concerns
      const data = response?.data || [];
-      // Sort by type priority based on scope, then by name
-      // - Personal scope: user > public
-      // - Group scope: group > public
-      const typePriority: Record<string, number> =
-        scope === 'group' ? { group: 0, public: 1 } : { user: 0, public: 1 };
-      data.sort((a, b) => {
+      // Sort by type priority based on scope, then by name
+      // - Personal scope: user > public
+      // - Group scope: group > public
+      // - All scope: user > group > public
+      const typePriority: Record<string, number> =
+        scope === 'group'
+          ? { group: 0, public: 1 }
+          : scope === 'all'
+            ? { user: 0, group: 1, public: 2 }
+            : { user: 0, public: 1 };
+      const sortedData = [...data].sort((a, b) => {
-        const priorityA = typePriority[a.type] ?? 1;
-        const priorityB = typePriority[b.type] ?? 1;
+        const priorityA = typePriority[a.type] ?? 999;
+        const priorityB = typePriority[b.type] ?? 999;
         if (priorityA !== priorityB) {
           return priorityA - priorityB;
         }
         return (a.name || '').localeCompare(b.name || '');
       });
-      setModels(data);
+      setModels(sortedData);
Note: The fallback priority is changed from 1 to 999 to ensure any unexpected types appear last in the sorted list.
backend/app/services/rag/embedding/factory.py (2)
17-164: Consider breaking down this function into smaller components.

The function is 147 lines long, which significantly exceeds the recommended 50-line guideline. It handles multiple responsibilities: database querying, CRD parsing, validation, and model construction.
💡 Suggested refactoring approach

Consider extracting helper functions:
def _query_model_kind(db: Session, user_id: int, model_name: str, model_namespace: str) -> Kind:
    """Query Model CRD from database."""
    # Lines 40-72
    ...

def _parse_and_validate_model_config(model_data: Dict[str, Any], model_name: str) -> tuple[str, Dict[str, Any]]:
    """Parse CRD and validate it's an embedding model. Returns (protocol, config)."""
    # Lines 75-103
    ...

def _create_embedding_instance(protocol: str, model_name: str, config: Dict[str, Any]):
    """Create the appropriate embedding model instance."""
    # Lines 111-164
    ...
This would improve readability, testability, and maintainability.
As per coding guidelines, function length SHOULD NOT exceed 50 lines.

88-88: Consider extracting the default model type to a named constant.

The magic string "llm" is used as a default value. Extracting it to a module-level constant would improve maintainability and make the codebase more searchable.
💡 Suggested constant extraction

At the top of the file:
 from app.services.rag.embedding.custom import CustomEmbedding
+
+DEFAULT_MODEL_TYPE = "llm"
Then use it in the fallback:
-    model_type = model_config.get("modelType", "llm")
+    model_type = model_config.get("modelType", DEFAULT_MODEL_TYPE)
As per coding guidelines, Python MUST extract magic numbers to named constants.
shared/models/task.py (3)
66-66: Extract magic string to a constant or enum.

The default value "continue" for next_action is a magic string. Consider defining valid actions as an enum or as named constants for better type safety and discoverability.
🔎 Proposed refactor using Enum
+from enum import Enum
+
+class NextAction(str, Enum):
+    """Valid next actions for thinking steps"""
+    CONTINUE = "continue"
+    STOP = "stop"
+    RETRY = "retry"
+
 class ThinkingStep(BaseModel):
     """Thinking step model for recording agent reasoning process"""
 
     title: str = Field(..., description="Title of thinking step")
-    next_action: str = Field(default="continue", description="Next action to take")
+    next_action: NextAction = Field(default=NextAction.CONTINUE, description="Next action to take")
     details: Optional[Dict[str, Any]] = Field(
         default=None, description="Detailed structured data for this step"
     )
82-86: Extract duplicated .dict() override to a base class or mixin.

The dict() method override with exclude_none=True is duplicated between ThinkingStep and ExecutionResult. Consider extracting this common pattern to reduce code duplication.
🔎 Proposed refactor using a mixin
+class ExcludeNoneMixin:
+    """Mixin that excludes None values when serializing to dict"""
+    
+    def dict(self, **kwargs) -> Dict[str, Any]:
+        """Override dict method to exclude None values"""
+        kwargs.setdefault("exclude_none", True)
+        return super().dict(**kwargs)
+
+
-class ThinkingStep(BaseModel):
+class ThinkingStep(ExcludeNoneMixin, BaseModel):
     """Thinking step model for recording agent reasoning process"""
 
     title: str = Field(..., description="Title of thinking step")
     next_action: str = Field(default="continue", description="Next action to take")
     details: Optional[Dict[str, Any]] = Field(
         default=None, description="Detailed structured data for this step"
     )
-
-    def dict(self, **kwargs) -> Dict[str, Any]:
-        """Override dict method to exclude None values"""
-        # Exclude None values by default
-        kwargs.setdefault("exclude_none", True)
-        return super().dict(**kwargs)
 
 
-class ExecutionResult(BaseModel):
+class ExecutionResult(ExcludeNoneMixin, BaseModel):
     value: Optional[str] = None
     thinking: List[ThinkingStep] = []
-
-    def dict(self, **kwargs) -> Dict[str, Any]:
-        """Override dict method to exclude None values"""
-        # Exclude None values by default
-        kwargs.setdefault("exclude_none", True)
-        return super().dict(**kwargs)
89-90: Add a docstring to the public class.

As per coding guidelines, Python classes should have docstrings. The TasksRequest model should include a docstring explaining its purpose and usage context.
🔎 Proposed addition
 class TasksRequest(BaseModel):
+    """Request model containing a list of tasks for batch operations."""
+    
     tasks: List[Task]
frontend/src/features/knowledge/document/components/RetrievalSettingsSection.tsx (2)
107-132: Potential re-render cascade with config in effect dependencies.

Including config in the dependency array while calling onChange({...config, ...}) inside the effect can cause extra re-renders. The condition guards against infinite loops, but if onChange is not memoized or if the parent recreates config on each render, this could trigger unnecessary effect executions.

Consider either:

Memoizing config in the parent, or

Using a ref to track whether auto-selection has already occurred

222-226: Type assertion without fallback.

The type as 'user' | 'public' | 'group' cast assumes the input will always be one of these values. If an unexpected type is passed, the translation key won't exist and may display a raw key or empty string.
🔎 Consider adding a fallback
 const getSourceTypeLabel = (type: string) => {
-  const typeKey = type as 'user' | 'public' | 'group';
-  return t(`document.retrieval.sourceType.${typeKey}`);
+  const validTypes = ['user', 'public', 'group'] as const;
+  const typeKey = validTypes.includes(type as typeof validTypes[number])
+    ? type
+    : 'user';
+  return t(`document.retrieval.sourceType.${typeKey}`);
 };
frontend/src/features/knowledge/document/components/CreateKnowledgeBaseDialog.tsx (1)
52-60: Consider extracting initial config to avoid duplication.

The initial retrievalConfig object is duplicated in three places: initial state (lines 52-60), after submit (lines 101-109), and on close (lines 119-127). This could lead to inconsistencies if one is updated but others are forgotten.
🔎 Proposed refactor
const INITIAL_RETRIEVAL_CONFIG: Partial<RetrievalConfig> = {
  retrieval_mode: 'vector',
  top_k: 5,
  score_threshold: 0.7,
  hybrid_weights: {
    vector_weight: 0.7,
    keyword_weight: 0.3,
  },
}

// Then use: setRetrievalConfig(INITIAL_RETRIEVAL_CONFIG)
Also applies to: 99-109, 119-127
backend/app/services/adapters/retriever_kinds.py (1)
125-125: Use truthy checks instead of explicit == True comparisons.

Lines 125, 193, and 222 use explicit == True comparisons with boolean columns, which is not idiomatic Python. Use direct truthy checks instead.
🔎 Proposed fix
 public_retrievers = (
     db.query(Kind)
     .filter(
         Kind.user_id == 0,
         Kind.kind == "Retriever",
         Kind.namespace == "default",
-        Kind.is_active == True,
+        Kind.is_active,
     )
     .order_by(Kind.created_at.desc())
     .all()
 )
Apply similar changes to lines 193 and 222.
Also applies to: 193-193, 222-222
backend/app/services/document_parser/parsers/__init__.py (1)
1-23: Consider sorting __all__ for consistency.

The package initialization is well-structured and correctly exports all parser implementations. As a minor style improvement, consider sorting the __all__ list alphabetically to align with the Ruff linting rule (RUF022).
🔎 Optional refactor to sort `__all__`
 __all__ = [
-    "MarkdownParser",
-    "PDFParser",
     "DocxParser",
+    "FallbackParser",
     "ImageParser",
-    "FallbackParser",
+    "MarkdownParser",
+    "PDFParser",
 ]
backend/app/services/document_parser/parsers/fallback_parser.py (2)
16-16: Unused import.

ParserFactory is imported but not used in this module. The FallbackParser is consumed by the factory, not the other way around.
🔎 Proposed fix
 from app.services.document_parser.base import BaseParser
-from app.services.document_parser.factory import ParserFactory
 from app.services.document_parser.models.block import BlockType, DocumentBlockData, SourceType
38-45: Use explicit Optional type hints per PEP 484.

The parameters use implicit Optional syntax (str = None), which is prohibited by PEP 484. Use Optional[str] or str | None for clarity.
🔎 Proposed fix
+from typing import List, Optional
+
-    def can_handle(self, content_type: str = None, filename: str = None) -> bool:
+    def can_handle(self, content_type: Optional[str] = None, filename: Optional[str] = None) -> bool:
backend/app/api/endpoints/rag.py (2)
553-558: Unused variable should be prefixed with underscore.

The retriever variable from unpacking is not used. Prefix with underscore to indicate intentional discard.
🔎 Proposed fix
         # Get retriever and create backend
-        retriever, storage_backend = get_retriever_and_backend(
+        _retriever, storage_backend = get_retriever_and_backend(
             retriever_name=retriever_name,
             retriever_namespace=retriever_namespace,
             user_id=current_user.id,
             db=db,
         )
574-577: Use exception chaining with raise ... from e.

When re-raising as HTTPException, use raise ... from e to preserve the exception chain for better debugging.
🔎 Proposed fix
     except HTTPException:
         raise
     except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
+        raise HTTPException(status_code=500, detail=str(e)) from e
backend/app/services/rag/storage/base.py (1)
128-133: Use raise ... from for proper exception chaining.

The static analysis hint B904 is valid. When re-raising a different exception inside an except block, chain it to preserve the original traceback for debugging.
🔎 Proposed fix
             try:
                 kb_id_num = int(knowledge_id)
-            except ValueError:
-                raise ValueError(
+            except ValueError as err:
+                raise ValueError(
                     f"knowledge_id must be a numeric value for 'rolling' mode, got: {knowledge_id}"
-                )
+                ) from err
backend/app/models/document_block.py (1)
93-105: datetime.utcnow is deprecated in Python 3.12+.

datetime.utcnow() is deprecated and returns a naive datetime. For timezone-aware timestamps, use datetime.now(timezone.utc).
🔎 Proposed fix
-from datetime import datetime
+from datetime import datetime, timezone


# In the Column definitions:
     created_at = Column(
         DateTime,
         nullable=False,
-        default=datetime.utcnow,
+        default=lambda: datetime.now(timezone.utc),
         comment="Creation time",
     )
     updated_at = Column(
         DateTime,
         nullable=False,
-        default=datetime.utcnow,
-        onupdate=datetime.utcnow,
+        default=lambda: datetime.now(timezone.utc),
+        onupdate=lambda: datetime.now(timezone.utc),
         comment="Update time",
     )
backend/app/services/document_parser/ocr/base.py (1)
49-75: Sync wrapper code is duplicated and uses deprecated get_event_loop().

The sync wrapper logic is duplicated here and in BaseStorageService. Additionally, asyncio.get_event_loop() is deprecated in Python 3.10+ when no running loop exists. Consider extracting a utility function and using the modern pattern.
🔎 Proposed fix - utility function approach

Create a shared utility:
# In a utils module
import asyncio
from typing import Coroutine, TypeVar

T = TypeVar("T")

def run_async_in_sync(coro: Coroutine[None, None, T]) -> T:
    """Run an async coroutine from a sync context."""
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        # No running loop, safe to use asyncio.run
        return asyncio.run(coro)
    
    # Loop is running, use thread pool
    import concurrent.futures
    with concurrent.futures.ThreadPoolExecutor() as executor:
        future = executor.submit(asyncio.run, coro)
        return future.result()
Then simplify the sync methods:
     def extract_text_sync(self, image_data: bytes) -> str:
-        import asyncio
-
-        try:
-            loop = asyncio.get_event_loop()
-            if loop.is_running():
-                import concurrent.futures
-
-                with concurrent.futures.ThreadPoolExecutor() as executor:
-                    future = executor.submit(
-                        asyncio.run,
-                        self.extract_text(image_data),
-                    )
-                    return future.result()
-            else:
-                return loop.run_until_complete(self.extract_text(image_data))
-        except RuntimeError:
-            return asyncio.run(self.extract_text(image_data))
+        from app.utils.async_helpers import run_async_in_sync
+        return run_async_in_sync(self.extract_text(image_data))
Also applies to: 77-103
backend/app/services/document_parser/storage/local_storage.py (1)
88-90: Use logger.exception for automatic traceback logging.

Per Ruff TRY400, logger.exception automatically includes the traceback, which is more useful for debugging than logger.error.
🔎 Proposed fix
         except Exception as e:
-            logger.error(f"Failed to save image: {e}")
+            logger.exception(f"Failed to save image: {e}")
             raise
Apply the same pattern to lines 116 and 145.
backend/app/services/document_parser/tasks.py (3)
78-86: Add synchronize_session parameter to bulk delete.

When using query().delete() in SQLAlchemy, it's best practice to specify synchronize_session to control how the session handles deleted objects. Without it, the default behavior may vary by SQLAlchemy version.
🔎 Proposed fix
             # Delete any existing blocks for this document
             existing_count = (
                 db.query(DocumentBlock)
                 .filter(DocumentBlock.document_id == document_id)
-                .delete()
+                .delete(synchronize_session="fetch")
             )
89-89: datetime.utcnow() is deprecated.

Same issue as in the model file. Use datetime.now(timezone.utc) instead.
🔎 Proposed fix
+from datetime import datetime, timezone

-            now = datetime.utcnow()
+            now = datetime.now(timezone.utc)
75-114: Consider using a context manager for the database session.

The try/finally pattern is correct but a context manager provides cleaner resource management.
🔎 Proposed fix using contextlib
+from contextlib import closing

-        db = SessionLocal()
-        try:
+        with closing(SessionLocal()) as db:
             # Delete any existing blocks...
             # ... rest of the code ...
             db.commit()
             logger.info(f"Saved {len(blocks)} blocks for document {document_id}")
-
-        finally:
-            db.close()
backend/alembic/versions/o5p6q7r8s9t0_add_document_blocks_table.py (1)
36-42: Use server_default for database-level defaults.

The default= parameter in op.create_table sets a Python-side default for migrations, not a database-level default. Use server_default to ensure defaults apply for all inserts including those outside the ORM.
🔎 Proposed fix
         sa.Column(
             "source_type",
             sa.String(20),
             nullable=False,
-            default="markdown",
+            server_default="markdown",
             comment="Source type: markdown, pdf, docx, image, git, ai",
         ),
         ...
         sa.Column(
             "editable",
             sa.Boolean,
-            default=False,
+            server_default=sa.false(),
             comment="Whether user can edit this block",
         ),
Also applies to: 55-60
backend/app/services/document_parser/storage/base.py (1)

71-112: Same sync wrapper pattern duplicated here.

This sync wrapper implementation duplicates the pattern in BaseOCRService. Consider extracting a shared utility function to avoid code duplication and centralize the async-to-sync bridging logic. See the earlier comment on backend/app/services/document_parser/ocr/base.py for the suggested utility approach.
backend/app/services/document_parser/parsers/image_parser.py (2)
31-41: Annotate mutable class attribute with ClassVar.

Per Ruff RUF012, mutable class attributes should be annotated with typing.ClassVar to make the intent explicit.
🔎 Proposed fix
+from typing import ClassVar, Dict, List, Optional
-from typing import List, Optional

 class ImageParser(BaseParser):
     # Content type to extension mapping
-    CONTENT_TYPE_MAP = {
+    CONTENT_TYPE_MAP: ClassVar[Dict[str, str]] = {
         "image/png": ".png",
121-121: Remove extraneous f-string prefix.

This f-string has no placeholders, making the f prefix unnecessary.
🔎 Proposed fix
-        logger.info(f"Parsed image file into 1 block")
+        logger.info("Parsed image file into 1 block")
frontend/src/features/admin/components/PublicRetrieverList.tsx (1)

79-80: Consider removing unused state variables.

_total and _setPage are declared but never used meaningfully. If pagination is planned for the future, consider adding a TODO comment; otherwise, remove them to reduce dead code.
backend/app/services/document_parser/factory.py (3)
32-32: Annotate mutable class attribute with ClassVar.

Per Ruff RUF012, mutable class attributes should be annotated with typing.ClassVar.
🔎 Proposed fix
-from typing import Dict, List, Optional, Type
+from typing import ClassVar, Dict, List, Optional, Type

 class ParserFactory:
-    _parsers: List[Type[BaseParser]] = []
+    _parsers: ClassVar[List[Type[BaseParser]]] = []
71-81: Remove the unused noqa directive.

The # noqa: F401 comment is flagged as unused by Ruff RUF100. Since F401 (unused import) is not enabled in your linter configuration, this directive has no effect and can be removed.
🔎 Proposed fix
-        from app.services.document_parser.parsers import (  # noqa: F401
+        from app.services.document_parser.parsers import (
             docx_parser,
83-123: Consider caching parser instances to avoid repeated instantiation.

get_parser instantiates every registered parser class on each call to check can_handle. For better efficiency, you could maintain a cache of instantiated parsers keyed by (storage_service, ocr_service). This is a minor optimization since the registry is small.
backend/app/services/document_parser/models/block.py (2)
12-15: Remove unused datetime import.

The datetime module is imported but never used in this file.
🔎 Proposed fix
 from dataclasses import dataclass, field
-from datetime import datetime
 from enum import Enum
 from typing import Any, Dict, Optional
85-102: Consider adding type hint for blocks field.

The blocks field uses list without a type parameter. Adding a type hint improves clarity and enables better static analysis.
🔎 Proposed fix
 @dataclass
 class ParseResult:
     document_id: str
-    blocks: list = field(default_factory=list)
+    blocks: List[DocumentBlockData] = field(default_factory=list)
     success: bool = True
backend/app/services/document_parser/parsers/docx_parser.py (3)
32-42: Annotate mutable class attribute with ClassVar.

Per Ruff RUF012, mutable class attributes should be annotated with typing.ClassVar.
🔎 Proposed fix
+from typing import ClassVar, Dict, List, Optional
-from typing import List, Optional

 class DocxParser(BaseParser):
     # Mapping of Word heading styles to levels
-    HEADING_STYLES = {
+    HEADING_STYLES: ClassVar[Dict[str, int]] = {
         "Heading 1": 1,
76-77: Use logging.exception to include traceback.

When logging an error in an exception handler, logging.exception automatically includes the traceback, which aids debugging.
🔎 Proposed fix
         except ImportError:
-            logger.error("python-docx not installed. Install with: pip install python-docx")
+            logger.exception("python-docx not installed. Install with: pip install python-docx")
225-237: Use logging.exception for error with traceback.

Replace logger.error with logger.exception to automatically capture the full traceback for debugging.
🔎 Proposed fix
         except Exception as e:
-            logger.error(f"Error parsing DOCX: {e}")
+            logger.exception("Error parsing DOCX")
             blocks.append(
backend/app/services/document_parser/parsers/pdf_parser.py (3)
217-244: Unused is_bold variable.

The is_bold variable is computed but never used in the classification logic. Either incorporate it into the heading detection heuristics or remove it.
🔎 Proposed fix
     def _classify_text_block(
         self,
         block: dict,
         text: str,
     ) -> tuple:
         # Get average font size from spans
         font_sizes = []
-        is_bold = False
 
         for line in block.get("lines", []):
             for span in line.get("spans", []):
                 font_sizes.append(span.get("size", 12))
-                flags = span.get("flags", 0)
-                if flags & 16:  # Bold flag
-                    is_bold = True
 
         avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12
Alternatively, if bold detection should influence heading classification:
         # Heuristics for heading detection
         # Large font or bold with short text likely a heading
         is_short = len(text) < 100 and "\n" not in text.strip()
 
-        if avg_font_size >= 14 and is_short:
+        if (avg_font_size >= 14 or is_bold) and is_short:
62-62: Consider using logging.exception to preserve stack trace.

When handling the ImportError, using logging.exception would capture the full stack trace, which aids debugging.
🔎 Proposed fix
         except ImportError:
-            logger.error("PyMuPDF not installed. Install with: pip install PyMuPDF")
+            logger.exception("PyMuPDF not installed. Install with: pip install PyMuPDF")
164-176: Use logging.exception and chain the exception for better debugging.

When catching parsing errors, logging.exception preserves the stack trace. Additionally, the error message exposes raw exception details which could leak internal information in some contexts.
🔎 Proposed fix
         except Exception as e:
-            logger.error(f"Error parsing PDF: {e}")
+            logger.exception(f"Error parsing PDF: {e}")
             blocks.append(
                 DocumentBlockData(
                     id=str(uuid.uuid4()),
                     document_id=document_id,
                     source_type=SourceType.PDF,
                     block_type=BlockType.UNSUPPORTED,
-                    content=f"Error parsing PDF: {str(e)}",
+                    content="Error parsing PDF. Please check the file format.",
                     editable=False,
                     order_index=0,
                 )
             )
backend/app/api/endpoints/document_parser.py (2)
191-197: Chain the exception and use logging.exception for stack trace.

When re-raising as HTTPException, chain from the original exception to preserve context. Also, using logging.exception captures the full stack trace.
🔎 Proposed fix
     except Exception as e:
-        logger.error(f"Error parsing document {document_id}: {e}")
+        logger.exception(f"Error parsing document {document_id}: {e}")
         db.rollback()
         raise HTTPException(
             status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
-            detail=f"Failed to parse document: {str(e)}",
-        )
+            detail="Failed to parse document. Please check the file format.",
+        ) from e
161-161: Update datetime.utcnow() to use timezone-aware UTC datetime.

datetime.utcnow() is deprecated in Python 3.12+. Replace with datetime.now(timezone.utc) for forward compatibility.
Proposed fix
-from datetime import datetime
+from datetime import datetime, timezone
 
 ...
-        now = datetime.utcnow()
+        now = datetime.now(timezone.utc)
backend/app/services/adapters/public_retriever.py (2)
96-100: Use HTTP 409 Conflict for duplicate resource.

When a retriever with the same name already exists, HTTP 409 (Conflict) is more semantically correct than 400 (Bad Request).
🔎 Proposed fix
         if existing:
             raise HTTPException(
-                status_code=400,
+                status_code=409,
                 detail=f"Public retriever '{retriever.metadata.name}' already exists in namespace '{namespace}'",
             )
Apply the same change at line 253-256 for the update conflict check.
37-57: Narrow the exception type in the adapter.

The broad Exception catch could mask unexpected errors. Consider catching ValidationError from Pydantic specifically.
🔎 Proposed fix
+from pydantic import ValidationError
 ...
         if isinstance(kind.json, dict):
             try:
                 retriever = Retriever.model_validate(kind.json)
                 display_name = retriever.metadata.displayName
                 storage_type = retriever.spec.storageConfig.type
                 description = retriever.spec.description
-            except Exception as e:
+            except (ValidationError, AttributeError, KeyError) as e:
                 logger.warning(f"Failed to parse retriever {kind.name}: {e}")
frontend/src/features/admin/components/RetrieverFormFields.tsx (1)
215-227: Add aria-label to password visibility toggle for accessibility.

The toggle button lacks an accessible label, which impacts screen reader users.
🔎 Proposed fix
               <Button
                 type="button"
                 variant="ghost"
                 size="icon"
                 className="absolute right-2 top-1/2 -translate-y-1/2 h-7 w-7"
                 onClick={() => setShowPassword(!showPassword)}
+                aria-label={showPassword ? t('common:hide_password') : t('common:show_password')}
               >
Apply similar changes to the API key toggle button at lines 248-256.
backend/app/services/document_parser/parsers/markdown_parser.py (1)

33-38: Table regex may not match all valid Markdown tables.

The TABLE_PATTERN requires a separator line |[-:| ]+| but the regex expects at least one header row before it. Some edge cases (e.g., tables with only separator and data rows) may not match. Additionally, the pattern doesn't handle tables without trailing newlines at EOF.

Consider testing with edge cases like:

Tables at end of file without trailing newline

Tables with varying column counts

Tables with escaped pipes in content
backend/app/schemas/document_block.py (2)
18-39: Consider reusing enums from models/block.py to avoid duplication.

BlockTypeEnum and SourceTypeEnum duplicate the definitions in backend/app/services/document_parser/models/block.py. This duplication could lead to drift if one is updated without the other.
🔎 Proposed fix
-from enum import Enum
 from typing import Any, Dict, List, Optional
 
 from pydantic import BaseModel, Field
+from app.services.document_parser.models.block import BlockType, SourceType
 
-
-class BlockTypeEnum(str, Enum):
-    """Enumeration of supported block types."""
-
-    HEADING = "heading"
-    PARAGRAPH = "paragraph"
-    LIST = "list"
-    CODE = "code"
-    TABLE = "table"
-    IMAGE = "image"
-    AI_SUMMARY = "ai_summary"
-    UNSUPPORTED = "unsupported"
-
-
-class SourceTypeEnum(str, Enum):
-    """Enumeration of document source types."""
-
-    MARKDOWN = "markdown"
-    PDF = "pdf"
-    DOCX = "docx"
-    IMAGE = "image"
-    GIT = "git"
-    AI = "ai"
+# Re-export for backward compatibility
+BlockTypeEnum = BlockType
+SourceTypeEnum = SourceType
92-98: document_id in ParseDocumentRequest may be redundant.

The trigger_parse endpoint already receives document_id as a path parameter. Having it in the request body creates potential for mismatch. Consider removing it from the schema.
🔎 Proposed fix
 class ParseDocumentRequest(BaseModel):
     """Schema for triggering document parsing."""
 
-    document_id: str = Field(..., description="ID of the document to parse")
     force: bool = Field(
         False, description="Force re-parsing even if blocks already exist"
     )
executor/agents/claude_code/progress_state_manager.py (2)
523-523: Extract magic number 2.0 to a named constant.

The hardcoded interval of 2.0 seconds for git monitoring should be extracted as a module-level constant for clarity and maintainability.
🔎 Proposed fix

Add near the top of the file after imports:
# Monitoring interval in seconds
GIT_MONITORING_INTERVAL = 2.0
Then update line 523:
-        self._monitor_timer = threading.Timer(2.0, self._check_git_changes)
+        self._monitor_timer = threading.Timer(
+            GIT_MONITORING_INTERVAL, self._check_git_changes
+        )
15-15: Update type hints to use Python 3.10+ native syntax.

The project requires Python 3.10+, so use native union syntax (|) and built-in generic types instead of typing imports:

Replace Optional[X] with X | None (5 occurrences)

Replace Dict[K, V] with dict[K, V]

Replace List[X] with list[X]

Remove unused imports: Optional, Dict, List from the typing import on line 15 after updating all usages.

coderabbitai · 2025-12-31T15:17:40Z

backend/app/schemas/admin.py

+class PublicRetrieverResponse(BaseModel):
+    """Public retriever response model"""
+
+    id: int
+    name: str
+    namespace: str
+    displayName: Optional[str] = None
+    storageType: str
+    description: Optional[str] = None
+    retriever_json: dict = Field(..., alias="json", serialization_alias="json")
+    is_active: bool
+    created_at: datetime
+    updated_at: datetime
+
+    class Config:
+        from_attributes = True
+        populate_by_name = True


⚠️ Potential issue | 🟠 Major

Fix inconsistent field naming conventions.

Lines 232 and 233 use camelCase (displayName, storageType) which is inconsistent with Python naming conventions and the existing PublicModelResponse schema (line 90 uses display_name). This creates API inconsistency.

🔎 Proposed fix

class PublicRetrieverResponse(BaseModel): """Public retriever response model""" id: int name: str namespace: str - displayName: Optional[str] = None - storageType: str + display_name: Optional[str] = None + storage_type: str description: Optional[str] = None retriever_json: dict = Field(..., alias="json", serialization_alias="json") is_active: bool created_at: datetime updated_at: datetime

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

class PublicRetrieverResponse(BaseModel):

"""Public retriever response model"""

id: int

name: str

namespace: str

displayName: Optional[str] = None

storageType: str

description: Optional[str] = None

retriever_json: dict = Field(..., alias="json", serialization_alias="json")

is_active: bool

created_at: datetime

updated_at: datetime

class Config:

from_attributes = True

populate_by_name = True

class PublicRetrieverResponse(BaseModel):

"""Public retriever response model"""

id: int

name: str

namespace: str

display_name: Optional[str] = None

storage_type: str

description: Optional[str] = None

retriever_json: dict = Field(..., alias="json", serialization_alias="json")

is_active: bool

created_at: datetime

updated_at: datetime

class Config:

from_attributes = True

populate_by_name = True

🤖 Prompt for AI Agents

In backend/app/schemas/admin.py around lines 226-242, the fields use camelCase (displayName, storageType) which is inconsistent with Python and the existing PublicModelResponse; rename them to snake_case (display_name: Optional[str] = None, storage_type: str) and, if the API must preserve camelCase payloads, add Field(..., alias="displayName") and Field(..., alias="storageType") respectively so populate_by_name remains true; update any code that constructs or reads these attributes to use the new snake_case names.

coderabbitai · 2025-12-31T15:17:40Z

backend/app/services/adapters/public_retriever.py

+    def delete_retriever(
+        self, db: Session, *, retriever_id: int, current_user: User
+    ) -> None:
+        """
+        Delete public retriever from kinds table (hard delete)
+
+        Args:
+            db: Database session
+            retriever_id: Retriever ID
+            current_user: Current admin user
+
+        Raises:
+            HTTPException: If retriever not found
+        """
+        kind = (
+            db.query(Kind)
+            .filter(
+                Kind.user_id == 0,
+                Kind.kind == "Retriever",
+                Kind.id == retriever_id,
+                Kind.is_active == True,
+            )
+            .first()
+        )
+        if not kind:
+            raise HTTPException(status_code=404, detail="Public retriever not found")
+        db.delete(kind)
+        db.commit()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, explore the repository structure to understand the layout find . -type f -name "*.py" | grep -E "(retriever|delete)" | head -20

Repository: wecode-ai/Wegent

Length of output: 301

🏁 Script executed:

#!/bin/bash # Search for retriever_kinds.py file fd "retriever_kinds" --type f

Repository: wecode-ai/Wegent

Length of output: 108

🏁 Script executed:

#!/bin/bash # Search for all delete_retriever methods in Python files rg "def delete_retriever" --type py -B 2 -A 15

Repository: wecode-ai/Wegent

Length of output: 4280

🏁 Script executed:

#!/bin/bash # Get the full delete_retriever method from retriever_kinds.py rg "def delete_retriever" backend/app/services/adapters/retriever_kinds.py -A 30

Repository: wecode-ai/Wegent

Length of output: 1022

🏁 Script executed:

#!/bin/bash # Check for any comments or documentation about hard vs soft delete patterns rg "hard.delete|soft.delete" --type py -i -B 2 -A 2

Repository: wecode-ai/Wegent

Length of output: 15547

🏁 Script executed:

#!/bin/bash # Check if there are other places where public retrievers are deleted rg "delete.*public.*retriever" --type py -i -B 3 -A 3

Repository: wecode-ai/Wegent

Length of output: 1528

🏁 Script executed:

#!/bin/bash # Get complete implementation of delete_retriever in retriever_kinds.py rg "def delete_retriever" backend/app/services/adapters/retriever_kinds.py -A 50 | head -60

Repository: wecode-ai/Wegent

Length of output: 1728

🏁 Script executed:

#!/bin/bash # Get the soft delete line from retriever_kinds.py rg "def delete_retriever" backend/app/services/adapters/retriever_kinds.py -A 70 | tail -20

Repository: wecode-ai/Wegent

Length of output: 587

Clarify deletion strategy for public retrievers.

The delete_retriever method in this file performs a hard delete, while retriever_kinds.py uses soft delete (is_active=False) for user-managed retrievers. This inconsistency impacts data recovery and audit capabilities.

If hard delete is intentional for admin-only public retrievers, add a comment explaining the rationale. Otherwise, align with the soft delete pattern used elsewhere in the codebase.

🧰 Tools

🪛 Ruff (0.14.10)

267-267: Unused method argument: current_user

(ARG002)

286-286: Avoid equality comparisons to True; use Kind.is_active: for truth checks

Replace with Kind.is_active

(E712)

🤖 Prompt for AI Agents

In backend/app/services/adapters/public_retriever.py around lines 266 to 293, the method currently performs a hard delete of a public retriever which is inconsistent with the soft-delete pattern (is_active=False) used elsewhere; either document the intentional rationale for hard delete with a clear comment above the delete block, or change the implementation to follow the soft-delete pattern: set kind.is_active = False (and optionally set a deleted_at/deleted_by audit field if available), add db.add(kind) or ensure the ORM tracks the change, and commit the session instead of db.delete(kind). Ensure the code path and tests reflect the chosen strategy.

coderabbitai · 2025-12-31T15:17:40Z

backend/app/services/document_parser/parsers/markdown_parser.py

+        # Sort blocks by their order_index (they may be out of order due to special element processing)
+        # Actually, we should sort by source position for proper document order
+        blocks.sort(key=lambda b: b.source_ref.get("line", 0) if b.source_ref else 0)
+
+        # Re-assign order indices
+        for idx, block in enumerate(blocks):
+            block.order_index = idx


⚠️ Potential issue | 🟠 Major

Sorting logic bug: inconsistent source_ref keys.

Code blocks and tables use {"start": ..., "end": ...} while headings, paragraphs, and lists use {"line": ...}. The sort key source_ref.get("line", 0) will default code/table blocks to position 0, causing incorrect document order.

🔎 Proposed fix

# Sort blocks by their source position for proper document order - blocks.sort(key=lambda b: b.source_ref.get("line", 0) if b.source_ref else 0) + def get_sort_key(block: DocumentBlockData) -> int: + if not block.source_ref: + return 0 + # Handle both "line" and "start" keys for different block types + return block.source_ref.get("line", block.source_ref.get("start", 0)) + + blocks.sort(key=get_sort_key)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Sort blocks by their order_index (they may be out of order due to special element processing)

# Actually, we should sort by source position for proper document order

blocks.sort(key=lambda b: b.source_ref.get("line", 0) if b.source_ref else 0)

# Re-assign order indices

for idx, block in enumerate(blocks):

block.order_index = idx

# Sort blocks by their source position for proper document order

def get_sort_key(block: DocumentBlockData) -> int:

if not block.source_ref:

return 0

# Handle both "line" and "start" keys for different block types

return block.source_ref.get("line", block.source_ref.get("start", 0))

blocks.sort(key=get_sort_key)

# Re-assign order indices

for idx, block in enumerate(blocks):

block.order_index = idx

coderabbitai · 2025-12-31T15:17:40Z

backend/app/services/document_parser/storage/local_storage.py

+        try:
+            # Extract path from URL
+            relative_path = image_url.replace(self.url_prefix, "").lstrip("/")
+            file_path = self.base_path / relative_path
+
+            if file_path.exists():
+                file_path.unlink()
+                logger.debug(f"Deleted image: {file_path}")
+                return True
+            else:
+                logger.warning(f"Image not found: {file_path}")
+                return False
+
+        except Exception as e:
+            logger.error(f"Failed to delete image: {e}")
+            return False


⚠️ Potential issue | 🟠 Major

Potential path traversal vulnerability in delete_image.

The image_url is user-influenced and the path extraction doesn't validate that the resulting path stays within base_path. A malicious URL like /api/document-images/../../../etc/passwd could escape the intended directory.

🔎 Proposed fix with path validation

async def delete_image(self, image_url: str) -> bool: try: # Extract path from URL relative_path = image_url.replace(self.url_prefix, "").lstrip("/") file_path = self.base_path / relative_path + # Validate path is within base_path to prevent path traversal + try: + file_path = file_path.resolve() + base_resolved = self.base_path.resolve() + if not str(file_path).startswith(str(base_resolved)): + logger.warning(f"Path traversal attempt detected: {image_url}") + return False + except (OSError, ValueError): + return False + if file_path.exists(): file_path.unlink() logger.debug(f"Deleted image: {file_path}") return True else: logger.warning(f"Image not found: {file_path}") return False except Exception as e: - logger.error(f"Failed to delete image: {e}") + logger.exception(f"Failed to delete image: {e}") return False

🧰 Tools

🪛 Ruff (0.14.10)

115-115: Do not catch blind exception: Exception

(BLE001)

116-116: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🤖 Prompt for AI Agents

In backend/app/services/document_parser/storage/local_storage.py around lines 102 to 117, the delete_image implementation builds a Path from a user-controlled image_url without validating it, allowing path traversal; change it to derive the relative path, construct the candidate path with self.base_path / relative_path, then resolve the candidate path (use candidate_path.resolve(strict=False)) and verify it is inside self.base_path (use Path.is_relative_to if available, or compare resolved paths/commonpath) before unlinking; if the resolved path is outside base_path or resolution fails, log a warning and return False, and handle symlinks and exceptions safely.

coderabbitai · 2025-12-31T15:17:41Z

backend/app/services/document_parser/storage/local_storage.py

+        doc_dir = self.base_path / document_id
+
+        if not doc_dir.exists():
+            return 0
+
+        try:
+            # Count files before deletion
+            file_count = sum(1 for _ in doc_dir.iterdir() if _.is_file())
+
+            # Remove entire directory
+            shutil.rmtree(doc_dir)


⚠️ Potential issue | 🟠 Major

Validate document_id to prevent path traversal.

Similar to delete_image, if document_id contains path components like ../, it could escape the base directory. Consider validating that document_id is a valid UUID or sanitizing it.

🔎 Proposed fix

async def delete_document_images(self, document_id: str) -> int: + # Validate document_id format to prevent path traversal + import re + if not re.match(r'^[a-zA-Z0-9_-]+$', document_id): + logger.warning(f"Invalid document_id format: {document_id}") + return 0 + doc_dir = self.base_path / document_id + # Additional safety: ensure path is within base_path + try: + doc_dir = doc_dir.resolve() + if not str(doc_dir).startswith(str(self.base_path.resolve())): + logger.warning(f"Path traversal attempt in document_id: {document_id}") + return 0 + except (OSError, ValueError): + return 0 + if not doc_dir.exists(): return 0

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In backend/app/services/document_parser/storage/local_storage.py around lines 129 to 139, the code builds doc_dir = self.base_path / document_id without validating document_id which allows path traversal (e.g. ../) to escape base_path; update the function to first validate document_id (preferably by attempting to parse it as a UUID and rejecting non-UUIDs) or at minimum reject any path separators or parent components, then build the path and call doc_dir = (self.base_path / document_id).resolve(); after resolving, verify that str(doc_dir).startswith(str(self.base_path.resolve())) (or use path comparison like self.base_path.resolve() in doc_dir.parents or doc_dir == self.base_path.resolve() / document_id) and if the check fails raise or return an error (e.g., ValueError or return 0) before counting files and calling shutil.rmtree; this prevents directory escapes while preserving the existing file-count-and-delete flow.

coderabbitai · 2025-12-31T15:17:41Z

frontend/src/features/admin/components/RetrieverFormFields.tsx

+  handleStorageTypeChange,
+  handleRetrievalMethodToggle,
+}) => {
+  const { t } = useTranslation(['admin', 'common', 'wizard']);


🛠️ Refactor suggestion | 🟠 Major

Avoid array format in useTranslation per coding guidelines.

The coding guidelines specify: "MUST use single namespace with useTranslation() - never use array format like useTranslation(['common', 'groups'])". Use cross-namespace key format t('namespace:key') instead.

🔎 Proposed fix

- const { t } = useTranslation(['admin', 'common', 'wizard']); + const { t } = useTranslation();

The existing translation calls already use the correct cross-namespace format (e.g., t('admin:public_retrievers.form.name'), t('common:retrievers.retrieval_method_vector')), so this change should work without further modifications.

Based on coding guidelines, the array format is prohibited.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const { t } = useTranslation(['admin', 'common', 'wizard']);

const { t } = useTranslation();

🤖 Prompt for AI Agents

frontend/src/features/admin/components/RetrieverFormFields.tsx lines 114-114: the code uses array namespace format useTranslation(['admin', 'common', 'wizard']) which is prohibited; change it to a single-namespace call useTranslation() and ensure all translation keys use cross-namespace notation like t('admin:public_retrievers.form.name') or t('common:retrievers.retrieval_method_vector'); if any t(...) calls omit the namespace, prepend the correct namespace to those keys so all translations resolve correctly.

coderabbitai · 2025-12-31T15:17:41Z

frontend/src/features/knowledge/document/components/KnowledgeDocumentPage.tsx

  const handleUpdate = async (data: Parameters<typeof personalKb.update>[1]) => {
    if (!editingKb) return;
    await personalKb.update(editingKb.id, data);
+    // Refresh the appropriate list based on whether it's a group or personal knowledge base
+    if (editingKb.namespace !== 'default') {
+      // Group knowledge base - trigger refresh via refreshKey
+      setGroupRefreshKey(prev => prev + 1);
+    }
    setEditingKb(null);
  };


⚠️ Potential issue | 🟠 Major

Missing personal knowledge base refresh after update.

The group knowledge base refresh logic is correctly implemented, but the personal knowledge base refresh is missing. When updating a personal KB (namespace === 'default'), personalKb.refresh() should be called to update the UI.

Compare with handleCreate (lines 140-144) which explicitly calls personalKb.refresh() for personal KBs.

🔎 Proposed fix

const handleUpdate = async (data: Parameters<typeof personalKb.update>[1]) => { if (!editingKb) return; await personalKb.update(editingKb.id, data); // Refresh the appropriate list based on whether it's a group or personal knowledge base if (editingKb.namespace !== 'default') { // Group knowledge base - trigger refresh via refreshKey setGroupRefreshKey(prev => prev + 1); + } else { + // Personal knowledge base - refresh directly + personalKb.refresh(); } setEditingKb(null); };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const handleUpdate = async (data: Parameters<typeof personalKb.update>[1]) => {

if (!editingKb) return;

await personalKb.update(editingKb.id, data);

// Refresh the appropriate list based on whether it's a group or personal knowledge base

if (editingKb.namespace !== 'default') {

// Group knowledge base - trigger refresh via refreshKey

setGroupRefreshKey(prev => prev + 1);

}

setEditingKb(null);

};

const handleUpdate = async (data: Parameters<typeof personalKb.update>[1]) => {

if (!editingKb) return;

await personalKb.update(editingKb.id, data);

// Refresh the appropriate list based on whether it's a group or personal knowledge base

if (editingKb.namespace !== 'default') {

// Group knowledge base - trigger refresh via refreshKey

setGroupRefreshKey(prev => prev + 1);

} else {

// Personal knowledge base - refresh directly

personalKb.refresh();

}

setEditingKb(null);

};

🤖 Prompt for AI Agents

frontend/src/features/knowledge/document/components/KnowledgeDocumentPage.tsx around lines 148 to 157: the update handler only triggers a group refresh when editingKb.namespace !== 'default' and never refreshes the personal KB UI; modify handleUpdate to call personalKb.refresh() when editingKb.namespace === 'default' (mirror the handleCreate behavior), then clear editingKb as before; ensure the call is awaited or fired after the personalKb.update completes and keep existing group refreshKey increment for non-default namespaces.

coderabbitai · 2025-12-31T15:17:41Z

frontend/src/features/knowledge/document/components/KnowledgeDocumentPage.tsx

  const handleDelete = async () => {
    if (!deletingKb) return;
    await personalKb.remove(deletingKb.id);
+    // Refresh the appropriate list based on whether it's a group or personal knowledge base
+    if (deletingKb.namespace !== 'default') {
+      // Group knowledge base - trigger refresh via refreshKey
+      setGroupRefreshKey(prev => prev + 1);
+    }
    setDeletingKb(null);
  };


⚠️ Potential issue | 🟠 Major

Missing personal knowledge base refresh after delete.

Similar to the update handler, the delete handler is missing the personal knowledge base refresh. When deleting a personal KB (namespace === 'default'), personalKb.refresh() should be called to update the UI.

🔎 Proposed fix

const handleDelete = async () => { if (!deletingKb) return; await personalKb.remove(deletingKb.id); // Refresh the appropriate list based on whether it's a group or personal knowledge base if (deletingKb.namespace !== 'default') { // Group knowledge base - trigger refresh via refreshKey setGroupRefreshKey(prev => prev + 1); + } else { + // Personal knowledge base - refresh directly + personalKb.refresh(); } setDeletingKb(null); };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

const handleDelete = async () => {

if (!deletingKb) return;

await personalKb.remove(deletingKb.id);

// Refresh the appropriate list based on whether it's a group or personal knowledge base

if (deletingKb.namespace !== 'default') {

// Group knowledge base - trigger refresh via refreshKey

setGroupRefreshKey(prev => prev + 1);

}

setDeletingKb(null);

};

const handleDelete = async () => {

if (!deletingKb) return;

await personalKb.remove(deletingKb.id);

// Refresh the appropriate list based on whether it's a group or personal knowledge base

if (deletingKb.namespace !== 'default') {

// Group knowledge base - trigger refresh via refreshKey

setGroupRefreshKey(prev => prev + 1);

} else {

// Personal knowledge base - refresh directly

personalKb.refresh();

}

setDeletingKb(null);

};

🤖 Prompt for AI Agents

In frontend/src/features/knowledge/document/components/KnowledgeDocumentPage.tsx around lines 159 to 168, the delete handler does not refresh the personal knowledge base UI when removing a personal KB (namespace === 'default'); update handleDelete to call and await personalKb.refresh() when deleting a personal KB, keep the existing setGroupRefreshKey increment for non-default namespaces, and then clear setDeletingKb(null) after the refresh so the UI reflects the deletion.

coderabbitai · 2025-12-31T15:17:41Z

frontend/src/features/knowledge/document/hooks/useRetrievers.ts

+      const data = response.data || [];
+      // Sort by type priority based on scope, then by name
+      // - Personal scope: user > public
+      // - Group scope: group > public
+      const typePriority: Record<string, number> =
+        scope === 'group' ? { group: 0, public: 1 } : { user: 0, public: 1 };
+      data.sort((a, b) => {
+        const priorityA = typePriority[a.type] ?? 1;
+        const priorityB = typePriority[b.type] ?? 1;
+        if (priorityA !== priorityB) {
+          return priorityA - priorityB;
+        }
+        return (a.name || '').localeCompare(b.name || '');
+      });
+      setRetrievers(data);


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# Find the UnifiedRetriever type definition rg -n "type UnifiedRetriever" --type ts --type tsx -A 10

Repository: wecode-ai/Wegent

Length of output: 87

🏁 Script executed:

# Look for retriever type enum definitions rg -n "RetrieverType|retriever.*type" --type ts --type tsx -B 2 -A 2

Repository: wecode-ai/Wegent

Length of output: 87

🏁 Script executed:

# Check the useRetrievers.ts file itself to understand the context fd "useRetrievers.ts" --type f

Repository: wecode-ai/Wegent

Length of output: 123

🏁 Script executed:

# Search for where the API is called to see what types are returned rg -n "retrievers|getRetrievers" --type ts --type tsx -B 3 -A 3

Repository: wecode-ai/Wegent

Length of output: 87

🏁 Script executed:

# Read the useRetrievers.ts file to see the full context cat -n frontend/src/features/knowledge/document/hooks/useRetrievers.ts

Repository: wecode-ai/Wegent

Length of output: 1916

🏁 Script executed:

# Search for UnifiedRetriever type definition without tsx flag rg -n "type UnifiedRetriever|interface UnifiedRetriever" -A 15

Repository: wecode-ai/Wegent

Length of output: 2825

🏁 Script executed:

# Search for retriever type values in the codebase rg -n "type.*['\"]user['\"]|type.*['\"]group['\"]|type.*['\"]public['\"]" -i

Repository: wecode-ai/Wegent

Length of output: 16077

🏁 Script executed:

# Look for the API endpoint being called rg -n "getRetrievers\|/retrievers" -A 5

Repository: wecode-ai/Wegent

Length of output: 42

🏁 Script executed:

# Find the backend retriever API endpoint rg -n "def.*retriever|def.*Retriever" backend/app/api/endpoints/adapter/retrievers.py -A 10

Repository: wecode-ai/Wegent

Length of output: 2835

🏁 Script executed:

# Check the retriever service to understand scope handling rg -n "unified.*retriever|get.*retriever" backend/app/services/ -i -A 5

Repository: wecode-ai/Wegent

Length of output: 6838

🏁 Script executed:

# Look for how the API handles personal vs group scope rg -n "scope.*personal|scope.*group" backend/app/services/ -i -A 3

Repository: wecode-ai/Wegent

Length of output: 14332

Fix sorting priority logic to handle all scope cases correctly.

The sorting implementation has a bug when scope='all'. The code correctly handles the individual 'personal' and 'group' scopes, but falls back to personal scope logic for 'all', which causes group type retrievers to share the same priority (1) as public retrievers and be sorted by name only.

Backend returns:

scope='personal': user + public types

scope='group': group + public types

scope='all': user + group + public types (all three)

Frontend sorting applies:

scope='group': { group: 0, public: 1 } ✓

All other scopes: { user: 0, public: 1 } – this misses group type for 'all' scope

Update the sorting logic to handle 'all' scope:

const typePriority: Record<string, number> = scope === 'group' ? { group: 0, public: 1 } : scope === 'all' ? { user: 0, group: 1, public: 2 } : { user: 0, public: 1 };

🤖 Prompt for AI Agents

In frontend/src/features/knowledge/document/hooks/useRetrievers.ts around lines 17 to 31, the sort priority falls back to the personal scope logic for scope='all', causing group retrievers to be ranked equal to public; update the typePriority assignment to explicitly handle scope === 'all' by returning priorities for user, group, and public (e.g., user:0, group:1, public:2), keep the existing group and personal cases, and leave the rest unchanged so that data.sort uses the correct priority mapping for all scopes.

coderabbitai · 2025-12-31T15:17:41Z

shared/models/task.py

    def dict(self, **kwargs) -> Dict[str, Any]:
        """Override dict method to exclude None values"""
        # Exclude None values by default
-        kwargs.setdefault('exclude_none', True)
+        kwargs.setdefault("exclude_none", True)
        return super().dict(**kwargs)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check Pydantic version in requirements files echo "=== Checking Pydantic version in requirements ===" fd -e txt -e in 'requirements' --exec cat {} \; 2>/dev/null | grep -i pydantic || echo "No match found in standard requirements files" echo -e "\n=== Checking for Pydantic imports across codebase ===" rg -n --type py 'from pydantic import' 2>/dev/null | head -20 || echo "No imports found" echo -e "\n=== Checking for .dict() method usage ===" rg -n --type py '\.dict\(' -g '!**/test*' 2>/dev/null | head -20 || echo "No .dict() usage found" echo -e "\n=== Checking for .model_dump() method usage ===" rg -n --type py '\.model_dump\(' 2>/dev/null | head -10 || echo "No .model_dump() usage found" echo -e "\n=== Checking shared/models/task.py ===" cat -n shared/models/task.py 2>/dev/null || echo "File not found"

Repository: wecode-ai/Wegent

Length of output: 7687

Replace deprecated .dict() method with .model_dump() for Pydantic v2 compatibility.

The project uses Pydantic 2.11.2, where .dict() is deprecated in favor of .model_dump(). Lines 71–75 (and the identical pattern at lines 82–86 in the ExecutionResult class) need to be updated. Replace super().dict(**kwargs) with super().model_dump(**kwargs).

🤖 Prompt for AI Agents

In shared/models/task.py around lines 71–75 (and the identical pattern in the ExecutionResult class at lines 82–86), the override calls super().dict(**kwargs) which is deprecated in Pydantic v2; replace those calls with super().model_dump(**kwargs) and keep kwargs.setdefault("exclude_none", True) so the method returns a Dict[str, Any] of the model dump with None values excluded.

2561056571 and others added 11 commits December 29, 2025 12:07

fix(backend): add debug logging for public retriever query

ed55e5f

Add logging to help diagnose why public retrievers are not being returned in the list API. Logs the count of found public retrievers and each added retriever name.

feat(rag): add public retriever support for knowledge base

81a45bf

Merge remote-tracking branch 'upstream/main' into weagent/feat-public…

907824b

…-retriever-support

Merge remote-tracking branch 'upstream/main' into weagent/feat-public…

7c5b29c

…-retriever-support

fix(frontend): fix file extension extraction in DocumentList

4d7def2

Merge remote-tracking branch 'upstream/main' into weagent/feat-public…

16b116b

…-retriever-support

Base automatically changed from weagent/feat-public-retriever-support to main December 31, 2025 10:28

qdaxb force-pushed the main branch from f92332e to e10724b Compare December 31, 2025 11:06

xuemin8 added 2 commits December 31, 2025 23:09

refactor(backend): update document parser block_type field naming

57e3fd3

Merge remote-tracking branch 'upstream/main' into weagent/document-pa…

3ad4f09

…rser-system

coderabbitai bot reviewed Dec 31, 2025

View reviewed changes

	const { t } = useTranslation(['admin', 'common', 'wizard']);
	const { t } = useTranslation();

feat(backend): implement document parser system for structured block output #802

Are you sure you want to change the base?

feat(backend): implement document parser system for structured block output #802

Uh oh!

Conversation

2561056571 commented Dec 31, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

2561056571 commented Dec 31, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 31, 2025 •

edited

Loading