-
Notifications
You must be signed in to change notification settings - Fork 67
feat(backend): implement document parser system for structured block output #802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add PublicRetrieverService for managing system-level retrievers (user_id=0) - Add admin API endpoints for public retriever CRUD operations - Update retriever_kinds_service to include public retrievers with fallback - Add PublicRetrieverList frontend component for admin management - Add public retriever tab to admin navigation - Update RetrievalSettingsSection to display source type labels (Personal/Public/Group) - Add i18n translations for public retriever features
…ptions - Add complete index strategy options (per_user, per_dataset, fixed, rolling) - Add retrieval methods configuration with checkboxes (vector, keyword, hybrid) - Add connection test functionality - Add storage type specific auth fields (username/password for ES, API key for Qdrant) - Add password visibility toggles - Add dynamic retrieval methods loading from API - Match personal retriever configuration page layout and functionality
Add logging to help diagnose why public retrievers are not being returned in the list API. Logs the count of found public retrievers and each added retriever name.
…-retriever-support
- Add RetrieverFormFields component for reusable form fields - Update PublicRetrieverList with create/edit/delete functionality - Add API endpoints for public retriever CRUD operations - Update storage base to support public retriever configuration - Update DocumentList and RetrievalSettingsSection for retriever selection
- Add public retriever support in RetrievalSettingsSection - Update useEmbeddingModels and useRetrievers hooks to fetch public models - Fix storage base to handle retriever configuration properly - Add i18n translations for public model labels
…-retriever-support
…-retriever-support
…output Add document parsing module that converts uploaded documents into structured Block format for preview, editing, and RAG/vectorization support. Key features: - DocumentBlock model and database migration for storing parsed blocks - Parser factory with auto-selection based on content type or file extension - Parsers for Markdown, PDF, DOCX, and images - Storage service interface with local filesystem implementation - OCR service interface with mock implementation - API endpoints for triggering parsing and managing blocks - Celery task template for async parsing
📝 WalkthroughWalkthroughThis PR introduces a comprehensive document parsing subsystem with database storage, multiple parser implementations, OCR/storage service abstractions, and API endpoints to parse documents into structured blocks with support for Markdown, PDF, DOCX, and image formats. Changes
Sequence DiagramsequenceDiagram
participant Client
participant API
participant Factory
participant Parser
participant Storage
participant OCR
participant DB
Client->>API: POST /documents/{id}/parse-sync<br/>(with file)
API->>Factory: get_parser(content_type, filename)
Factory->>Parser: instantiate with storage/OCR
API->>Parser: parse(binary_data, document_id, filename)
par Parallel Processing
Parser->>Storage: save_image(optional)
Parser->>OCR: extract_text(optional)
Parser->>OCR: describe_image(optional)
end
Parser-->>API: List[DocumentBlockData]
API->>DB: delete existing blocks for document
API->>DB: insert new blocks<br/>(id, document_id, type, content, etc.)
DB-->>API: committed
API-->>Client: ParseDocumentResponse<br/>(status, total_blocks, parse_time_ms)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
🧰 Additional context used📓 Path-based instructions (3)**/*.{py,ts,tsx,js,jsx}📄 CodeRabbit inference engine (AGENTS.md)
Files:
**/*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
backend/**/*.py📄 CodeRabbit inference engine (AGENTS.md)
Files:
🧠 Learnings (1)📚 Learning: 2025-12-31T03:47:12.160ZApplied to files:
🧬 Code graph analysis (1)backend/app/models/__init__.py (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
🔇 Additional comments (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add source_type field to track document origin type (markdown, pdf, docx, image, git, ai) for better content categorization and RAG support. Changes: - Add source_type column to DocumentBlock model and migration - Add SourceType enum to data models and Pydantic schemas - Update all parsers to set appropriate source_type - Update API endpoints and Celery task to persist source_type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 16
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
backend/app/services/rag/embedding/factory.py (1)
17-18: Add return type hint to function signature.The function signature is missing a return type annotation. Per coding guidelines, Python code MUST include type hints for all functions.
🔎 Proposed fix
Based on the function's return statements (CustomEmbedding or OpenAIEmbedding), add a return type:
+from typing import Any, Dict, Union +from llama_index.core.embeddings import BaseEmbedding + def create_embedding_model_from_crd( db: Session, user_id: int, model_name: str, model_namespace: str = "default" -): +) -> BaseEmbedding:Note: Verify that both
CustomEmbeddingandOpenAIEmbeddinginherit fromBaseEmbeddingor adjust the type accordingly.As per coding guidelines, Python code MUST include type hints.
executor/agents/claude_code/progress_state_manager.py (3)
160-222: Refactor oversized_build_workbench_structuremethod.This method spans 62 lines and performs multiple responsibilities: extracting user info, determining git type, and constructing the workbench dictionary. Split into smaller helper methods to improve clarity and testability, aligning with the 50-line function guideline.
🔎 Suggested refactoring approach
Extract helper methods:
_determine_git_type(git_domain: str) -> str_determine_summary(result_value: str | None) -> strdef _determine_git_type(self, git_domain: str) -> str: """Determine git type from domain.""" return "github" if "github.com" in git_domain.lower() else "gitlab" def _determine_summary(self, result_value: str | None) -> str: """Determine summary from multiple sources.""" if result_value: return result_value if self.task_data.get("subtask_title"): return self.task_data.get("subtask_title") if self.task_data.get("prompt"): return self.task_data.get("prompt") return "" def _build_workbench_structure( self, status: str = "running", result_value: str | None = None ) -> Dict[str, Any]: """Build workbench data structure.""" current_time = datetime.now().isoformat() git_domain = self.task_data["git_domain"] workbench = { "taskTitle": self.task_data.get("task_title", ""), "taskNumber": str(self.task_data.get("task_id", "")), "status": status, "completedTime": current_time if status == "completed" else "", "repository": self.task_data.get("git_repo", ""), "branch": self.task_data.get("branch_name", ""), "sessions": 1, "premiumRequests": 0, "lastUpdated": current_time, "summary": self._determine_summary(result_value), "changes": [], "originalPrompt": self.task_data.get("prompt", ""), "file_changes": [], "git_info": { "initial_commit_id": "", "initial_commit_message": "", "task_commits": [], "source_branch": "", "target_branch": "", }, "git_domain": git_domain, "git_type": self._determine_git_type(git_domain), } return workbench
261-351: Refactor oversized_update_task_commitsmethod.This method spans 90 lines, handling commit iteration, commit info collection, markdown formatting, and workbench updates. Break into focused helper methods to improve maintainability and testability per the 50-line guideline.
🔎 Suggested refactoring approach
Extract helper methods:
_build_commit_info(commit) -> Dict[str, Any]_format_commits_as_markdown(task_commits: List[Dict]) -> str_get_commit_range() -> List[Dict[str, Any]](handles repo iteration)def _build_commit_info(self, commit) -> Dict[str, Any]: """Build commit information dictionary.""" return { "commit_id": commit.hexsha, "short_id": commit.hexsha[:8], "message": commit.message.strip(), "author": commit.author.name, "author_email": commit.author.email, "committed_date": datetime.fromtimestamp( commit.committed_date ).isoformat(), "stats": { "files_changed": len(commit.stats.files), "insertions": commit.stats.total["insertions"], "deletions": commit.stats.total["deletions"], }, } def _format_commits_as_markdown(self, task_commits: List[Dict[str, Any]]) -> str: """Format commits as markdown.""" if not task_commits: return "" commit_lines = [ f"### commit id: {c['commit_id']}\n{c['message']}\n---\n" for c in task_commits ] return "\n".join(commit_lines)Then simplify
_update_task_commits()to call these helpers.
353-492: Refactor oversized_get_git_file_changesmethod.This method spans 139 lines and combines repository access, diff computation, and line-counting logic. Decompose into focused helpers (e.g.,
_collect_all_diffs(),_parse_diff_lines(),_build_file_change_record()) to improve readability and testability per the 50-line guideline.🔎 Suggested refactoring approach
Extract key helpers:
_collect_all_diffs(repo: Repo) -> List[DiffIndex]— consolidates committed, staged, and unstaged diffs_parse_diff_line_counts(diff_text: str) -> tuple[int, int]— counts added/removed lines_build_file_change_record(diff) -> Dict[str, Any]— creates file change dictdef _parse_diff_line_counts(self, diff_text: str) -> tuple[int, int]: """Parse diff text to count added and removed lines.""" added = removed = 0 for line in diff_text.split("\n"): if line.startswith("+") and not line.startswith("+++"): added += 1 elif line.startswith("-") and not line.startswith("---"): removed += 1 return added, removed def _build_file_change_record(self, diff) -> Dict[str, Any]: """Build file change record from diff.""" old_path = diff.a_path if diff.a_path else diff.b_path new_path = diff.b_path if diff.b_path else diff.a_path added_lines = removed_lines = 0 if diff.diff: diff_text = diff.diff.decode("utf-8", errors="ignore") added_lines, removed_lines = self._parse_diff_line_counts(diff_text) return { "old_path": old_path, "new_path": new_path, "new_file": diff.new_file, "renamed_file": diff.renamed_file, "deleted_file": diff.deleted_file, "added_lines": added_lines, "removed_lines": removed_lines, "diff_title": os.path.basename(new_path), }Then simplify the main method to iterate diffs and call
_build_file_change_record().
🟡 Minor comments (7)
frontend/src/features/knowledge/document/components/DocumentList.tsx-114-114 (1)
114-114: Handle edge case for files without extensions.The current implementation using
file.name.split('.').pop()will return the entire filename if there's no extension (no dot in the filename). This could lead to unexpected behavior or display issues.🔎 Proposed fix to handle files without extensions
- const extension = file.name.split('.').pop() || ''; + const parts = file.name.split('.'); + const extension = parts.length > 1 ? parts.pop() || '' : '';This ensures that files without extensions get an empty string instead of the full filename.
frontend/src/features/knowledge/document/components/RetrievalSettingsSection.tsx-137-155 (1)
137-155: Inconsistent namespace check for embedding model existence.The retriever existence check (line 119) verifies both
nameandnamespace, but the embedding model check (line 141) only verifiesname. If two embedding models share the same name but have different namespaces, the wrong model might be considered as "existing."🔎 Proposed fix
const currentModelExists = config.embedding_config?.model_name && - embeddingModels.some(m => m.name === config.embedding_config?.model_name); + embeddingModels.some( + m => + m.name === config.embedding_config?.model_name && + (m.namespace || 'default') === config.embedding_config?.model_namespace + );frontend/src/features/settings/components/RetrieverList.tsx-402-402 (1)
402-402: Missing namespace prefix in translation key.Same issue here -
t('retrievers.public')should include thecommon:namespace prefix.🔎 Proposed fix
- publicLabel={t('retrievers.public')} + publicLabel={t('common:retrievers.public')}frontend/src/features/settings/components/RetrieverList.tsx-388-388 (1)
388-388: Missing namespace prefix in translation key.The translation key
t('retrievers.public_retrievers')is missing thecommon:namespace prefix. Other translation calls in this file consistently uset('common:retrievers...')format. This will cause a missing translation at runtime.🔎 Proposed fix
- {t('retrievers.public_retrievers')} ({publicRetrievers.length}) + {t('common:retrievers.public_retrievers')} ({publicRetrievers.length})backend/alembic/versions/o5p6q7r8s9t0_add_document_blocks_table.py-29-35 (1)
29-35: Duplicate index creation fordocument_id.The column definition at line 33 includes
index=True, and there's also an explicitcreate_indexcall at lines 95-100. This may cause an error or create duplicate indexes depending on how the migration runs. Remove one of them.🔎 Proposed fix - remove redundant index=True from column
sa.Column( "document_id", sa.String(36), nullable=False, - index=True, comment="Reference to uploaded document", ),Or alternatively, remove the explicit
create_indexcall if you prefer the inline approach. The explicit approach gives you control over the index name.Also applies to: 94-100
executor/agents/claude_code/progress_state_manager.py-177-179 (1)
177-179: Remove unused variablebot_info.Line 177 assigns to
bot_infobut it is never referenced. The assignment can be simplified to accessself.task_data.get("bot", [{}])[0]directly where needed, or removed if not required.🔎 Proposed fix
# Extract task information from task_data user_info = self.task_data.get("user", {}) - bot_info = ( - self.task_data.get("bot", [{}])[0] if self.task_data.get("bot") else {} - )If
bot_infois intended for future use, document that intent or store it as an instance variable.executor/agents/claude_code/progress_state_manager.py-46-46 (1)
46-46: Update parameter type hints to use modern union syntax.Line 46 uses
Optional[str]. Replace withstr | None(PEP 604, fully supported in Python 3.10+). Project requires Python 3.10+, so update allOptional[T]instances toT | Nonethroughout the file:
- Line 46:
project_path: Optional[str]→str | None- Line 61:
workbench_data: Optional[Dict[str, Any]]→Dict[str, Any] | None- Line 62:
initial_commit_id: Optional[str]→str | None- Line 63:
_monitor_timer: Optional[threading.Timer]→threading.Timer | None- Line 113:
extra_result: Optional[Dict[str, Any]]→Dict[str, Any] | NoneAlso remove
Optionalfrom the imports on line 15.
🧹 Nitpick comments (49)
frontend/src/features/settings/components/prompt-fine-tune/index.ts (1)
5-7: Remove trailing semicolons to align with TypeScript guidelines.Lines 5–7 end with semicolons, which violates the coding guideline stating that TypeScript files MUST NOT use semicolons. Please remove them for consistency with Prettier and ESLint configuration.
🔎 Proposed fix
-export { default as PromptFineTuneDialog } from './PromptFineTuneDialog'; -export { default as PromptTestPanel } from './PromptTestPanel'; -export { default as PromptComparePanel } from './PromptComparePanel'; +export { default as PromptFineTuneDialog } from './PromptFineTuneDialog' +export { default as PromptTestPanel } from './PromptTestPanel' +export { default as PromptComparePanel } from './PromptComparePanel'As per coding guidelines, TypeScript/React code must follow Prettier formatting with no semicolons.
frontend/src/features/knowledge/document/hooks/useEmbeddingModels.ts (1)
30-44: Verify priority handling for 'all' scope and avoid mutating API response data.Two concerns with the sorting implementation:
Incomplete priority map for 'all' scope: When
scopeis'all', the API response may include models withtypeof'group','user', and'public'. However, the priority map for non-group scopes (lines 34-35) only defines'user'and'public'. Models withtype: 'group'will fall back to priority1via the nullish coalescing on line 37, placing them at the same priority as public models. Please verify this is the intended behavior when viewing all models across scopes.Array mutation: Line 36 calls
data.sort(), which mutates the original array from the API response. Ifresponse?.datais not null, the original array reference is being modified. Consider creating a shallow copy before sorting to avoid potential side effects if the data is cached or reused elsewhere.🔎 Suggested refactor to address both concerns
const data = response?.data || []; - // Sort by type priority based on scope, then by name - // - Personal scope: user > public - // - Group scope: group > public - const typePriority: Record<string, number> = - scope === 'group' ? { group: 0, public: 1 } : { user: 0, public: 1 }; - data.sort((a, b) => { + // Sort by type priority based on scope, then by name + // - Personal scope: user > public + // - Group scope: group > public + // - All scope: user > group > public + const typePriority: Record<string, number> = + scope === 'group' + ? { group: 0, public: 1 } + : scope === 'all' + ? { user: 0, group: 1, public: 2 } + : { user: 0, public: 1 }; + const sortedData = [...data].sort((a, b) => { - const priorityA = typePriority[a.type] ?? 1; - const priorityB = typePriority[b.type] ?? 1; + const priorityA = typePriority[a.type] ?? 999; + const priorityB = typePriority[b.type] ?? 999; if (priorityA !== priorityB) { return priorityA - priorityB; } return (a.name || '').localeCompare(b.name || ''); }); - setModels(data); + setModels(sortedData);Note: The fallback priority is changed from
1to999to ensure any unexpected types appear last in the sorted list.backend/app/services/rag/embedding/factory.py (2)
17-164: Consider breaking down this function into smaller components.The function is 147 lines long, which significantly exceeds the recommended 50-line guideline. It handles multiple responsibilities: database querying, CRD parsing, validation, and model construction.
💡 Suggested refactoring approach
Consider extracting helper functions:
def _query_model_kind(db: Session, user_id: int, model_name: str, model_namespace: str) -> Kind: """Query Model CRD from database.""" # Lines 40-72 ... def _parse_and_validate_model_config(model_data: Dict[str, Any], model_name: str) -> tuple[str, Dict[str, Any]]: """Parse CRD and validate it's an embedding model. Returns (protocol, config).""" # Lines 75-103 ... def _create_embedding_instance(protocol: str, model_name: str, config: Dict[str, Any]): """Create the appropriate embedding model instance.""" # Lines 111-164 ...This would improve readability, testability, and maintainability.
As per coding guidelines, function length SHOULD NOT exceed 50 lines.
88-88: Consider extracting the default model type to a named constant.The magic string
"llm"is used as a default value. Extracting it to a module-level constant would improve maintainability and make the codebase more searchable.💡 Suggested constant extraction
At the top of the file:
from app.services.rag.embedding.custom import CustomEmbedding + +DEFAULT_MODEL_TYPE = "llm"Then use it in the fallback:
- model_type = model_config.get("modelType", "llm") + model_type = model_config.get("modelType", DEFAULT_MODEL_TYPE)As per coding guidelines, Python MUST extract magic numbers to named constants.
shared/models/task.py (3)
66-66: Extract magic string to a constant or enum.The default value
"continue"fornext_actionis a magic string. Consider defining valid actions as an enum or as named constants for better type safety and discoverability.🔎 Proposed refactor using Enum
+from enum import Enum + +class NextAction(str, Enum): + """Valid next actions for thinking steps""" + CONTINUE = "continue" + STOP = "stop" + RETRY = "retry" + class ThinkingStep(BaseModel): """Thinking step model for recording agent reasoning process""" title: str = Field(..., description="Title of thinking step") - next_action: str = Field(default="continue", description="Next action to take") + next_action: NextAction = Field(default=NextAction.CONTINUE, description="Next action to take") details: Optional[Dict[str, Any]] = Field( default=None, description="Detailed structured data for this step" )
82-86: Extract duplicated.dict()override to a base class or mixin.The
dict()method override withexclude_none=Trueis duplicated betweenThinkingStepandExecutionResult. Consider extracting this common pattern to reduce code duplication.🔎 Proposed refactor using a mixin
+class ExcludeNoneMixin: + """Mixin that excludes None values when serializing to dict""" + + def dict(self, **kwargs) -> Dict[str, Any]: + """Override dict method to exclude None values""" + kwargs.setdefault("exclude_none", True) + return super().dict(**kwargs) + + -class ThinkingStep(BaseModel): +class ThinkingStep(ExcludeNoneMixin, BaseModel): """Thinking step model for recording agent reasoning process""" title: str = Field(..., description="Title of thinking step") next_action: str = Field(default="continue", description="Next action to take") details: Optional[Dict[str, Any]] = Field( default=None, description="Detailed structured data for this step" ) - - def dict(self, **kwargs) -> Dict[str, Any]: - """Override dict method to exclude None values""" - # Exclude None values by default - kwargs.setdefault("exclude_none", True) - return super().dict(**kwargs) -class ExecutionResult(BaseModel): +class ExecutionResult(ExcludeNoneMixin, BaseModel): value: Optional[str] = None thinking: List[ThinkingStep] = [] - - def dict(self, **kwargs) -> Dict[str, Any]: - """Override dict method to exclude None values""" - # Exclude None values by default - kwargs.setdefault("exclude_none", True) - return super().dict(**kwargs)
89-90: Add a docstring to the public class.As per coding guidelines, Python classes should have docstrings. The
TasksRequestmodel should include a docstring explaining its purpose and usage context.🔎 Proposed addition
class TasksRequest(BaseModel): + """Request model containing a list of tasks for batch operations.""" + tasks: List[Task]frontend/src/features/knowledge/document/components/RetrievalSettingsSection.tsx (2)
107-132: Potential re-render cascade withconfigin effect dependencies.Including
configin the dependency array while callingonChange({...config, ...})inside the effect can cause extra re-renders. The condition guards against infinite loops, but ifonChangeis not memoized or if the parent recreatesconfigon each render, this could trigger unnecessary effect executions.Consider either:
- Memoizing
configin the parent, or- Using a ref to track whether auto-selection has already occurred
222-226: Type assertion without fallback.The
type as 'user' | 'public' | 'group'cast assumes the input will always be one of these values. If an unexpected type is passed, the translation key won't exist and may display a raw key or empty string.🔎 Consider adding a fallback
const getSourceTypeLabel = (type: string) => { - const typeKey = type as 'user' | 'public' | 'group'; - return t(`document.retrieval.sourceType.${typeKey}`); + const validTypes = ['user', 'public', 'group'] as const; + const typeKey = validTypes.includes(type as typeof validTypes[number]) + ? type + : 'user'; + return t(`document.retrieval.sourceType.${typeKey}`); };frontend/src/features/knowledge/document/components/CreateKnowledgeBaseDialog.tsx (1)
52-60: Consider extracting initial config to avoid duplication.The initial
retrievalConfigobject is duplicated in three places: initial state (lines 52-60), after submit (lines 101-109), and on close (lines 119-127). This could lead to inconsistencies if one is updated but others are forgotten.🔎 Proposed refactor
const INITIAL_RETRIEVAL_CONFIG: Partial<RetrievalConfig> = { retrieval_mode: 'vector', top_k: 5, score_threshold: 0.7, hybrid_weights: { vector_weight: 0.7, keyword_weight: 0.3, }, } // Then use: setRetrievalConfig(INITIAL_RETRIEVAL_CONFIG)Also applies to: 99-109, 119-127
backend/app/services/adapters/retriever_kinds.py (1)
125-125: Use truthy checks instead of explicit== Truecomparisons.Lines 125, 193, and 222 use explicit
== Truecomparisons with boolean columns, which is not idiomatic Python. Use direct truthy checks instead.🔎 Proposed fix
public_retrievers = ( db.query(Kind) .filter( Kind.user_id == 0, Kind.kind == "Retriever", Kind.namespace == "default", - Kind.is_active == True, + Kind.is_active, ) .order_by(Kind.created_at.desc()) .all() )Apply similar changes to lines 193 and 222.
Also applies to: 193-193, 222-222
backend/app/services/document_parser/parsers/__init__.py (1)
1-23: Consider sorting__all__for consistency.The package initialization is well-structured and correctly exports all parser implementations. As a minor style improvement, consider sorting the
__all__list alphabetically to align with the Ruff linting rule (RUF022).🔎 Optional refactor to sort `__all__`
__all__ = [ - "MarkdownParser", - "PDFParser", "DocxParser", + "FallbackParser", "ImageParser", - "FallbackParser", + "MarkdownParser", + "PDFParser", ]backend/app/services/document_parser/parsers/fallback_parser.py (2)
16-16: Unused import.
ParserFactoryis imported but not used in this module. TheFallbackParseris consumed by the factory, not the other way around.🔎 Proposed fix
from app.services.document_parser.base import BaseParser -from app.services.document_parser.factory import ParserFactory from app.services.document_parser.models.block import BlockType, DocumentBlockData, SourceType
38-45: Use explicitOptionaltype hints per PEP 484.The parameters use implicit
Optionalsyntax (str = None), which is prohibited by PEP 484. UseOptional[str]orstr | Nonefor clarity.🔎 Proposed fix
+from typing import List, Optional + - def can_handle(self, content_type: str = None, filename: str = None) -> bool: + def can_handle(self, content_type: Optional[str] = None, filename: Optional[str] = None) -> bool:backend/app/api/endpoints/rag.py (2)
553-558: Unused variable should be prefixed with underscore.The
retrievervariable from unpacking is not used. Prefix with underscore to indicate intentional discard.🔎 Proposed fix
# Get retriever and create backend - retriever, storage_backend = get_retriever_and_backend( + _retriever, storage_backend = get_retriever_and_backend( retriever_name=retriever_name, retriever_namespace=retriever_namespace, user_id=current_user.id, db=db, )
574-577: Use exception chaining withraise ... from e.When re-raising as
HTTPException, useraise ... from eto preserve the exception chain for better debugging.🔎 Proposed fix
except HTTPException: raise except Exception as e: - raise HTTPException(status_code=500, detail=str(e)) + raise HTTPException(status_code=500, detail=str(e)) from ebackend/app/services/rag/storage/base.py (1)
128-133: Useraise ... fromfor proper exception chaining.The static analysis hint B904 is valid. When re-raising a different exception inside an
exceptblock, chain it to preserve the original traceback for debugging.🔎 Proposed fix
try: kb_id_num = int(knowledge_id) - except ValueError: - raise ValueError( + except ValueError as err: + raise ValueError( f"knowledge_id must be a numeric value for 'rolling' mode, got: {knowledge_id}" - ) + ) from errbackend/app/models/document_block.py (1)
93-105:datetime.utcnowis deprecated in Python 3.12+.
datetime.utcnow()is deprecated and returns a naive datetime. For timezone-aware timestamps, usedatetime.now(timezone.utc).🔎 Proposed fix
-from datetime import datetime +from datetime import datetime, timezone # In the Column definitions: created_at = Column( DateTime, nullable=False, - default=datetime.utcnow, + default=lambda: datetime.now(timezone.utc), comment="Creation time", ) updated_at = Column( DateTime, nullable=False, - default=datetime.utcnow, - onupdate=datetime.utcnow, + default=lambda: datetime.now(timezone.utc), + onupdate=lambda: datetime.now(timezone.utc), comment="Update time", )backend/app/services/document_parser/ocr/base.py (1)
49-75: Sync wrapper code is duplicated and uses deprecatedget_event_loop().The sync wrapper logic is duplicated here and in
BaseStorageService. Additionally,asyncio.get_event_loop()is deprecated in Python 3.10+ when no running loop exists. Consider extracting a utility function and using the modern pattern.🔎 Proposed fix - utility function approach
Create a shared utility:
# In a utils module import asyncio from typing import Coroutine, TypeVar T = TypeVar("T") def run_async_in_sync(coro: Coroutine[None, None, T]) -> T: """Run an async coroutine from a sync context.""" try: loop = asyncio.get_running_loop() except RuntimeError: # No running loop, safe to use asyncio.run return asyncio.run(coro) # Loop is running, use thread pool import concurrent.futures with concurrent.futures.ThreadPoolExecutor() as executor: future = executor.submit(asyncio.run, coro) return future.result()Then simplify the sync methods:
def extract_text_sync(self, image_data: bytes) -> str: - import asyncio - - try: - loop = asyncio.get_event_loop() - if loop.is_running(): - import concurrent.futures - - with concurrent.futures.ThreadPoolExecutor() as executor: - future = executor.submit( - asyncio.run, - self.extract_text(image_data), - ) - return future.result() - else: - return loop.run_until_complete(self.extract_text(image_data)) - except RuntimeError: - return asyncio.run(self.extract_text(image_data)) + from app.utils.async_helpers import run_async_in_sync + return run_async_in_sync(self.extract_text(image_data))Also applies to: 77-103
backend/app/services/document_parser/storage/local_storage.py (1)
88-90: Uselogger.exceptionfor automatic traceback logging.Per Ruff TRY400,
logger.exceptionautomatically includes the traceback, which is more useful for debugging thanlogger.error.🔎 Proposed fix
except Exception as e: - logger.error(f"Failed to save image: {e}") + logger.exception(f"Failed to save image: {e}") raiseApply the same pattern to lines 116 and 145.
backend/app/services/document_parser/tasks.py (3)
78-86: Addsynchronize_sessionparameter to bulk delete.When using
query().delete()in SQLAlchemy, it's best practice to specifysynchronize_sessionto control how the session handles deleted objects. Without it, the default behavior may vary by SQLAlchemy version.🔎 Proposed fix
# Delete any existing blocks for this document existing_count = ( db.query(DocumentBlock) .filter(DocumentBlock.document_id == document_id) - .delete() + .delete(synchronize_session="fetch") )
89-89:datetime.utcnow()is deprecated.Same issue as in the model file. Use
datetime.now(timezone.utc)instead.🔎 Proposed fix
+from datetime import datetime, timezone - now = datetime.utcnow() + now = datetime.now(timezone.utc)
75-114: Consider using a context manager for the database session.The try/finally pattern is correct but a context manager provides cleaner resource management.
🔎 Proposed fix using contextlib
+from contextlib import closing - db = SessionLocal() - try: + with closing(SessionLocal()) as db: # Delete any existing blocks... # ... rest of the code ... db.commit() logger.info(f"Saved {len(blocks)} blocks for document {document_id}") - - finally: - db.close()backend/alembic/versions/o5p6q7r8s9t0_add_document_blocks_table.py (1)
36-42: Useserver_defaultfor database-level defaults.The
default=parameter inop.create_tablesets a Python-side default for migrations, not a database-level default. Useserver_defaultto ensure defaults apply for all inserts including those outside the ORM.🔎 Proposed fix
sa.Column( "source_type", sa.String(20), nullable=False, - default="markdown", + server_default="markdown", comment="Source type: markdown, pdf, docx, image, git, ai", ), ... sa.Column( "editable", sa.Boolean, - default=False, + server_default=sa.false(), comment="Whether user can edit this block", ),Also applies to: 55-60
backend/app/services/document_parser/storage/base.py (1)
71-112: Same sync wrapper pattern duplicated here.This sync wrapper implementation duplicates the pattern in
BaseOCRService. Consider extracting a shared utility function to avoid code duplication and centralize the async-to-sync bridging logic. See the earlier comment onbackend/app/services/document_parser/ocr/base.pyfor the suggested utility approach.backend/app/services/document_parser/parsers/image_parser.py (2)
31-41: Annotate mutable class attribute withClassVar.Per Ruff RUF012, mutable class attributes should be annotated with
typing.ClassVarto make the intent explicit.🔎 Proposed fix
+from typing import ClassVar, Dict, List, Optional -from typing import List, Optional class ImageParser(BaseParser): # Content type to extension mapping - CONTENT_TYPE_MAP = { + CONTENT_TYPE_MAP: ClassVar[Dict[str, str]] = { "image/png": ".png",
121-121: Remove extraneous f-string prefix.This f-string has no placeholders, making the
fprefix unnecessary.🔎 Proposed fix
- logger.info(f"Parsed image file into 1 block") + logger.info("Parsed image file into 1 block")frontend/src/features/admin/components/PublicRetrieverList.tsx (1)
79-80: Consider removing unused state variables.
_totaland_setPageare declared but never used meaningfully. If pagination is planned for the future, consider adding a TODO comment; otherwise, remove them to reduce dead code.backend/app/services/document_parser/factory.py (3)
32-32: Annotate mutable class attribute withClassVar.Per Ruff RUF012, mutable class attributes should be annotated with
typing.ClassVar.🔎 Proposed fix
-from typing import Dict, List, Optional, Type +from typing import ClassVar, Dict, List, Optional, Type class ParserFactory: - _parsers: List[Type[BaseParser]] = [] + _parsers: ClassVar[List[Type[BaseParser]]] = []
71-81: Remove the unusednoqadirective.The
# noqa: F401comment is flagged as unused by Ruff RUF100. Since F401 (unused import) is not enabled in your linter configuration, this directive has no effect and can be removed.🔎 Proposed fix
- from app.services.document_parser.parsers import ( # noqa: F401 + from app.services.document_parser.parsers import ( docx_parser,
83-123: Consider caching parser instances to avoid repeated instantiation.
get_parserinstantiates every registered parser class on each call to checkcan_handle. For better efficiency, you could maintain a cache of instantiated parsers keyed by (storage_service, ocr_service). This is a minor optimization since the registry is small.backend/app/services/document_parser/models/block.py (2)
12-15: Remove unuseddatetimeimport.The
datetimemodule is imported but never used in this file.🔎 Proposed fix
from dataclasses import dataclass, field -from datetime import datetime from enum import Enum from typing import Any, Dict, Optional
85-102: Consider adding type hint forblocksfield.The
blocksfield useslistwithout a type parameter. Adding a type hint improves clarity and enables better static analysis.🔎 Proposed fix
@dataclass class ParseResult: document_id: str - blocks: list = field(default_factory=list) + blocks: List[DocumentBlockData] = field(default_factory=list) success: bool = Truebackend/app/services/document_parser/parsers/docx_parser.py (3)
32-42: Annotate mutable class attribute withClassVar.Per Ruff RUF012, mutable class attributes should be annotated with
typing.ClassVar.🔎 Proposed fix
+from typing import ClassVar, Dict, List, Optional -from typing import List, Optional class DocxParser(BaseParser): # Mapping of Word heading styles to levels - HEADING_STYLES = { + HEADING_STYLES: ClassVar[Dict[str, int]] = { "Heading 1": 1,
76-77: Uselogging.exceptionto include traceback.When logging an error in an exception handler,
logging.exceptionautomatically includes the traceback, which aids debugging.🔎 Proposed fix
except ImportError: - logger.error("python-docx not installed. Install with: pip install python-docx") + logger.exception("python-docx not installed. Install with: pip install python-docx")
225-237: Uselogging.exceptionfor error with traceback.Replace
logger.errorwithlogger.exceptionto automatically capture the full traceback for debugging.🔎 Proposed fix
except Exception as e: - logger.error(f"Error parsing DOCX: {e}") + logger.exception("Error parsing DOCX") blocks.append(backend/app/services/document_parser/parsers/pdf_parser.py (3)
217-244: Unusedis_boldvariable.The
is_boldvariable is computed but never used in the classification logic. Either incorporate it into the heading detection heuristics or remove it.🔎 Proposed fix
def _classify_text_block( self, block: dict, text: str, ) -> tuple: # Get average font size from spans font_sizes = [] - is_bold = False for line in block.get("lines", []): for span in line.get("spans", []): font_sizes.append(span.get("size", 12)) - flags = span.get("flags", 0) - if flags & 16: # Bold flag - is_bold = True avg_font_size = sum(font_sizes) / len(font_sizes) if font_sizes else 12Alternatively, if bold detection should influence heading classification:
# Heuristics for heading detection # Large font or bold with short text likely a heading is_short = len(text) < 100 and "\n" not in text.strip() - if avg_font_size >= 14 and is_short: + if (avg_font_size >= 14 or is_bold) and is_short:
62-62: Consider usinglogging.exceptionto preserve stack trace.When handling the
ImportError, usinglogging.exceptionwould capture the full stack trace, which aids debugging.🔎 Proposed fix
except ImportError: - logger.error("PyMuPDF not installed. Install with: pip install PyMuPDF") + logger.exception("PyMuPDF not installed. Install with: pip install PyMuPDF")
164-176: Uselogging.exceptionand chain the exception for better debugging.When catching parsing errors,
logging.exceptionpreserves the stack trace. Additionally, the error message exposes raw exception details which could leak internal information in some contexts.🔎 Proposed fix
except Exception as e: - logger.error(f"Error parsing PDF: {e}") + logger.exception(f"Error parsing PDF: {e}") blocks.append( DocumentBlockData( id=str(uuid.uuid4()), document_id=document_id, source_type=SourceType.PDF, block_type=BlockType.UNSUPPORTED, - content=f"Error parsing PDF: {str(e)}", + content="Error parsing PDF. Please check the file format.", editable=False, order_index=0, ) )backend/app/api/endpoints/document_parser.py (2)
191-197: Chain the exception and uselogging.exceptionfor stack trace.When re-raising as
HTTPException, chain from the original exception to preserve context. Also, usinglogging.exceptioncaptures the full stack trace.🔎 Proposed fix
except Exception as e: - logger.error(f"Error parsing document {document_id}: {e}") + logger.exception(f"Error parsing document {document_id}: {e}") db.rollback() raise HTTPException( status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, - detail=f"Failed to parse document: {str(e)}", - ) + detail="Failed to parse document. Please check the file format.", + ) from e
161-161: Updatedatetime.utcnow()to use timezone-aware UTC datetime.
datetime.utcnow()is deprecated in Python 3.12+. Replace withdatetime.now(timezone.utc)for forward compatibility.Proposed fix
-from datetime import datetime +from datetime import datetime, timezone ... - now = datetime.utcnow() + now = datetime.now(timezone.utc)backend/app/services/adapters/public_retriever.py (2)
96-100: Use HTTP 409 Conflict for duplicate resource.When a retriever with the same name already exists, HTTP 409 (Conflict) is more semantically correct than 400 (Bad Request).
🔎 Proposed fix
if existing: raise HTTPException( - status_code=400, + status_code=409, detail=f"Public retriever '{retriever.metadata.name}' already exists in namespace '{namespace}'", )Apply the same change at line 253-256 for the update conflict check.
37-57: Narrow the exception type in the adapter.The broad
Exceptioncatch could mask unexpected errors. Consider catchingValidationErrorfrom Pydantic specifically.🔎 Proposed fix
+from pydantic import ValidationError ... if isinstance(kind.json, dict): try: retriever = Retriever.model_validate(kind.json) display_name = retriever.metadata.displayName storage_type = retriever.spec.storageConfig.type description = retriever.spec.description - except Exception as e: + except (ValidationError, AttributeError, KeyError) as e: logger.warning(f"Failed to parse retriever {kind.name}: {e}")frontend/src/features/admin/components/RetrieverFormFields.tsx (1)
215-227: Addaria-labelto password visibility toggle for accessibility.The toggle button lacks an accessible label, which impacts screen reader users.
🔎 Proposed fix
<Button type="button" variant="ghost" size="icon" className="absolute right-2 top-1/2 -translate-y-1/2 h-7 w-7" onClick={() => setShowPassword(!showPassword)} + aria-label={showPassword ? t('common:hide_password') : t('common:show_password')} >Apply similar changes to the API key toggle button at lines 248-256.
backend/app/services/document_parser/parsers/markdown_parser.py (1)
33-38: Table regex may not match all valid Markdown tables.The
TABLE_PATTERNrequires a separator line|[-:| ]+|but the regex expects at least one header row before it. Some edge cases (e.g., tables with only separator and data rows) may not match. Additionally, the pattern doesn't handle tables without trailing newlines at EOF.Consider testing with edge cases like:
- Tables at end of file without trailing newline
- Tables with varying column counts
- Tables with escaped pipes in content
backend/app/schemas/document_block.py (2)
18-39: Consider reusing enums frommodels/block.pyto avoid duplication.
BlockTypeEnumandSourceTypeEnumduplicate the definitions inbackend/app/services/document_parser/models/block.py. This duplication could lead to drift if one is updated without the other.🔎 Proposed fix
-from enum import Enum from typing import Any, Dict, List, Optional from pydantic import BaseModel, Field +from app.services.document_parser.models.block import BlockType, SourceType - -class BlockTypeEnum(str, Enum): - """Enumeration of supported block types.""" - - HEADING = "heading" - PARAGRAPH = "paragraph" - LIST = "list" - CODE = "code" - TABLE = "table" - IMAGE = "image" - AI_SUMMARY = "ai_summary" - UNSUPPORTED = "unsupported" - - -class SourceTypeEnum(str, Enum): - """Enumeration of document source types.""" - - MARKDOWN = "markdown" - PDF = "pdf" - DOCX = "docx" - IMAGE = "image" - GIT = "git" - AI = "ai" +# Re-export for backward compatibility +BlockTypeEnum = BlockType +SourceTypeEnum = SourceType
92-98:document_idinParseDocumentRequestmay be redundant.The
trigger_parseendpoint already receivesdocument_idas a path parameter. Having it in the request body creates potential for mismatch. Consider removing it from the schema.🔎 Proposed fix
class ParseDocumentRequest(BaseModel): """Schema for triggering document parsing.""" - document_id: str = Field(..., description="ID of the document to parse") force: bool = Field( False, description="Force re-parsing even if blocks already exist" )executor/agents/claude_code/progress_state_manager.py (2)
523-523: Extract magic number2.0to a named constant.The hardcoded interval of 2.0 seconds for git monitoring should be extracted as a module-level constant for clarity and maintainability.
🔎 Proposed fix
Add near the top of the file after imports:
# Monitoring interval in seconds GIT_MONITORING_INTERVAL = 2.0Then update line 523:
- self._monitor_timer = threading.Timer(2.0, self._check_git_changes) + self._monitor_timer = threading.Timer( + GIT_MONITORING_INTERVAL, self._check_git_changes + )
15-15: Update type hints to use Python 3.10+ native syntax.The project requires Python 3.10+, so use native union syntax (
|) and built-in generic types instead of typing imports:
- Replace
Optional[X]withX | None(5 occurrences)- Replace
Dict[K, V]withdict[K, V]- Replace
List[X]withlist[X]Remove unused imports:
Optional,Dict,Listfrom the typing import on line 15 after updating all usages.
| class PublicRetrieverResponse(BaseModel): | ||
| """Public retriever response model""" | ||
|
|
||
| id: int | ||
| name: str | ||
| namespace: str | ||
| displayName: Optional[str] = None | ||
| storageType: str | ||
| description: Optional[str] = None | ||
| retriever_json: dict = Field(..., alias="json", serialization_alias="json") | ||
| is_active: bool | ||
| created_at: datetime | ||
| updated_at: datetime | ||
|
|
||
| class Config: | ||
| from_attributes = True | ||
| populate_by_name = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix inconsistent field naming conventions.
Lines 232 and 233 use camelCase (displayName, storageType) which is inconsistent with Python naming conventions and the existing PublicModelResponse schema (line 90 uses display_name). This creates API inconsistency.
🔎 Proposed fix
class PublicRetrieverResponse(BaseModel):
"""Public retriever response model"""
id: int
name: str
namespace: str
- displayName: Optional[str] = None
- storageType: str
+ display_name: Optional[str] = None
+ storage_type: str
description: Optional[str] = None
retriever_json: dict = Field(..., alias="json", serialization_alias="json")
is_active: bool
created_at: datetime
updated_at: datetime📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| class PublicRetrieverResponse(BaseModel): | |
| """Public retriever response model""" | |
| id: int | |
| name: str | |
| namespace: str | |
| displayName: Optional[str] = None | |
| storageType: str | |
| description: Optional[str] = None | |
| retriever_json: dict = Field(..., alias="json", serialization_alias="json") | |
| is_active: bool | |
| created_at: datetime | |
| updated_at: datetime | |
| class Config: | |
| from_attributes = True | |
| populate_by_name = True | |
| class PublicRetrieverResponse(BaseModel): | |
| """Public retriever response model""" | |
| id: int | |
| name: str | |
| namespace: str | |
| display_name: Optional[str] = None | |
| storage_type: str | |
| description: Optional[str] = None | |
| retriever_json: dict = Field(..., alias="json", serialization_alias="json") | |
| is_active: bool | |
| created_at: datetime | |
| updated_at: datetime | |
| class Config: | |
| from_attributes = True | |
| populate_by_name = True |
🤖 Prompt for AI Agents
In backend/app/schemas/admin.py around lines 226-242, the fields use camelCase
(displayName, storageType) which is inconsistent with Python and the existing
PublicModelResponse; rename them to snake_case (display_name: Optional[str] =
None, storage_type: str) and, if the API must preserve camelCase payloads, add
Field(..., alias="displayName") and Field(..., alias="storageType") respectively
so populate_by_name remains true; update any code that constructs or reads these
attributes to use the new snake_case names.
| def delete_retriever( | ||
| self, db: Session, *, retriever_id: int, current_user: User | ||
| ) -> None: | ||
| """ | ||
| Delete public retriever from kinds table (hard delete) | ||
| Args: | ||
| db: Database session | ||
| retriever_id: Retriever ID | ||
| current_user: Current admin user | ||
| Raises: | ||
| HTTPException: If retriever not found | ||
| """ | ||
| kind = ( | ||
| db.query(Kind) | ||
| .filter( | ||
| Kind.user_id == 0, | ||
| Kind.kind == "Retriever", | ||
| Kind.id == retriever_id, | ||
| Kind.is_active == True, | ||
| ) | ||
| .first() | ||
| ) | ||
| if not kind: | ||
| raise HTTPException(status_code=404, detail="Public retriever not found") | ||
| db.delete(kind) | ||
| db.commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, explore the repository structure to understand the layout
find . -type f -name "*.py" | grep -E "(retriever|delete)" | head -20Repository: wecode-ai/Wegent
Length of output: 301
🏁 Script executed:
#!/bin/bash
# Search for retriever_kinds.py file
fd "retriever_kinds" --type fRepository: wecode-ai/Wegent
Length of output: 108
🏁 Script executed:
#!/bin/bash
# Search for all delete_retriever methods in Python files
rg "def delete_retriever" --type py -B 2 -A 15Repository: wecode-ai/Wegent
Length of output: 4280
🏁 Script executed:
#!/bin/bash
# Get the full delete_retriever method from retriever_kinds.py
rg "def delete_retriever" backend/app/services/adapters/retriever_kinds.py -A 30Repository: wecode-ai/Wegent
Length of output: 1022
🏁 Script executed:
#!/bin/bash
# Check for any comments or documentation about hard vs soft delete patterns
rg "hard.delete|soft.delete" --type py -i -B 2 -A 2Repository: wecode-ai/Wegent
Length of output: 15547
🏁 Script executed:
#!/bin/bash
# Check if there are other places where public retrievers are deleted
rg "delete.*public.*retriever" --type py -i -B 3 -A 3Repository: wecode-ai/Wegent
Length of output: 1528
🏁 Script executed:
#!/bin/bash
# Get complete implementation of delete_retriever in retriever_kinds.py
rg "def delete_retriever" backend/app/services/adapters/retriever_kinds.py -A 50 | head -60Repository: wecode-ai/Wegent
Length of output: 1728
🏁 Script executed:
#!/bin/bash
# Get the soft delete line from retriever_kinds.py
rg "def delete_retriever" backend/app/services/adapters/retriever_kinds.py -A 70 | tail -20Repository: wecode-ai/Wegent
Length of output: 587
Clarify deletion strategy for public retrievers.
The delete_retriever method in this file performs a hard delete, while retriever_kinds.py uses soft delete (is_active=False) for user-managed retrievers. This inconsistency impacts data recovery and audit capabilities.
If hard delete is intentional for admin-only public retrievers, add a comment explaining the rationale. Otherwise, align with the soft delete pattern used elsewhere in the codebase.
🧰 Tools
🪛 Ruff (0.14.10)
267-267: Unused method argument: current_user
(ARG002)
286-286: Avoid equality comparisons to True; use Kind.is_active: for truth checks
Replace with Kind.is_active
(E712)
🤖 Prompt for AI Agents
In backend/app/services/adapters/public_retriever.py around lines 266 to 293,
the method currently performs a hard delete of a public retriever which is
inconsistent with the soft-delete pattern (is_active=False) used elsewhere;
either document the intentional rationale for hard delete with a clear comment
above the delete block, or change the implementation to follow the soft-delete
pattern: set kind.is_active = False (and optionally set a deleted_at/deleted_by
audit field if available), add db.add(kind) or ensure the ORM tracks the change,
and commit the session instead of db.delete(kind). Ensure the code path and
tests reflect the chosen strategy.
| # Sort blocks by their order_index (they may be out of order due to special element processing) | ||
| # Actually, we should sort by source position for proper document order | ||
| blocks.sort(key=lambda b: b.source_ref.get("line", 0) if b.source_ref else 0) | ||
|
|
||
| # Re-assign order indices | ||
| for idx, block in enumerate(blocks): | ||
| block.order_index = idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorting logic bug: inconsistent source_ref keys.
Code blocks and tables use {"start": ..., "end": ...} while headings, paragraphs, and lists use {"line": ...}. The sort key source_ref.get("line", 0) will default code/table blocks to position 0, causing incorrect document order.
🔎 Proposed fix
# Sort blocks by their source position for proper document order
- blocks.sort(key=lambda b: b.source_ref.get("line", 0) if b.source_ref else 0)
+ def get_sort_key(block: DocumentBlockData) -> int:
+ if not block.source_ref:
+ return 0
+ # Handle both "line" and "start" keys for different block types
+ return block.source_ref.get("line", block.source_ref.get("start", 0))
+
+ blocks.sort(key=get_sort_key)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Sort blocks by their order_index (they may be out of order due to special element processing) | |
| # Actually, we should sort by source position for proper document order | |
| blocks.sort(key=lambda b: b.source_ref.get("line", 0) if b.source_ref else 0) | |
| # Re-assign order indices | |
| for idx, block in enumerate(blocks): | |
| block.order_index = idx | |
| # Sort blocks by their source position for proper document order | |
| def get_sort_key(block: DocumentBlockData) -> int: | |
| if not block.source_ref: | |
| return 0 | |
| # Handle both "line" and "start" keys for different block types | |
| return block.source_ref.get("line", block.source_ref.get("start", 0)) | |
| blocks.sort(key=get_sort_key) | |
| # Re-assign order indices | |
| for idx, block in enumerate(blocks): | |
| block.order_index = idx |
| try: | ||
| # Extract path from URL | ||
| relative_path = image_url.replace(self.url_prefix, "").lstrip("/") | ||
| file_path = self.base_path / relative_path | ||
|
|
||
| if file_path.exists(): | ||
| file_path.unlink() | ||
| logger.debug(f"Deleted image: {file_path}") | ||
| return True | ||
| else: | ||
| logger.warning(f"Image not found: {file_path}") | ||
| return False | ||
|
|
||
| except Exception as e: | ||
| logger.error(f"Failed to delete image: {e}") | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential path traversal vulnerability in delete_image.
The image_url is user-influenced and the path extraction doesn't validate that the resulting path stays within base_path. A malicious URL like /api/document-images/../../../etc/passwd could escape the intended directory.
🔎 Proposed fix with path validation
async def delete_image(self, image_url: str) -> bool:
try:
# Extract path from URL
relative_path = image_url.replace(self.url_prefix, "").lstrip("/")
file_path = self.base_path / relative_path
+ # Validate path is within base_path to prevent path traversal
+ try:
+ file_path = file_path.resolve()
+ base_resolved = self.base_path.resolve()
+ if not str(file_path).startswith(str(base_resolved)):
+ logger.warning(f"Path traversal attempt detected: {image_url}")
+ return False
+ except (OSError, ValueError):
+ return False
+
if file_path.exists():
file_path.unlink()
logger.debug(f"Deleted image: {file_path}")
return True
else:
logger.warning(f"Image not found: {file_path}")
return False
except Exception as e:
- logger.error(f"Failed to delete image: {e}")
+ logger.exception(f"Failed to delete image: {e}")
return False🧰 Tools
🪛 Ruff (0.14.10)
115-115: Do not catch blind exception: Exception
(BLE001)
116-116: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
🤖 Prompt for AI Agents
In backend/app/services/document_parser/storage/local_storage.py around lines
102 to 117, the delete_image implementation builds a Path from a user-controlled
image_url without validating it, allowing path traversal; change it to derive
the relative path, construct the candidate path with self.base_path /
relative_path, then resolve the candidate path (use
candidate_path.resolve(strict=False)) and verify it is inside self.base_path
(use Path.is_relative_to if available, or compare resolved paths/commonpath)
before unlinking; if the resolved path is outside base_path or resolution fails,
log a warning and return False, and handle symlinks and exceptions safely.
| doc_dir = self.base_path / document_id | ||
|
|
||
| if not doc_dir.exists(): | ||
| return 0 | ||
|
|
||
| try: | ||
| # Count files before deletion | ||
| file_count = sum(1 for _ in doc_dir.iterdir() if _.is_file()) | ||
|
|
||
| # Remove entire directory | ||
| shutil.rmtree(doc_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate document_id to prevent path traversal.
Similar to delete_image, if document_id contains path components like ../, it could escape the base directory. Consider validating that document_id is a valid UUID or sanitizing it.
🔎 Proposed fix
async def delete_document_images(self, document_id: str) -> int:
+ # Validate document_id format to prevent path traversal
+ import re
+ if not re.match(r'^[a-zA-Z0-9_-]+$', document_id):
+ logger.warning(f"Invalid document_id format: {document_id}")
+ return 0
+
doc_dir = self.base_path / document_id
+ # Additional safety: ensure path is within base_path
+ try:
+ doc_dir = doc_dir.resolve()
+ if not str(doc_dir).startswith(str(self.base_path.resolve())):
+ logger.warning(f"Path traversal attempt in document_id: {document_id}")
+ return 0
+ except (OSError, ValueError):
+ return 0
+
if not doc_dir.exists():
return 0Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In backend/app/services/document_parser/storage/local_storage.py around lines
129 to 139, the code builds doc_dir = self.base_path / document_id without
validating document_id which allows path traversal (e.g. ../) to escape
base_path; update the function to first validate document_id (preferably by
attempting to parse it as a UUID and rejecting non-UUIDs) or at minimum reject
any path separators or parent components, then build the path and call doc_dir =
(self.base_path / document_id).resolve(); after resolving, verify that
str(doc_dir).startswith(str(self.base_path.resolve())) (or use path comparison
like self.base_path.resolve() in doc_dir.parents or doc_dir ==
self.base_path.resolve() / document_id) and if the check fails raise or return
an error (e.g., ValueError or return 0) before counting files and calling
shutil.rmtree; this prevents directory escapes while preserving the existing
file-count-and-delete flow.
| handleStorageTypeChange, | ||
| handleRetrievalMethodToggle, | ||
| }) => { | ||
| const { t } = useTranslation(['admin', 'common', 'wizard']); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Avoid array format in useTranslation per coding guidelines.
The coding guidelines specify: "MUST use single namespace with useTranslation() - never use array format like useTranslation(['common', 'groups'])". Use cross-namespace key format t('namespace:key') instead.
🔎 Proposed fix
- const { t } = useTranslation(['admin', 'common', 'wizard']);
+ const { t } = useTranslation();The existing translation calls already use the correct cross-namespace format (e.g., t('admin:public_retrievers.form.name'), t('common:retrievers.retrieval_method_vector')), so this change should work without further modifications.
Based on coding guidelines, the array format is prohibited.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const { t } = useTranslation(['admin', 'common', 'wizard']); | |
| const { t } = useTranslation(); |
🤖 Prompt for AI Agents
frontend/src/features/admin/components/RetrieverFormFields.tsx lines 114-114:
the code uses array namespace format useTranslation(['admin', 'common',
'wizard']) which is prohibited; change it to a single-namespace call
useTranslation() and ensure all translation keys use cross-namespace notation
like t('admin:public_retrievers.form.name') or
t('common:retrievers.retrieval_method_vector'); if any t(...) calls omit the
namespace, prepend the correct namespace to those keys so all translations
resolve correctly.
| const handleUpdate = async (data: Parameters<typeof personalKb.update>[1]) => { | ||
| if (!editingKb) return; | ||
| await personalKb.update(editingKb.id, data); | ||
| // Refresh the appropriate list based on whether it's a group or personal knowledge base | ||
| if (editingKb.namespace !== 'default') { | ||
| // Group knowledge base - trigger refresh via refreshKey | ||
| setGroupRefreshKey(prev => prev + 1); | ||
| } | ||
| setEditingKb(null); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing personal knowledge base refresh after update.
The group knowledge base refresh logic is correctly implemented, but the personal knowledge base refresh is missing. When updating a personal KB (namespace === 'default'), personalKb.refresh() should be called to update the UI.
Compare with handleCreate (lines 140-144) which explicitly calls personalKb.refresh() for personal KBs.
🔎 Proposed fix
const handleUpdate = async (data: Parameters<typeof personalKb.update>[1]) => {
if (!editingKb) return;
await personalKb.update(editingKb.id, data);
// Refresh the appropriate list based on whether it's a group or personal knowledge base
if (editingKb.namespace !== 'default') {
// Group knowledge base - trigger refresh via refreshKey
setGroupRefreshKey(prev => prev + 1);
+ } else {
+ // Personal knowledge base - refresh directly
+ personalKb.refresh();
}
setEditingKb(null);
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const handleUpdate = async (data: Parameters<typeof personalKb.update>[1]) => { | |
| if (!editingKb) return; | |
| await personalKb.update(editingKb.id, data); | |
| // Refresh the appropriate list based on whether it's a group or personal knowledge base | |
| if (editingKb.namespace !== 'default') { | |
| // Group knowledge base - trigger refresh via refreshKey | |
| setGroupRefreshKey(prev => prev + 1); | |
| } | |
| setEditingKb(null); | |
| }; | |
| const handleUpdate = async (data: Parameters<typeof personalKb.update>[1]) => { | |
| if (!editingKb) return; | |
| await personalKb.update(editingKb.id, data); | |
| // Refresh the appropriate list based on whether it's a group or personal knowledge base | |
| if (editingKb.namespace !== 'default') { | |
| // Group knowledge base - trigger refresh via refreshKey | |
| setGroupRefreshKey(prev => prev + 1); | |
| } else { | |
| // Personal knowledge base - refresh directly | |
| personalKb.refresh(); | |
| } | |
| setEditingKb(null); | |
| }; |
🤖 Prompt for AI Agents
frontend/src/features/knowledge/document/components/KnowledgeDocumentPage.tsx
around lines 148 to 157: the update handler only triggers a group refresh when
editingKb.namespace !== 'default' and never refreshes the personal KB UI; modify
handleUpdate to call personalKb.refresh() when editingKb.namespace === 'default'
(mirror the handleCreate behavior), then clear editingKb as before; ensure the
call is awaited or fired after the personalKb.update completes and keep existing
group refreshKey increment for non-default namespaces.
| const handleDelete = async () => { | ||
| if (!deletingKb) return; | ||
| await personalKb.remove(deletingKb.id); | ||
| // Refresh the appropriate list based on whether it's a group or personal knowledge base | ||
| if (deletingKb.namespace !== 'default') { | ||
| // Group knowledge base - trigger refresh via refreshKey | ||
| setGroupRefreshKey(prev => prev + 1); | ||
| } | ||
| setDeletingKb(null); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing personal knowledge base refresh after delete.
Similar to the update handler, the delete handler is missing the personal knowledge base refresh. When deleting a personal KB (namespace === 'default'), personalKb.refresh() should be called to update the UI.
🔎 Proposed fix
const handleDelete = async () => {
if (!deletingKb) return;
await personalKb.remove(deletingKb.id);
// Refresh the appropriate list based on whether it's a group or personal knowledge base
if (deletingKb.namespace !== 'default') {
// Group knowledge base - trigger refresh via refreshKey
setGroupRefreshKey(prev => prev + 1);
+ } else {
+ // Personal knowledge base - refresh directly
+ personalKb.refresh();
}
setDeletingKb(null);
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| const handleDelete = async () => { | |
| if (!deletingKb) return; | |
| await personalKb.remove(deletingKb.id); | |
| // Refresh the appropriate list based on whether it's a group or personal knowledge base | |
| if (deletingKb.namespace !== 'default') { | |
| // Group knowledge base - trigger refresh via refreshKey | |
| setGroupRefreshKey(prev => prev + 1); | |
| } | |
| setDeletingKb(null); | |
| }; | |
| const handleDelete = async () => { | |
| if (!deletingKb) return; | |
| await personalKb.remove(deletingKb.id); | |
| // Refresh the appropriate list based on whether it's a group or personal knowledge base | |
| if (deletingKb.namespace !== 'default') { | |
| // Group knowledge base - trigger refresh via refreshKey | |
| setGroupRefreshKey(prev => prev + 1); | |
| } else { | |
| // Personal knowledge base - refresh directly | |
| personalKb.refresh(); | |
| } | |
| setDeletingKb(null); | |
| }; |
🤖 Prompt for AI Agents
In frontend/src/features/knowledge/document/components/KnowledgeDocumentPage.tsx
around lines 159 to 168, the delete handler does not refresh the personal
knowledge base UI when removing a personal KB (namespace === 'default'); update
handleDelete to call and await personalKb.refresh() when deleting a personal KB,
keep the existing setGroupRefreshKey increment for non-default namespaces, and
then clear setDeletingKb(null) after the refresh so the UI reflects the
deletion.
| const data = response.data || []; | ||
| // Sort by type priority based on scope, then by name | ||
| // - Personal scope: user > public | ||
| // - Group scope: group > public | ||
| const typePriority: Record<string, number> = | ||
| scope === 'group' ? { group: 0, public: 1 } : { user: 0, public: 1 }; | ||
| data.sort((a, b) => { | ||
| const priorityA = typePriority[a.type] ?? 1; | ||
| const priorityB = typePriority[b.type] ?? 1; | ||
| if (priorityA !== priorityB) { | ||
| return priorityA - priorityB; | ||
| } | ||
| return (a.name || '').localeCompare(b.name || ''); | ||
| }); | ||
| setRetrievers(data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Find the UnifiedRetriever type definition
rg -n "type UnifiedRetriever" --type ts --type tsx -A 10Repository: wecode-ai/Wegent
Length of output: 87
🏁 Script executed:
# Look for retriever type enum definitions
rg -n "RetrieverType|retriever.*type" --type ts --type tsx -B 2 -A 2Repository: wecode-ai/Wegent
Length of output: 87
🏁 Script executed:
# Check the useRetrievers.ts file itself to understand the context
fd "useRetrievers.ts" --type fRepository: wecode-ai/Wegent
Length of output: 123
🏁 Script executed:
# Search for where the API is called to see what types are returned
rg -n "retrievers|getRetrievers" --type ts --type tsx -B 3 -A 3Repository: wecode-ai/Wegent
Length of output: 87
🏁 Script executed:
# Read the useRetrievers.ts file to see the full context
cat -n frontend/src/features/knowledge/document/hooks/useRetrievers.tsRepository: wecode-ai/Wegent
Length of output: 1916
🏁 Script executed:
# Search for UnifiedRetriever type definition without tsx flag
rg -n "type UnifiedRetriever|interface UnifiedRetriever" -A 15Repository: wecode-ai/Wegent
Length of output: 2825
🏁 Script executed:
# Search for retriever type values in the codebase
rg -n "type.*['\"]user['\"]|type.*['\"]group['\"]|type.*['\"]public['\"]" -iRepository: wecode-ai/Wegent
Length of output: 16077
🏁 Script executed:
# Look for the API endpoint being called
rg -n "getRetrievers\|/retrievers" -A 5Repository: wecode-ai/Wegent
Length of output: 42
🏁 Script executed:
# Find the backend retriever API endpoint
rg -n "def.*retriever|def.*Retriever" backend/app/api/endpoints/adapter/retrievers.py -A 10Repository: wecode-ai/Wegent
Length of output: 2835
🏁 Script executed:
# Check the retriever service to understand scope handling
rg -n "unified.*retriever|get.*retriever" backend/app/services/ -i -A 5Repository: wecode-ai/Wegent
Length of output: 6838
🏁 Script executed:
# Look for how the API handles personal vs group scope
rg -n "scope.*personal|scope.*group" backend/app/services/ -i -A 3Repository: wecode-ai/Wegent
Length of output: 14332
Fix sorting priority logic to handle all scope cases correctly.
The sorting implementation has a bug when scope='all'. The code correctly handles the individual 'personal' and 'group' scopes, but falls back to personal scope logic for 'all', which causes group type retrievers to share the same priority (1) as public retrievers and be sorted by name only.
Backend returns:
scope='personal': user + public typesscope='group': group + public typesscope='all': user + group + public types (all three)
Frontend sorting applies:
scope='group':{ group: 0, public: 1 }✓- All other scopes:
{ user: 0, public: 1 }– this missesgrouptype for'all'scope
Update the sorting logic to handle 'all' scope:
const typePriority: Record<string, number> =
scope === 'group' ? { group: 0, public: 1 } :
scope === 'all' ? { user: 0, group: 1, public: 2 } :
{ user: 0, public: 1 };🤖 Prompt for AI Agents
In frontend/src/features/knowledge/document/hooks/useRetrievers.ts around lines
17 to 31, the sort priority falls back to the personal scope logic for
scope='all', causing group retrievers to be ranked equal to public; update the
typePriority assignment to explicitly handle scope === 'all' by returning
priorities for user, group, and public (e.g., user:0, group:1, public:2), keep
the existing group and personal cases, and leave the rest unchanged so that
data.sort uses the correct priority mapping for all scopes.
| def dict(self, **kwargs) -> Dict[str, Any]: | ||
| """Override dict method to exclude None values""" | ||
| # Exclude None values by default | ||
| kwargs.setdefault('exclude_none', True) | ||
| kwargs.setdefault("exclude_none", True) | ||
| return super().dict(**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check Pydantic version in requirements files
echo "=== Checking Pydantic version in requirements ==="
fd -e txt -e in 'requirements' --exec cat {} \; 2>/dev/null | grep -i pydantic || echo "No match found in standard requirements files"
echo -e "\n=== Checking for Pydantic imports across codebase ==="
rg -n --type py 'from pydantic import' 2>/dev/null | head -20 || echo "No imports found"
echo -e "\n=== Checking for .dict() method usage ==="
rg -n --type py '\.dict\(' -g '!**/test*' 2>/dev/null | head -20 || echo "No .dict() usage found"
echo -e "\n=== Checking for .model_dump() method usage ==="
rg -n --type py '\.model_dump\(' 2>/dev/null | head -10 || echo "No .model_dump() usage found"
echo -e "\n=== Checking shared/models/task.py ==="
cat -n shared/models/task.py 2>/dev/null || echo "File not found"Repository: wecode-ai/Wegent
Length of output: 7687
Replace deprecated .dict() method with .model_dump() for Pydantic v2 compatibility.
The project uses Pydantic 2.11.2, where .dict() is deprecated in favor of .model_dump(). Lines 71–75 (and the identical pattern at lines 82–86 in the ExecutionResult class) need to be updated. Replace super().dict(**kwargs) with super().model_dump(**kwargs).
🤖 Prompt for AI Agents
In shared/models/task.py around lines 71–75 (and the identical pattern in the
ExecutionResult class at lines 82–86), the override calls super().dict(**kwargs)
which is deprecated in Pydantic v2; replace those calls with
super().model_dump(**kwargs) and keep kwargs.setdefault("exclude_none", True) so
the method returns a Dict[str, Any] of the model dump with None values excluded.
Summary
Test plan
Summary by CodeRabbit
Release Notes
New Features
Chores
✏️ Tip: You can customize this high-level summary in your review settings.