Add fuzzy package index search helper#69
Conversation
Add a no-dependency APT Packages index search helper so CX Linux can show useful fuzzy, synonym-aware package matches before broader AI-backed install flows exist. The PR keeps the implementation scoped to repository metadata and fixture tests because the distro build scripts require privileged host operations and are not safe to run during this change. Constraint: Existing repository install/build scripts include privileged and destructive host operations, so verification is limited to static Python and fixture shell tests. Rejected: Implement full LLM-backed cortex search in this repo | the issue spans cx-core behavior and would be too broad for a small distro-side PR. Confidence: medium Scope-risk: narrow Directive: Keep this helper read-only against Packages indexes; do not make it install or execute package contents. Tested: python3 -m py_compile apt/scripts/search-packages.py Tested: bash -n tests/search-packages-test.sh Tested: make test-search Tested: git diff --check Not-tested: Full distro build or package installation due privileged host-side scripts flagged by repo vet.
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a 'Smart Package Search' utility for querying APT package indexes with fuzzy matching and synonym support. The changes include the core Python script, documentation in the README, and a new test suite integrated into the Makefile. The code review feedback focuses on performance optimizations and bug fixes: specifically, optimizing the Levenshtein distance calculation, pre-calculating query and package tokens to avoid redundant processing within the search loop, and preventing duplicate results by ensuring only one index file (compressed or uncompressed) is loaded per directory.
| return {token for token in normalize(value).split() if token} | ||
|
|
||
|
|
||
| def levenshtein(a: str, b: str) -> int: |
There was a problem hiding this comment.
| query_norm = normalize(query) | ||
| query_terms = expanded_query_terms(query) |
There was a problem hiding this comment.
These query-related values (query_norm, query_terms, and the tokens used in Levenshtein/fuzzy checks) are recalculated for every package in the search loop. For better performance, especially with larger indexes, these should be computed once in the search function and passed as arguments to score_package.
| query_norm = normalize(query) | ||
| query_terms = expanded_query_terms(query) | ||
| name_norm = normalize(package.name) | ||
| package_terms = tokens(f"{package.name} {package.description}") |
There was a problem hiding this comment.
package_terms is calculated for every package on every search. This involves string concatenation, regex normalization, and tokenization. Since the package index is static during the script execution, these tokens should be pre-calculated once when the index is parsed and stored in the Package object to improve search responsiveness.
| dists = repo_root / "dists" | ||
| if not dists.exists(): | ||
| return [] | ||
| return sorted(dists.glob("**/Packages")) + sorted(dists.glob("**/Packages.gz")) |
There was a problem hiding this comment.
The current implementation returns both Packages and Packages.gz if they both exist in the same directory. This causes load_packages to parse the same data twice, leading to duplicate results in the search output. You can use a dictionary to keep only one index file per directory, preferring the compressed version.
| return sorted(dists.glob("**/Packages")) + sorted(dists.glob("**/Packages.gz")) | |
| found = {p.parent: p for p in sorted(dists.glob("**/Packages*")) if p.name in ("Packages", "Packages.gz")} | |
| return sorted(found.values()) |
Fix the default repository discovery path so plain Packages files win over matching Packages.gz files, preventing duplicate results in normal APT repositories. Also let typoed synonym queries resolve through a bounded fuzzy alias pass while keeping the guard narrow enough to avoid unrelated alias expansion. Constraint: Follow-up review found the documented default path could duplicate package rows when both index encodings exist. Rejected: Parse both encodings and dedupe by package name | would hide real duplicate package stanzas across suites or architectures. Confidence: high Scope-risk: narrow Tested: python3 -m py_compile apt/scripts/search-packages.py Tested: bash -n tests/search-packages-test.sh Tested: make test-search Tested: git diff --check
Review feedback called out redundant scoring work in the smart package search loop. The search path now builds normalized query state once and caches edit-distance calculations while keeping the ranking behavior covered by the existing fixture tests. Constraint: Keep the helper dependency-free and preserve the current CLI output Rejected: Add a heavier search index abstraction | unnecessary for the current APT Packages helper Confidence: high Scope-risk: narrow Tested: python3 -m py_compile apt/scripts/search-packages.py Tested: bash -n tests/search-packages-test.sh Tested: make test-search Tested: git diff --check
|
Follow-up pushed in 21c5408 after reviewing the current Gemini feedback:
Verification rerun:
|
Subagent review found two correctness gaps in the package search helper: punctuation-only queries normalized to an empty string and matched every package, and overlapping indexes could return duplicate package rows. The search path now returns no results for empty normalized queries and deduplicates both index paths and parsed package identities. Constraint: Preserve the existing no-dependency CLI and current ranking behavior for valid queries Rejected: Add persistent package index state | unnecessary for a single-command helper Confidence: high Scope-risk: narrow Tested: python3 -m py_compile apt/scripts/search-packages.py Tested: bash -n tests/search-packages-test.sh Tested: make test-search Tested: git diff --check
|
Second follow-up pushed in 17a422e after an additional review pass found two correctness gaps:
Verification rerun:
|
Summary
Adds a scoped package-search helper for generated APT
Packagesindexes, focused on the first layer of #32: fuzzy matching, typo handling, synonym/natural-language queries, ranked results, tests, and docs.This stays read-only against repository metadata and does not install or execute package contents.
Covered behavior:
postgress->postgresql.web server,graphics card, andhardening.PackagesandPackages.gzindex loading.make test-search.Closes #32.
Scope note
The issue also mentions LLM-powered search, category filtering, and search history. This PR intentionally provides a deterministic, no-new-dependency foundation inside
cx-distrofirst. It can be wired into a largercortex searchor LLM-backed flow later without requiring privileged distro build/install scripts.Verification
python3 -m py_compile apt/scripts/search-packages.py- passedbash -n tests/search-packages-test.sh- passedmake test-search- passedgit diff --check- passedAI disclosure
AI-assisted with Codex; I reviewed the final diff and verification output before submitting.