Skip to content

Add fuzzy package index search helper#69

Open
thedalbee wants to merge 4 commits into
cxlinux-ai:mainfrom
thedalbee:codex/package-fuzzy-search-32
Open

Add fuzzy package index search helper#69
thedalbee wants to merge 4 commits into
cxlinux-ai:mainfrom
thedalbee:codex/package-fuzzy-search-32

Conversation

@thedalbee

Copy link
Copy Markdown

Summary

Adds a scoped package-search helper for generated APT Packages indexes, focused on the first layer of #32: fuzzy matching, typo handling, synonym/natural-language queries, ranked results, tests, and docs.

This stays read-only against repository metadata and does not install or execute package contents.

Covered behavior:

  • Exact and fuzzy package-name matching, e.g. postgress -> postgresql.
  • Short natural-language/synonym queries, e.g. web server, graphics card, and hardening.
  • Ranked result output with score and match reason.
  • Plain Packages and Packages.gz index loading.
  • Graceful missing-index errors.
  • Focused fixture tests via make test-search.

Closes #32.

Scope note

The issue also mentions LLM-powered search, category filtering, and search history. This PR intentionally provides a deterministic, no-new-dependency foundation inside cx-distro first. It can be wired into a larger cortex search or LLM-backed flow later without requiring privileged distro build/install scripts.

Verification

  • python3 -m py_compile apt/scripts/search-packages.py - passed
  • bash -n tests/search-packages-test.sh - passed
  • make test-search - passed
  • git diff --check - passed
  • Repo vet was run and reported existing CRITICAL-risk privileged/destructive scripts in this repository, so I did not run host install/build/package commands.

AI disclosure

AI-assisted with Codex; I reviewed the final diff and verification output before submitting.

Add a no-dependency APT Packages index search helper so CX Linux can show useful fuzzy, synonym-aware package matches before broader AI-backed install flows exist. The PR keeps the implementation scoped to repository metadata and fixture tests because the distro build scripts require privileged host operations and are not safe to run during this change.

Constraint: Existing repository install/build scripts include privileged and destructive host operations, so verification is limited to static Python and fixture shell tests.

Rejected: Implement full LLM-backed cortex search in this repo | the issue spans cx-core behavior and would be too broad for a small distro-side PR.

Confidence: medium

Scope-risk: narrow

Directive: Keep this helper read-only against Packages indexes; do not make it install or execute package contents.

Tested: python3 -m py_compile apt/scripts/search-packages.py

Tested: bash -n tests/search-packages-test.sh

Tested: make test-search

Tested: git diff --check

Not-tested: Full distro build or package installation due privileged host-side scripts flagged by repo vet.
@thedalbee thedalbee requested a review from a team as a code owner May 13, 2026 01:05
@coderabbitai

coderabbitai Bot commented May 13, 2026

Copy link
Copy Markdown

Warning

Rate limit exceeded

@thedalbee has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 31 minutes and 52 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bedda739-f48a-4ed3-b5bc-930276253b22

📥 Commits

Reviewing files that changed from the base of the PR and between cfd0627 and 17a422e.

📒 Files selected for processing (4)
  • Makefile
  • apt/README.md
  • apt/scripts/search-packages.py
  • tests/search-packages-test.sh
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a 'Smart Package Search' utility for querying APT package indexes with fuzzy matching and synonym support. The changes include the core Python script, documentation in the README, and a new test suite integrated into the Makefile. The code review feedback focuses on performance optimizations and bug fixes: specifically, optimizing the Levenshtein distance calculation, pre-calculating query and package tokens to avoid redundant processing within the search loop, and preventing duplicate results by ensuring only one index file (compressed or uncompressed) is loaded per directory.

return {token for token in normalize(value).split() if token}


def levenshtein(a: str, b: str) -> int:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Levenshtein distance calculation is called for every package name and term during search. Since the scoring logic only cares about distances up to 2 (see lines 149 and 160), you can optimize this by returning early if the length difference between a and b is already greater than the threshold.

Comment thread apt/scripts/search-packages.py Outdated
Comment on lines +125 to +126
query_norm = normalize(query)
query_terms = expanded_query_terms(query)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These query-related values (query_norm, query_terms, and the tokens used in Levenshtein/fuzzy checks) are recalculated for every package in the search loop. For better performance, especially with larger indexes, these should be computed once in the search function and passed as arguments to score_package.

query_norm = normalize(query)
query_terms = expanded_query_terms(query)
name_norm = normalize(package.name)
package_terms = tokens(f"{package.name} {package.description}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

package_terms is calculated for every package on every search. This involves string concatenation, regex normalization, and tokenization. Since the package index is static during the script execution, these tokens should be pre-calculated once when the index is parsed and stored in the Package object to improve search responsiveness.

Comment thread apt/scripts/search-packages.py Outdated
dists = repo_root / "dists"
if not dists.exists():
return []
return sorted(dists.glob("**/Packages")) + sorted(dists.glob("**/Packages.gz"))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation returns both Packages and Packages.gz if they both exist in the same directory. This causes load_packages to parse the same data twice, leading to duplicate results in the search output. You can use a dictionary to keep only one index file per directory, preferring the compressed version.

Suggested change
return sorted(dists.glob("**/Packages")) + sorted(dists.glob("**/Packages.gz"))
found = {p.parent: p for p in sorted(dists.glob("**/Packages*")) if p.name in ("Packages", "Packages.gz")}
return sorted(found.values())

dalbee added 2 commits May 13, 2026 10:07
Fix the default repository discovery path so plain Packages files win over matching Packages.gz files, preventing duplicate results in normal APT repositories. Also let typoed synonym queries resolve through a bounded fuzzy alias pass while keeping the guard narrow enough to avoid unrelated alias expansion.

Constraint: Follow-up review found the documented default path could duplicate package rows when both index encodings exist.

Rejected: Parse both encodings and dedupe by package name | would hide real duplicate package stanzas across suites or architectures.

Confidence: high

Scope-risk: narrow

Tested: python3 -m py_compile apt/scripts/search-packages.py

Tested: bash -n tests/search-packages-test.sh

Tested: make test-search

Tested: git diff --check
Review feedback called out redundant scoring work in the smart package search loop. The search path now builds normalized query state once and caches edit-distance calculations while keeping the ranking behavior covered by the existing fixture tests.

Constraint: Keep the helper dependency-free and preserve the current CLI output

Rejected: Add a heavier search index abstraction | unnecessary for the current APT Packages helper

Confidence: high

Scope-risk: narrow

Tested: python3 -m py_compile apt/scripts/search-packages.py

Tested: bash -n tests/search-packages-test.sh

Tested: make test-search

Tested: git diff --check
@thedalbee

Copy link
Copy Markdown
Author

Follow-up pushed in 21c5408 after reviewing the current Gemini feedback:

  • Query normalization, query tokens, and expanded synonym terms are now built once per search instead of once per package.
  • Levenshtein distance results are cached to reduce repeated fuzzy-scoring work.
  • Existing duplicate Packages/Packages.gz loading coverage remains in tests.

Verification rerun:

  • python3 -m py_compile apt/scripts/search-packages.py
  • bash -n tests/search-packages-test.sh
  • make test-search
  • git diff --check

Subagent review found two correctness gaps in the package search helper: punctuation-only queries normalized to an empty string and matched every package, and overlapping indexes could return duplicate package rows. The search path now returns no results for empty normalized queries and deduplicates both index paths and parsed package identities.

Constraint: Preserve the existing no-dependency CLI and current ranking behavior for valid queries

Rejected: Add persistent package index state | unnecessary for a single-command helper

Confidence: high

Scope-risk: narrow

Tested: python3 -m py_compile apt/scripts/search-packages.py

Tested: bash -n tests/search-packages-test.sh

Tested: make test-search

Tested: git diff --check
@thedalbee

Copy link
Copy Markdown
Author

Second follow-up pushed in 17a422e after an additional review pass found two correctness gaps:

  • Punctuation-only queries such as !!! now return no results instead of matching every package through an empty normalized string.
  • Duplicate explicit --index paths are skipped.
  • Overlapping package indexes are deduped by package identity so the same package/version/description is not listed twice.
  • The fuzzy-term scoring loop now avoids edit-distance work when token lengths are already too far apart.
  • Added regression coverage for empty normalized queries and duplicate index/package handling.

Verification rerun:

  • python3 -m py_compile apt/scripts/search-packages.py
  • bash -n tests/search-packages-test.sh
  • make test-search
  • git diff --check

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Smart Package Search with Fuzzy Matching

1 participant