Add fuzzy package index search helper by thedalbee · Pull Request #69 · cxlinux-ai/cx-distro

thedalbee · 2026-05-13T01:05:22Z

Summary

Adds a scoped package-search helper for generated APT Packages indexes, focused on the first layer of #32: fuzzy matching, typo handling, synonym/natural-language queries, ranked results, tests, and docs.

This stays read-only against repository metadata and does not install or execute package contents.

Covered behavior:

Exact and fuzzy package-name matching, e.g. postgress -> postgresql.
Short natural-language/synonym queries, e.g. web server, graphics card, and hardening.
Ranked result output with score and match reason.
Plain Packages and Packages.gz index loading.
Graceful missing-index errors.
Focused fixture tests via make test-search.

Closes #32.

Scope note

The issue also mentions LLM-powered search, category filtering, and search history. This PR intentionally provides a deterministic, no-new-dependency foundation inside cx-distro first. It can be wired into a larger cortex search or LLM-backed flow later without requiring privileged distro build/install scripts.

Verification

python3 -m py_compile apt/scripts/search-packages.py - passed
bash -n tests/search-packages-test.sh - passed
make test-search - passed
git diff --check - passed
Repo vet was run and reported existing CRITICAL-risk privileged/destructive scripts in this repository, so I did not run host install/build/package commands.

AI disclosure

AI-assisted with Codex; I reviewed the final diff and verification output before submitting.

Add a no-dependency APT Packages index search helper so CX Linux can show useful fuzzy, synonym-aware package matches before broader AI-backed install flows exist. The PR keeps the implementation scoped to repository metadata and fixture tests because the distro build scripts require privileged host operations and are not safe to run during this change. Constraint: Existing repository install/build scripts include privileged and destructive host operations, so verification is limited to static Python and fixture shell tests. Rejected: Implement full LLM-backed cortex search in this repo | the issue spans cx-core behavior and would be too broad for a small distro-side PR. Confidence: medium Scope-risk: narrow Directive: Keep this helper read-only against Packages indexes; do not make it install or execute package contents. Tested: python3 -m py_compile apt/scripts/search-packages.py Tested: bash -n tests/search-packages-test.sh Tested: make test-search Tested: git diff --check Not-tested: Full distro build or package installation due privileged host-side scripts flagged by repo vet.

coderabbitai · 2026-05-13T01:06:08Z

Warning

Rate limit exceeded

@thedalbee has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 31 minutes and 52 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bedda739-f48a-4ed3-b5bc-930276253b22

📥 Commits

Reviewing files that changed from the base of the PR and between cfd0627 and 17a422e.

📒 Files selected for processing (4)

Makefile
apt/README.md
apt/scripts/search-packages.py
tests/search-packages-test.sh

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a 'Smart Package Search' utility for querying APT package indexes with fuzzy matching and synonym support. The changes include the core Python script, documentation in the README, and a new test suite integrated into the Makefile. The code review feedback focuses on performance optimizations and bug fixes: specifically, optimizing the Levenshtein distance calculation, pre-calculating query and package tokens to avoid redundant processing within the search loop, and preventing duplicate results by ensuring only one index file (compressed or uncompressed) is loaded per directory.

gemini-code-assist · 2026-05-13T01:06:53Z

+    return {token for token in normalize(value).split() if token}
+
+
+def levenshtein(a: str, b: str) -> int:


The Levenshtein distance calculation is called for every package name and term during search. Since the scoring logic only cares about distances up to 2 (see lines 149 and 160), you can optimize this by returning early if the length difference between a and b is already greater than the threshold.

gemini-code-assist · 2026-05-13T01:06:53Z

+    query_norm = normalize(query)
+    query_terms = expanded_query_terms(query)


These query-related values (query_norm, query_terms, and the tokens used in Levenshtein/fuzzy checks) are recalculated for every package in the search loop. For better performance, especially with larger indexes, these should be computed once in the search function and passed as arguments to score_package.

gemini-code-assist · 2026-05-13T01:06:53Z

+    query_norm = normalize(query)
+    query_terms = expanded_query_terms(query)
+    name_norm = normalize(package.name)
+    package_terms = tokens(f"{package.name} {package.description}")


package_terms is calculated for every package on every search. This involves string concatenation, regex normalization, and tokenization. Since the package index is static during the script execution, these tokens should be pre-calculated once when the index is parsed and stored in the Package object to improve search responsiveness.

gemini-code-assist · 2026-05-13T01:06:54Z

+    dists = repo_root / "dists"
+    if not dists.exists():
+        return []
+    return sorted(dists.glob("**/Packages")) + sorted(dists.glob("**/Packages.gz"))


The current implementation returns both Packages and Packages.gz if they both exist in the same directory. This causes load_packages to parse the same data twice, leading to duplicate results in the search output. You can use a dictionary to keep only one index file per directory, preferring the compressed version.

Suggested change

return sorted(dists.glob("**/Packages")) + sorted(dists.glob("**/Packages.gz"))

found = {p.parent: p for p in sorted(dists.glob("**/Packages*")) if p.name in ("Packages", "Packages.gz")}

return sorted(found.values())

Fix the default repository discovery path so plain Packages files win over matching Packages.gz files, preventing duplicate results in normal APT repositories. Also let typoed synonym queries resolve through a bounded fuzzy alias pass while keeping the guard narrow enough to avoid unrelated alias expansion. Constraint: Follow-up review found the documented default path could duplicate package rows when both index encodings exist. Rejected: Parse both encodings and dedupe by package name | would hide real duplicate package stanzas across suites or architectures. Confidence: high Scope-risk: narrow Tested: python3 -m py_compile apt/scripts/search-packages.py Tested: bash -n tests/search-packages-test.sh Tested: make test-search Tested: git diff --check

Review feedback called out redundant scoring work in the smart package search loop. The search path now builds normalized query state once and caches edit-distance calculations while keeping the ranking behavior covered by the existing fixture tests. Constraint: Keep the helper dependency-free and preserve the current CLI output Rejected: Add a heavier search index abstraction | unnecessary for the current APT Packages helper Confidence: high Scope-risk: narrow Tested: python3 -m py_compile apt/scripts/search-packages.py Tested: bash -n tests/search-packages-test.sh Tested: make test-search Tested: git diff --check

thedalbee · 2026-05-13T01:19:42Z

Follow-up pushed in 21c5408 after reviewing the current Gemini feedback:

Query normalization, query tokens, and expanded synonym terms are now built once per search instead of once per package.
Levenshtein distance results are cached to reduce repeated fuzzy-scoring work.
Existing duplicate Packages/Packages.gz loading coverage remains in tests.

Verification rerun:

python3 -m py_compile apt/scripts/search-packages.py
bash -n tests/search-packages-test.sh
make test-search
git diff --check

Subagent review found two correctness gaps in the package search helper: punctuation-only queries normalized to an empty string and matched every package, and overlapping indexes could return duplicate package rows. The search path now returns no results for empty normalized queries and deduplicates both index paths and parsed package identities. Constraint: Preserve the existing no-dependency CLI and current ranking behavior for valid queries Rejected: Add persistent package index state | unnecessary for a single-command helper Confidence: high Scope-risk: narrow Tested: python3 -m py_compile apt/scripts/search-packages.py Tested: bash -n tests/search-packages-test.sh Tested: make test-search Tested: git diff --check

thedalbee · 2026-05-13T01:21:57Z

Second follow-up pushed in 17a422e after an additional review pass found two correctness gaps:

Punctuation-only queries such as !!! now return no results instead of matching every package through an empty normalized string.
Duplicate explicit --index paths are skipped.
Overlapping package indexes are deduped by package identity so the same package/version/description is not listed twice.
The fuzzy-term scoring loop now avoids edit-distance work when token lengths are already too far apart.
Added regression coverage for empty normalized queries and duplicate index/package handling.

Verification rerun:

python3 -m py_compile apt/scripts/search-packages.py
bash -n tests/search-packages-test.sh
make test-search
git diff --check

thedalbee requested a review from a team as a code owner May 13, 2026 01:05

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

dalbee added 2 commits May 13, 2026 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fuzzy package index search helper#69

Add fuzzy package index search helper#69
thedalbee wants to merge 4 commits into
cxlinux-ai:mainfrom
thedalbee:codex/package-fuzzy-search-32

thedalbee commented May 13, 2026

Uh oh!

coderabbitai Bot commented May 13, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

thedalbee commented May 13, 2026

Uh oh!

thedalbee commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		return {token for token in normalize(value).split() if token}


		def levenshtein(a: str, b: str) -> int:

		query_norm = normalize(query)
		query_terms = expanded_query_terms(query)

	return sorted(dists.glob("/Packages")) + sorted(dists.glob("/Packages.gz"))
	found = {p.parent: p for p in sorted(dists.glob("*/Packages")) if p.name in ("Packages", "Packages.gz")}
	return sorted(found.values())

Conversation

thedalbee commented May 13, 2026

Summary

Scope note

Verification

AI disclosure

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

thedalbee commented May 13, 2026

Uh oh!

thedalbee commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 13, 2026 •

edited

Loading