A Python library for cloning, indexing, and semantically searching Git repositories using embeddings (OpenAI or SBERT) and Chroma — plus a high-level agent_query for autonomous code agent.
- Clone or update any Git repository with a single call
- Extract semantic code chunks via Tree-Sitter grammars (functions, classes, methods, etc.)
- Fallback to line-based chunking for unsupported languages or large files
- Embed code or text with your choice of:
- OpenAI's
text-embedding-ada-002via OpenAIEmbeddings - Local SBERT model (e.g.
microsoft/graphcodebert-base) via SBERTEmbeddings
- OpenAI's
- Persist vector store in a
.kno/folder using Chroma - Auto-commit & push the embedding database back to your repo
- Fast similarity search over indexed code chunks
- Autonomous agent for code analysis via
agent_query()
pip install kno-sdkfrom kno_sdk import clone_and_index, search, EmbeddingMethod
# 1. Clone (or pull) and index a repository
repo_index = clone_and_index(
repo_url="https://github.com/SyedGhazanferAnwar/NestJs-MovieApp",
branch="master",
embedding=EmbeddingMethod.SBERT, # or EmbeddingMethod.OPENAI
cloned_repo_base_dir="repos" # where to clone locally
)
print("Indexed at:", repo_index.path)
print("Directory snapshot:\n", repo_index.digest)
# 2. Perform semantic search
results = search(
repo_url="https://github.com/SyedGhazanferAnwar/NestJs-MovieApp",
branch="master",
embedding=EmbeddingMethod.SBERT,
cloned_repo_base_dir="repos",
query="NestFactory",
k=5
)
for i, chunk in enumerate(results, 1):
print(f"--- Result #{i} ---\n{chunk}\n")
# 3. Autonomous Code-Analysis Agent
from kno_sdk import agent_query, EmbeddingMethod, LLMProvider
# First create a repo index
repo_index = clone_and_index(
repo_url="https://github.com/WebGoat/WebGoat",
branch="main",
embedding=EmbeddingMethod.SBERT,
cloned_repo_base_dir="repos"
)
# Then use the index with agent_query
result = agent_query(
repo_index=repo_index,
llm_provider=LLMProvider.ANTHROPIC,
llm_model="claude-3-haiku-20240307",
llm_temperature=0.0,
llm_max_tokens=4096,
llm_system_prompt="You are a senior code-analysis agent.",
prompt="Find issues, bugs and vulnerabilities in this repo, and explain each with exact code locations.",
MODEL_API_KEY="your_api_key_here"
)
print(result)Clone (or pull) a repository, embed its files, and persist a Chroma database in .kno folder. Finally, commit & push the .kno/ folder back to the original repo.
def clone_and_index(
repo_url: str,
branch: str = "main",
embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
cloned_repo_base_dir: str = "."
) -> RepoIndex-
repo_url — Git HTTPS/SSH URL
-
branch — branch to clone or update (default: main)
-
embedding — EmbeddingMethod.OPENAI or EmbeddingMethod.SBERT
-
base_dir — local directory to clone into (default: current working dir)
Returns a RepoIndex object with:
-
path: pathlib.Path — local clone directory
-
digest: str — textual snapshot of the directory tree
-
vector_store: Chroma — the Chroma collection instance
Run a similarity search on an existing .kno/ Chroma database.
def search(
repo_url: str,
branch: str = "main",
embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
query: str = "",
k: int = 8,
cloned_repo_base_dir: str = "."
) -> List[str]-
query — your natural-language or code search prompt
-
k — number of top results to return
Returns a list of the top-k matching code/text chunks.
High-level agent that clones, indexes, and then iteratively uses tools (search_code, read_file, etc.) plus an LLM to fulfill your prompt.
def agent_query(
repo_url: str,
branch: str = "main",
embedding: EmbeddingMethod = EmbeddingMethod.SBERT,
cloned_repo_base_dir: str = str(Path.cwd()),
llm_provider: LLMProvider = LLMProvider.ANTHROPIC,
llm_model: str = "claude-3-haiku-20240307",
llm_temperature: float = 0.0,
llm_max_tokens: int = 4096,
llm_system_prompt: str = "",
prompt: str = "",
MODEL_API_KEY: str = "",
) -> str-
repo_url, branch, embedding, base_dir — same as above
-
llm_provider — LLMProvider.OPENAI or LLMProvider.ANTHROPIC
-
llm_model — model name (e.g. "gpt-4" or "claude-3-haiku-20240307")
-
llm_temperature, llm_max_tokens — sampling params
-
llm_system_prompt — initial system message for the agent
-
prompt — your user query/task description
-
MODEL_API_KEY — sets OPENAI_API_KEY or ANTHROPIC_API_KEY
Returns the agent's Final Answer as a string.
class EmbeddingMethod(str, Enum):
OPENAI = "OpenAIEmbeddings"
SBERT = "SBERTEmbeddings"Choose between OpenAI's hosted embeddings or a local SBERT model.
-
Clone or PullUses GitPython to clone depth-1 or pull the latest changes.
-
Directory SnapshotBuilds a small "digest" of files/folders (up to ~1 K tokens).
-
Chunk Extraction
-
Tree-sitter for language-aware extraction of functions, classes, etc.
-
Fallback to fixed-size line chunks for unknown languages or large files.
-
-
Embedding
-
Streams each chunk into your chosen embedding backend.
-
Respects a 16 000-token cap per chunk.
-
-
Vector Store
-
Persists embeddings in a namespaced Chroma collection under .kno/.
-
Only indexes files once (skips already-populated collections).
-
-
Commit & Push
- Automatically stages, commits, and pushes .kno/ back to your remote.
-
Autonomous Agent
-
RAG prompt
-
Tool calls (search_code, read_file, …)
-
Iterative LLM planning & execution
-
Stops on "Final Answer:" or max iterations
-
Skip directories: .git, node_modules, build, dist, target, .vscode, .kno
-
Skip files: package-lock.json, yarn.lock, .prettierignore
-
Binary extensions: common image, audio, video, archive, font, and binary file types
All of the above can be modified by forking the source and adjusting the skip_dirs, skip_files, and BINARY_EXTS sets.
-
Fork this repo
-
Create your feature branch (git checkout -b feature/AmazingFeature)
-
Commit your changes (git commit -m 'Add amazing feature')
-
Push to the branch (git push origin feature/AmazingFeature)
-
Open a Pull Request
Please run pytest before submitting and follow the existing code style.