A command-line chatbot agent for building and exploring a knowledge base of research papers from ArXiv. The system provides an interactive terminal interface for finding, downloading, indexing, summarizing, and semantically searching through research papers with LLM-powered capabilities.
Keeping up with the latest research in fast-moving fields like generative AI can be overwhelming. This tool helps researchers:
- Discover relevant papers - Search ArXiv with keyword queries and intelligent ranking
- Build a personal knowledge base - Download, index, and organize papers locally
- Generate AI summaries - Create structured summaries of papers with customizable prompts
- Perform semantic search - Find information across your paper collection using natural language queries
- Conduct deep research - Use hierarchical RAG (Retrieval-Augmented Generation) to synthesize insights across multiple papers with citations
- Manage your collection - Add personal notes, validate storage integrity, and maintain your paper repository
The system uses a state machine-driven workflow with a rich terminal interface, providing visual feedback and markdown rendering for an enhanced research experience.
- Python 3.12 or higher
- OpenAI API key - Set as
OPENAI_API_KEYenvironment variable - uv package manager - Install from https://docs.astral.sh/uv/
- Operating System - macOS or Linux (Windows not currently supported)
- Clone the repository:
git clone https://github.com/yourusername/my-research-assistant.git
cd my-research-assistant- Install uv if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh- Install dependencies:
uv sync- Set up environment variables:
export OPENAI_API_KEY='your-api-key-here'
export DOC_HOME='/path/to/your/documents' # Where papers will be storedOptional environment variables:
export DEFAULT_MODEL='gpt-4o' # LLM model to use (default: gpt-4o)
export DEFAULT_EMBEDDING_MODEL='text-embedding-ada-002' # Embedding model (default: text-embedding-ada-002)
export MODEL_API_BASE='https://api.openai.com/v1' # API endpoint (default: OpenAI, can use gateway)
export PDF_VIEWER='/usr/bin/open' # PDF viewer executable (default: terminal viewer)Note: The MODEL_API_BASE variable allows you to use an API gateway or alternative OpenAI-compatible endpoint. This is useful for load balancing, cost management, or using local model servers.
- Optional: Set up Google Custom Search for enhanced paper discovery:
The find command uses Google Custom Search as the primary discovery method when credentials are configured. This provides more reliable and higher-quality search results than the ArXiv API keyword search alone. If credentials are not configured, the system automatically falls back to ArXiv API search.
To enable Google Custom Search:
a. Get a Google Custom Search API key:
- Go to Google Cloud Console
- Create or select a project
- Enable the Custom Search API
- Create credentials (API key)
b. Create a Custom Search Engine:
- Go to Google Programmable Search Engine
- Create a new search engine
- Configure it to search only
arxiv.org - Copy the Search Engine ID
c. Set environment variables:
export GOOGLE_SEARCH_API_KEY='your-google-api-key'
export GOOGLE_SEARCH_ENGINE_ID='your-search-engine-id'Free tier: Google provides 100 queries/day for free. Each find command uses 1 query.
Troubleshooting:
- If you see "Google Custom Search failed: API request failed with status code 429", you've exceeded your quota. Wait for the daily reset or upgrade your quota.
- If credentials are not configured, you'll see "Google Custom Search not configured, using ArXiv API search..." in the logs, and the system will use the legacy ArXiv API search method.
my-research-assistant/
├── src/my_research_assistant/
│ ├── __init__.py
│ ├── chat.py # Rich terminal interface
│ ├── state_machine.py # Workflow state management
│ ├── workflow.py # LlamaIndex workflow orchestration
│ ├── arxiv_downloader.py # ArXiv API integration
│ ├── google_search.py # Google Custom Search integration
│ ├── vector_store.py # ChromaDB dual vector stores & PDF text extraction
│ ├── summarizer.py # LLM-powered summarization
│ ├── paper_manager.py # Paper resolution utilities
│ ├── result_storage.py # Save/manage research results
│ ├── paper_removal.py # Remove papers from store
│ ├── validate_store.py # Store validation
│ ├── prompt.py # Template-based prompt system
│ ├── models.py # Centralized LLM config
│ ├── file_locations.py # Storage configuration
│ ├── project_types.py # Data structures
│ └── prompts/ # Markdown prompt templates
├── tests/ # Pytest test suite
├── designs/ # Design documentation
├── screenshots/ # Example screenshots
├── pyproject.toml # Project configuration
├── LICENSE # Apache 2.0 license
└── README.md
Launch the interactive chat interface:
uv run chatOr run the main module directly:
uv run python -m my_research_assistantThe chat interface supports optional logging for debugging and troubleshooting:
# Enable terminal logging at INFO level
uv run chat --loglevel INFO
# Write logs to a file (appends to file on each run)
uv run chat --logfile research-assistant.log
# Combine both options
uv run chat --loglevel DEBUG --logfile debug.logLog levels (from least to most verbose):
ERROR- Only errors with stack tracesWARNING- Errors and warningsINFO- Errors, warnings, and progress informationDEBUG- All messages including detailed debugging information
Log formats:
- Terminal: Single-character level indicator (E/W/I/D) + message
- File: ISO timestamp + level + message
Notes:
- Logs are appended to the file (not overwritten) on each run
- API keys are automatically redacted in log output
- LlamaIndex verbose logging is suppressed by default
Here's a typical workflow for researching transformer attention mechanisms:
- Find papers on ArXiv:
> find transformer attention mechanisms
The system searches ArXiv and displays ranked results with titles, authors, and categories.
- Summarize a paper:
> summarize 1
Downloads the PDF, extracts text, generates an LLM summary, and indexes the content.
- View summaries:
> summary 1
Displays the structured summary for the selected paper.
- Perform semantic search:
> sem-search how does multi-head attention work?
Searches across all indexed papers and returns a synthesized answer with page references.
- Conduct deep research:
> research what are the key innovations in transformer architectures?
Uses hierarchical RAG: searches summaries first, retrieves detailed content from relevant papers, and synthesizes a comprehensive report with citations.
- Save results:
> save
Saves the research report with an LLM-generated title to the results directory.
- Add personal notes:
> notes
Opens your editor to add personal notes for the current paper.
- List your collection:
> list
Displays all downloaded papers with pagination.
Available from any state.
| Command | Description | Example |
|---|---|---|
find <query> |
Search ArXiv for papers (uses Google Custom Search if configured, otherwise ArXiv API) | find deep learning optimization |
list |
List all downloaded papers | list |
Work with individual papers.
| Command | Description | Available From |
|---|---|---|
summarize <number|id> |
Download and summarize a paper | After find (or any state with ArXiv ID) |
summary <number|id> |
View existing summary | After list, sem-search, or research |
open <number|id> |
View paper content | Same as summary |
notes |
Edit personal notes | After selecting a paper |
improve <feedback> |
Improve current summary/results | While viewing summary or results |
Available from any state.
| Command | Description | Example |
|---|---|---|
sem-search <query> |
Semantic search across papers | sem-search attention mechanisms |
research <query> |
Deep research with hierarchical RAG | research transformer architectures |
save |
Save search/research results | After sem-search or research |
System maintenance and utilities.
| Command | Description | Available From |
|---|---|---|
remove-paper <number|id> |
Remove paper from repository | Any state |
rebuild-index |
Rebuild vector store indexes | Any state |
summarize-all |
Generate summaries for all papers | Any state |
validate-store |
Check repository integrity | Any state |
Tools for testing and debugging search functionality.
| Command | Description | Example |
|---|---|---|
uv run search-tester [OPTIONS] QUERY |
Test search APIs directly | uv run search-tester "neural networks" |
search-tester options:
--summary- Search the summary index instead of content index--papers PAPER_IDS- Filter search to specific papers (comma-separated)-k N- Number of chunks to return (default: 20 for content, 5 for summary)--content-similarity-threshold T- Minimum similarity score (0.0-1.0, default: 0.6)--use-mmr- Use Maximum Marginal Relevance for diversity--mmr-alpha A- MMR alpha parameter (0.0-1.0, default: 0.5, requires --use-mmr)
Examples:
# Search content index
uv run search-tester "attention mechanisms"
# Search summary index with custom k
uv run search-tester --summary -k 10 "transformers"
# Search specific papers with MMR
uv run search-tester --papers 2503.12345,2503.67890 --use-mmr "optimization"
# Adjust similarity threshold
uv run search-tester --content-similarity-threshold 0.7 "deep learning"Available from any state.
| Command | Description |
|---|---|
help |
Show valid commands for current state |
status |
Display current workflow status |
history |
Show conversation history |
clear |
Clear conversation history |
quit or exit |
Exit the application |
Papers can be referenced in two ways:
-
By number (1-indexed): When you have results from
find,list,sem-search, orresearch> summary 3 # View summary of paper #3 from current results -
By ArXiv ID: From any state using the full ArXiv ID
> summary 2404.16130v2 # View summary by ArXiv ID
The system stores all data in a directory specified by the DOC_HOME environment variable:
${DOC_HOME}/
├── pdfs/ # Downloaded PDF files
│ └── <arxiv-id>.pdf # Named using ArXiv conventions
├── paper_metadata/ # ArXiv metadata (JSON)
│ └── <arxiv-id>.json
├── extracted_paper_text/ # Extracted markdown text
│ └── <arxiv-id>.md
├── summaries/ # LLM-generated summaries
│ ├── <arxiv-id>_summary.md
│ └── images/ # Extracted figures
│ └── <arxiv-id>/
├── notes/ # Personal notes (markdown)
│ └── <arxiv-id>_notes.md
├── results/ # Saved search/research results
│ └── <timestamp>_<title>.md
└── index/ # ChromaDB vector stores
├── content/ # Paper content chunks
└── summary/ # Summaries + notes
Papers in your collection can be in different states:
- Downloaded: PDF exists in
pdfs/ - Extracted: Text extracted to
extracted_paper_text/ - Summarized: Summary generated in
summaries/ - Indexed (content): Paper chunks indexed for semantic search
- Indexed (summary): Summary indexed for research
Use validate-store to see the status of all papers.
The system uses a state machine-driven workflow with a pipeline architecture for paper processing:
ArXiv Search → Download PDF → Extract Text → Index Content → Generate Summary → Index Summary
- 6 states:
initial,select-new,select-view,summarized,sem-search,research - 3 state variables:
last_query_set,selected_paper,draft - Command validation based on current state
- Conditional query set preservation for seamless navigation
- Content index: Chunks from paper PDFs for semantic search
- Summary index: Summaries + personal notes for research
- ChromaDB backend with persistent storage
- Metadata enrichment with ArXiv categories, authors, and page numbers
- Event-driven pipeline for paper processing
- Structured result objects (
QueryResult,ProcessingResult,SaveResult) - Async workflow support for efficient operations
- Integration with OpenAI embeddings and LLMs
- Keyword Search: Google Custom Search (if configured) or ArXiv API for candidate papers
- Version Deduplication: Automatic selection of latest paper versions when multiple versions found
- Semantic Reranking: LlamaIndex embeddings for similarity ranking
- Paper ID Sorting: Results sorted by ArXiv ID for consistent numbering across commands
- Hierarchical RAG: Summary-level search → targeted content retrieval → synthesis
- Template-based markdown prompts in
src/my_research_assistant/prompts/ - Variable substitution with
{{VAR_NAME}}syntax - Versioned prompts (v1, v2) for base summaries and improvements
- Centralized management for consistency
- Search: ArXiv API → metadata extraction → semantic reranking
- Download: PDF retrieval → local storage in
pdfs/ - Extract: PyMuPDF → markdown text + images
- Index: Document chunking → embeddings → ChromaDB storage
- Summarize: LLM with versioned prompts → markdown summary
- Search/Research: Query → vector similarity → LLM synthesis → citations
Comprehensive test suite with pytest:
- State machine tests: 30+ tests covering all workflows and transitions
- Command tests: Integration tests for chat interface
- Component tests: Unit tests for individual modules
- Mock-based testing: Isolated component testing
- Async support:
pytest-asynciofor workflow testing
Run tests:
uv run pytest
uv run pytest -v # Verbose output
uv run pytest tests/test_state_machine.py # Specific test file- Custom exceptions:
IndexError,ConfigError,PromptFileError,PromptVarError - State machine recovery: Automatic transitions to safe states on failures
- Graceful fallbacks: Text-based similarity when embedding fails
- Robust validation: File existence checks before operations
- Structured errors: Detailed error messages in result objects
- Centralized config via
models.py - Environment-based selection:
DEFAULT_MODEL- LLM model name (defaults togpt-4o)DEFAULT_REASONING_MODEL- Reasoning model for deep analytical tasks (defaults togpt-5.1)DEFAULT_EMBEDDING_MODEL- Embedding model (defaults totext-embedding-ada-002)MODEL_API_BASE- API endpoint URL (defaults to OpenAI, supports API gateways)OPENAI_API_KEY- API authentication key
- Reasoning model: Configured with
reasoning_effort="high"for maximum analytical capability - OpenAI integration: Configurable model parameters
- API gateway support: Use
MODEL_API_BASEto route through proxies or local servers - Caching support: Performance optimization for repeated queries
# Run all tests
uv run pytest
# Run with verbose output
uv run pytest -v
# Run specific test file
uv run pytest tests/test_summarizer.py
# Run tests for a specific function
uv run pytest -k test_state_machine_transitionsThe project uses pytest-cov to measure test coverage. Here are the common commands:
# Basic coverage report
uv run pytest --cov=my_research_assistant
# Coverage with missing lines (recommended)
uv run pytest --cov=my_research_assistant --cov-report=term-missingThis shows which specific lines aren't covered by tests, making it easy to identify gaps.
# Generate HTML coverage report
uv run pytest --cov=my_research_assistant --cov-report=htmlThis creates an interactive HTML report in htmlcov/index.html that you can open in a browser for detailed coverage analysis.
# Combined terminal + HTML reports
uv run pytest --cov=my_research_assistant --cov-report=term-missing --cov-report=htmlThis gives you both the terminal summary and the detailed HTML report.
# Add runtime dependency
uv add <package-name>
# Add dev dependency
uv add --group dev <package-name>- Write new unit tests in
tests/rather than creating throwaway tests - Use
FileLocationsto override default locations in tests (prevents modifyingdocs/) - Follow existing patterns for state machine integration
- Use structured result objects for workflow methods
The designs/ directory contains comprehensive design documentation:
workflow-state-machine-and-commands.md- State machine specification - implementedcommand-arguments.md- Paper argument parsing - implementedcommand-types.md- Command categorization and usage patterns - implementedopen-command.md- PDF viewer integration and terminal fallback - implementedremove-paper-command.md- Paper removal from all storage locations - implementedresearch-command.md- Hierarchical RAG design - implementedfind-command.md- Google Custom Search integration - implementedconstants.md- Centralized constants for search and retrieval hyperparameters - implementedvalidate-command.md- Store validation command - implementederror-handling-and-logging.md- Error reporting and logging system - implementedfile-store.md- Data storage architecture - implementeduser-stories.md- User operations and workflows - partially implementedimproved-pagination.md- Single-key pagination design - implemented
Contributions are welcome! This project is in early development. Please:
- Fork the repository
- Create a feature branch
- Write tests for new functionality
- Ensure all tests pass with
uv run pytest - Submit a pull request
Copyright 2025 Benedat LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Built with:
- LlamaIndex - Workflow orchestration and RAG
- ChromaDB - Vector database
- Rich - Terminal interface
- PyMuPDF - PDF processing
- ArXiv API - Paper metadata and downloads





