My Research Assistant

A command-line chatbot agent for building and exploring a knowledge base of research papers from ArXiv. The system provides an interactive terminal interface for finding, downloading, indexing, summarizing, and semantically searching through research papers with LLM-powered capabilities.

What It Does and Why

Keeping up with the latest research in fast-moving fields like generative AI can be overwhelming. This tool helps researchers:

Discover relevant papers - Search ArXiv with keyword queries and intelligent ranking
Build a personal knowledge base - Download, index, and organize papers locally
Generate AI summaries - Create structured summaries of papers with customizable prompts
Perform semantic search - Find information across your paper collection using natural language queries
Conduct deep research - Use hierarchical RAG (Retrieval-Augmented Generation) to synthesize insights across multiple papers with citations
Manage your collection - Add personal notes, validate storage integrity, and maintain your paper repository

The system uses a state machine-driven workflow with a rich terminal interface, providing visual feedback and markdown rendering for an enhanced research experience.

Getting Started

Prerequisites

Python 3.12 or higher
OpenAI API key - Set as OPENAI_API_KEY environment variable
uv package manager - Install from https://docs.astral.sh/uv/
Operating System - macOS or Linux (Windows not currently supported)

Installation

Clone the repository:

git clone https://github.com/yourusername/my-research-assistant.git
cd my-research-assistant

Install uv if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

Install dependencies:

uv sync

Set up environment variables:

export OPENAI_API_KEY='your-api-key-here'
export DOC_HOME='/path/to/your/documents'  # Where papers will be stored

Optional environment variables:

export DEFAULT_MODEL='gpt-4o'  # LLM model to use (default: gpt-4o)
export DEFAULT_EMBEDDING_MODEL='text-embedding-ada-002'  # Embedding model (default: text-embedding-ada-002)
export MODEL_API_BASE='https://api.openai.com/v1'  # API endpoint (default: OpenAI, can use gateway)
export PDF_VIEWER='/usr/bin/open'  # PDF viewer executable (default: terminal viewer)

Note: The MODEL_API_BASE variable allows you to use an API gateway or alternative OpenAI-compatible endpoint. This is useful for load balancing, cost management, or using local model servers.

Optional: Set up Google Custom Search for enhanced paper discovery:

The find command uses Google Custom Search as the primary discovery method when credentials are configured. This provides more reliable and higher-quality search results than the ArXiv API keyword search alone. If credentials are not configured, the system automatically falls back to ArXiv API search.

To enable Google Custom Search:

a. Get a Google Custom Search API key:

Go to Google Cloud Console
Create or select a project
Enable the Custom Search API
Create credentials (API key)

b. Create a Custom Search Engine:

Go to Google Programmable Search Engine
Create a new search engine
Configure it to search only arxiv.org
Copy the Search Engine ID

c. Set environment variables:

export GOOGLE_SEARCH_API_KEY='your-google-api-key'
export GOOGLE_SEARCH_ENGINE_ID='your-search-engine-id'

Free tier: Google provides 100 queries/day for free. Each find command uses 1 query.

Troubleshooting:

If you see "Google Custom Search failed: API request failed with status code 429", you've exceeded your quota. Wait for the daily reset or upgrade your quota.
If credentials are not configured, you'll see "Google Custom Search not configured, using ArXiv API search..." in the logs, and the system will use the legacy ArXiv API search method.

Project File Layout

my-research-assistant/
├── src/my_research_assistant/
│   ├── __init__.py
│   ├── chat.py                    # Rich terminal interface
│   ├── state_machine.py           # Workflow state management
│   ├── workflow.py                # LlamaIndex workflow orchestration
│   ├── arxiv_downloader.py        # ArXiv API integration
│   ├── google_search.py           # Google Custom Search integration
│   ├── vector_store.py            # ChromaDB dual vector stores & PDF text extraction
│   ├── summarizer.py              # LLM-powered summarization
│   ├── paper_manager.py           # Paper resolution utilities
│   ├── result_storage.py          # Save/manage research results
│   ├── paper_removal.py           # Remove papers from store
│   ├── validate_store.py          # Store validation
│   ├── prompt.py                  # Template-based prompt system
│   ├── models.py                  # Centralized LLM config
│   ├── file_locations.py          # Storage configuration
│   ├── project_types.py           # Data structures
│   └── prompts/                   # Markdown prompt templates
├── tests/                         # Pytest test suite
├── designs/                       # Design documentation
├── screenshots/                   # Example screenshots
├── pyproject.toml                 # Project configuration
├── LICENSE                        # Apache 2.0 license
└── README.md

Running the Application

Launch the interactive chat interface:

uv run chat

Or run the main module directly:

uv run python -m my_research_assistant

Logging Options

The chat interface supports optional logging for debugging and troubleshooting:

# Enable terminal logging at INFO level
uv run chat --loglevel INFO

# Write logs to a file (appends to file on each run)
uv run chat --logfile research-assistant.log

# Combine both options
uv run chat --loglevel DEBUG --logfile debug.log

Log levels (from least to most verbose):

ERROR - Only errors with stack traces
WARNING - Errors and warnings
INFO - Errors, warnings, and progress information
DEBUG - All messages including detailed debugging information

Log formats:

Terminal: Single-character level indicator (E/W/I/D) + message
File: ISO timestamp + level + message

Notes:

Logs are appended to the file (not overwritten) on each run
API keys are automatically redacted in log output
LlamaIndex verbose logging is suppressed by default

Example Session

Here's a typical workflow for researching transformer attention mechanisms:

Find papers on ArXiv:

> find transformer attention mechanisms

The system searches ArXiv and displays ranked results with titles, authors, and categories.

Summarize a paper:

> summarize 1

Downloads the PDF, extracts text, generates an LLM summary, and indexes the content.

View summaries:

> summary 1

Displays the structured summary for the selected paper.

Perform semantic search:

> sem-search how does multi-head attention work?

Searches across all indexed papers and returns a synthesized answer with page references.

Conduct deep research:

> research what are the key innovations in transformer architectures?

Uses hierarchical RAG: searches summaries first, retrieves detailed content from relevant papers, and synthesizes a comprehensive report with citations.

Save results:

> save

Saves the research report with an LLM-generated title to the results directory.

Add personal notes:

> notes

Opens your editor to add personal notes for the current paper.

List your collection:

> list

Displays all downloaded papers with pagination.

Command Reference

Discovery Commands

Available from any state.

Command	Description	Example
`find <query>`	Search ArXiv for papers (uses Google Custom Search if configured, otherwise ArXiv API)	`find deep learning optimization`
`list`	List all downloaded papers	`list`

Paper Processing Commands

Work with individual papers.

Command	Description	Available From
`summarize <number\|id>`	Download and summarize a paper	After `find` (or any state with ArXiv ID)
`summary <number\|id>`	View existing summary	After `list`, `sem-search`, or `research`
`open <number\|id>`	View paper content	Same as `summary`
`notes`	Edit personal notes	After selecting a paper
`improve <feedback>`	Improve current summary/results	While viewing summary or results

Search & Research Commands

Available from any state.

Command	Description	Example
`sem-search <query>`	Semantic search across papers	`sem-search attention mechanisms`
`research <query>`	Deep research with hierarchical RAG	`research transformer architectures`
`save`	Save search/research results	After `sem-search` or `research`

Management Commands

System maintenance and utilities.

Command	Description	Available From
`remove-paper <number\|id>`	Remove paper from repository	Any state
`rebuild-index`	Rebuild vector store indexes	Any state
`summarize-all`	Generate summaries for all papers	Any state
`validate-store`	Check repository integrity	Any state

Development/Testing Commands

Tools for testing and debugging search functionality.

Command	Description	Example
`uv run search-tester [OPTIONS] QUERY`	Test search APIs directly	`uv run search-tester "neural networks"`

search-tester options:

--summary - Search the summary index instead of content index
--papers PAPER_IDS - Filter search to specific papers (comma-separated)
-k N - Number of chunks to return (default: 20 for content, 5 for summary)
--content-similarity-threshold T - Minimum similarity score (0.0-1.0, default: 0.6)
--use-mmr - Use Maximum Marginal Relevance for diversity
--mmr-alpha A - MMR alpha parameter (0.0-1.0, default: 0.5, requires --use-mmr)

Examples:

# Search content index
uv run search-tester "attention mechanisms"

# Search summary index with custom k
uv run search-tester --summary -k 10 "transformers"

# Search specific papers with MMR
uv run search-tester --papers 2503.12345,2503.67890 --use-mmr "optimization"

# Adjust similarity threshold
uv run search-tester --content-similarity-threshold 0.7 "deep learning"

System Commands

Available from any state.

Command	Description
`help`	Show valid commands for current state
`status`	Display current workflow status
`history`	Show conversation history
`clear`	Clear conversation history
`quit` or `exit`	Exit the application

Paper References

Papers can be referenced in two ways:

By number (1-indexed): When you have results from find, list, sem-search, or research
```
> summary 3  # View summary of paper #3 from current results
```

By ArXiv ID: From any state using the full ArXiv ID

> summary 2404.16130v2  # View summary by ArXiv ID

Document Store Layout

The system stores all data in a directory specified by the DOC_HOME environment variable:

${DOC_HOME}/
├── pdfs/                          # Downloaded PDF files
│   └── <arxiv-id>.pdf            # Named using ArXiv conventions
├── paper_metadata/                # ArXiv metadata (JSON)
│   └── <arxiv-id>.json
├── extracted_paper_text/          # Extracted markdown text
│   └── <arxiv-id>.md
├── summaries/                     # LLM-generated summaries
│   ├── <arxiv-id>_summary.md
│   └── images/                    # Extracted figures
│       └── <arxiv-id>/
├── notes/                         # Personal notes (markdown)
│   └── <arxiv-id>_notes.md
├── results/                       # Saved search/research results
│   └── <timestamp>_<title>.md
└── index/                         # ChromaDB vector stores
    ├── content/                   # Paper content chunks
    └── summary/                   # Summaries + notes

Storage States

Papers in your collection can be in different states:

Downloaded: PDF exists in pdfs/
Extracted: Text extracted to extracted_paper_text/
Summarized: Summary generated in summaries/
Indexed (content): Paper chunks indexed for semantic search
Indexed (summary): Summary indexed for research

Use validate-store to see the status of all papers.

Implementation Overview

Architecture

The system uses a state machine-driven workflow with a pipeline architecture for paper processing:

ArXiv Search → Download PDF → Extract Text → Index Content → Generate Summary → Index Summary

Key Components

State Machine (`state_machine.py`)

6 states: initial, select-new, select-view, summarized, sem-search, research
3 state variables: last_query_set, selected_paper, draft
Command validation based on current state
Conditional query set preservation for seamless navigation

Dual Vector Stores (`vector_store.py`)

Content index: Chunks from paper PDFs for semantic search
Summary index: Summaries + personal notes for research
ChromaDB backend with persistent storage
Metadata enrichment with ArXiv categories, authors, and page numbers

LlamaIndex Workflow (`workflow.py`)

Event-driven pipeline for paper processing
Structured result objects (QueryResult, ProcessingResult, SaveResult)
Async workflow support for efficient operations
Integration with OpenAI embeddings and LLMs

Search & Retrieval Strategy

Keyword Search: Google Custom Search (if configured) or ArXiv API for candidate papers
Version Deduplication: Automatic selection of latest paper versions when multiple versions found
Semantic Reranking: LlamaIndex embeddings for similarity ranking
Paper ID Sorting: Results sorted by ArXiv ID for consistent numbering across commands
Hierarchical RAG: Summary-level search → targeted content retrieval → synthesis

Prompt System (`prompt.py`)

Template-based markdown prompts in src/my_research_assistant/prompts/
Variable substitution with {{VAR_NAME}} syntax
Versioned prompts (v1, v2) for base summaries and improvements
Centralized management for consistency

Core Data Flow

Search: ArXiv API → metadata extraction → semantic reranking
Download: PDF retrieval → local storage in pdfs/
Extract: PyMuPDF → markdown text + images
Index: Document chunking → embeddings → ChromaDB storage
Summarize: LLM with versioned prompts → markdown summary
Search/Research: Query → vector similarity → LLM synthesis → citations

Testing

Comprehensive test suite with pytest:

State machine tests: 30+ tests covering all workflows and transitions
Command tests: Integration tests for chat interface
Component tests: Unit tests for individual modules
Mock-based testing: Isolated component testing
Async support: pytest-asyncio for workflow testing

Run tests:

uv run pytest
uv run pytest -v  # Verbose output
uv run pytest tests/test_state_machine.py  # Specific test file

Error Handling

Custom exceptions: IndexError, ConfigError, PromptFileError, PromptVarError
State machine recovery: Automatic transitions to safe states on failures
Graceful fallbacks: Text-based similarity when embedding fails
Robust validation: File existence checks before operations
Structured errors: Detailed error messages in result objects

Model Configuration

Centralized config via models.py
Environment-based selection:
- DEFAULT_MODEL - LLM model name (defaults to gpt-4o)
- DEFAULT_REASONING_MODEL - Reasoning model for deep analytical tasks (defaults to gpt-5.1)
- DEFAULT_EMBEDDING_MODEL - Embedding model (defaults to text-embedding-ada-002)
- MODEL_API_BASE - API endpoint URL (defaults to OpenAI, supports API gateways)
- OPENAI_API_KEY - API authentication key
Reasoning model: Configured with reasoning_effort="high" for maximum analytical capability
OpenAI integration: Configurable model parameters
API gateway support: Use MODEL_API_BASE to route through proxies or local servers
Caching support: Performance optimization for repeated queries

Development

Running Tests

# Run all tests
uv run pytest

# Run with verbose output
uv run pytest -v

# Run specific test file
uv run pytest tests/test_summarizer.py

# Run tests for a specific function
uv run pytest -k test_state_machine_transitions

Code Coverage

The project uses pytest-cov to measure test coverage. Here are the common commands:

# Basic coverage report
uv run pytest --cov=my_research_assistant

# Coverage with missing lines (recommended)
uv run pytest --cov=my_research_assistant --cov-report=term-missing

This shows which specific lines aren't covered by tests, making it easy to identify gaps.

# Generate HTML coverage report
uv run pytest --cov=my_research_assistant --cov-report=html

This creates an interactive HTML report in htmlcov/index.html that you can open in a browser for detailed coverage analysis.

# Combined terminal + HTML reports
uv run pytest --cov=my_research_assistant --cov-report=term-missing --cov-report=html

This gives you both the terminal summary and the detailed HTML report.

Adding Dependencies

# Add runtime dependency
uv add <package-name>

# Add dev dependency
uv add --group dev <package-name>

Code Structure

Write new unit tests in tests/ rather than creating throwaway tests
Use FileLocations to override default locations in tests (prevents modifying docs/)
Follow existing patterns for state machine integration
Use structured result objects for workflow methods

Design Documents

The designs/ directory contains comprehensive design documentation:

workflow-state-machine-and-commands.md - State machine specification - implemented
command-arguments.md - Paper argument parsing - implemented
command-types.md - Command categorization and usage patterns - implemented
open-command.md - PDF viewer integration and terminal fallback - implemented
remove-paper-command.md - Paper removal from all storage locations - implemented
research-command.md - Hierarchical RAG design - implemented
find-command.md - Google Custom Search integration - implemented
constants.md - Centralized constants for search and retrieval hyperparameters - implemented
validate-command.md - Store validation command - implemented
error-handling-and-logging.md - Error reporting and logging system - implemented
file-store.md - Data storage architecture - implemented
user-stories.md - User operations and workflows - partially implemented
improved-pagination.md - Single-key pagination design - implemented

Contributing

Contributions are welcome! This project is in early development. Please:

Fork the repository
Create a feature branch
Write tests for new functionality
Ensure all tests pass with uv run pytest
Submit a pull request

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Acknowledgments

Built with:

LlamaIndex - Workflow orchestration and RAG
ChromaDB - Vector database
Rich - Terminal interface
PyMuPDF - PDF processing
ArXiv API - Paper metadata and downloads

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.claude/agents		.claude/agents
.vscode		.vscode
designs		designs
docs		docs
screenshots		screenshots
src/my_research_assistant		src/my_research_assistant
tests		tests
utils		utils
.gitignore		.gitignore
.python-version		.python-version
CLAUDE.md		CLAUDE.md
DEVELOPMENT_WORKFLOW.md		DEVELOPMENT_WORKFLOW.md
LICENSE		LICENSE
README.md		README.md
devlog.md		devlog.md
envrc.template		envrc.template
plan.md		plan.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

BenedatLLC/my-research-assistant

Folders and files

Latest commit

History

Repository files navigation

My Research Assistant

What It Does and Why

Getting Started

Prerequisites

Installation

Project File Layout

Running the Application

Logging Options

Example Session

Command Reference

Discovery Commands

Paper Processing Commands

Search & Research Commands

Management Commands

Development/Testing Commands

System Commands

Paper References

Document Store Layout

Storage States

Implementation Overview

Architecture

Key Components

State Machine (state_machine.py)

Dual Vector Stores (vector_store.py)

LlamaIndex Workflow (workflow.py)

Search & Retrieval Strategy

Prompt System (prompt.py)

Core Data Flow

Testing

Error Handling

Model Configuration

Development

Running Tests

Code Coverage

Adding Dependencies

Code Structure

Design Documents

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

State Machine (`state_machine.py`)

Dual Vector Stores (`vector_store.py`)

LlamaIndex Workflow (`workflow.py`)

Prompt System (`prompt.py`)

Packages