Scholar Query Splitter

A powerful Python tool that helps researchers overcome Google Scholar's 1000-result limit by intelligently splitting large queries into smaller, manageable chunks using AI-powered semantic analysis.

🎯 Two Modes of Operation

1. Discovery Mode (main.py)

Find interesting query combinations and research gaps by exploring modifier combinations.

2. Exhaustive Mode (main_exhaustive.py)

Systematically split large queries to capture ALL results when a query returns >1000 hits.

Overview

Google Scholar limits search results to 1000 per query. This tool solves that limitation by:

Analyzing your query using OpenAlex metadata
Extracting relevant keywords and entities using AI/NLP
Generating strategic subqueries that together capture all results
Tracking coverage to ensure no results are missed

Features

🚀 GPU Acceleration: Optimized for NVIDIA GPUs (tested on RTX 4090)
📊 OpenAlex Integration: Downloads open-access metadata for analysis
🧠 AI-Powered Extraction:
- KeyBERT for semantic keyword extraction
- spaCy for named entity recognition with type classification
📈 Smart Query Generation: Creates optimal subqueries using:
- Single modifiers
- Modifier combinations
- Exclusion queries
📅 Temporal Splitting: Year-by-year processing for better control
📋 Complete Coverage Tracking: Ensures no results are missed
💾 Incremental Saving: Results saved as processing continues
📊 Comprehensive Analysis: Statistics and insights on query effectiveness

Installation

Basic Installation

# Clone the repository
git clone https://github.com/hcss-utils/scholar-query-splitter.git
cd scholar-query-splitter

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Download spaCy model
python -m spacy download en_core_web_sm

GPU Setup (Optional but Recommended)

For GPU acceleration with NVIDIA GPUs:

# Install CUDA support for spaCy
pip install -r requirements-gpu.txt

# Or manually:
pip install cupy-cuda12x  # For CUDA 12.x
pip install spacy[cuda12x]

Usage

Discovery Mode - Find Research Gaps

Use this mode to explore interesting query combinations and find research niches:

# Basic usage
python main.py '(police OR "law enforcement") AND (strategic OR strategy)'

# With GPU and custom parameters
python main.py '(police OR "law enforcement") AND (strategic OR strategy)' \
    --use-gpu \
    --max-results 2000 \
    --top-keywords 30 \
    --top-entities 30 \
    --output-csv results.csv

Exhaustive Mode - Get ALL Results

Use this mode when you need to capture ALL results from queries returning >1000 hits:

# Split a query exhaustively for complete coverage
python main_exhaustive.py '(police OR "law enforcement") AND (strategic OR strategy)' \
    --start-year 2020 \
    --end-year 2024 \
    --use-gpu

# With all options
python main_exhaustive.py '(police OR "law enforcement") AND (strategic OR strategy)' \
    --start-year 2020 \
    --end-year 2024 \
    --target-size 800 \           # Max hits per subquery
    --max-metadata 10000 \        # More metadata = better modifiers
    --top-keywords 200 \          # Extract more keywords
    --top-entities 200 \          # Extract more entities
    --use-gpu \
    --spacy-model en_core_web_lg

Quick Test

Test the exhaustive mode with a single year:

python test_exhaustive.py

Parameters

Common Parameters

base_query: Your search query (required)
--use-gpu: Enable GPU acceleration
--spacy-model: spaCy model to use (default: en_core_web_sm)
--top-keywords: Number of keywords to extract
--top-entities: Number of entities to extract

Discovery Mode Specific

--max-results: Max results from OpenAlex (default: 1000)
--max-queries: Max subqueries to generate (default: 100)
--output-csv: Output CSV filename

Exhaustive Mode Specific

--start-year & --end-year: Year range (REQUIRED)
--target-size: Max hits per query (default: 900)
--max-metadata: Max metadata to analyze (default: 5000)

Output Files

Discovery Mode

query_results.csv: Queries with hit counts
query_results_analysis.json: Statistical analysis
outputs/keywords_*.json: Extracted keywords
outputs/entities_*.json: Extracted entities with types

Exhaustive Mode

outputs/exhaustive/all_queries_*.csv: Complete query list
outputs/exhaustive/final_report_*.json: Coverage report
outputs/exhaustive/year_*_results.json: Year-specific results
outputs/exhaustive/coverage_map_*.json: Coverage tracking

How It Works

1. Metadata Collection

Queries OpenAlex for relevant papers
Downloads metadata including titles, abstracts, and concepts

2. Modifier Extraction

Keywords: Extracts meaningful uni/bi-grams using KeyBERT
Entities: Identifies organizations, locations, people using spaCy
Filtering: Removes stopwords and terms already in base query

3. Query Generation

Discovery Mode

Combines modifiers to create interesting subqueries
Focuses on finding niche combinations

Exhaustive Mode

Tests modifier effectiveness
Creates optimal splitting strategy
Uses combinations and exclusions to ensure complete coverage

4. Validation

Counts actual Google Scholar hits for each query
Tracks coverage to ensure completeness

Entity Types

The tool extracts and classifies these entity types:

ORG: Organizations (FBI, United Nations)
GPE: Geopolitical entities (United States, London)
PERSON: People names
LOC: Locations (Africa, Pacific Ocean)
FAC: Facilities (JFK Airport)
EVENT: Events (World War II)

Tips for Best Results

Start with Exhaustive Mode for systematic literature reviews
Use GPU acceleration for 5-10x faster processing
Extract more modifiers for better coverage (--top-keywords 200)
Process year by year for large time ranges
Use larger spaCy models for better entity recognition

Example Workflow

# Step 1: Test with one year
python main_exhaustive.py 'your query' --start-year 2023 --end-year 2023 --use-gpu

# Step 2: If successful, run full range
python main_exhaustive.py 'your query' --start-year 2015 --end-year 2024 --use-gpu

# Step 3: Use generated queries to download papers systematically
# The CSV output contains all queries needed for complete coverage

Performance

With RTX 4090 and 90GB RAM:

Processes ~1000 documents in 2-3 minutes
Extracts keywords/entities from 10,000 documents in ~10 minutes
GPU acceleration provides 5-10x speedup for NLP tasks

Troubleshooting

CUDA/GPU Issues

# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"

Memory Issues

Reduce --max-metadata if running out of RAM
Process fewer years at once

Rate Limiting

Use --use-proxy for Google Scholar requests
Add delays between requests

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this tool in your research, please cite:

@software{scholar_query_splitter,
  title = {Scholar Query Splitter: Automated Query Decomposition for Complete Google Scholar Coverage},
  author = {HCSS Utils},
  year = {2024},
  url = {https://github.com/hcss-utils/scholar-query-splitter}
}

Acknowledgments

OpenAlex for providing free academic metadata
The KeyBERT and spaCy teams for excellent NLP tools
Google Scholar for being an invaluable research resource

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
json		json
pipeline		pipeline
.gitignore		.gitignore
EXHAUSTIVE_MODE.md		EXHAUSTIVE_MODE.md
GPU_SETUP.md		GPU_SETUP.md
LICENSE		LICENSE
RATE_LIMIT_SOLUTIONS.md		RATE_LIMIT_SOLUTIONS.md
README.md		README.md
RUN_GUIDE.md		RUN_GUIDE.md
debug_api.py		debug_api.py
debug_openalex.py		debug_openalex.py
debug_openalex2.py		debug_openalex2.py
debug_response.json		debug_response.json
main.py		main.py
main_exhaustive.py		main_exhaustive.py
openalex_direct.py		openalex_direct.py
push_to_github.sh		push_to_github.sh
requirements-gpu.txt		requirements-gpu.txt
requirements.txt		requirements.txt
run_all_years.py		run_all_years.py
run_example.sh		run_example.sh
run_example_gpu.py		run_example_gpu.py
run_quick_test.py		run_quick_test.py
run_reuse_modifiers.sh		run_reuse_modifiers.sh
run_with_scraperapi.sh		run_with_scraperapi.sh
test_api_direct.py		test_api_direct.py
test_exhaustive.py		test_exhaustive.py
test_filters.py		test_filters.py
test_full_run.py		test_full_run.py
test_mode_exhaustive.py		test_mode_exhaustive.py
test_openalex.py		test_openalex.py

License

hcss-utils/scholar-query-splitter

Folders and files

Latest commit

History

Repository files navigation

Scholar Query Splitter

🎯 Two Modes of Operation

1. Discovery Mode (main.py)

2. Exhaustive Mode (main_exhaustive.py)

Overview

Features

Installation

Basic Installation

GPU Setup (Optional but Recommended)

Usage

Discovery Mode - Find Research Gaps

Exhaustive Mode - Get ALL Results

Quick Test

Parameters

Common Parameters

Discovery Mode Specific

Exhaustive Mode Specific

Output Files

Discovery Mode

Exhaustive Mode

How It Works

1. Metadata Collection

2. Modifier Extraction

3. Query Generation

Discovery Mode

Exhaustive Mode

4. Validation

Entity Types

Tips for Best Results

Example Workflow

Performance

Troubleshooting

CUDA/GPU Issues

Memory Issues

Rate Limiting

Contributing

License

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages