A comprehensive pipeline for performing Named Entity Recognition (NER) on text documents using spaCy.
- Extract standard named entities using spaCy's pre-trained models
- Add custom entity patterns for domain-specific NER
- Generate visualisations of recognised entities
- Produce detailed analysis of entity type distribution
- Save results in structured formats
- Python 3.12+
- uv package manager
-
Clone the repository:
git clone https://github.com/ai-mindset/ner_playground.git cd ner_playground
-
Create a virtual environment with UV:
uv venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install the package and development dependencies:
uv pip install -e ".[dev]"
-
Install the spaCy model:
# Download the model wheel directly curl -LO https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl # Install it directly with UV uv pip install en_core_web_sm-3.8.0-py3-none-any.whl
Run the NER analysis on a text file:
python -m src.ner.main --input texts/sample.txt --output plots/entities.html
from src.ner.main import perform_ner_analysis
# Sample text for analysis
text = """spaCy is an open-source software library for advanced natural language processing, written in Python and Cython. The main developers are Matthew Honnibal and Ines Montani."""
# Run the analysis
results = perform_ner_analysis(text)
# Access the entities found
entities_df = results["all_entities"]
print(entities_df)
# The HTML visualisation is available at
html_viz = results["visualization_html"]
ner_playground/
├── pyproject.toml # Project configuration and dependencies
├── src/
│ └── ner/
│ └── main.py # Main NER pipeline implementation
├── texts/ # Sample texts for analysis
└── plots/ # Output directory for visualisations
The NER pipeline performs the following steps:
- Loads the spaCy language model (
en_core_web_sm
) - Processes the input text to create a spaCy document
- Extracts standard named entities (people, organisations, locations, etc.)
- Applies custom entity patterns for domain-specific terminology
- Combines all entities and sorts them by position in the text
- Generates a summary of entity type distribution
- Creates an HTML visualisation of the entities in context
- Returns structured results for further analysis
You can customise the NER pipeline by modifying the custom patterns in src/ner/main.py
. The default implementation includes patterns for programming languages and libraries.
If you encounter errors with spaCy model loading:
-
Verify the model is installed correctly:
python -c "import spacy; print(spacy.util.get_installed_models())"
-
If the model is not listed, reinstall using the method in the Installation section.
MIT