LexiScan Auto is an AI-powered legal document analysis system that extracts structured entities from legal contracts. The system processes PDF contracts and automatically identifies key legal information such as dates, parties, jurisdictions, monetary values, and signatories.
This project combines OCR, Natural Language Processing (NLP), and rule-based post-processing to convert unstructured legal documents into structured JSON data.
This project now uses a BERT transformer model for Named Entity Recognition (NER) via Hugging Face's transformers library. The default model used in examples and tests is dslim/bert-base-NER, which is downloaded at runtime by transformers.
To speed up the first run (recommended), pre-download the model once in your environment:
Windows / PowerShell:
& .\venv\Scripts\Activate.ps1
python - <<'PY'
from transformers import AutoTokenizer, AutoModelForTokenClassification
AutoTokenizer.from_pretrained("dslim/bert-base-NER")
AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
print("Downloaded")
PYThere is a small smoke test included (smoke_test.py) that demonstrates the transformer integration and prints extracted entities and regex fallbacks for DATE and MONEY. Run it like this:
& .\venv\Scripts\Activate.ps1
python -u smoke_test.pyIf the model download stalls, check your internet connectivity and disk space. See requirements.txt for pinned dependency versions.
- Extract entities from legal contracts
- Handles scanned or image-based PDFs using OCR
- Transformer-based BERT Named Entity Recognition (NER) model
- Rule-based layer to improve precision
- REST API using FastAPI
- End-to-end automated pipeline
- Unit tests using pytest
The system extracts the following entities:
- DATE – contract dates and deadlines
- PARTY – organizations involved in the contract
- PERSON – signatories or individuals
- JURISDICTION – locations and governing law
- MONEY – contract values and payments
PDF Contract
│
▼
OCR (pdf2image + Tesseract)
│
▼
Transformer NER (Hugging Face BERT)
│
▼
Regex / Rule-based Post Processing
│
▼
Structured JSON Output
legal-ner-lexiscan-main
│
├── data
│ └── raw_pdfs
│ └── sample.pdf
│
├── models
│ └── legal_ner_model
│
├── src
│ ├── api.py
│ ├── extract_entities.py
│ ├── main.py
│ ├── ner_transformer.py
│ ├── ocr.py
│ ├── post_processing.py
│ ├── train_ner.py
│ └── evaluate.py
│
├── tests
│ ├── test_entities.py
│ └── test_pipeline.py
│
├── smoke_test.py
├── temp_uploads
├── README.md
└── requirements.txt
git clone https://github.com/your-username/legal-ner-lexiscan.git
cd legal-ner-lexiscan
python -m venv venv
Activate:
Windows
venv\Scripts\activate
Mac/Linux
source venv/bin/activate
pip install -r requirements.txt
Install Tesseract OCR
https://github.com/tesseract-ocr/tesseract
Install Poppler
https://github.com/oschwartz10612/poppler-windows
Start the FastAPI server:
uvicorn src.api:app --reload
Open API documentation:
http://127.0.0.1:8000/docs
POST /extract
Upload a PDF contract and the system will return extracted entities.
{
"filename": "sample.pdf",
"entities": {
"DATE": [
"2025-01-15",
"2027-01-14"
],
"PARTY": [
"GlobalTech Innovations Pvt Ltd",
"Apex Financial Holdings Inc"
],
"JURISDICTION": [
"State of California, USA",
"San Francisco, California"
],
"MONEY": [
"$2,500,000",
"$500,000"
],
"PERSON": [
"Arjun Mehta",
"Laura Thompson"
]
}
}
pytest
Expected output:
2 passed
- Python
- spaCy
- transformers
- BERT (
dslim/bert-base-NER) - FastAPI
- pdf2image
- Tesseract OCR
- pytest
- Legal contract analysis
- Compliance automation
- Document intelligence systems
- Enterprise document processing
- Support multiple document formats
- Improve NER model accuracy with larger datasets
- Deploy using Docker
- Integrate with cloud storage APIs
Riddhi Saraogi
B.Tech Computer Science Artificial Intelligence & Data Science Enthusiast