Skip to content

riddhisaraogi11/Updated-Legal-NER

Repository files navigation

Automated Legal Entity Extractor (LexiScan Auto)

Overview

LexiScan Auto is an AI-powered legal document analysis system that extracts structured entities from legal contracts. The system processes PDF contracts and automatically identifies key legal information such as dates, parties, jurisdictions, monetary values, and signatories.

This project combines OCR, Natural Language Processing (NLP), and rule-based post-processing to convert unstructured legal documents into structured JSON data.


Features

Transformer model (BERT)

This project now uses a BERT transformer model for Named Entity Recognition (NER) via Hugging Face's transformers library. The default model used in examples and tests is dslim/bert-base-NER, which is downloaded at runtime by transformers.

To speed up the first run (recommended), pre-download the model once in your environment:

Windows / PowerShell:

& .\venv\Scripts\Activate.ps1
python - <<'PY'
from transformers import AutoTokenizer, AutoModelForTokenClassification
AutoTokenizer.from_pretrained("dslim/bert-base-NER")
AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
print("Downloaded")
PY

There is a small smoke test included (smoke_test.py) that demonstrates the transformer integration and prints extracted entities and regex fallbacks for DATE and MONEY. Run it like this:

& .\venv\Scripts\Activate.ps1
python -u smoke_test.py

If the model download stalls, check your internet connectivity and disk space. See requirements.txt for pinned dependency versions.

  • Extract entities from legal contracts
  • Handles scanned or image-based PDFs using OCR
  • Transformer-based BERT Named Entity Recognition (NER) model
  • Rule-based layer to improve precision
  • REST API using FastAPI
  • End-to-end automated pipeline
  • Unit tests using pytest

Extracted Entities

The system extracts the following entities:

  • DATE – contract dates and deadlines
  • PARTY – organizations involved in the contract
  • PERSON – signatories or individuals
  • JURISDICTION – locations and governing law
  • MONEY – contract values and payments

Project Architecture

PDF Contract
     │
     ▼
OCR (pdf2image + Tesseract)
     │
     ▼
Transformer NER (Hugging Face BERT)
     │
     ▼
Regex / Rule-based Post Processing
     │
     ▼
Structured JSON Output

Project Structure

legal-ner-lexiscan-main
│
├── data
│   └── raw_pdfs
│       └── sample.pdf
│
├── models
│   └── legal_ner_model
│
├── src
│   ├── api.py
│   ├── extract_entities.py
│   ├── main.py
│   ├── ner_transformer.py
│   ├── ocr.py
│   ├── post_processing.py
│   ├── train_ner.py
│   └── evaluate.py
│
├── tests
│   ├── test_entities.py
│   └── test_pipeline.py
│
├── smoke_test.py
├── temp_uploads
├── README.md
└── requirements.txt

Installation

1. Clone the Repository

git clone https://github.com/your-username/legal-ner-lexiscan.git
cd legal-ner-lexiscan

2. Create Virtual Environment

python -m venv venv

Activate:

Windows

venv\Scripts\activate

Mac/Linux

source venv/bin/activate

3. Install Dependencies

pip install -r requirements.txt

4. Install OCR Dependencies

Install Tesseract OCR

https://github.com/tesseract-ocr/tesseract

Install Poppler

https://github.com/oschwartz10612/poppler-windows


Running the API

Start the FastAPI server:

uvicorn src.api:app --reload

Open API documentation:

http://127.0.0.1:8000/docs

API Endpoint

Upload Contract for Entity Extraction

POST /extract

Upload a PDF contract and the system will return extracted entities.


Example Output

{
 "filename": "sample.pdf",
 "entities": {
   "DATE": [
      "2025-01-15",
      "2027-01-14"
   ],
   "PARTY": [
      "GlobalTech Innovations Pvt Ltd",
      "Apex Financial Holdings Inc"
   ],
   "JURISDICTION": [
      "State of California, USA",
      "San Francisco, California"
   ],
   "MONEY": [
      "$2,500,000",
      "$500,000"
   ],
   "PERSON": [
      "Arjun Mehta",
      "Laura Thompson"
   ]
 }
}

Running Tests

pytest

Expected output:

2 passed

Technologies Used

  • Python
  • spaCy
  • transformers
  • BERT (dslim/bert-base-NER)
  • FastAPI
  • pdf2image
  • Tesseract OCR
  • pytest

Applications

  • Legal contract analysis
  • Compliance automation
  • Document intelligence systems
  • Enterprise document processing

Future Improvements

  • Support multiple document formats
  • Improve NER model accuracy with larger datasets
  • Deploy using Docker
  • Integrate with cloud storage APIs

Author

Riddhi Saraogi

B.Tech Computer Science Artificial Intelligence & Data Science Enthusiast


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages