GitHub - oidlabs-com/Lexoid: Multimodal document parser for high quality data understanding and extraction

 ___      _______  __   __  _______  ___   ______  
|   |    |       ||  |_|  ||       ||   | |      | 
|   |    |    ___||       ||   _   ||   | |  _    |
|   |    |   |___ |       ||  | |  ||   | | | |   |
|   |___ |    ___| |     | |  |_|  ||   | | |_|   |
|       ||   |___ |   _   ||       ||   | |       |
|_______||_______||__| |__||_______||___| |______|

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.

Documentation

Motivation:

Use the multi-modal advancement of LLMs
Enable convenience for users
Collaborate with a permissive license

Installation

Installing with pip

pip install lexoid

To use LLM-based parsing, define the following environment variables or create a .env file with the following definitions

OPENAI_API_KEY=""
GOOGLE_API_KEY=""

Optionally, to use Playwright for retrieving web content (instead of the requests library):

playwright install --with-deps --only-shell chromium

Building `.whl` from source

make build

Creating a local installation

To install dependencies:

make install

or, to install with dev-dependencies:

make dev

To activate virtual environment:

source .venv/bin/activate

Usage

Example Notebook

Example Colab Notebook

Here's a quick example to parse documents using Lexoid:

from lexoid.api import parse
from lexoid.api import ParserType

parsed_md = parse("https://www.justice.gov/eoir/immigration-law-advisor", parser_type="LLM_PARSE")["raw"]
# or
pdf_path = "path/to/immigration-law-advisor.pdf"
parsed_md = parse(pdf_path, parser_type="LLM_PARSE")["raw"]

print(parsed_md)

Parameters

path (str): The file path or URL.
parser_type (str, optional): The type of parser to use ("LLM_PARSE" or "STATIC_PARSE"). Defaults to "AUTO".
pages_per_split (int, optional): Number of pages per split for chunking. Defaults to 4.
max_threads (int, optional): Maximum number of threads for parallel processing. Defaults to 4.
**kwargs: Additional arguments for the parser.

Supported API Providers

Google
OpenAI
Hugging Face
Together AI
OpenRouter
Fireworks

Benchmark

Results aggregated across 11 documents.

Note: Benchmarks are currently done in the zero-shot setting.

Rank	Model	SequenceMatcher Similarity	TFIDF Similarity	Time (s)	Cost ($)
1	gemini-2.5-pro	0.907 (±0.151)	0.973 (±0.053)	22.23	0.02305
2	AUTO	0.905 (±0.111)	0.967 (±0.051)	10.31	0.00068
3	gemini-2.5-flash	0.902 (±0.151)	0.984 (±0.030)	48.67	0.01051
4	gemini-2.0-flash	0.900 (±0.127)	0.971 (±0.040)	12.43	0.00081
5	mistral-ocr-latest	0.890 (±0.097)	0.930 (±0.095)	5.69	0.00127
6	claude-3-5-sonnet-20241022	0.873 (±0.195)	0.937 (±0.095)	16.86	0.01779
7	gemini-1.5-flash	0.868 (±0.198)	0.965 (±0.041)	17.19	0.00044
8	claude-sonnet-4-20250514	0.814 (±0.197)	0.903 (±0.150)	21.99	0.02045
9	accounts/fireworks/models/llama4-scout-instruct-basic	0.804 (±0.242)	0.931 (±0.067)	9.76	0.00087
10	claude-opus-4-20250514	0.798 (±0.230)	0.878 (±0.159)	21.01	0.09233
11	gpt-4o	0.796 (±0.264)	0.898 (±0.117)	28.23	0.01473
12	accounts/fireworks/models/llama4-maverick-instruct-basic	0.792 (±0.206)	0.914 (±0.128)	10.71	0.00149
13	gemini-1.5-pro	0.782 (±0.341)	0.833 (±0.252)	27.13	0.01275
14	gpt-4.1-mini	0.767 (±0.243)	0.807 (±0.197)	22.64	0.00352
15	gpt-4o-mini	0.727 (±0.245)	0.832 (±0.136)	17.20	0.00650
16	meta-llama/Llama-Vision-Free	0.682 (±0.223)	0.847 (±0.135)	12.31	0.00000
17	meta-llama/Llama-3.2-11B-Vision-Instruct-Turbo	0.677 (±0.226)	0.850 (±0.134)	7.23	0.00015
18	microsoft/phi-4-multimodal-instruct	0.665 (±0.258)	0.800 (±0.217)	10.96	0.00049
19	claude-3-7-sonnet-20250219	0.634 (±0.395)	0.752 (±0.298)	70.10	0.01775
20	google/gemma-3-27b-it	0.624 (±0.357)	0.750 (±0.327)	24.51	0.00020
21	gpt-4.1	0.622 (±0.314)	0.782 (±0.191)	34.66	0.01461
22	meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo	0.559 (±0.233)	0.822 (±0.119)	27.74	0.01102
23	ds4sd/SmolDocling-256M-preview	0.486 (±0.378)	0.583 (±0.355)	108.91	0.00000
24	qwen/qwen-2.5-vl-7b-instruct	0.469 (±0.364)	0.617 (±0.441)	13.23	0.00060

Name		Name	Last commit message	Last commit date
Latest commit History 209 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
lexoid		lexoid
tests		tests
.env_example		.env_example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Motivation:

Installation

Installing with pip

Building `.whl` from source

Creating a local installation

Usage

Parameters

Supported API Providers

Benchmark

About

Uh oh!

Releases 18

Uh oh!

Contributors 7

Uh oh!

Languages

License

oidlabs-com/Lexoid

Folders and files

Latest commit

History

Repository files navigation

Motivation:

Installation

Installing with pip

Building .whl from source

Creating a local installation

Usage

Parameters

Supported API Providers

Benchmark

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 18

Uh oh!

Contributors 7

Uh oh!

Languages

Building `.whl` from source