Inkwell is a modular Python library for extracting information from PDF documents documents with state of the art Vision Language Models. We make use of layout understanding models to improve accuracy of Vision Language models. You can swap out the layout models or the vision language models pretty easily to create pipelines which work well for specific document layouts.
Inkwell uses the following models, with more integrations in the work
- Layout Detection: Faster RCNN, LayoutLMv3, Paddle
- Table Detection: Table Transformer
- Table Data Extraction: Phi3.5-Vision, Qwen2 VL 2B, Table Transformer, OpenAI 4o Mini
- OCR: Tesseract, PaddleOCR, Phi3.5-Vision, Qwen2 VL 2B
pip install py-inkwell
In addition, install detectron2
pip install git+https://github.com/facebookresearch/detectron2.git
Install Tesseract
For Ubuntu -
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
and, Mac OS
brew install tesseract
For GPUs, install flash attention for faster inference.
pip install flash-attn --no-build-isolation
from inkwell.pipeline import Pipeline
pipeline = Pipeline()
document = pipeline.process("/path/to/file.pdf")
for page in document.pages:
figures = page.image_fragments()
tables = page.table_fragments()
text_blocks = page.text_fragments()
# Check the content of the image fragments
for figure in figures:
figure_image = figure.content.image
print(f"Text in figure:\n{figure.content.text}")
# Check the content of the table fragments
for table in tables:
table_image = table.content.image
print(f"Table detected: {table.content.data}")
# Check the content of the text blocks
for text_block in text_blocks:
text_block_image = text_block.content.image
print(f"Text block detected: {text_block.content.text}")
Default models: We have defined a config class here, and we use the default CPU Config in the pipeline for best results. If you want to use the default GPU pipeline, you can instantiate it with the GPU config class.
from inkwell.pipeline import DefaultGPUPipelineConfig, Pipeline
config = DefaultGPUPipelineConfig()
pipeline = Pipeline(config=config)
If you want to change the default models, you can replace them with models listed below by passing them in the config during pipeline initialization:
from inkwell.pipeline import PipelineConfig, Pipeline
from inkwell.layout_detector import LayoutDetectorType
from inkwell.ocr import OCRType
from inkwell.table_detector import TableDetectorType, TableExtractorType
config = PipelineConfig(
layout_detector=LayoutDetectorType.FASTER_RCNN,
table_extractor=TableExtractorType.PHI3_VISION,
)
pipeline = Pipeline(config=config)
You can add custom detectors and other components to the pipeline yourself - follow the instructions in the Custom Components notebook
We derived inspiration from several open-source libraries in our implementation, like Layout Parser and Deepdoctection. We would like to thank the contributors to these libraries for their work.