GROBID Client Python

A simple, efficient Python client for GROBID REST services that provides concurrent processing capabilities for PDF documents, reference strings, and patents.

📋 Table of Contents

Features
Prerequisites
Installation
Quick Start
Usage
- Command Line Interface
- Python Library
Configuration
Services
Testing
Performance
Development
License

✨ Features

Concurrent Processing: Efficiently process multiple documents in parallel
Flexible Input: Process PDF files, text files with references, and XML patents
Configurable: Customizable server settings, timeouts, and processing options
Command Line & Library: Use as a standalone CLI tool or import into your Python projects
Coordinate Extraction: Optional PDF coordinate extraction for precise element positioning
Sentence Segmentation: Layout-aware sentence segmentation capabilities
JSON Output: Convert TEI XML output to structured JSON format with CORD-19-like structure
Markdown Output: Convert TEI XML output to clean Markdown format with structured sections

📋 Prerequisites

Python: 3.8 - 3.13 (tested versions)
GROBID Server: A running GROBID service instance
- Local installation: GROBID Documentation
- Docker: docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.8.2
- Default server: http://localhost:8070
- Online demo: https://lfoppiano-grobid.hf.space (usage limits apply), more details here.

Important

GROBID supports Windows only through Docker containers. See the Docker documentation for details.

🚀 Installation

Choose one of the following installation methods:

PyPI (Recommended)

pip install grobid-client-python

Development Version

pip install git+https://github.com/kermitt2/grobid_client_python.git

Local Development

git clone https://github.com/kermitt2/grobid_client_python
cd grobid_client_python
pip install -e .

⚡ Quick Start

Command Line

# Process PDFs in a directory
grobid_client --input ./pdfs --output ./output processFulltextDocument

# Process with custom server
grobid_client --server https://your-grobid-server.com --input ./pdfs processFulltextDocument

Python Library

from grobid_client.grobid_client import GrobidClient

# Create client instance
client = GrobidClient(config_path="./config.json")

# Process documents
client.process("processFulltextDocument", "/path/to/pdfs", n=10)

📖 Usage

Command Line Interface

The client provides a comprehensive CLI with the following syntax:

grobid_client [OPTIONS] SERVICE

Available Services

Service	Description	Input Format
`processFulltextDocument`	Extract full document structure	PDF files
`processHeaderDocument`	Extract document metadata	PDF files
`processReferences`	Extract bibliographic references	PDF files
`processCitationList`	Parse citation strings	Text files (one citation per line)
`processCitationPatentST36`	Process patent citations	XML ST36 format
`processCitationPatentPDF`	Process patent PDFs	PDF files

Common Options

Option	Description	Default
`--input`	Input directory path	Required
`--output`	Output directory path	Same as input
`--server`	GROBID server URL	`http://localhost:8070`
`--n`	Concurrency level	10
`--config`	Config file path	Optional
`--force`	Overwrite existing files	False
`--verbose`	Enable verbose logging	False

Processing Options

Option	Description
`--generateIDs`	Generate random XML IDs
`--consolidate_header`	Consolidate header metadata
`--consolidate_citations`	Consolidate bibliographic references
`--include_raw_citations`	Include raw citation text
`--include_raw_affiliations`	Include raw affiliation text
`--teiCoordinates`	Add PDF coordinates to XML
`--segmentSentences`	Segment sentences with coordinates
`--flavor`	Processing flavor for fulltext extraction
`--json`	Convert TEI output to JSON format
`--markdown`	Convert TEI output to Markdown format

Examples

# Basic fulltext processing
grobid_client --input ~/documents --output ~/results processFulltextDocument

# High concurrency with coordinates
grobid_client --input ~/pdfs --output ~/tei --n 20 --teiCoordinates processFulltextDocument

# Process with JSON output
grobid_client --input ~/pdfs --output ~/results --json processFulltextDocument

# Process with Markdown output
grobid_client --input ~/pdfs --output ~/results --markdown processFulltextDocument

# Process citations with custom server
grobid_client --server https://grobid.example.com --input ~/citations.txt processCitationList

# Force reprocessing with sentence segmentation and JSON output
grobid_client --input ~/docs --force --segmentSentences --json processFulltextDocument

Python Library

Basic Usage

from grobid_client.grobid_client import GrobidClient

# Initialize with default localhost server
client = GrobidClient()

# Initialize with custom server
client = GrobidClient(grobid_server="https://your-server.com")

# Initialize with config file
client = GrobidClient(config_path="./config.json")

# Process documents
client.process(
    service="processFulltextDocument",
    input_path="/path/to/pdfs",
    output_path="/path/to/output",
    n=20
)

Advanced Usage

# Process with specific options
client.process(
    service="processFulltextDocument",
    input_path="/path/to/pdfs",
    output_path="/path/to/output",
    n=10,
    generateIDs=True,
    consolidate_header=True,
    teiCoordinates=True,
    segmentSentences=True
)

# Process with JSON output
client.process(
    service="processFulltextDocument",
    input_path="/path/to/pdfs",
    output_path="/path/to/output",
    json_output=True
)

# Process with Markdown output
client.process(
    service="processFulltextDocument",
    input_path="/path/to/pdfs",
    output_path="/path/to/output",
    markdown_output=True
)

# Process citation lists
client.process(
    service="processCitationList",
    input_path="/path/to/citations.txt",
    output_path="/path/to/output"
)

⚙️ Configuration

Configuration can be provided via a JSON file. When using the CLI, the --server argument overrides the config file settings.

Default Configuration

{
  "grobid_server": "http://localhost:8070",
  "batch_size": 1000,
  "sleep_time": 5,
  "timeout": 60,
  "coordinates": [
    "persName",
    "figure",
    "ref",
    "biblStruct",
    "formula",
    "s"
  ]
}

Configuration Parameters

Parameter	Description	Default
`grobid_server`	GROBID server URL	`http://localhost:8070`
`batch_size`	Thread pool size. Tune carefully: a large batch size will result in the data being written less frequently	1000
`sleep_time`	Wait time when server is busy (seconds)	5
`timeout`	Client-side timeout (seconds)	180
`coordinates`	XML elements for coordinate extraction	See above
`logging`	Logging configuration (level, format, file output)	See Logging section

Tip

Since version 0.0.12, the config file is optional. The client will use default localhost settings if no configuration is provided.

Logging Configuration

The client provides configurable logging with different verbosity levels. By default, only essential statistics and warnings are shown.

Logging Behavior

Without --verbose: Shows only essential information and warnings/errors
With --verbose: Shows detailed processing information at INFO level

Always Visible Output

The following information is always displayed regardless of the --verbose flag:

Found 1000 file(s) to process
Processing completed: 950 out of 1000 files processed
Errors: 50 out of 1000 files processed
Processing completed in 120.5 seconds

Verbose Output (`--verbose`)

When the --verbose flag is used, additional detailed information is displayed:

Server connection status
Individual file processing details
JSON conversion messages
Detailed error messages
Processing progress information

Examples

# Clean output - only essential statistics
grobid_client --input pdfs/ processFulltextDocument
# Output:
# Found 1000 file(s) to process
# Processing completed: 950 out of 1000 files processed
# Errors: 50 out of 1000 files processed
# Processing completed in 120.5 seconds

# Verbose output - detailed processing information
grobid_client --input pdfs/ --verbose processFulltextDocument
# Output includes all essential stats PLUS:
# GROBID server http://localhost:8070 is up and running
# JSON file example.json does not exist, generating JSON from existing TEI...
# Successfully created JSON file: example.json
# ... and other detailed processing information

Configuration File Logging

The config file can include logging settings:

{
    "grobid_server": "http://localhost:8070",
    "logging": {
        "level": "WARNING",
        "format": "%(asctime)s - %(levelname)s - %(message)s",
        "console": true,
        "file": null
    }
}

Note: The --verbose command line flag always takes precedence over configuration file logging settings.

🔬 Services

Fulltext Document Processing

Extracts complete document structure including headers, body text, figures, tables, and references.

grobid_client --input pdfs/ --output results/ processFulltextDocument

JSON Output Format

When using the --json flag, the client converts TEI XML output to a structured JSON format similar to CORD-19. This provides:

Structured Bibliography: Title, authors, DOI, publication date, journal information
Body Text: Paragraphs and sentences with metadata and reference annotations
Figures and Tables: Structured JSON format for tables with headers, rows, and metadata
Reference Information: In-text citations with offsets and targets

JSON Structure

{
  "level": "paragraph",
  "biblio": {
    "title": "Document Title",
    "authors": [
      "Author 1",
      "Author 2"
    ],
    "doi": "10.1000/example",
    "publication_date": "2023-01-01",
    "journal": "Journal Name",
    "abstract": [
      ...
    ]
  },
  "body_text": [
    {
      "id": "p_12345",
      "text": "Paragraph text with citations [1].",
      "head_section": "Introduction",
      "refs": [
        {
          "type": "bibr",
          "target": "b1",
          "text": "[1]",
          "offset_start": 25,
          "offset_end": 28
        }
      ]
    }
  ],
  "figures_and_tables": [
    {
      "id": "table_1",
      "type": "table",
      "label": "Table 1",
      "head": "Sample Data",
      "content": {
        "headers": [
          "Header 1",
          "Header 2"
        ],
        "rows": [
          [
            "Value 1",
            "Value 2"
          ]
        ],
        "metadata": {
          "row_count": 1,
          "column_count": 2,
          "has_headers": true
        }
      }
    }
  ]
}

Usage Examples

# Generate both TEI and JSON outputs
grobid_client --input pdfs/ --output results/ --json processFulltextDocument

# JSON output with coordinates and sentence segmentation
grobid_client --input pdfs/ --output results/ --json --teiCoordinates --segmentSentences processFulltextDocument

# Python library usage
client.process(
    service="processFulltextDocument",
    input_path="/path/to/pdfs",
    output_path="/path/to/output",
    json_output=True
)

Note

When using --json, the --force flag only checks for existing TEI files. If a TEI file is rewritten (due to --force), the corresponding JSON file is automatically rewritten as well.

Markdown Output Format

When using the --markdown flag, the client converts TEI XML output to a clean, readable Markdown format. This provides:

Structured Sections: Title, Authors, Affiliations, Publication Date, Fulltext, Annex, and References
Clean Formatting: Human-readable format suitable for documentation and sharing
Preserved Content: All text content with proper section organization
Reference Formatting: Bibliographic references in a readable format

Markdown Structure

The generated Markdown follows this structure:

# Document Title

## Authors

- Author Name 1
- Author Name 2

## Affiliations

- Affiliation 1
- Affiliation 2

## Publication Date

January 1, 2023

## Fulltext

### Introduction

Content of the introduction section...

### Methods

Content of the methods section...

## Annex

### Acknowledgements

Acknowledgement text...

### Competing Interests

Competing interests statement...

## References

**[1]** Paper Title. *Author Name*. *Journal Name* (2023).
**[2]** Another Paper. *Author et al.*. *Conference* (2022).

Usage Examples

# Generate both TEI and Markdown outputs
grobid_client --input pdfs/ --output results/ --markdown processFulltextDocument

# Markdown output with coordinates and sentence segmentation
grobid_client --input pdfs/ --output results/ --markdown --teiCoordinates --segmentSentences processFulltextDocument

# Python library usage
client.process(
    service="processFulltextDocument",
    input_path="/path/to/pdfs",
    output_path="/path/to/output",
    markdown_output=True
)

Note

When using --markdown, the --force flag only checks for existing TEI files. If a TEI file is rewritten (due to --force), the corresponding Markdown file is automatically rewritten as well.

Header Document Processing

Extracts only document metadata (title, authors, abstract, etc.).

grobid_client --input pdfs/ --output headers/ processHeaderDocument

Reference Processing

Extracts and structures bibliographic references from documents.

grobid_client --input pdfs/ --output refs/ processReferences

Citation List Processing

Parses raw citation strings from text files.

grobid_client --input citations.txt --output parsed/ processCitationList

Tip

For citation lists, input should be text files with one citation string per line.

🧪 Testing

The project includes comprehensive unit and integration tests using pytest.

Running Tests

# Install development dependencies
pip install -e .[dev]

# Run all tests
pytest

# Run with coverage
pytest --cov=grobid_client

# Run specific test file
pytest tests/test_client.py

# Run with verbose output
pytest -v

Test Structure

tests/test_client.py - Unit tests for the base API client
tests/test_grobid_client.py - Unit tests for the GROBID client
tests/test_integration.py - Integration tests with real GROBID server
tests/conftest.py - Test configuration and fixtures

Continuous Integration

Tests are automatically run via GitHub Actions on:

Push to main branch
Pull requests
Multiple Python versions (3.8-3.13)

📊 Performance

Benchmark results for processing 136 PDFs (3,443 pages total, ~25 pages per PDF) on Intel Core i7-4790K CPU 4.00GHz:

Concurrency	Runtime (s)	s/PDF	PDF/s
1	209.0	1.54	0.65
2	112.0	0.82	1.21
3	80.4	0.59	1.69
5	62.9	0.46	2.16
8	55.7	0.41	2.44
10	55.3	0.40	2.45

Additional Benchmarks

Header processing: 3.74s for 136 PDFs (36 PDF/s) with n=10
Reference extraction: 26.9s for 136 PDFs (5.1 PDF/s) with n=10
Citation parsing: 4.3s for 3,500 citations (814 citations/s) with n=10

🛠️ Development

Setting Up Development Environment

# Clone the repository
git clone https://github.com/kermitt2/grobid_client_python
cd grobid_client_python

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install in development mode with test dependencies
pip install -e .[dev]

# Install pre-commit hooks (optional)
pre-commit install

Creating a New Release

The project uses bump-my-version for version management:

# Install bump-my-version
pip install bump-my-version

# Bump version (patch, minor, or major)
bump-my-version bump patch

# The release will be automatically published to PyPI

Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Make your changes
Add tests for new functionality
Run the test suite (pytest)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

Distributed under the Apache 2.0 License. See LICENSE for more information.

👥 Authors & Contact

Main Author: Patrice Lopez ([email protected])
Maintainer: Luca Foppiano ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.github/workflows		.github/workflows
grobid_client		grobid_client
resources		resources
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Readme.md		Readme.md
config.json		config.json
example.py		example.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

License

kermitt2/grobid-client-python

Folders and files

Latest commit

History

Repository files navigation

GROBID Client Python

📋 Table of Contents

✨ Features

📋 Prerequisites

🚀 Installation

PyPI (Recommended)

Development Version

Local Development

⚡ Quick Start

Command Line

Python Library

📖 Usage

Command Line Interface

Available Services

Common Options

Processing Options

Examples

Python Library

Basic Usage

Advanced Usage

⚙️ Configuration

Default Configuration

Configuration Parameters

Logging Configuration

Logging Behavior

Always Visible Output

Verbose Output (--verbose)

Examples

Configuration File Logging

🔬 Services

Fulltext Document Processing

JSON Output Format

JSON Structure

Usage Examples

Markdown Output Format

Markdown Structure

Usage Examples

Header Document Processing

Reference Processing

Citation List Processing

🧪 Testing

Running Tests

Test Structure

Continuous Integration

📊 Performance

Additional Benchmarks

🛠️ Development

Setting Up Development Environment

Creating a New Release

Contributing

📄 License

👥 Authors & Contact

🔗 Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors 11

Uh oh!

Languages

Verbose Output (`--verbose`)

Packages