🪓 wtpsplit is a Python package that offers training, inference, and evaluation of state-of-the-art Segment any Text (SaT) models for partitioning text into sentences.
✂️ wtpsplit-lite is a lightweight version of wtsplit that only retains accelerated ONNX inference of SaT models with minimal dependencies:
- huggingface-hub to download the model
- numpy to process the model in- and output
- onnxruntime to run the model
- tokenizers to tokenize the text for the model
To install this package, run:
pip install wtpsplit-lite
Tip
For a complete list of Segment any Text (SaT) models and all SaT.split
keyword arguments, see the wtsplit README.
Example usage:
from wtpsplit_lite import SaT
text = """
It is known that Maxwell’s electrodynamics—as usually understood at the
present time—when applied to moving bodies, leads to asymmetries which do
not appear to be inherent in the phenomena. Take, for example, the recipro-
cal electrodynamic action of a magnet and a conductor.
"""
# Fast (~150ms/page), good quality:
sat = SaT("sat-3l-sm")
sentences = sat.split(text, stride=128, block_size=256)
# Slow, highest quality:
sat = SaT("sat-12l-sm")
sentences = sat.split(text)
This package also contributes a new 'hat' weighting scheme to wtpsplit that improves output quality when using large strides. To enable it, set weighting="hat"
as follows:
# Fast (~150ms/page), better quality:
sat = SaT("sat-3l-sm")
sentences = sat.split(text, stride=128, block_size=256, weighting="hat")
Note
In wtpsplit, the SaT implementation treats newlines as sentence boundaries by default. However, this leads to poor results on text extracted from PDF such as in the example above. In wtpsplit-lite, newlines are therefore treated as whitepace by default. You can choose which behavior you prefer with the treat_newline_as_space
boolean keyword argument of the SaT.split
method.
Prerequisites
1. Set up Git to use SSH
- Generate an SSH key and add the SSH key to your GitHub account.
- Configure SSH to automatically load your SSH keys:
cat << EOF >> ~/.ssh/config Host * AddKeysToAgent yes IgnoreUnknown UseKeychain UseKeychain yes ForwardAgent yes EOF
2. Install Docker
- Install Docker Desktop.
- Linux only:
- Export your user's user id and group id so that files created in the Dev Container are owned by your user:
cat << EOF >> ~/.bashrc export UID=$(id --user) export GID=$(id --group) EOF
- Export your user's user id and group id so that files created in the Dev Container are owned by your user:
- Linux only:
3. Install VS Code or PyCharm
- Install VS Code and VS Code's Dev Containers extension. Alternatively, install PyCharm.
- Optional: install a Nerd Font such as FiraCode Nerd Font and configure VS Code or configure PyCharm to use it.
Development environments
The following development environments are supported:
- ⭐️ GitHub Codespaces: click on Code and select Create codespace to start a Dev Container with GitHub Codespaces.
- ⭐️ Dev Container (with container volume): click on Open in Dev Containers to clone this repository in a container volume and create a Dev Container with VS Code.
- Dev Container: clone this repository, open it with VS Code, and run Ctrl/⌘ + ⇧ + P → Dev Containers: Reopen in Container.
- PyCharm: clone this repository, open it with PyCharm, and configure Docker Compose as a remote interpreter with the
dev
service. - Terminal: clone this repository, open it with your terminal, and run
docker compose up --detach dev
to start a Dev Container in the background, and then rundocker compose exec dev zsh
to open a shell prompt in the Dev Container.
Developing
- This project follows the Conventional Commits standard to automate Semantic Versioning and Keep A Changelog with Commitizen.
- Run
poe
from within the development environment to print a list of Poe the Poet tasks available to run on this project. - Run
poetry add {package}
from within the development environment to install a run time dependency and add it topyproject.toml
andpoetry.lock
. Add--group test
or--group dev
to install a CI or development dependency, respectively. - Run
poetry update
from within the development environment to upgrade all dependencies to the latest versions allowed bypyproject.toml
. - Run
cz bump
to bump the package's version, update theCHANGELOG.md
, and create a git tag.