Releases · segment-any-text/wtpsplit · GitHub

11 Apr 10:49

markus583

Release 2.2.1 Latest

Latest

What's Changed

add compat with hf-hub >1 by @markus583 in #176
Add long description to pypi info
[DEV] Update release.sh for toolchain compat

Full Changelog: 2.2.0...2.2.1

Contributors

markus583

Assets 2

26 Feb 12:51

markus583

Release 2.2.0

What's Changed

New Features

Length-constrained segmentation (#164 by @harikesavan): Control segment lengths with min_length and max_length parameters. Uses Viterbi (optimal) or greedy algorithms with configurable priors (uniform, gaussian, lognormal, clipped_polynomial) and language-aware defaults. Useful for embedding pipelines, storage limits, or any downstream task requiring fixed-size chunks. See docs/LENGTH_CONSTRAINTS.md.

Bug Fixes & Improvements

Transformers ≥5 compatibility (#172 by @markus583): Full support for transformers v5 while remaining backward-compatible with v4. Also removes the adapters library as a hard inference dependency - LoRA weights can now be merged without it installed.
Auto-detect num_labels for LoRA on sm models (#170 by @markus583): Fixes loading LoRA adapters trained with num_labels > 1 onto -sm models, which previously caused a shape mismatch error (#168).

Other

Minimum supported Python version: 3.9
Python 3.13 added to CI matrix
CI now runs new length-constrained segmentation tests

New Contributors

@harikesavan made their first contribution in #164

Full Changelog: 2.1.7...2.2.0

Contributors

harikesavan and markus583

Assets 2

19 Nov 08:41

markus583

Release 2.1.7

Suppress annoying warnings of upstream dependencies in some Python version
Add possibility to not merge LoRA weights (still defaults to merging for efficiency reasons)

Full Changelog: 2.1.6...2.1.7

Assets 2

23 Jun 03:49

markus583

Release 2.1.6

What's Changed

Improve postprocessing efficiency by @kevinhu in #157

New Contributors

@kevinhu made their first contribution in #157

Full Changelog: 2.1.5...2.1.6

Contributors

kevinhu

Assets 2

01 Apr 13:35

markus583

Release 2.1.5

Changelog

Avoid unnecessary len check by using is None for tokenizer, leading to major speedups (#150)
Change default install onnxruntime from cpu to flexible install gpu and cpu (#152)
Allow using pre-downloaded tokenizer so SaT can be used offline (#151)
Add checks when setting a ONNX model object (#149)

Assets 2

25 Jan 16:43

markus583

Release 2.1.4

Introduce optional hat weighting by @lsorber
Clarify LoRA adaptation
Clarify treat_newline_as_space: renamed to split_on_input_newlines. treat_newline_as_space will be deprecated in a future release.

Contributors

lsorber

Assets 2

14 Dec 11:06

markus583

Release 2.1.2

Fixes #142: AssertionError when string is only comprised of newlines, whitespace, or if its an empty strong.

Assets 2

27 Oct 14:19

markus583

Release 2.1.1

Change default behaviour for newlines in SaT.split.
- Now, while the model ignores them, they will used to split as simple post-processing.
Small bugfixes for LoRA training
Update Readme for advanced usage

Assets 2

24 Sep 21:37

markus583

Release 2.1.0

Adds ONNX support for SaT models.
- Including export scripts and an updated README.
- This results in 50% improved inference time on GPU.

Assets 2

09 Sep 10:49

markus583

Release 2.0.8

Fix splitting of short sequences into individual characters (#127)

Assets 2