Skip to content

Releases: segment-any-text/wtpsplit

Release 2.2.1

11 Apr 10:49

Choose a tag to compare

What's Changed

  • add compat with hf-hub >1 by @markus583 in #176
  • Add long description to pypi info
  • [DEV] Update release.sh for toolchain compat

Full Changelog: 2.2.0...2.2.1

Release 2.2.0

26 Feb 12:51

Choose a tag to compare

What's Changed

New Features

  • Length-constrained segmentation (#164 by @harikesavan): Control segment lengths with min_length and max_length parameters. Uses Viterbi (optimal) or greedy algorithms with configurable priors (uniform, gaussian, lognormal, clipped_polynomial) and language-aware defaults. Useful for embedding pipelines, storage limits, or any downstream task requiring fixed-size chunks. See docs/LENGTH_CONSTRAINTS.md.

Bug Fixes & Improvements

  • Transformers ≥5 compatibility (#172 by @markus583): Full support for transformers v5 while remaining backward-compatible with v4. Also removes the adapters library as a hard inference dependency - LoRA weights can now be merged without it installed.
  • Auto-detect num_labels for LoRA on sm models (#170 by @markus583): Fixes loading LoRA adapters trained with num_labels > 1 onto -sm models, which previously caused a shape mismatch error (#168).

Other

  • Minimum supported Python version: 3.9
  • Python 3.13 added to CI matrix
  • CI now runs new length-constrained segmentation tests

New Contributors

Full Changelog: 2.1.7...2.2.0

Release 2.1.7

19 Nov 08:41

Choose a tag to compare

  • Suppress annoying warnings of upstream dependencies in some Python version
  • Add possibility to not merge LoRA weights (still defaults to merging for efficiency reasons)

Full Changelog: 2.1.6...2.1.7

Release 2.1.6

23 Jun 03:49

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 2.1.5...2.1.6

Release 2.1.5

01 Apr 13:35

Choose a tag to compare

Changelog

  • Avoid unnecessary len check by using is None for tokenizer, leading to major speedups (#150)
  • Change default install onnxruntime from cpu to flexible install gpu and cpu (#152)
  • Allow using pre-downloaded tokenizer so SaT can be used offline (#151)
  • Add checks when setting a ONNX model object (#149)

Release 2.1.4

25 Jan 16:43

Choose a tag to compare

  • Introduce optional hat weighting by @lsorber
  • Clarify LoRA adaptation
  • Clarify treat_newline_as_space: renamed to split_on_input_newlines. treat_newline_as_space will be deprecated in a future release.

Release 2.1.2

14 Dec 11:06

Choose a tag to compare

  • Fixes #142: AssertionError when string is only comprised of newlines, whitespace, or if its an empty strong.

Release 2.1.1

27 Oct 14:19

Choose a tag to compare

  • Change default behaviour for newlines in SaT.split.
    • Now, while the model ignores them, they will used to split as simple post-processing.
  • Small bugfixes for LoRA training
  • Update Readme for advanced usage

Release 2.1.0

24 Sep 21:37
00d2d6c

Choose a tag to compare

  • Adds ONNX support for SaT models.
    • Including export scripts and an updated README.
    • This results in 50% improved inference time on GPU.

Release 2.0.8

09 Sep 10:49

Choose a tag to compare

  • Fix splitting of short sequences into individual characters (#127)