Skip to content

v0.24.1

Latest
Compare
Choose a tag to compare
@benbrandt benbrandt released this 24 Feb 09:46
· 7 commits to main since this release

What's Changed

Added a new chunk_char_indices method to the Rust splitters in #607

use text_splitter::{Characters, ChunkCharIndex, TextSplitter};

let text = "\r\na̐éö̲\r\n";
let splitter = TextSplitter::new(3);
let chunks = splitter.chunk_char_indices(text).collect::<Vec<_>>();

assert_eq!(
    vec![
        ChunkCharIndex {
            chunk: "a̐é",
            byte_offset: 2,
            char_offset: 2
        },
        ChunkCharIndex {
            chunk: "ö̲",
            byte_offset: 7,
            char_offset: 5
        }
    ],
    chunks
);

The pulls logic from the Python bindings down into the core library. This will be more expensive than just byte offsets, and for most usage in Rust, just having byte offsets is sufficient.

However, when interfacing with other languages or systems that require character offsets, this will track the character offsets for you, accounting for any trimming that may have occurred.

Full Changelog: v0.24.0...v0.24.1