Skip to content

v0.7.1

Choose a tag to compare

@stevenobiajulu stevenobiajulu released this 27 Mar 03:07
· 21 commits to main since this release
16f3198

Fixed

  • Paragraph similarity alignment on legal boilerplate documents (#62, fixes #61) — The hierarchical paragraph matching used a greedy first-match algorithm that allowed low-similarity matches to consume revised paragraphs intended for higher-similarity matches later in the document. On documents with extensive shared legal vocabulary (e.g., NVCA Certificate of Incorporation), this caused incorrect paragraph alignment, garbled reject-all output, and fallback to the rebuild reconstruction path with ~950 phantom insertions.

    Fix: Two-part improvement:

    1. Order-constrained gap matching — Pass 1 exact-hash anchors divide documents into gaps. Pass 2 similarity matching is scoped to each gap via mini-LCS, guaranteeing document order preservation.
    2. TF-IDF cosine similarity — Replaces Jaccard word similarity, which over-weights common boilerplate. IDF down-weights high-frequency terms like "holders", "Preferred Stock", "Corporation"; cosine similarity on TF-IDF vectors produces more discriminating scores.

    Before: 949 phantom insertions (rebuild fallback) → After: 332 insertions (inplace, correct)

What's Changed

Bug Fixes

  • fix(docx-core): order-constrained TF-IDF paragraph matching prevents phantom redlines by @stevenobiajulu in #62

Other Changes

Full Changelog: v0.7.0...v0.7.1