v0.7.1
Fixed
-
Paragraph similarity alignment on legal boilerplate documents (#62, fixes #61) — The hierarchical paragraph matching used a greedy first-match algorithm that allowed low-similarity matches to consume revised paragraphs intended for higher-similarity matches later in the document. On documents with extensive shared legal vocabulary (e.g., NVCA Certificate of Incorporation), this caused incorrect paragraph alignment, garbled reject-all output, and fallback to the rebuild reconstruction path with ~950 phantom insertions.
Fix: Two-part improvement:
- Order-constrained gap matching — Pass 1 exact-hash anchors divide documents into gaps. Pass 2 similarity matching is scoped to each gap via mini-LCS, guaranteeing document order preservation.
- TF-IDF cosine similarity — Replaces Jaccard word similarity, which over-weights common boilerplate. IDF down-weights high-frequency terms like "holders", "Preferred Stock", "Corporation"; cosine similarity on TF-IDF vectors produces more discriminating scores.
Before: 949 phantom insertions (rebuild fallback) → After: 332 insertions (inplace, correct)
What's Changed
Bug Fixes
- fix(docx-core): order-constrained TF-IDF paragraph matching prevents phantom redlines by @stevenobiajulu in #62
Other Changes
- Add translated READMEs (zh, es, pt-br, de) by @stevenobiajulu in #60
- chore(release): bump to 0.7.1 — fix paragraph similarity alignment (#61) by @stevenobiajulu in #63
Full Changelog: v0.7.0...v0.7.1