Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fix optimal chunking edge cases #32

Merged
merged 1 commit into from
Oct 15, 2024
Merged

fix: fix optimal chunking edge cases #32

merged 1 commit into from
Oct 15, 2024

Conversation

lsorber
Copy link
Member

@lsorber lsorber commented Oct 15, 2024

This PR fixes or improves the following:

  1. If there are 0 or 1 sentences, exit early with 1 chunk.
  2. If there total length of all sentences is less than the chunk max size, exit early with 1 chunk.
  3. The discourse vector is now normalised to unit length.
  4. The discourse vector is only removed if it would not lead to all-zero embeddings (which can happen for an array of identical sentences).
  5. The sentence window size is automatically reduced if there are fewer sentences than the window size.
  6. Preconditions are verified at the beginning.

@lsorber lsorber self-assigned this Oct 15, 2024
@lsorber lsorber merged commit 996b9ee into main Oct 15, 2024
2 checks passed
@lsorber lsorber deleted the ls-fix-chunking branch October 15, 2024 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant