Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/chunk_elements #3921

Open
Jimmy-web169 opened this issue Feb 14, 2025 · 0 comments
Open

feat/chunk_elements #3921

Jimmy-web169 opened this issue Feb 14, 2025 · 0 comments
Labels
enhancement New feature or request

Comments

@Jimmy-web169
Copy link

Jimmy-web169 commented Feb 14, 2025

When I use chunk_elements on my List[Elements], the Table element is always combined with other elements, resulting in a CompositeElement. In addition, whenever the text_as_html characters exceed the default maximum value, I encounter the same issue.(resulting in a CompositeElement)

According to the official documentation for unstructured regarding the chunk_element function:

A single element that exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text splitting.
A Table element is always isolated and never combined with another element. If a Table is oversized (exceeding the hard-max), it is divided into two or more TableChunk elements using text splitting.

I anticipated that a Table element would never be combined with other elements into a CompositeElement.

One approach is to refactor the will_fit() method as follows:

def will_fit(self, element: Element) -> bool:
        # -- if the new element is a Table, it can only fit in an empty pre-chunk --
        if isinstance(element, Table):
            return len(self._elements) == 0

        # -- if the pre-chunk already contains a Table, no additional element should fit --
        if any(isinstance(e, Table) for e in self._elements):
            return False

        # -- an empty pre-chunk will accept any element (including an oversized element) --
        if len(self._elements) == 0:
            return True

        # -- a pre-chunk that already exceeds the soft-max is considered "full" --
        if self._text_length > self._opts.soft_max:
            return False

        # -- don't add an element if it would increase total size beyond the hard-max --
        return not self._remaining_space < len(element.text)

With this change, if the element type is Table, it always fits in an empty pre-chunk.

If you have any suggestions or if I have any misunderstanding, please don't hesitate to let me know.
Thanks!

@Jimmy-web169 Jimmy-web169 added the enhancement New feature or request label Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant