You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I use chunk_elements on my List[Elements], the Table element is always combined with other elements, resulting in a CompositeElement. In addition, whenever the text_as_html characters exceed the default maximum value, I encounter the same issue.(resulting in a CompositeElement)
According to the official documentation for unstructured regarding the chunk_element function:
A single element that exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text splitting.
A Table element is always isolated and never combined with another element. If a Table is oversized (exceeding the hard-max), it is divided into two or more TableChunk elements using text splitting.
I anticipated that a Table element would never be combined with other elements into a CompositeElement.
One approach is to refactor the will_fit() method as follows:
def will_fit(self, element: Element) -> bool:
# -- if the new element is a Table, it can only fit in an empty pre-chunk --
if isinstance(element, Table):
return len(self._elements) == 0
# -- if the pre-chunk already contains a Table, no additional element should fit --
if any(isinstance(e, Table) for e in self._elements):
return False
# -- an empty pre-chunk will accept any element (including an oversized element) --
if len(self._elements) == 0:
return True
# -- a pre-chunk that already exceeds the soft-max is considered "full" --
if self._text_length > self._opts.soft_max:
return False
# -- don't add an element if it would increase total size beyond the hard-max --
return not self._remaining_space < len(element.text)
With this change, if the element type is Table, it always fits in an empty pre-chunk.
If you have any suggestions or if I have any misunderstanding, please don't hesitate to let me know.
Thanks!
The text was updated successfully, but these errors were encountered:
When I use chunk_elements on my List[Elements], the Table element is always combined with other elements, resulting in a CompositeElement. In addition, whenever the text_as_html characters exceed the default maximum value, I encounter the same issue.(resulting in a CompositeElement)
According to the official documentation for unstructured regarding the chunk_element function:
A single element that exceeds the hard-max is isolated (never combined with another element) and then divided into two or more chunks using text splitting.
A Table element is always isolated and never combined with another element. If a Table is oversized (exceeding the hard-max), it is divided into two or more TableChunk elements using text splitting.
I anticipated that a Table element would never be combined with other elements into a CompositeElement.
One approach is to refactor the will_fit() method as follows:
With this change, if the element type is Table, it always fits in an empty pre-chunk.
If you have any suggestions or if I have any misunderstanding, please don't hesitate to let me know.
Thanks!
The text was updated successfully, but these errors were encountered: