Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support RecursiveRules (in reverse) for OverlapRefiner #150

Open
Sankgreall opened this issue Jan 19, 2025 · 3 comments
Open

[FEAT] Support RecursiveRules (in reverse) for OverlapRefiner #150

Sankgreall opened this issue Jan 19, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@Sankgreall
Copy link
Contributor

Sankgreall commented Jan 19, 2025

📋 Quick Check

  • [ x ] I've checked this feature isn't already implemented or proposed
  • [ x ] This feature is relevant to Chonkie's purpose (text chunking for RAG)

💡 Feature Description

OverlapRefinery is limited to taking a fixed token/word prefix, but would be much more powerful if we could apply recursive rules with a minimum token length in reverse.

🛠️ Implementation Approach

Commentary on the issue here, and a sample implementation: https://sankgreall.medium.com/intelligent-transcript-chunking-63bfca62a5ad. Pasting the relevant code below as well.

class HierarchicalOverlapRefinery(OverlapRefinery):
    def __init__(
        self,
        context_size: int = 128,
        min_tokens: int = 50,
        tokenizer: Any = None,
        rules: RecursiveRules = None,
        merge_context: bool = True,
        inplace: bool = True,
        approximate: bool = True,
    ) -> None:
        super().__init__(
            context_size=context_size,
            tokenizer=tokenizer,
            merge_context=merge_context,
            inplace=inplace,
            approximate=approximate
        )
        self.min_tokens = min_tokens
        self.rules = rules
        self.mode = "prefix"

    def _get_token_count(self, text: str) -> int:
        if hasattr(self, "tokenizer") and not self.approximate:
            return len(self.tokenizer.encode(text))
        return int(len(text) / self._AVG_CHAR_PER_TOKEN)

    def _find_minimal_speaker_boundary(self, text: str) -> Optional[Tuple[str, int]]:
        """Find the smallest speaker chunk from the end that meets min_tokens."""
        if not self.rules or not self.rules.levels:
            return None
        
        speaker_rule = self.rules.levels[0]
        if not speaker_rule.delimiters:
            return None

        # Split text by speaker delimiter
        for delimiter in speaker_rule.delimiters:
            # Split text preserving delimiters
            parts = text.split(delimiter)
            if len(parts) <= 1:
                continue

            # Build chunks from the end until we meet min_tokens
            current_text = ""
            accumulated_parts = []
            
            # Work backwards through parts
            for part in reversed(parts):
                if part:  # Skip empty parts
                    test_text = part + delimiter + current_text
                    token_count = self._get_token_count(test_text)
                    
                    print(f"Testing chunk: {test_text[:50]}...")
                    print(f"Token count: {token_count}")
                    
                    if token_count >= self.min_tokens:
                        # Found smallest valid chunk
                        if accumulated_parts:
                            final_text = part + delimiter + delimiter.join(accumulated_parts)
                        else:
                            final_text = test_text
                            
                        # Find position in original text
                        start_pos = text.rindex(final_text)
                        return final_text, start_pos
                        
                    # Keep accumulating if under min_tokens
                    accumulated_parts.insert(0, part)
                    current_text = test_text

        return None

    def _ensure_min_tokens(self, text: str, start_level: int = 1) -> Optional[str]:
        """Recursively ensure text meets minimum token requirement using rules."""
        current_tokens = self._get_token_count(text)
        if current_tokens >= self.min_tokens:
            return text
            
        # Try each remaining rule level to extend context
        for level in range(start_level, len(self.rules.levels)):
            rule = self.rules.levels[level]
            if not rule.delimiters:
                continue
                
            for delimiter in rule.delimiters:
                prefix_text = text
                pos = text.rfind(delimiter)
                while pos > 0 and self._get_token_count(prefix_text) < self.min_tokens:
                    prefix_text = text[pos:]
                    pos = text[:pos].rfind(delimiter)
                    
                if self._get_token_count(prefix_text) >= self.min_tokens:
                    return prefix_text
                    
        return None

    def _hierarchical_prefix_context(self, chunk: Chunk) -> Optional[Context]:
        """Get prefix context using hierarchical rules."""
        if not self.rules or not self.rules.levels:
            return self._prefix_overlap_token(chunk)

        # First try to find minimal speaker boundary
        speaker_result = self._find_minimal_speaker_boundary(chunk.text)
        
        if speaker_result:
            context_text, start_pos = speaker_result
        else:
            # No speaker boundary found - use entire chunk
            context_text = chunk.text
            start_pos = 0

        try:
            context_tokens = self._get_token_count(context_text)
            return Context(
                text=context_text,
                token_count=context_tokens,
                start_index=chunk.start_index + start_pos,
                end_index=chunk.end_index
            )
        except ValueError:
            print(f"Warning: Could not find context in original text")
            return self._prefix_overlap_token(chunk)

    def _get_prefix_overlap_context(self, chunk: Chunk) -> Optional[Context]:
        return self._hierarchical_prefix_context(chunk)

🎯 Why is this needed?

Extends the capability of OverlapRefinery so that it can consider document/chunk structure when making prefix decisions.

Will do my best to submit a PR in the future, but raising as a feature request in case there is any comment or feedback.

@Sankgreall Sankgreall added the enhancement New feature or request label Jan 19, 2025
@bhavnicksm
Copy link
Collaborator

Hey @Sankgreall! 😊

Firstly, thank you for creating the issue; I am positively amazed at the due diligence put in this issue and the blog post as well~ This is awesome! I'd like to boost this blog from Chonkie's main channels as well, if you don't mind 🫶🚀

Some comments I had on this enhancement are as follows (in no particular order):

  • Yes, I think it makes sense to give the user the option to add structural context with the overlap refinery. Currently, the OverlapRefinery does support the dynamic chunk size but limited to sentences. If the OverlapRefinery sees SentenceChunks or SemanticChunks as indications of having used SentenceChunker or SemanticChunker respectively, then it uses sentences to make the overlap. I didn't expose it out yet, because didn't think we should add to the user's cognitive load, but we can add that well. Similarly, we can add a method = 'recursive' in the overlap refinery to do it recursively, taking in RecursiveRules when required.
  • I am a bit concerned about the latency of the HierarchialOverlapRefinery. If I understand this correctly, it would recursively chunk every N-1th chunk, to make context for the Nth chunk—which would add on to the calls to the tokenizer and splitter. I'll need to test out to understand how much time it would take exactly. How were the timings you were seeing?
  • Another concern I have is that, let's say we used a RecursiveChunker which already creates chunks that are "dynamic" in terms of tokens. Then if the OverlapRefinery is initialised to context_size values higher than the RecursiveChunker's chunk_size, then you'd be guaranteed to have the last entire chunk as the context usually. Would there be a difference for the LM if we supplied such contextual chunks? Since the last option in the HeirarchialOverlapRefinery is to use the entire previous chunk, if we just go with that, would we be worse off? I'm just trying to understand the value for it, not trying to disparage this, of course!

Happy to discuss more on this over on our Discord channel! Hit me up, @bhavnicksm and we can chat on one of the Voice Channels as well~

Thanks! 😊

@Sankgreall
Copy link
Contributor Author

Sankgreall commented Jan 20, 2025

Thanks for the reply, some excellent points, and would be greatly honoured if you wanted to circulate the blog!

With regards to latency, I just ran some tests. My existing transcript was about 15,000 tokens and I was making chunks with a token limit of 600. I increased the transcript size by copying and pasting it many times, and then reran the operation with some timers. Here are the results (just for the refining).

On my device at least, I don't notice a significant issue.

Average chunk processing time: 0.0008 seconds
Maximum chunk processing time: 0.0077 seconds
Average boundary search time: 0.0002 seconds

Total processing time: 2.00 seconds
Total tokens processed: 1,949,136
Processing speed: 973971.4 tokens/second

On your last bullet, as we've chatted about now, the goal with this refiner is to return the minimum prefix size necessary to establish speaker context for the LLM. The way I've implemented this is through leveraging the RecursiveRules to apply the best fit from the previous chunk where the returned tokens are greater than min_tokens.

This is because we want to provide the LLM with maximum useful information (i.e., who is speaking), whilst being aware that more tokens increases the chance of model decoherency.

The implications for this approach also go well beyond transcripts. My grander plans for this would be for the refiner to be capable of scrolling past chunk[n-1] and apply RecursiveRules further up the historical context. This would allow, as an example, a prefix that could contain the last major heading or title of a markdown document, or in the case of the transcripts, the last known speaker even if that speaker started speaking 5 chunks ago.

I've been working on a more complete class today. Would be happy to raise a PR?

@bhavnicksm
Copy link
Collaborator

Hey @Sankgreall! 😄

PRs welcome 🤗

It would be great to merge the HeirarchialOverlapRefinery functionality with the current OverlapRefinery and add an additional signature parameter method which, when set to recursive can follow the logic for HeirarchialOverlapRefinery as you have set.

Let me know if you have any issues or discussion points; happy to connect!

Thanks ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants