Skip to content

Conversation

OGuggenbuehl
Copy link

@OGuggenbuehl OGuggenbuehl commented Jul 29, 2025

Proposed Changes:

Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers

How did you test it?

unit tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@CLAassistant
Copy link

CLAassistant commented Jul 29, 2025

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 29, 2025
@OGuggenbuehl OGuggenbuehl changed the title Feature/md header splitter feat:MarkdownHeaderSplitter Jul 29, 2025
@sjrl sjrl self-assigned this Aug 19, 2025
@sjrl
Copy link
Contributor

sjrl commented Aug 19, 2025

@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py

This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.

@sjrl sjrl changed the title feat:MarkdownHeaderSplitter feat: MarkdownHeaderSplitter Aug 27, 2025
@OGuggenbuehl OGuggenbuehl force-pushed the feature/md-header-splitter branch from ba90272 to 7ef16a7 Compare September 23, 2025 12:22
@OGuggenbuehl OGuggenbuehl force-pushed the feature/md-header-splitter branch from 56dd0a0 to d7d4f18 Compare September 25, 2025 09:07
@OGuggenbuehl OGuggenbuehl force-pushed the feature/md-header-splitter branch from abb0a84 to 44e0454 Compare September 26, 2025 15:22
@OGuggenbuehl
Copy link
Author

@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?

@sjrl
Copy link
Contributor

sjrl commented Sep 29, 2025

@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?

Good question! It does sound different from what our other splitters do and almost fits better into the DocumentCleaner abstraction that we have. If you find that you don't use it often and/or that it doesn't always work well I could see it as a separate component that we could add to https://github.com/deepset-ai/haystack-experimental first so we can easily make breaking changes to it and gather feedback before fully committing to it. What do you think?

@OGuggenbuehl
Copy link
Author

@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?

Good question! It does sound different from what our other splitters do and almost fits better into the DocumentCleaner abstraction that we have. If you find that you don't use it often and/or that it doesn't always work well I could see it as a separate component that we could add to https://github.com/deepset-ai/haystack-experimental first so we can easily make breaking changes to it and gather feedback before fully committing to it. What do you think?

I feel that this would be a more appropriate approach - also because it would improve the separation of concerns by component. I suggest I:

  • remove the header-level inference functionality from this component
  • re-implement it as its own component and put up a PR in haystack-experimental

@sjrl
Copy link
Contributor

sjrl commented Sep 29, 2025

@OGuggenbuehl that sounds good!

@OGuggenbuehl
Copy link
Author

@sjrl done! MarkdownHeaderLevelsInferrer now lives here and is entirely removed from this PR

@OGuggenbuehl OGuggenbuehl requested a review from sjrl September 30, 2025 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants