feat: MarkdownHeaderSplitter #9660

OGuggenbuehl · 2025-07-29T13:55:52Z

Proposed Changes:

Implement MarkdownHeaderSplitter to split Documents written in .md based on their headers

How did you test it?

unit tests

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

CLAassistant · 2025-07-29T13:55:59Z

All committers have signed the CLA.

haystack/components/preprocessors/markdown_header_splitter.py

sjrl · 2025-08-19T11:46:09Z

@OGuggenbuehl definitely looks like an interesting approach! I've left an initial set of comments, but to further review I'd appreciate if you could add a set of tests like the ones we have for the DocumentSplitter https://github.com/deepset-ai/haystack/blob/main/test/components/preprocessors/test_document_splitter.py

This will help me be able to review the actual algorithm for splitting since it's easier to understand with examples.

use haystack logging Co-authored-by: Sebastian Husch Lee <[email protected]>

remove temp toc Co-authored-by: Sebastian Husch Lee <[email protected]>

…enbuehl/haystack into feature/md-header-splitter

haystack/components/preprocessors/markdown_header_splitter.py

test/components/preprocessors/test_markdown_header_splitter.py

haystack/components/preprocessors/markdown_header_splitter.py

add more logging for empty documents

OGuggenbuehl · 2025-09-27T10:42:53Z

@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?

sjrl · 2025-09-29T06:45:41Z

@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?

Good question! It does sound different from what our other splitters do and almost fits better into the DocumentCleaner abstraction that we have. If you find that you don't use it often and/or that it doesn't always work well I could see it as a separate component that we could add to https://github.com/deepset-ai/haystack-experimental first so we can easily make breaking changes to it and gather feedback before fully committing to it. What do you think?

OGuggenbuehl · 2025-09-29T08:43:37Z

@sjrl I have been thinking about whether keeping _infer_header_levels as a method makes sense for this. it's only useful in certain cases, the algorithm does not perfectly recreate document structure in all cases and it is a distinct concern from the component as a whole. do you think it makes sense to keep it in or move it to a more experimental / internal repo instead?

Good question! It does sound different from what our other splitters do and almost fits better into the DocumentCleaner abstraction that we have. If you find that you don't use it often and/or that it doesn't always work well I could see it as a separate component that we could add to https://github.com/deepset-ai/haystack-experimental first so we can easily make breaking changes to it and gather feedback before fully committing to it. What do you think?

I feel that this would be a more appropriate approach - also because it would improve the separation of concerns by component. I suggest I:

remove the header-level inference functionality from this component
re-implement it as its own component and put up a PR in haystack-experimental

sjrl · 2025-09-29T08:47:24Z

@OGuggenbuehl that sounds good!

OGuggenbuehl · 2025-09-29T14:55:32Z

@sjrl done! MarkdownHeaderLevelsInferrer now lives here and is entirely removed from this PR

OGuggenbuehl added 3 commits July 11, 2025 16:17

implement md-header-splitter and add tests

a2a4f86

rework md-header splitter to rewrite md-header levels

4337a5b

remove deprecated test

393cd53

github-actions bot added topic:tests type:documentation Improvements on the docs labels Jul 29, 2025

OGuggenbuehl changed the title ~~Feature/md header splitter~~ feat:MarkdownHeaderSplitter Jul 29, 2025

Merge branch 'main' into feature/md-header-splitter

de6b0d9

sjrl self-assigned this Aug 19, 2025

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Aug 19, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl changed the title ~~feat:MarkdownHeaderSplitter~~ feat: MarkdownHeaderSplitter Aug 27, 2025

OGuggenbuehl and others added 9 commits September 9, 2025 14:32

Update haystack/components/preprocessors/markdown_header_splitter.py

0e9f955

use haystack logging Co-authored-by: Sebastian Husch Lee <[email protected]>

use native types

fad1ed7

move to haystack logging

8910485

Update haystack/components/preprocessors/markdown_header_splitter.py

b3114e6

remove temp toc Co-authored-by: Sebastian Husch Lee <[email protected]>

docstrings improvements

2abec16

Merge branch 'feature/md-header-splitter' of https://github.com/OGugg…

3917116

…enbuehl/haystack into feature/md-header-splitter

fix CustomDocumentSplitter arguments

7f92dc9

remove header prefix from content

6d75b58

rework split_id assignment to avoid collisions

c1bb05e

OGuggenbuehl force-pushed the feature/md-header-splitter branch from ba90272 to 7ef16a7 Compare September 23, 2025 12:22

OGuggenbuehl and others added 4 commits September 23, 2025 15:54

Merge branch 'main' into feature/md-header-splitter

876b244

unified page-counting

5203603

simplify conditional secondary-split initialization and usage

debe17e

Merge branch 'feature/md-header-splitter' of https://github.com/OGugg…

fc2cc58

…enbuehl/haystack into feature/md-header-splitter

sjrl reviewed Sep 24, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Show resolved Hide resolved

sjrl reviewed Sep 24, 2025

View reviewed changes

test/components/preprocessors/test_markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 24, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 24, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 24, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

sjrl reviewed Sep 24, 2025

View reviewed changes

haystack/components/preprocessors/markdown_header_splitter.py Outdated Show resolved Hide resolved

OGuggenbuehl added 4 commits September 24, 2025 13:54

fix linting error

b74cefc

clearly specify the use of ATX-style headers (#) only

e7e9872

reference doc_id when logging no headers found

f66e77b

initialize md-header pattern as private variable once

445ffe8

OGuggenbuehl force-pushed the feature/md-header-splitter branch from 56dd0a0 to d7d4f18 Compare September 25, 2025 09:07

OGuggenbuehl added 5 commits September 26, 2025 17:22

add example to for inferring header levels to docstring

1b2160b

improve empty document handling

94218fa

add more logging for empty documents

more explicit testing for inferred headers

b6e2486

fix linting issue

530eafa

improved empty content handling test cases

44e0454

OGuggenbuehl force-pushed the feature/md-header-splitter branch from abb0a84 to 44e0454 Compare September 26, 2025 15:22

remove all functionality related to inferring md-header levels

47e3b9e

compile regex-pattern in init for performance gains

12fbf8b

OGuggenbuehl requested a review from sjrl September 30, 2025 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: MarkdownHeaderSplitter #9660

feat: MarkdownHeaderSplitter #9660

Uh oh!

OGuggenbuehl commented Jul 29, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjrl commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OGuggenbuehl commented Sep 27, 2025

Uh oh!

sjrl commented Sep 29, 2025

Uh oh!

OGuggenbuehl commented Sep 29, 2025

Uh oh!

sjrl commented Sep 29, 2025

Uh oh!

OGuggenbuehl commented Sep 29, 2025

Uh oh!

Uh oh!

feat: MarkdownHeaderSplitter #9660

Are you sure you want to change the base?

feat: MarkdownHeaderSplitter #9660

Uh oh!

Conversation

OGuggenbuehl commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes:

How did you test it?

Checklist

Uh oh!

CLAassistant commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sjrl commented Aug 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OGuggenbuehl commented Sep 27, 2025

Uh oh!

sjrl commented Sep 29, 2025

Uh oh!

OGuggenbuehl commented Sep 29, 2025

Uh oh!

sjrl commented Sep 29, 2025

Uh oh!

OGuggenbuehl commented Sep 29, 2025

Uh oh!

Uh oh!

OGuggenbuehl commented Jul 29, 2025 •

edited

Loading

CLAassistant commented Jul 29, 2025 •

edited

Loading