-
Notifications
You must be signed in to change notification settings - Fork 2.5k
feat(converters): CSVToDocument supports row-level conversion #9773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 18377253514Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Yes, you are right. I’m working on uploading the right code
…On Mon, Sep 8, 2025 at 20:10 Michele Pangrazzi ***@***.***> wrote:
***@***.**** commented on this pull request.
Hi @xoaryaa <https://github.com/xoaryaa>!
Your PR claims to fix #8848
<#8848>, but the pushed
changes are completely unrelated to CSVToDocument. The PR instead
implements a request_headers feature for the LinkContentFetcher component.
Are you aware of this? Maybe you pushed the wrong code by mistake?
—
Reply to this email directly, view it on GitHub
<#9773 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3HSYNHWBM32FI5P7BZXQ3T3RWINPAVCNFSM6AAAAACF5MNLSSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTCOJWHA2TQNZXG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
944f9fa to
35c44a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments!
e45faae to
3b907fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've added another comment.
| "Could not parse CSV rows for {source}. Falling back to file mode. Error: {error}", | ||
| source=source, | ||
| error=e, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpangrazzi In this case, I would recommend raising an error and stopping. If the user wants to convert the rows to docs, there is a less chance they would want to use the whole CSV as fallback. We should rather notify and stop execution.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Amnah199 I agree on that, I wasn't considering the fallback properly. We may want to raise directly an error here (I was thinking about a raise flag, but maybe I am overthinking that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xoaryaa Here as well, lets remove the fallback implementation i.e. if the row conversion of the csv fails, raise a RuntimeError with a clear message and stop execution.
| def _build_document_from_row( | ||
| self, row: dict[str, Any], base_meta: dict[str, Any], row_index: int, content_column: Optional[str] | ||
| ) -> Document: | ||
| """ | ||
| Create a Document from a single CSV row. Does not catch exceptions; caller wraps. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have parameter details in the docstrings
| if content_column: | ||
| content = self._safe_value(row.get(content_column)) | ||
| else: | ||
| content = "\n".join(f"{k}: {self._safe_value(v)}" for k, v in row.items()) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpangrazzi Not sure about this fallback either. I would rather enforce the user to pass a column that should be marked as Document.content.
@sjrl I remember we had a small discussion about this. Do you have an opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Amnah199 yeah I agree also on this, let's be consistent with the above one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpangrazzi @Amnah199 what exactly do you want me to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xoaryaa Would be nice if you can update the PR to make content_column a required param in run method instead of init. And we can remove this fallback logic if the content_column is missing.
|
@xoaryaa Hi! Would you be able to address what we discussed in the last comments? Let us know if it's not clear enough. We are looking forward to include your changes in the next Haystack release! |
|
Yes, I will do that. I just need a couple days and I’ll be done by the end
of this week if thats fine?
…On Tue, Sep 16, 2025 at 18:51 Michele Pangrazzi ***@***.***> wrote:
*mpangrazzi* left a comment (deepset-ai/haystack#9773)
<#9773 (comment)>
@xoaryaa <https://github.com/xoaryaa> Hi! Would you be able to address
what we discussed in the last comments? Let us know if it's not clear
enough. We are looking forward to include your changes in the next Haystack
release!
—
Reply to this email directly, view it on GitHub
<#9773 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A3HSYNDFZLTEUPBTDSCGUJ33TAFGLAVCNFSM6AAAAACF5MNLSSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEOJYG42TANBTG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
12c44bc to
18a0fe6
Compare
|
@mpangrazzi @Amnah199 The integration tests are not running — looks like it’s due to path filters since only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @xoaryaa! Sorry about the delay. Since you added content_column as a new parameter to the run() method, you need also to update the following tests:
test_running_a_correct_pipeline[AsyncPipeline-that is a file conversion pipeline with two joiners]test_running_a_correct_pipeline[Pipeline-that is a file conversion pipeline with two joiners]
in test/core/pipeline/features/test_run.py.
… columns→meta) + tests + releasenote Signed-off-by: Arya Tayshete <[email protected]>
Signed-off-by: Arya Tayshete <[email protected]>
Signed-off-by: Arya Tayshete <[email protected]>
Signed-off-by: Arya Tayshete <[email protected]>
…e fallbacks; improve docstrings; update tests Signed-off-by: Arya Tayshete <[email protected]>
Signed-off-by: Arya Tayshete <[email protected]>
d10d22d to
9da567d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Related Issues
CSVToDocumentConverter to Support Row-Level Conversion #8848Proposed Changes:
conversion_mode: Literal["file","row"](default:"file") toCSVToDocument.conversion_mode="row", convert each CSV row into oneDocument.contentcomes from a user-selectedcontent_column(if provided).Document.metaas{column_name: value}.row_numberto meta for traceability.delimiterandquotechar(passed tocsv.DictReader)."file"mode remains the default and unchanged."file"mode.How did you test it?
test_row_mode_with_content_column: asserts per-rowDocumentcreation, content from selected column, and remaining columns inmeta.test_row_mode_without_content_column: asserts readable"key: value"listing whencontent_column=None.test_row_mode_meta_merging: verifies ByteStream/meta merging into each row’smeta.file_pathhandling and delimiters.Notes for the reviewer
"file"as the defaultconversion_mode.store_full_pathbehavior is respected in row mode (we still shortenfile_pathunless explicitly requested).Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.