Skip to content

Conversation

@xoaryaa
Copy link
Contributor

@xoaryaa xoaryaa commented Sep 8, 2025

Related Issues

Proposed Changes:

  • Add optional conversion_mode: Literal["file","row"] (default: "file") to CSVToDocument.
  • In conversion_mode="row", convert each CSV row into one Document.
    • content comes from a user-selected content_column (if provided).
    • All remaining CSV columns are copied into Document.meta as {column_name: value}.
    • Adds row_number to meta for traceability.
  • Add CSV parsing options delimiter and quotechar (passed to csv.DictReader).
  • Preserve existing behavior & API: "file" mode remains the default and unchanged.
  • Friendly fallback: if row parsing fails, log a warning and fall back to "file" mode.
  • Docs & release note added.

How did you test it?

  • Unit tests (all passing locally):
    • test_row_mode_with_content_column: asserts per-row Document creation, content from selected column, and remaining columns in meta.
    • test_row_mode_without_content_column: asserts readable "key: value" listing when content_column=None.
    • test_row_mode_meta_merging: verifies ByteStream/meta merging into each row’s meta.
  • Existing CSV converter tests still pass (backward-compat).
  • Manual verification with a small CSV (2–3 rows) to check file_path handling and delimiters.

Notes for the reviewer

  • Backward compatibility is preserved by keeping "file" as the default conversion_mode.
  • store_full_path behavior is respected in row mode (we still shorten file_path unless explicitly requested).
  • Column names that collide with existing meta keys are not overwritten; row columns are only added where keys don’t exist.
  • Happy to extend with optional type-casting for common numeric/boolean columns in a follow-up if desired.

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@xoaryaa xoaryaa requested a review from a team as a code owner September 8, 2025 13:08
@xoaryaa xoaryaa requested review from mpangrazzi and removed request for a team September 8, 2025 13:08
@coveralls
Copy link
Collaborator

coveralls commented Sep 8, 2025

Pull Request Test Coverage Report for Build 18377253514

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage remained the same at 92.062%

Files with Coverage Reduction New Missed Lines %
components/converters/csv.py 4 95.45%
Totals Coverage Status
Change from base Build 18340248289: 0.0%
Covered Lines: 13233
Relevant Lines: 14374

💛 - Coveralls

Copy link
Contributor

@mpangrazzi mpangrazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xoaryaa!

Your PR claims to fix #8848, but the pushed changes are completely unrelated to CSVToDocument. The PR instead implements a request_headers feature for the LinkContentFetcher component. Are you aware of this? Maybe you pushed the wrong code by mistake?

@xoaryaa
Copy link
Contributor Author

xoaryaa commented Sep 8, 2025 via email

@github-actions github-actions bot added the type:documentation Improvements on the docs label Sep 9, 2025
@xoaryaa xoaryaa requested a review from mpangrazzi September 10, 2025 04:44
Copy link
Contributor

@mpangrazzi mpangrazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments!

Copy link
Contributor

@mpangrazzi mpangrazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've added another comment.

Comment on lines 158 to 161
"Could not parse CSV rows for {source}. Falling back to file mode. Error: {error}",
source=source,
error=e,
)
Copy link
Contributor

@Amnah199 Amnah199 Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpangrazzi In this case, I would recommend raising an error and stopping. If the user wants to convert the rows to docs, there is a less chance they would want to use the whole CSV as fallback. We should rather notify and stop execution.
WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Amnah199 I agree on that, I wasn't considering the fallback properly. We may want to raise directly an error here (I was thinking about a raise flag, but maybe I am overthinking that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xoaryaa Here as well, lets remove the fallback implementation i.e. if the row conversion of the csv fails, raise a RuntimeError with a clear message and stop execution.

Comment on lines 199 to 207
def _build_document_from_row(
self, row: dict[str, Any], base_meta: dict[str, Any], row_index: int, content_column: Optional[str]
) -> Document:
"""
Create a Document from a single CSV row. Does not catch exceptions; caller wraps.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to have parameter details in the docstrings

Comment on lines 208 to 212
if content_column:
content = self._safe_value(row.get(content_column))
else:
content = "\n".join(f"{k}: {self._safe_value(v)}" for k, v in row.items())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpangrazzi Not sure about this fallback either. I would rather enforce the user to pass a column that should be marked as Document.content.

@sjrl I remember we had a small discussion about this. Do you have an opinion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Amnah199 yeah I agree also on this, let's be consistent with the above one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpangrazzi @Amnah199 what exactly do you want me to do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xoaryaa Would be nice if you can update the PR to make content_column a required param in run method instead of init. And we can remove this fallback logic if the content_column is missing.

@mpangrazzi
Copy link
Contributor

@xoaryaa Hi! Would you be able to address what we discussed in the last comments? Let us know if it's not clear enough. We are looking forward to include your changes in the next Haystack release!

@xoaryaa
Copy link
Contributor Author

xoaryaa commented Sep 16, 2025 via email

@xoaryaa
Copy link
Contributor Author

xoaryaa commented Sep 23, 2025

@mpangrazzi @Amnah199 The integration tests are not running — looks like it’s due to path filters since only csv.py and its unit test were changed. Please confirm this is expected.

Copy link
Contributor

@mpangrazzi mpangrazzi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xoaryaa! Sorry about the delay. Since you added content_column as a new parameter to the run() method, you need also to update the following tests:

  • test_running_a_correct_pipeline[AsyncPipeline-that is a file conversion pipeline with two joiners]
  • test_running_a_correct_pipeline[Pipeline-that is a file conversion pipeline with two joiners]

in test/core/pipeline/features/test_run.py.

@xoaryaa xoaryaa force-pushed the feat/csv-row-mode branch from d10d22d to 9da567d Compare October 5, 2025 06:26
@xoaryaa xoaryaa requested a review from mpangrazzi October 6, 2025 07:26
Copy link
Contributor

@Amnah199 Amnah199 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@mpangrazzi mpangrazzi enabled auto-merge (squash) October 9, 2025 13:03
@mpangrazzi mpangrazzi merged commit f8d6757 into deepset-ai:main Oct 9, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

topic:tests type:documentation Improvements on the docs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance CSVToDocument Converter to Support Row-Level Conversion

4 participants