Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicate validation in parquet could happen before compute #753

Open
phofl opened this issue Jan 16, 2024 · 4 comments
Open

Predicate validation in parquet could happen before compute #753

phofl opened this issue Jan 16, 2024 · 4 comments

Comments

@phofl
Copy link
Collaborator

phofl commented Jan 16, 2024

    ddf = dd.from_dict(
        {"A": range(8), "B": [1, 1, 2, 2, 3, 3, 4, 4]},
        npartitions=4,
    )
    ddf.to_parquet(tmp_path, engine=engine)

    with pytest.raises(ValueError, match="not a valid operator in predicates"):
        unsupported_op = [[("B", "not eq", 1)]]
        dd.read_parquet(tmp_path, engine=engine, filters=unsupported_op)

Ideally, this would raise before we trigger compute

@phofl
Copy link
Collaborator Author

phofl commented Jan 16, 2024

cc @rjzamora if you have thoughts where this belongs

@rjzamora
Copy link
Member

It should raise once dask-expr actually passes the filters to pyarrow, but this won't happen until we need to know the number of partitions. So, dd.read_parquet(...).compute() should raise. If this is not the case, then there is indeed a bug somewhere.

@rjzamora
Copy link
Member

rjzamora commented Jan 16, 2024

Just to clarify: We can still add an extra/earlier validation step to dask-expr. I'll take a look to see where that would be easiest.

@phofl
Copy link
Collaborator Author

phofl commented Jan 16, 2024

Sorry, you are completely correct, it's raising when I call compute.

Ideally we would fail earlier, but this is by no means urgent, I'll change the title

@phofl phofl changed the title Predicate validation in parquet seems to miss something Predicate validation in parquet could happen before compute Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants