Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "[TPC-H] Refactor query 9 to avoid recursion error in dask-expr" #1299

Merged

Conversation

phofl
Copy link
Contributor

@phofl phofl commented Jan 29, 2024

Reverts #1277

can not reproduce locally

@phofl phofl closed this Jan 29, 2024
@phofl phofl reopened this Jan 29, 2024
@phofl phofl closed this Jan 30, 2024
@phofl phofl reopened this Jan 30, 2024
@milesgranger
Copy link
Contributor

milesgranger commented Jan 30, 2024

can not reproduce locally

Weird, I can still produce it with the latest release (0.3.5) on this PR (dask 2024.1.1)

Nvm, I see #1296 ought to fix it. I'm for reverting it, but I think it's also good to cover this aspect/style of writing as a typical user might also write queries this way. Will also make it obvious to us what aspects of the API need performance improvements. As opposed to just not used those APIs.

Further edit, okay, maybe we just go for this then if it's more performant, a lot of the CI failures include query 9 timeout failure.

@phofl
Copy link
Contributor Author

phofl commented Jan 30, 2024

Using lambdas is inherently bad in systems that can be optimised, so we should start warning users about this at some point when we have column expressions. There is not much point in benchmarking this and shooting ourselves in the foot on purpose while doing so. You are shuffling and reading the whole dataset which is just incredibly slow, which explains the timeouts

I was doing something dumb, which hid the error from me, sorry.

@phofl
Copy link
Contributor Author

phofl commented Jan 30, 2024

@milesgranger Could you push this over the finish line? it looks like the query is schematically off

@milesgranger
Copy link
Contributor

@phofl ready now, failures are unrelated.

FAILED tests/tpch/test_dask.py::test_query_10 - RuntimeError: shuffle_barrier failed during shuffle 9a0b09c0c0f40cd95e7fa7560513bb54
ERROR tests/tpch/test_dask.py::test_query_11 - asyncio.exceptions.TimeoutError: 1/16 nanny worker(s) did not shut down within 120s

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@hendrikmakait hendrikmakait merged commit a9918cc into main Jan 30, 2024
@hendrikmakait hendrikmakait deleted the revert-1277-milesgranger/tpch-query9-dask-refactor branch January 30, 2024 13:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants