Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAP-658] [Feature] Spark Connect as connection method #493

Open
3 tasks done
timvw opened this issue Jun 24, 2023 · 11 comments · May be fixed by dbt-labs/dbt-spark#899
Open
3 tasks done

[ADAP-658] [Feature] Spark Connect as connection method #493

timvw opened this issue Jun 24, 2023 · 11 comments · May be fixed by dbt-labs/dbt-spark#899
Labels
help_wanted Extra attention is needed pkg:dbt-spark Issue affects dbt-spark type:enhancement New feature request

Comments

@timvw
Copy link

timvw commented Jun 24, 2023

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt-spark functionality, rather than a Big Idea better suited to a discussion

Describe the feature

I would like to be able to use dbt (spark) via the Spark Connect api

Describe alternatives you've considered

We could decide not to support this

Who will this benefit?

All users that have a Spark Connect endpoint available

Are you interested in contributing this feature?

Yes

Anything else?

https://spark.apache.org/docs/latest/spark-connect-overview.html

@timvw timvw added type:enhancement New feature request triage:product In Product's queue labels Jun 24, 2023
@github-actions github-actions bot changed the title [Feature] Spark Connect as connection method [ADAP-658] [Feature] Spark Connect as connection method Jun 24, 2023
@dataders
Copy link
Contributor

dataders commented Jul 4, 2023

@timvw I agree this could unlock quite a bit for us over time. 👁️ @Fokko do you know much about this new feature?

@dataders dataders removed the triage:product In Product's queue label Jul 4, 2023
@Fokko
Copy link
Contributor

Fokko commented Jul 7, 2023

@dataders Thanks for pinging me. I worked with Databricks' Spark connect quite a bit, and it is great to see that it is now part of Spark Open Source. I think it makes a lot of sense to add this.

@ssabdb
Copy link

ssabdb commented Jul 7, 2023

@Fokko - I would be interested in your take on my interpretation of spark-connect's suitability in dbt-labs/dbt-spark#821?

I have no experience with spark connect, but if the objective is to support the execution of SQL from DBT I can see how this would work.

I'm not sure it would support python models as presently implemented but this is perhaps not the intent of this issue.

@timvw timvw closed this as completed Aug 28, 2023
@timvw
Copy link
Author

timvw commented Aug 28, 2023

Closed (as the possibility to connect to Live seems more favorable for now)

@vakarisbk vakarisbk linked a pull request Oct 3, 2023 that will close this issue
4 tasks
@vakarisbk
Copy link

vakarisbk commented Oct 4, 2023

Hi! I would like to reopen this discussion as I have made a PR dbt-labs/dbt-spark#899 introducing support for Spark Connect SQL models (well probably should have done this before the PR, but water under the bridge now :) ).

I believe it makes sense to introduce support for Spark Connect SQL models because it unlocks an additional way of using DBT with open source Spark without much code changes from DBT side (the implementation is based on the existing Spark Session code). Currently the only way to run DBT with open source Spark in production is using a Thrift connection, so adding at least another alternative would open up dbt to more users.

Livy as an alternative was also discussed in dbt-labs/dbt-spark#821 issue. Livy would work well for SQL models, but the Livy open source project is pretty much dead. Though some cloud providers (AWS EMR, Gcloud DataProc, maybe some others) still expose Livy compatible APIs, so users using those cloud providers would benefit from dbt livy support. There is also a fairly new open source project called Lighter, which aims to replace Livy and has a Livy compatible API.

But I don't think the question should be Spark Connect OR Livy. I think we can support both, especially since supporting Spark Connect would probably not require a lot of additional effort, since the implementation is highly tied to Spark Session, which dbt already supports.

I would like to hear what dbt and the community think about introducing Spark connect SQL models and whether it's worth supporting this feature.

@vakarisbk
Copy link

vakarisbk commented Oct 4, 2023

Regarding Python models:

Livy would be much better suited for dbt Python models as it would stick to dbt philosophy of generating the code locally and then shipping it somewhere else to execute. And it would support running arbitrary Python code remotely, not just a subset of APIs that are supported by Spark Connect, but again the open source Livy project is pretty much dead.

Spark Connect on the other hand is a fairly good alternative. It is limited in that it only supports Dataframe API and in the latest Spark release - Pandas on Pyspark API and PyTorch, but maybe that's enough for most use cases? And there are always UDFs which are also executed remotely AFAIK.
It also allows easier local development as spinning up a local Spark connect cluster is very easy.

I think it would make sense to split the discussion on Python models on Spark Connect into a separate issues if anyone wants to continue discussing it.

@ben-schreiber
Copy link
Contributor

ben-schreiber commented Feb 6, 2024

@vakarisbk I agree 100%. I would also add two points:

  1. Since the SparkSession used for executing SQL with Spark Connect is exactly the one we would use to execute Python, the additional work needed to add support for DBT Python models on Spark Connect as well is low hanging.
  2. Based on what I've seen (and you mentioned), the Livy project is an older technology which are dying and Thrift supports only SQL. Additionally, Spark Connect seems to the incoming generation of technology for remotely connecting to a Spark application.

@timvw timvw reopened this Feb 6, 2024
@ssabdb
Copy link

ssabdb commented Feb 9, 2024

I proposed dbt-labs/dbt-spark#821 and agree with the recommendation to split them into two separate sets of requirements, one for spark connect as a method to support SQL and one for a means (spark connect or whatever) to implement python dbt models in OSS spark.

This ticket focusses on ising spark connect as an alternative to the thriftserver method, which only supports SQL would still bring advantages

I've not tried it but might if I get around to it, but it may well be possible to do this without any changes at all just by setting

export SPARK_REMOTE="sc://localhost" source

However, that would bring SQL support only but would improve the current basic spark session implementation.

@ben-schreiber to be clear, I think there would be a limitation of spark connect which is highlighted by @vakarisbk

Livy would be much better suited for dbt Python models as it would stick to dbt philosophy of generating the code locally and then shipping it somewhere else to execute. And it would support running arbitrary Python code remotely, not just a subset of APIs that are supported by Spark Connect

Or to put it another way, spark connect cannot run arbitrary python remotely - AFAIK, there's no way to access an available python interpreter, and no requirement for one to be available. That's different to the approach taken by the other connectors which have all the relevant bits of python executed on the remote server. Quite possibly that's an acceptable limitation but a potentially confusing one - packages would only be installed locally, for example, whilst the configuration makes it clear this is for remote installation.

I do share the concerns around Livy's aliveness as well.

@ben-schreiber
Copy link
Contributor

@ssabdb Agreed that the there is a limitation; I think this is the key point:

Quite possibly that's an acceptable limitation but a potentially confusing one

Additionally, since there are numerous ways to connect to and use Spark, I'm not sure a "one size fits all" approach to Python DBT models for OSS Spark is the correct one. In any event, let's leave the Python model discussion for a dedicated issue (#415 ?)

@Fleid Fleid added the help_wanted Extra attention is needed label Feb 22, 2024
@GeorgiiKolpakov
Copy link

@ben-schreiber @vakarisbk
I'm tentatively curious what is the resolution of the discussion? Is it fine to introduce spark-connect as SQL-only solution as a solution to this issue?
If yes, then what are the remaining hurdles of merging existing PR? I see a review with one comment that had been resolved. Isn't it possible to merge that PR and to close this issue?

@ben-schreiber
Copy link
Contributor

@GeorgiiKolpakov We would need to involve someone with merge permissions to move the PR forward

@mikealfare mikealfare added the pkg:dbt-spark Issue affects dbt-spark label Jan 13, 2025
@mikealfare mikealfare transferred this issue from dbt-labs/dbt-spark Jan 13, 2025
VanTudor pushed a commit that referenced this issue Jan 14, 2025
Co-authored-by: Jeremy Cohen <[email protected]>
mikealfare pushed a commit that referenced this issue Jan 20, 2025
Co-authored-by: colin-rogers-dbt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help_wanted Extra attention is needed pkg:dbt-spark Issue affects dbt-spark type:enhancement New feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants