Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Databricks] Supporting OAuth & Serverless compute #127

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

zacdav-db
Copy link

@zacdav-db zacdav-db commented Oct 14, 2024

Not the prettiest first attempt, have added support for serverless and OAuth.

  • Using the Databricks SDK and the sdkConfig as a mechanism to connect and authenticate
  • Serverless defaults to FALSE and currently still requires version to be specified (this strictly isn't required)
  • Added boolean check to remove spark configs when on serverless as they can't be applied

Comment on lines 45 to 49
# # Checks for OAuth Databricks token inside the RStudio API
# if (is.null(token) && exists(".rs.api.getDatabricksToken")) {
# getDatabricksToken <- get(".rs.api.getDatabricksToken")
# token <- set_names(getDatabricksToken(databricks_host()), "oauth")
# }
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be handled by SDK config component.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, are we talking about this SDK? https://github.com/databricks/databricks-sdk-py/ And if so, can you point me to where it handles the RStudio token? I can't seem to find it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDK won't detect the .rs.api.getDatabricks* but maybe theres a gap in my understanding, I thought connect would also write to a config file as well, which the SDK should pickup?

Copy link
Collaborator

@edgararuiz edgararuiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sending over this PR, it's looking great!

Comment on lines 45 to 49
# # Checks for OAuth Databricks token inside the RStudio API
# if (is.null(token) && exists(".rs.api.getDatabricksToken")) {
# getDatabricksToken <- get(".rs.api.getDatabricksToken")
# token <- set_names(getDatabricksToken(databricks_host()), "oauth")
# }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, are we talking about this SDK? https://github.com/databricks/databricks-sdk-py/ And if so, can you point me to where it handles the RStudio token? I can't seem to find it

@@ -71,22 +72,28 @@ spark_connect_method.spark_method_databricks_connect <- function(
method <- method[[1]]
token <- databricks_token(token, fail = FALSE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on your comment on line 137, I think we should remove this line. And have token only populated when the user passes it as an argument in the spark_connect() call

Copy link
Author

@zacdav-db zacdav-db Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking for leaving this was that users explicitly setting the DATABRICKS_TOKEN and DATABRICKS_HOST vars should have those respected as it were set explicitly. The Databricks Python SDK won't detect those when its done from R.

databricks_token function also looks for CONNECT_DATABRICKS_TOKEN so i think its probably important to leave that intact?

I was expecting hierarchy to be:

  1. Explicit token
  2. DATABRICKS_TOKEN
  3. CONNECT_DATABRICKS_TOKEN
  4. .rs.api.getDatabricksToken(host)
  5. Python SDK explicit setting of profile
  6. Python SDK detection of DEFAULT profile

Where 1-4 are handled by databricks_token

# sdk config
conf_args <- list(host = master)
# if token is found, propagate
# otherwise trust in sdk to detect and do what it can?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we remove line 72, then this if makes sense to leave

conn <- exec(databricks_session, !!!remote_args)
sdk_config <- db_sdk$core$Config(!!!conf_args)

# unsure if this iss needed anymore?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to remove this from here, specially since we can't use httr2:::is_hosted_session() since a ::: is not allowed. Do you think this is important for the package to do if the user is on desktop? If so, what do you think about isolating it in its own exported function? Maybe pysparklyr::databricks_desktop_login()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is required, will do some testing without.

@zacdav-db
Copy link
Author

Im reviewing this with a clearer mind and I think we should attempt to defer as much as possible to the SDK for Databricks auth and logic. Will have an attempt.

@edgararuiz
Copy link
Collaborator

Hey, want me to do another review, or are you still working on this?

@zacdav-db
Copy link
Author

@edgararuiz its at the point where I'd appreciate some input if you have a spare moment. It's not finalised but want to ensure we agree on the direction / shape its taking! 🙏

@edgararuiz
Copy link
Collaborator

@zacdav-db - Looking good. Feel free to remove sanitze_host(), we can always restore it later if needed. Also, is it passing tests locally for you?

@zacdav-db
Copy link
Author

@edgararuiz not passing tests locally, seems theres an install issue.

Theres also one complication with using the SDK for auth everywhere, it requires the python env to be loaded, which depends on version being known. version can't be determined without API which if using SDK, requires the env.

We probably don't want to force version to be specified, but I think the only was we can defer to SDK is to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants