Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenLineage] Added Openlineage support for DatabricksCopyIntoOperator #45257

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rahul-madaan
Copy link
Contributor


This PR adds support for DatabricksCopyIntoOperator (/providers/databricks/operators/databricks_sql.py)
Taking reference from CopyFromExternalStageToSnowflakeOperator which already has OL support.

tested using this DAG - click to open
"""
Example DAG demonstrating the usage of DatabricksCopyIntoOperator with OpenLineage support.
"""

import logging
logging.getLogger('databricks.sql').setLevel(logging.DEBUG)

from datetime import datetime, timedelta
from airflow import DAG
from airflow.providers.databricks.operators.databricks_sql import DatabricksCopyIntoOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 0,
    'retry_delay': timedelta(minutes=5),
}

with DAG(
    'databricks_copy_into_example',
    default_args=default_args,
    description='Example DAG for DatabricksCopyIntoOperator with OpenLineage',
    schedule_interval=None,
    schedule=None,
    start_date=datetime(2024, 12, 13),
    catchup=False,
    tags=['example', 'databricks', 'openlineage'],
) as dag:

    # Example with S3
    copy_from_s3 = DatabricksCopyIntoOperator(
        task_id='copy_from_s3',
        databricks_conn_id='databricks_default',
        table_name='wide_world_importers.astronomer_assets.sample',
        file_location='s3a://kreative360/yoyo/sample.csv',
        file_format='CSV',
        format_options={
            "header": "true",
            "inferSchema": "true",
            "delimiter": ","
        },
        copy_options={
            "force": "true",
            "mergeSchema": "true"
        },
        http_path='/sql/1.0/warehouses/ca43e87568a0b22e',
        credential={
            "AWS_ACCESS_KEY": "<redacted>",
            "AWS_SECRET_KEY": "<redacted>",
            "AWS_SESSION_TOKEN": "<redacted>",
            "AWS_REGION": "ap-south-1"
        }
    )

    # Example with Azure Blob Storage using wasbs protocol
    copy_from_azure = DatabricksCopyIntoOperator(
        task_id='copy_from_azure',
        databricks_conn_id='databricks_default',  
        table_name='wide_world_importers.astronomer_assets.sample',
        file_location='wasbs://[email protected]/sample.csv',
        file_format='CSV',
        # Using Azure storage credential
        credential={
            "AZURE_SAS_TOKEN": "<redacted>", # Replace with actual SAS token
        },
        format_options={
            "header": "true",
            "inferSchema": "true"
        },
        copy_options={
            "force": "true",
            "mergeSchema": "true"
        },
        http_path='/sql/1.0/warehouses/ca43e87568a0b22e'
    )

    # Example with GCS
    copy_from_gcs = DatabricksCopyIntoOperator(
        task_id='copy_from_gcs',
        databricks_conn_id='databricks_default',
        table_name='wide_world_importers.astronomer_assets.sample',
        file_location='gs://kreative360/yoyo/sample.csv',
        file_format='CSV',
        format_options={
            "header": "true",
            "inferSchema": "true",
            "delimiter": ","
        },
        copy_options={
            "force": "true",
            "mergeSchema": "true"
        },
        http_path='/sql/1.0/warehouses/ca43e87568a0b22e',
    )

    [copy_from_s3, copy_from_azure, copy_from_gcs]

note - tests have been performed using s3 object on aws only, other cloud providers (azure and gcs) have been tested only using FAIL events.

OL events:

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@rahul-madaan rahul-madaan changed the title [OpenLineage] Added Openlineage support to DatabricksCopyIntoOperator [OpenLineage] Added Openlineage support for DatabricksCopyIntoOperator Dec 28, 2024
@rahul-madaan
Copy link
Contributor Author

@kacpermuda @potiuk could you please take a look at the PR and approve?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant