examples/pipelines/run-pipelines-on-apache-airflow at main · elyra-ai/examples

Name	Name	Last commit message	Last commit date
parent directory ..
components	components
doc/images	doc/images
resources	resources
README.md	README.md

Run pipelines on Apache Airflow

A pipeline comprises one or more nodes that are (in many cases) connected to define execution dependencies. Each node is implemented by a component and typically performs only a single task, such as loading data, processing data, training a model, or sending an email. Note that in Apache Airflow components are called operators, but for the sake of consistency the Elyra documentation refers to them as components.

A generic pipeline comprises nodes that are implemented using generic components. Elyra includes generic components that run Jupyter notebooks, Python scripts, and R scripts. Generic components have in common that they are supported in every Elyra pipelines runtime environment: local/JupyterLab, Kubeflow Pipelines, and Apache Airflow.

The following tutorials cover generic pipelines:

A runtime specific pipeline comprises nodes that are implemented using generic components or custom components. Custom components are runtime specific and user-provided.

In this intermediate tutorial you will learn how to add Apache Airflow components to Elyra and how to utilize them in pipelines.

The features described in this tutorial require Elyra v3.3 or later. The tutorial instructions were last updated using Elyra v3.3.0 and Airflow v1.10.12.

Prerequisites

JupyterLab 3.x with Elyra extension v3.3 (or later) installed.
Access to an Apache Airflow deployment.

Some familiarity with Apache Airflow and Apache Airflow operators (i.e., components) is required to complete the tutorial. If you are new to Elyra, please review the Run generic pipelines on Apache Airflow tutorial. It introduces concepts and tasks that are used in this tutorial, but not explained here to avoid content duplication.

Information to collect before starting

Collect the following information for your Apache Airflow installation:

API endpoint, e.g. http://your-airflow-webserver:port
GitHub API endpoint, e.g. https://api.github.com if the repository is hosted on GitHub
GitHub DAG repository name, e.g. your-git-org/your-dag-repo
GitHub DAG repository branch, e.g. main
GitHub access token, e.g. 4d79206e616d6520697320426f6e642e204a616d657320426f6e64.

Detailed instructions for setting up a DAG repository and generating an access token can be found in the User Guide.

Elyra utilizes S3-compatible cloud storage to make data available to Jupyter notebooks and R or Python scripts while they are executed. Any kind of S3-based cloud storage should work (e.g. IBM Cloud Object Storage or Minio) as long as it can be accessed from the machine where JupyterLab/Elyra is running and from the Apache Airflow cluster.

Elyra also puts the STDOUT (including STDERR) run output into a file when env var ELYRA_GENERIC_NODES_ENABLE_SCRIPT_OUTPUT_TO_S3 is set to true or not present in the runtime container, which is the default. This happens in addition to logging and writing to STDOUT and STDERR at runtime.

ipynb file execution run/STDOUT output is written to S3-compatible object storage in the following files:

<notebook name>-output.ipynb
<notebook name>.html

.r and .py file execution run/STDOUT output is written to to S3-compatible object storage in the following files:

<r or python filename>.log

Note: If you prefer to use S3-compatible storage for transfer of files between pipeline steps only and not for logging information / run output of R, Python and Jupyter Notebook files, either set env var ELYRA_GENERIC_NODES_ENABLE_SCRIPT_OUTPUT_TO_S3 to false in runtime container builds or pass that env value explicitely in the env section of the pipeline editor, either at Pipeline Properties - Generic Node Defaults - Environment Variables or at Node Properties - Additional Properties - Environment Variables.

Collect the following information:

S3 compatible object storage endpoint, e.g. http://minio-service.kubernetes:9000
S3 object storage username, e.g. minio
S3 object storage password, e.g. minio123
S3 object storage bucket, e.g. pipelines-artifacts

Tutorial setup

Create a runtime configuration

Create a runtime environment configuration for your Apache Airflow installation as described in Runtime configuration topic in the User Guide or the Run generic pipelines on Apache Airflow tutorial.

Create a new connection id

One of the components used in this tutorial utilizes a pre-configured http_conn_id, which is set in the completed tutorial pipeline to http_github.

You must configure a connection with that id in order for the pipeline run to succeed:

Open the Airflow GUI
Navigate to Admin > Connections
Create a new connection, specifying the following:
- Connection id: http_github
- Connection type: HTTP
- Host: https://api.github.com
- Schema: https

Clone the tutorial artifacts

This tutorial uses the run-pipelines-on-apache-airflow sample from the https://github.com/elyra-ai/examples GitHub repository.

Launch JupyterLab.
Open the Git clone wizard (Git > Clone A Repository).
Enter https://github.com/elyra-ai/examples.git as Clone URI.
In the File Browser navigate to examples/pipelines/run-pipelines-on-apache-airflow.

The cloned repository includes the resources needed to run the tutorial pipeline.

You are ready to start the tutorial.

Add custom components via component catalog

Elyra stores information about custom components in component catalogs and makes those components available in the Visual Pipeline Editor's palette. Components can be grouped into categories to make them more easily discoverable.

Custom components are managed in the JupyterLab UI using the Pipeline components panel. You access the panel by:

Selecting Component Catalogs from the JupyterLab sidebar.
Clicking the Open Component Catalogs button in the pipeline editor toolbar.
Searching for Manage URL Component Catalog, Manage Filesystem Component Catalog, or Manage Directory Component Catalog in the JupyterLab command palette.

You can automate the component management tasks using the elyra-metadata install component-catalogs CLI command.

The component catalog can access component specifications that are stored in the local file system or on remote sources. In this tutorial 'local' refers to the file system where JupyterLab/Elyra is running. For example, if you've installed Elyra on your laptop, local refers to the laptop's file system. If you've installed Elyra in a container image, local refers to the container's file system.

Add components from local sources

To add component specifications to the registry that are stored locally:

Open the Component Catalogs panel using one of the approaches mentioned above.
Add a new component catalog entry by clicking + and selecting New Filesystem Component Catalog. The first tutorial component you are adding to the registry makes an HTTP Request.
Enter or select the following:
- Name: request data
- Description: request data from GitHub API
- Runtime: APACHE_AIRFLOW
- Category Names: request
- Base Directory: .../examples/pipelines/run-pipelines-on-apache-airflow/components (on Windows: ...\examples\pipelines\run-pipelines-on-apache-airflow\components)
- Paths: http_operator.py
  
  Note: Replace ... with the path to the location where you cloned the Elyra example repository. The base directory can include ~ or ~user to indicate the home directory. The concatenation of the base directory and each path must resolve to an absolute path or Elyra won't be able to locate the specified files.
Save the component catalog entry.

There are two approaches you can take to add multiple related component specifications to the registry:

Specify multiple Path values.
Store the related specifications in the same directory and use the Directory catalog type. Elyra searches the directory for specifications. Check the Include Subdirectories checkbox to search subdirectories for component specifications as well.

Refer to the descriptions in the linked documentation topic for details and examples.

Locally stored component specifications have the advantage that they can be quickly loaded by Elyra. If you need to share component specifications with other users, ensure that the given Paths are the same relative paths across installations. The Base Directory can differ across installations.

Add components from remote sources

The URL Component Catalog type only supports web resources that can be downloaded using HTTP GET requests, which don't require authentication.

To add component specifications to the catalog that are stored remotely:

Open the Pipeline components panel.
Add a second component catalog entry, this time selecting New URL Component Catalog from the dropdown menu. This component executes a given bash command.
Enter the following information:
- Name: run command
- Description: run a shell script
- Runtime: APACHE_AIRFLOW
- Category Names: scripting
- URLs: https://raw.githubusercontent.com/elyra-ai/examples/main/pipelines/run-pipelines-on-apache-airflow/components/bash_operator.py
Save the component catalog entry.

The catalog is now populated with the custom components you'll use in the tutorial pipeline.

Next, you'll create a pipeline that uses the registered components.

Create a pipeline

The pipeline editor's palette is populated from the component catalog. To use the components in a pipeline:

Open the JupyterLab Launcher.
Click the Apache Airflow pipeline editor tile to open the Visual Pipeline Editor for Apache Airflow.
Expand the palette panel. Two new component categories are displayed (request and scripting), each containing one component entry that you added:
Drag the 'SimpleHttpOperator' component onto the canvas to create the first pipeline node.
Drag the 'BashOperator' component onto the canvas to create a second node and connect the two nodes as shown.

The components require inputs, which you need to specify to render the nodes functional.
Open the properties of the 'SimpleHttpOperator' node:
- select the node and expand (↤) the properties slideout panel on the right OR
- right click on the node and select Open Properties

Review the node properties. The properties are a combination of Elyra metadata and information that was extracted from the component's specification:

class SimpleHttpOperator(BaseOperator):
 """
 Calls an endpoint on an HTTP system to execute an action

 :param http_conn_id: The connection to run the operator against
 :type http_conn_id: str
 :param endpoint: The relative part of the full url. (templated)
 :type endpoint: str
 :param method: The HTTP method to use, default = "POST"
 :type method: str
 :param data: The data to pass. POST-data in POST/PUT and params
     in the URL for a GET request. (templated)
 :type data: For POST/PUT, depends on the content-type parameter,
     for GET a dictionary of key/value string pairs
 :param headers: The HTTP headers to be added to the GET request
 :type headers: a dictionary of string key/value pairs
 :param response_check: A check against the 'requests' response object.
     Returns True for 'pass' and False otherwise.
...

The Elyra properties include:

Label: If specified, the value is used as node name in the pipeline instead of the component name. Use labels to resolve naming conflicts that might arise if a pipeline uses the same component multiple times. For example, if a pipeline utilizes the 'SimpleHttpOperator' component to make two requests, you could override the node name by specifying 'HTTP Request 1' and 'HTTP Request 2' as labels:
Component source: A read-only property that identifies source information about a component, such as the type of catalog in which this component is stored and any unique identifying information. This property is displayed for informational purposes only.

Enter the following values for the SimpleHttpOperator properties:
- endpoint -> /repos/elyra-ai/examples/contents/pipelines/run-pipelines-on-apache-airflow/resources/command.txt
  - Since this property is implicity required in the operator specification file, the pipeline editor displays a red bar and enforces the constraint.
- method -> GET
- data -> {"ref": "master"}
  - This information tells the GitHub API which branch to use when returning the file contents
- headers -> {"Accept": "Accept:application/vnd.github.v3.raw"}
  - This tells the API what format the returned data should be
  - In this case, we want the raw GitHub file
- xcom_push -> check the checkbox for True
  - This property indicates to Airflow whether we want to pass on the output of this component (in this case, the file contents of our requested file) to be accessed by later nodes in the pipeline
- http_conn_id -> http_github
  - This property tells the Airflow instance which Connection id it will use as the API base URL
  - This property was configured in the above section, Create a new connection id

Open the properties of the 'BashOperator' node. The specification for the underlying component looks as follows:

class BashOperator(BaseOperator):
 """
 Execute a Bash script, command or set of commands.
 ...

 :param bash_command: The command, set of commands or reference to a
     bash script (must be '.sh') to be executed. (templated)
 :type bash_command: str
 :param xcom_push: If xcom_push is True, the last line written to stdout
     will also be pushed to an XCom when the bash command completes.
 :type xcom_push: bool
 :param env: If env is not None, it must be a mapping that defines the
     environment variables for the new process; these are used instead
     of inheriting the current process environment, which is the default
     behavior. (templated)
 :type env: dict
 :param output_encoding: Output encoding of bash command
 :type output_encoding: str
...

In Apache Airflow, the output of a component can be used as a property value for any downstream node. (A downstream node is a node that is connected to and executed after the node in question). The pipeline editor renders a selector widget for each property that allows you to choose between two options as a value:

A raw value, entered manually
The output of an upstream node

The contents of the file requested by the SimpleHttpOperator are made available to the downstream nodes in the pipeline by setting the xcom_push property of SimpleHttpOperator to True. This output value will be the input of the bash_command property. Choose 'Please select an output from a parent :' from the dropdown menu and select SimpleHttpOperator.

Since the 'BashOperator' node is only connected to one upstream node ('SimpleHttpOperator'), you can only choose the output of that node. If a node is connected to multiple upstream nodes, you can choose the output of any of these nodes as input, as shown in this example:

The output of the EmailOperator node cannot be consumed by the 'SlackAPIPostOperator' node, because the two nodes are not connected in this pipeline. Ensure that the xcom_push property is set to True for any node whose output will be used in a subsequent node.

Elyra intentionally only supports explicit dependencies between nodes to avoid potential usability issues.
The bash command requested and returned by the SimpleHttpOperator node includes an environment variable called name that can be set by the env property of the BashOperator. Enter {'name': 'World'} as the value for this field. You can use another name in place of 'World', if desired.
Save the pipeline.
Rename the pipeline to something meaningful:
- right click on the pipeline editor tab and select Rename Pipeline... OR
- in the JupyterLab File Browser right click on the .pipeline file

Next, let's run the pipeline!

Run the pipeline

To run the pipeline on Apache Airflow:

Click the Run button in the pipeline editor toolbar.

You can also use the elyra-pipeline submit command to run the pipeline using the command line interface.
In the run pipeline dialog select the runtime configuration you created when you completed the setup for this tutorial.
Start the pipeline run and monitor its execution progress in the Apache Airflow Dashboard.

You can also click the GitHub Repository link to inspect the DAG, if desired.
Review the logs of each pipeline task. The output of the 'BashOperator' node should show that Hello, World is printed in the log.

Elyra does not store custom component outputs in cloud storage. (It only does this for generic pipeline components.) To access the output of custom components use the Apache Airflow Dashboard.

Next steps

This concludes the Run pipelines on Apache Airflow tutorial. You've learned how to

add custom Apache Airflow components to the Elyra component registry
create a pipeline from custom components

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run-pipelines-on-apache-airflow

run-pipelines-on-apache-airflow

README.md