A pipeline comprises one or more nodes that are (in many cases) connected to define execution dependencies. Each node is implemented by a component and typically performs only a single task, such as loading data, processing data, training a model, or sending an email. Note that in Apache Airflow components are called operators, but for the sake of consistency the Elyra documentation refers to them as components.
A generic pipeline comprises nodes that are implemented using generic components. Elyra includes generic components that run Jupyter notebooks, Python scripts, and R scripts. Generic components have in common that they are supported in every Elyra pipelines runtime environment: local/JupyterLab, Kubeflow Pipelines, and Apache Airflow.
The following tutorials cover generic pipelines:
- Introduction to generic pipelines
- Run generic pipelines on Kubeflow Pipelines
- Run generic pipelines on Apache Airflow
A runtime specific pipeline comprises nodes that are implemented using generic components or custom components. Custom components are runtime specific and user-provided.
In this intermediate tutorial you will learn how to add Apache Airflow components to Elyra and how to utilize them in pipelines.
The features described in this tutorial require Elyra v3.3 or later. The tutorial instructions were last updated using Elyra v3.3.0 and Airflow v1.10.12.
- JupyterLab 3.x with Elyra extension v3.3 (or later) installed.
- Access to an Apache Airflow deployment.
Some familiarity with Apache Airflow and Apache Airflow operators (i.e., components) is required to complete the tutorial. If you are new to Elyra, please review the Run generic pipelines on Apache Airflow tutorial. It introduces concepts and tasks that are used in this tutorial, but not explained here to avoid content duplication.
Collect the following information for your Apache Airflow installation:
- API endpoint, e.g.
http://your-airflow-webserver:port
- GitHub API endpoint, e.g.
https://api.github.com
if the repository is hosted on GitHub - GitHub DAG repository name, e.g.
your-git-org/your-dag-repo
- GitHub DAG repository branch, e.g.
main
- GitHub access token, e.g.
4d79206e616d6520697320426f6e642e204a616d657320426f6e64
.
Detailed instructions for setting up a DAG repository and generating an access token can be found in the User Guide.
Elyra utilizes S3-compatible cloud storage to make data available to Jupyter notebooks and R or Python scripts while they are executed. Any kind of S3-based cloud storage should work (e.g. IBM Cloud Object Storage or Minio) as long as it can be accessed from the machine where JupyterLab/Elyra is running and from the Apache Airflow cluster.
Elyra also puts the STDOUT (including STDERR) run output into a file when env var ELYRA_GENERIC_NODES_ENABLE_SCRIPT_OUTPUT_TO_S3
is set to true
or not present in the runtime container, which is the default.
This happens in addition to logging and writing to STDOUT and STDERR at runtime.
ipynb
file execution run/STDOUT output is written to S3-compatible object storage in the following files:
<notebook name>-output.ipynb
<notebook name>.html
.r and .py file execution run/STDOUT output is written to to S3-compatible object storage in the following files:
<r or python filename>.log
Note: If you prefer to use S3-compatible storage for transfer of files between pipeline steps only and not for logging information / run output of R, Python and Jupyter Notebook files,
either set env var ELYRA_GENERIC_NODES_ENABLE_SCRIPT_OUTPUT_TO_S3
to false
in runtime container builds or pass that env value explicitely in the env section of the pipeline editor,
either at Pipeline Properties - Generic Node Defaults - Environment Variables or at
Node Properties - Additional Properties - Environment Variables.
Collect the following information:
- S3 compatible object storage endpoint, e.g.
http://minio-service.kubernetes:9000
- S3 object storage username, e.g.
minio
- S3 object storage password, e.g.
minio123
- S3 object storage bucket, e.g.
pipelines-artifacts
Create a runtime environment configuration for your Apache Airflow installation as described in Runtime configuration topic in the User Guide or the Run generic pipelines on Apache Airflow tutorial.
One of the components used in this tutorial utilizes a pre-configured http_conn_id
, which is set in the completed tutorial pipeline to http_github
.
You must configure a connection with that id in order for the pipeline run to succeed:
- Open the Airflow GUI
- Navigate to
Admin
>Connections
- Create a new connection, specifying the following:
- Connection id:
http_github
- Connection type:
HTTP
- Host:
https://api.github.com
- Schema:
https
- Connection id:
This tutorial uses the run-pipelines-on-apache-airflow
sample from the https://github.com/elyra-ai/examples GitHub repository.
-
Launch JupyterLab.
-
Open the Git clone wizard (Git > Clone A Repository).
-
Enter
https://github.com/elyra-ai/examples.git
as Clone URI. -
In the File Browser navigate to
examples/pipelines/run-pipelines-on-apache-airflow
.The cloned repository includes the resources needed to run the tutorial pipeline.
You are ready to start the tutorial.
Elyra stores information about custom components in component catalogs and makes those components available in the Visual Pipeline Editor's palette. Components can be grouped into categories to make them more easily discoverable.
Custom components are managed in the JupyterLab UI using the Pipeline components panel. You access the panel by:
- Selecting
Component Catalogs
from the JupyterLab sidebar. - Clicking the
Open Component Catalogs
button in the pipeline editor toolbar. - Searching for
Manage URL Component Catalog
,Manage Filesystem Component Catalog
, orManage Directory Component Catalog
in the JupyterLab command palette.
You can automate the component management tasks using the
elyra-metadata install component-catalogs
CLI command.
The component catalog can access component specifications that are stored in the local file system or on remote sources. In this tutorial 'local' refers to the file system where JupyterLab/Elyra is running. For example, if you've installed Elyra on your laptop, local refers to the laptop's file system. If you've installed Elyra in a container image, local refers to the container's file system.
To add component specifications to the registry that are stored locally:
-
Open the Component Catalogs panel using one of the approaches mentioned above.
-
Add a new component catalog entry by clicking
+
and selectingNew Filesystem Component Catalog
. The first tutorial component you are adding to the registry makes an HTTP Request. -
Enter or select the following:
-
Name:
request data
-
Description:
request data from GitHub API
-
Runtime:
APACHE_AIRFLOW
-
Category Names:
request
-
Base Directory:
.../examples/pipelines/run-pipelines-on-apache-airflow/components
(on Windows:...\examples\pipelines\run-pipelines-on-apache-airflow\components
) -
Paths:
http_operator.py
Note: Replace
...
with the path to the location where you cloned the Elyra example repository. The base directory can include~
or~user
to indicate the home directory. The concatenation of the base directory and each path must resolve to an absolute path or Elyra won't be able to locate the specified files.
-
-
Save the component catalog entry.
There are two approaches you can take to add multiple related component specifications to the registry:
- Specify multiple
Path
values. - Store the related specifications in the same directory and use the
Directory
catalog type. Elyra searches the directory for specifications. Check the Include Subdirectories checkbox to search subdirectories for component specifications as well.
Refer to the descriptions in the linked documentation topic for details and examples.
Locally stored component specifications have the advantage that they can be quickly loaded by Elyra. If you need to share component specifications with other users, ensure that the given Paths are the same relative paths across installations. The Base Directory can differ across installations.
The URL Component Catalog
type only supports web resources that can be downloaded using HTTP GET
requests, which don't require authentication.
To add component specifications to the catalog that are stored remotely:
- Open the Pipeline components panel.
- Add a second component catalog entry, this time selecting
New URL Component Catalog
from the dropdown menu. This component executes a given bash command. - Enter the following information:
- Name:
run command
- Description:
run a shell script
- Runtime:
APACHE_AIRFLOW
- Category Names:
scripting
- URLs:
https://raw.githubusercontent.com/elyra-ai/examples/main/pipelines/run-pipelines-on-apache-airflow/components/bash_operator.py
- Name:
- Save the component catalog entry.
The catalog is now populated with the custom components you'll use in the tutorial pipeline.
Next, you'll create a pipeline that uses the registered components.
The pipeline editor's palette is populated from the component catalog. To use the components in a pipeline:
-
Open the JupyterLab Launcher.
-
Click the
Apache Airflow pipeline editor
tile to open the Visual Pipeline Editor for Apache Airflow. -
Expand the palette panel. Two new component categories are displayed (
request
andscripting
), each containing one component entry that you added: -
Drag the '
SimpleHttpOperator
' component onto the canvas to create the first pipeline node. -
Drag the '
BashOperator
' component onto the canvas to create a second node and connect the two nodes as shown.The components require inputs, which you need to specify to render the nodes functional.
-
Open the properties of the '
SimpleHttpOperator
' node:- select the node and expand (↤) the properties slideout panel on the right OR
- right click on the node and select
Open Properties
-
Review the node properties. The properties are a combination of Elyra metadata and information that was extracted from the component's specification:
class SimpleHttpOperator(BaseOperator): """ Calls an endpoint on an HTTP system to execute an action :param http_conn_id: The connection to run the operator against :type http_conn_id: str :param endpoint: The relative part of the full url. (templated) :type endpoint: str :param method: The HTTP method to use, default = "POST" :type method: str :param data: The data to pass. POST-data in POST/PUT and params in the URL for a GET request. (templated) :type data: For POST/PUT, depends on the content-type parameter, for GET a dictionary of key/value string pairs :param headers: The HTTP headers to be added to the GET request :type headers: a dictionary of string key/value pairs :param response_check: A check against the 'requests' response object. Returns True for 'pass' and False otherwise. ...
The Elyra properties include:
-
Label
: If specified, the value is used as node name in the pipeline instead of the component name. Use labels to resolve naming conflicts that might arise if a pipeline uses the same component multiple times. For example, if a pipeline utilizes the 'SimpleHttpOperator
' component to make two requests, you could override the node name by specifying 'HTTP Request 1
' and 'HTTP Request 2
' as labels: -
Component source
: A read-only property that identifies source information about a component, such as the type of catalog in which this component is stored and any unique identifying information. This property is displayed for informational purposes only.
-
-
Enter the following values for the
SimpleHttpOperator
properties:-
endpoint
->/repos/elyra-ai/examples/contents/pipelines/run-pipelines-on-apache-airflow/resources/command.txt
- Since this property is implicity required in the operator specification file, the pipeline editor displays a red bar and enforces the constraint.
-
method
->GET
-
data
->{"ref": "master"}
- This information tells the GitHub API which branch to use when returning the file contents
-
headers
->{"Accept": "Accept:application/vnd.github.v3.raw"}
- This tells the API what format the returned data should be
- In this case, we want the raw GitHub file
-
xcom_push
-> check the checkbox forTrue
- This property indicates to Airflow whether we want to pass on the output of this component (in this case, the file contents of our requested file) to be accessed by later nodes in the pipeline
-
http_conn_id
->http_github
- This property tells the Airflow instance which
Connection
id it will use as the API base URL - This property was configured in the above section, Create a new connection id
- This property tells the Airflow instance which
-
-
Open the properties of the '
BashOperator
' node. The specification for the underlying component looks as follows:class BashOperator(BaseOperator): """ Execute a Bash script, command or set of commands. ... :param bash_command: The command, set of commands or reference to a bash script (must be '.sh') to be executed. (templated) :type bash_command: str :param xcom_push: If xcom_push is True, the last line written to stdout will also be pushed to an XCom when the bash command completes. :type xcom_push: bool :param env: If env is not None, it must be a mapping that defines the environment variables for the new process; these are used instead of inheriting the current process environment, which is the default behavior. (templated) :type env: dict :param output_encoding: Output encoding of bash command :type output_encoding: str ...
In Apache Airflow, the output of a component can be used as a property value for any downstream node. (A downstream node is a node that is connected to and executed after the node in question). The pipeline editor renders a selector widget for each property that allows you to choose between two options as a value:
-
A raw value, entered manually
-
The output of an upstream node
-
-
The contents of the file requested by the
SimpleHttpOperator
are made available to the downstream nodes in the pipeline by setting thexcom_push
property ofSimpleHttpOperator
to True. This output value will be the input of thebash_command
property. Choose 'Please select an output from a parent :
' from the dropdown menu and selectSimpleHttpOperator
.Since the '
BashOperator
' node is only connected to one upstream node ('SimpleHttpOperator
'), you can only choose the output of that node. If a node is connected to multiple upstream nodes, you can choose the output of any of these nodes as input, as shown in this example:The output of the
EmailOperator
node cannot be consumed by the 'SlackAPIPostOperator
' node, because the two nodes are not connected in this pipeline. Ensure that thexcom_push
property is set toTrue
for any node whose output will be used in a subsequent node.Elyra intentionally only supports explicit dependencies between nodes to avoid potential usability issues.
-
The bash command requested and returned by the
SimpleHttpOperator
node includes an environment variable calledname
that can be set by theenv
property of theBashOperator
. Enter{'name': 'World'}
as the value for this field. You can use another name in place of 'World', if desired. -
Save the pipeline.
-
Rename the pipeline to something meaningful:
- right click on the pipeline editor tab and select
Rename Pipeline...
OR - in the JupyterLab File Browser right click on the
.pipeline
file
- right click on the pipeline editor tab and select
Next, let's run the pipeline!
To run the pipeline on Apache Airflow:
-
Click the
Run
button in the pipeline editor toolbar.You can also use the
elyra-pipeline submit
command to run the pipeline using the command line interface. -
In the run pipeline dialog select the runtime configuration you created when you completed the setup for this tutorial.
-
Start the pipeline run and monitor its execution progress in the Apache Airflow Dashboard.
You can also click the
GitHub Repository
link to inspect the DAG, if desired. -
Review the logs of each pipeline task. The output of the '
BashOperator
' node should show thatHello, World
is printed in the log.Elyra does not store custom component outputs in cloud storage. (It only does this for generic pipeline components.) To access the output of custom components use the Apache Airflow Dashboard.
This concludes the Run pipelines on Apache Airflow tutorial. You've learned how to
- add custom Apache Airflow components to the Elyra component registry
- create a pipeline from custom components
- Creating a custom operator topic in the Apache Airflow documentation
- Pipelines topic in the Elyra User Guide
- Pipeline components topic in the Elyra User Guide
- Component catalog connector marketplace
- Requirements and best practices for custom pipeline components topic in the Elyra User Guide
- Command line interface topic in the Elyra User Guide