Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airflow Integration #117

Closed
wants to merge 9 commits into from
Closed

Airflow Integration #117

wants to merge 9 commits into from

Conversation

gonzaloetjo
Copy link
Contributor

@gonzaloetjo gonzaloetjo commented May 5, 2023

Which issue(s) this PR fixes

Fixes #105
Fixes #59

Additional comments

For the moment it:

  • Installs airflow 2.2.5 (compatible with 3.6)
  • Creates a virtual environment for the installation.
  • Uses Celery as executor, Redis as broker, and Flower to monitor executors.
  • Adds additional providers (hive, spark, hdfs, kerberos)
  • Adds SSL and kerberos configuration.
  • Adds necesary systemd servecies for Airflow to work
  • Creates connectors for Hive, Spark and HDFS
  • Creates default-test dags for Hive (table creation, data insert, nyc taxi manipulation)
  • Creates default-test dags for Spark (examples use)
  • Allows for impersonation through tdp_user for Hive and Spark.

Issues:

  • Certain features are not available in the current Airflow version. For example, run_as_user, which allows specifying the system user executing a task, is not supported. However, proxy_user is functional, enabling impersonation of another user when submitting queries or jobs to external systems like Hive or Spark.
  • Due to the absence of run_as_user, it is not possible for logged-in users to be automatically recognized as the executing user in the DAG or connector. This necessitates the use of impersonation through proxy_user (specified either in the DAG or the connection), which poses a security risk. A partial workaround is to have a single administrator execute the DAGs. Future Airflow versions aim to provide improved access control.
  • Implementing impersonation with HDFS presents challenges, which need to be addressed for smoother integration.

Agreements

@gonzaloetjo gonzaloetjo requested a review from Pierrotws May 5, 2023 12:30
@gonzaloetjo gonzaloetjo force-pushed the 105-airflow-integration branch from 111a711 to e7567ca Compare June 1, 2023 15:28
@gonzaloetjo
Copy link
Contributor Author

  • Adding Scheduler HA
  • Testing with 2.6.3

@gonzaloetjo gonzaloetjo marked this pull request as ready for review August 11, 2023 10:18
@gonzaloetjo
Copy link
Contributor Author

gonzaloetjo commented Aug 11, 2023

Current body of work:

Ansible Airflow TDP Extra

This role deploys the Apache Airflow release. In general it installs and/or configures:

  • Airflow: Installs packaged airflow 2.2.5 or 2.6.2, and all its dependencies, through a virtual-environment and pip-install
  • Scheduler: Responsible for monitoring and ensuring scheduled execution of all tasks. Decides where and when tasks are run.
  • Webserver: A Flask server that serves the Airflow UI. Helps monitor, trigger and debug DAGs.
  • Broker: Uses Redis. This Facilitates communication between the Airflow Scheduler and the Workers, handling task messages and their status updates.
  • Executor: Uses Celery. Responsible for determining how tasks are run, in parallel or sequentially, locally or distributed.
  • Flower: Used to monitor task progress and history in celery executors.
  • Workers: Execute the tasks. They pick up and run tasks sent to the queue by the executor.
  • Database (Metadata DB): Uses postgresql. Stores metadata about the state of tasks and workflows, and assists in recovery in case of failures.
  • Dag Directory: Folder of DAG files. It is read by the scheduler and executor, and has to exist in every worker as well.
  • Services: Creates services for web-server, scheduler, workers, redis, flower.

Prerequisites

  • python3 and python3-pip installed on all nodes.
  • Airflow version = 2.2.5 requires python >= 3.6 & Airflow version > 2.6.3 requires python >= 3.8
  • Hadoop TDP release .tar.gz (hadoop_dist_file role variable) file available in files
  • Groups airflow_webserver, airflow_scheduler, airflow_broker and airflow_worker defined in the Ansible hosts file
  • Postegres user and databse to store metadata. This is done in prerequisites.
  • Certificate files {{ fqdn }}.key and {{ fqdn }}.pem for every node available in files
  • Admin access to a KDC with the realm, kadmin_principal and kadmin_password role vars provided
  • Extra steps done for hadoop and hive configuration. This is currently added in this collection through tdp-cluster.yml.

Extra Work (Utils)

These tasks are intended for some extra tasks the user might or might not want to include in the installation. They include:

  • Policies: Currently used to limit what a user (ie. tdp_user) can do when editing a dag.
  • Roles: Used to limit what a user/group can view or do within a dag.
  • Dag Directory structure: Working on dag-folder structure to handle groups/users.
  • Creates connectors with impersonation capabilities for Hive, Spark and HDFS
  • Dags examples that can be directly used on top of TDP (given the previous steps have been included)

Security implementations:

  • Policies: Currently used to limit what a user (ie. tdp_user) can do when editing a dag.
  • Roles: Used to limit what a user/group can view or do within a dag.
  • Dag Directory structure: Working on dag-folder structure to handle groups/users.

@gonzaloetjo
Copy link
Contributor Author

Airflow 2.7 is out. Relevant changes:

  • Cluster Activity UI: New overview page for the cluster. May render flower useless.
  • Deferrable Mode Enablement: Enable for all deferrable tasks with 1 config setting.
  • OpenLineage Integration: Now a built-in feature for reliable lineage publishing.
  • Executors Moved into Providers: Improved bug-fix release process. May need reconfig in roles
  • Required Provider Versions:
    • Celery provider 3.3.0+ for Celery executors.
    • Kubernetes provider 7.4.0+ for Kubernetes executor.

@rpignolet
Copy link
Contributor

Inactivity

@rpignolet rpignolet closed this Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fully integrate airflow to TDP Airflow: refactor role to be compatible with tdp-lib
3 participants