This project demonstrates the process of building an ELT pipeline from scratch using DBT, Snowflake, and Airflow. The pipeline extracts data from Snowflake's TPCH dataset, performs transformations using DBT, and orchestrates the workflow using Airflow. The primary focus is on data modeling, fact table creation, and business logic transformations.
Traditional ETL (Extract, Transform, Load) transforms data before loading it into a data warehouse. ELT, in contrast, loads raw data first, then applies transformations. Modern tools and cloud storage (like Snowflake) make ELT more efficient by leveraging cheaper storage.
- DBT: Data transformation
- Snowflake: Data warehouse
- Airflow: Workflow orchestration
- Set up a Snowflake account.
- Create a warehouse, database, and role in Snowflake using the following SQL commands:
CREATE WAREHOUSE IF NOT EXISTS DBT_WAREHOUSE; CREATE DATABASE IF NOT EXISTS DBT_DATABASE; CREATE ROLE IF NOT EXISTS DBT_ROLE;
- Grant privileges to the user and roles.
- Install DBT Core:
pip install dbt-core
- Initialize the DBT project:
dbt init data_pipeline
- Configure the Snowflake profile within DBT.
- Staging Tables: Pull data from Snowflake TPCH dataset and create views for
orders
andlineitems
tables. - Fact Tables: Create fact tables using business logic and transformations:
- Example: Creating a surrogate key for dimensional modeling.
- Aggregating data into a fact table.
- Defined reusable macros in DBT for business logic, such as calculating the discounted amount from extended prices.
- Added DBT tests (both generic and singular) to ensure data integrity:
- Unique, non-null checks on primary keys.
- Valid range checks on dates and values.
- Installed Airflow and Astronomer Cosmos to orchestrate the DBT transformations.
- Set up Airflow DAGs to trigger DBT runs:
- Scheduled DAGs to run daily.
- Configured the Snowflake connection within Airflow.
data_pipeline/ ├── models/ │ ├── staging/ │ └── marts/ ├── macros/ ├── tests/ ├── dags/ │ └── dbt_pipeline_dag.py └── Dockerfile
models/
: Contains all the DBT models, including staging and fact tables.macros/
: Contains reusable business logic for transformations.tests/
: Includes DBT tests for data validation.dags/
: Contains the Airflow DAG for orchestrating the DBT runs.
- DBT Models:
dbt run
- Airflow DAG:
- Start Airflow:
airflow webserver airflow scheduler
- Trigger the
dbt_pipeline_dag
from the Airflow UI.
- Start Airflow:
- Add more comprehensive data tests.
- Integrate additional sources or incremental data loading.
- Deploy to cloud services (AWS, GCP) for scalability.
This project demonstrates how to create an ELT pipeline using modern tools like DBT, Snowflake, and Airflow. It covers the complete workflow from environment setup to orchestration with Airflow. The result is a scalable and reusable ELT pipeline, ready for deployment in a production environment.
This project was build with a hand-on tutorial, with adaptation for Widnows and another changes needed, original video
This project is open-source and available under the MIT License.