This project uses Apache Airflow to create workflow orchestration of a data pipeline that extracts, changes the format and uploads the data to a data lake (GCS).
- The airflow is installed on using the official arflow docker image.
- DAGs are created in a python file.
- The DAGs are triggered from the airflow webserver UI.
The NYC taxi trip data was used for this project.
- Downloaded the data using
curl
. - Data was transformed using the
pandas
library from csv to parquet (compressed format easier for querying).
- The data was uplaoded using
google.cloud
python library