Skip to content

Latest commit

 

History

History
18 lines (14 loc) · 786 Bytes

README.md

File metadata and controls

18 lines (14 loc) · 786 Bytes

Airflow_Ingestion

Introduction

This project uses Apache Airflow to create workflow orchestration of a data pipeline that extracts, changes the format and uploads the data to a data lake (GCS).

Process

  • The airflow is installed on using the official arflow docker image.
  • DAGs are created in a python file.
  • The DAGs are triggered from the airflow webserver UI.

Download and Transformation of Data

The NYC taxi trip data was used for this project.

  • Downloaded the data using curl.
  • Data was transformed using the pandas library from csv to parquet (compressed format easier for querying).

Uploading Data

  • The data was uplaoded using google.cloud python library