- Languages: Python, SQL
- Databases: Postgres, Mysql, AWS RDS, DynamoDB, RedShift, Apache Cassandra
- Modeling: Dimensional data modeling
- Batch ETL: Python, SparkSQL
- Workflow Management: Airflow
-
- Created a relational database using PostgreSQL to find the diverse needs of data consumers.
- Developed a Star Schema database using optimized definitions of Fact and Dimension tables. Normalization of tables.
- Created a NoSQL database using Apache Cassandra based on the schema outlined above.
- Developed denormalized tables optimized for both specific set queries and business queries
- An ETL pipeline was build for PostgreSQL and Apache Cassandra Databases.
Proficiencies learned and used: Python, PostgreSQL, Star Schema, ETL pipelines, Normalization, Apache Cassandra, Denormalization
-
- Applied Data Warehouse architectures learned and built a data warehouse on AWS cloud.
- Developed an ETL pipeline that extracts data from S3 buckets, stages them in Redshift cluster, and transforms the data into dimension and fact tables for analytics teams.
- Learnt more on Amazon Redshift CLusters, IAM Roles, Security Groups
Proficiencies learned and used: Python, Amazon Redshift, AWS CLI, Amazon SDK, SQL, PostgreSQL
-
- Built a data lake on AWS Cloud using Spark and AWS EMR CLuster.
- Scaled up the current ETL pipeline by moving the data warehouse to a data lake.
- The Data lake acts as a single source analytics platform and ETL jobs are written in Spark that extracts data from S3, stages them in Redshift, processes the data into analytics tables using Spark, and loads them back into S3.
Proficiencies learned and used: Spark, S3, EMR, Parquet.
-
- Created and automated a set of data pipelines with Airflow and Python.
- Wrote custom operators, plugins to perform tasks like staging data, transforming data into star schema by creating dimension and fact tables, and validation through data quality checks
- I scheduled ETL jobs in Airflow, created project-related plugins, operators and automated the pipeline execution leading to better monitoring and debugging production pipelines.
Proficiencies learned and used: Apache Airflow, S3, Amazon Redshift, Python.
-
- Developed an ETL Pipeline for Airbnb that extracts data from S3 Bucket, stages them in Redshift cluster, and transforms the data into dimension and fact tables for analytics teams.
- Created automated set of data pipelines using Apache Airflow using custom operators, plugins and validated through data quality checks
- Created Scheduled ETL jobs in Airflow
Proficiencies learned and used: PostgreSQL, Apache Spark, S3, EMR, Parquet, Amazon Redshift, Python, SnowFlake Schema.