Skip to content

srujanreddyj/Data-Engineering-concepts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

75 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data-Engineering-Concepts

Data Engineering skills and tools

  • Languages: Python, SQL
  • Databases: Postgres, Mysql, AWS RDS, DynamoDB, RedShift, Apache Cassandra
  • Modeling: Dimensional data modeling
  • Batch ETL: Python, SparkSQL
  • Workflow Management: Airflow

Data Engineering projects/Concepts Learnt

    • Created a relational database using PostgreSQL to find the diverse needs of data consumers.
    • Developed a Star Schema database using optimized definitions of Fact and Dimension tables. Normalization of tables.
    • Created a NoSQL database using Apache Cassandra based on the schema outlined above.
      • Developed denormalized tables optimized for both specific set queries and business queries
    • An ETL pipeline was build for PostgreSQL and Apache Cassandra Databases.

    Proficiencies learned and used: Python, PostgreSQL, Star Schema, ETL pipelines, Normalization, Apache Cassandra, Denormalization

    • Applied Data Warehouse architectures learned and built a data warehouse on AWS cloud.
    • Developed an ETL pipeline that extracts data from S3 buckets, stages them in Redshift cluster, and transforms the data into dimension and fact tables for analytics teams.
    • Learnt more on Amazon Redshift CLusters, IAM Roles, Security Groups

    Proficiencies learned and used: Python, Amazon Redshift, AWS CLI, Amazon SDK, SQL, PostgreSQL

    • Built a data lake on AWS Cloud using Spark and AWS EMR CLuster.
    • Scaled up the current ETL pipeline by moving the data warehouse to a data lake.
    • The Data lake acts as a single source analytics platform and ETL jobs are written in Spark that extracts data from S3, stages them in Redshift, processes the data into analytics tables using Spark, and loads them back into S3.

    Proficiencies learned and used: Spark, S3, EMR, Parquet.

    • Created and automated a set of data pipelines with Airflow and Python.
    • Wrote custom operators, plugins to perform tasks like staging data, transforming data into star schema by creating dimension and fact tables, and validation through data quality checks
    • I scheduled ETL jobs in Airflow, created project-related plugins, operators and automated the pipeline execution leading to better monitoring and debugging production pipelines.

    Proficiencies learned and used: Apache Airflow, S3, Amazon Redshift, Python.

    • Developed an ETL Pipeline for Airbnb that extracts data from S3 Bucket, stages them in Redshift cluster, and transforms the data into dimension and fact tables for analytics teams.
    • Created automated set of data pipelines using Apache Airflow using custom operators, plugins and validated through data quality checks
    • Created Scheduled ETL jobs in Airflow

    Proficiencies learned and used: PostgreSQL, Apache Spark, S3, EMR, Parquet, Amazon Redshift, Python, SnowFlake Schema.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published