A Professional Data Engineer enables data-driven decision making by
collecting, transforming, and publishing data.
A data engineer should be able to
design, build, operationalize, secure, and monitor data processing systems with a particular emphasis on
- security and compliance;
- scalability and efficiency;
- reliability and fidelity;
- flexibility and portability.
A data engineer should also be able to
leverage, deploy, and continuously train pre-existing machine learning models.
And that concludes in below four parts:
-
Designing data processing systems
- Selecting the appropriate storage technologies
- Designing data pipelines
- Designing a data processing solution
- Migrating data warehousing and data processing
-
Building and operationalizing data processing systems
- Building and operationalizing storage systems
- Building and operationalizing pipelines
- Building and operationalizing processing infrastructure
-
Operationalizing machine learning models
- Deploying an ML pipeline
- Leveraging pre-built ML models as a service
- Choosing the appropriate training and serving infrastructure
- Measuring, monitoring, and troubleshooting machine learning models
-
Ensuring solution quality
- Designing for security and compliance
- Ensuring scalability and efficiency
- Ensuring reliability and fidelity
- Ensuring flexibility and portability
- Cloud SQL
- DataStore
- Bigtable
- Cloud Spanner
- Realtime messaging with Pub/Sub
- Data Pipelines with Cloud Dataflow
- Dataproc
- BigQuery
- AI Platform
- Pretrained ML APIs
- Datalab
- Dataprep
- Data Stidio
- Cloud Composer
There are google's supporting services that will help you go-around with Big Data and ML services provided by Google Cloud.
- Scalable and high-performance virtual machines
- Fine-grained access control and visibility for centrally managing cloud resources.
-
Monitoring for applications on Google Cloud and AWS.
-
Logging for applications on Google Cloud and AWS.
-
Fully managed relational database service for MySQL, PostgreSQL, and SQL server
-
WordPress, backends, game states, CRM tools, MySQL, PostgreSQL, and Microsoft SQL Servers
-
AWS RDS, AWS Aurora, Azure Database, Azure SQL Database
-
Cloud Storage allows world-wide storage and retrieval of any amount of data at any time. You can use Cloud Storage for a range of scenarios including serving website content, storing data for archival and disaster recovery, or distributing large data objects to users via direct download.
-
Globally unique bucket name.
-
Fully managed, scalable, relational database service for regional and global application data
-
Cloud Spanner is a scalable relational database service built to support transactions, strong consistency, and high availability across regions and continents.
-
Cassandra (with CQL), AWS Aurora, AWS DynamoDB, Azure CosmosDB
-
The Firebase Realtime Database is a cloud-hosted NoSQL database that lets you store and sync data between your users in real time.
-
MongoDB, AWS DynamoDB, Azure Cosmos DB
-
Cloud Firestore is a fast, fully managed, serverless, cloud-native NoSQL document database.
-
Enterprise-grade, scalable NoSQL
-
Sync data across devices, on or offline
-
MongoDB, AWS DynamoDB, Azure CosmosDB
-
Cloud Memorystore is a fully managed in-memory data store service for Redis built on scalable, more secure, and highly available infrastructure.
-
Easy lift and shift applications from open-source redis to Memorystore.
-
AWS Elasticache, Azure Cache
- MapReduce
- Apache Hadoop & HDFS
- Apache Spark
- Apache Pig
- Apchae Tez
- Apache Kafka
-
Global messaging and event ingestion
-
Pub/Sub is a fully-managed real-time messaging service that allows you to send and receive messages between independent applications.
-
Decouple background data and event processing from the code that handles user-facing requests
-
Streamed events, IoT, metrics can be ingested to cloud pub/sub
-
Managed Apache Beam, Fast, unified stream and batch data processing
-
Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing. With its serverless approach to resource provisioning and management, you have access to virtually limitless capacity to solve your biggest data processing challenges, while paying only for what you use.
-
Horizontal autoscaling of worker resources to maximize resource utilization
-
Can be connected with Pub/Sub to do data processing in batch or streaming
-
Managed Apache Spark and Hadoop clusters
-
Also supports Apache Pig and Apache Hive
-
Dataproc is a managed Apache Spark and Apache Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning. Dataproc automation helps you create clusters quickly, manage them easily, and save money by turning clusters off when you don't need them. With less time and money spent on administration, you can focus on your jobs and your data.
-
Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many core Google services, including Search, Analytics, Maps, and Gmail.
-
Global distributed, RowKey concept
-
running large analytical workloads and building low-latency applications
-
HBase, Cassandra, AWS DynamoDB, Azure CosmosDB
-
BigQuery is Google's fully managed, petabyte scale, low cost analytics data warehouse. BigQuery is NoOps—there is no infrastructure to manage and you don't need a database administrator—so you can focus on analyzing data to find meaningful insights, use familiar SQL, and take advantage of our pay-as-you-go model.
-
Serverless, real-time analytics, advanced and predictive analytics, large-scale events, and enterprises
-
AWS Redshift, Snowflake, and Azure SQL Data Warehouse
-
Use Cloud Datalab to easily explore, visualize, analyze, and transform data using familiar languages, such as Python and SQL, interactively. Pre-installed Jupyter introductory, sample, and tutorial notebooks, show you how to:
-
Access, analyze, monitor, and visualize data
-
Use notebooks with Python, TensorFlow Machine Learning, and Google Analytics, Google BigQuery, and Google Charts APIs
-
Store these notebooks to GCS and access anytime again
-
-
Serverless BI reporting and Dashboard
-
Google Data Studio is a fully managed visual analytics service that can help anyone in your organization unlock insights from data through easy-to-create and interactive dashboards that inspire smarter business decision-making.
-
When Data Studio is combined with BigQuery BI Engine, an in-memory analysis service, data exploration and visual interactivity reach sub-second speeds, over massive datasets.
-
Cloud Composer is a managed Apache Airflow service that helps you create, schedule, monitor and manage workflows.
-
Cloud Composer automation helps you create Airflow environments quickly and use Airflow-native tools, such as the powerful Airflow web interface and command line tools, so you can focus on your workflows and not your infrastructure.