Module 1: Setting up the Data Engineering Environment
- Installing and Configuring SQL and Python Environments
- IDEs and Tools for Data Engineering
Module 2: Database Essentials for Data Engineering
- Introduction to PostgreSQL and Database Management
- Creating and Managing Tables
- Indexing and Query Optimization
- Utilizing Pre-defined Functions in Data Engineering
- Advanced SQL Queries for Data Manipulation
Module 3: Data Engineering Programming with Python
- Basic Programming Constructs in Python
- Working with Collections (Lists, Dictionaries, etc.)
- Data Manipulation with Pandas Library
- Database Interaction with Python
- Error Handling and Exception
Module 4: Data Engineering with Spark Dataframe APIs (PySpark)
- Introduction to PySpark and Spark Dataframes
- Data Transformation with select, filter, groupBy, orderBy, etc.
- Advanced Data Manipulation Techniques
- Joins and Aggregations with Dataframes
Module 5: Advanced Data Engineering with Spark SQL (PySpark and Spark SQL)
- Writing High-Quality Spark SQL Queries
- Complex SQL Operations: SELECT, WHERE, GROUP BY, ORDER BY, etc.
- Window Functions in Spark SQL
- Optimization Techniques for Spark SQL
Module 6: Spark Metastore and Integration
- Understanding Spark Metastore and its Role
- Integrating Dataframes and Spark SQL
- Managing Metadata in Spark
Module 7: Building Data Engineering Pipelines with Spark and Python
- Designing Data Pipelines with Spark and Python
- Implementing ETL Processes
- Error Handling and Logging in Data Pipelines
Module 8: Working with Different File Formats
- Handling Parquet, JSON, CSV, and Other Formats
- Data Serialization and Deserialization
- File Formats for Efficient Data Storage and Processing
Module 9: Setting up Hadoop and Spark Cluster on GCP
- Deploying Hadoop and Spark Cluster on Google Cloud Platform (GCP) using Dataproc
- Configuring Cluster Settings and Scaling
- Data Partitioning and Shuffling in Distributed Systems
- Managing Resources and Jobs on a Cluster
- Implementing Fault Tolerance and High Availability
- Monitoring and Optimization of Cluster Performance
- Security and Access Control in Hadoop and Spark Clusters
- Integrating External Storage and Data Sources with GCP Cluster
Module 10: Final Project: Applying Data Engineering Concepts
- Designing and Implementing an End-to-End Data Engineering Project
- Utilizing SQL, Python, PySpark, and Cluster Setup on GCP
- Building Data Pipelines, Performing Data Transformation, and Loading Data
- Presenting the Final Project and Demonstrating Proficiency in Data Engineering Concepts
Requirements
-
Laptop with decent configuration (Minimum 4 GB RAM and Dual Core)
-
Sign up for GCP with the available credit or AWS Access
-
Setup self support lab on cloud platforms (you might have to pay the applicable cloud fee unless you have credit)
-
CS or IT degree or prior IT experience is highly desired
Who this course is for:
- Computer Science or IT Students or other graduates with passion to get into IT
- Data Warehouse Developers who want to transition to Data Engineering roles
- ETL Developers who want to transition to Data Engineering roles
- Database or PL/SQL Developers who want to transition to Data Engineering roles
- BI Developers who want to transition to Data Engineering roles
- QA Engineers to learn about Data Engineering
- Application Developers to gain Data Engineering Skills