Data Engineering Fundamentals: SQL, Python, PySpark

As part of this course, you will learn all the Data Engineering Essentials related to building Data Pipelines using SQL, Python as Hadoop, Hive, or Spark SQL as well as PySpark Data Frame APIs. You will also understand the development and deployment lifecycle of Python applications using Docker as well as PySpark on multinode clusters. You will also gain basic knowledge about reviewing Spark Jobs using Spark UI.

What you’ll learn

Foundations of Data Engineering
Set up environments for SQL, Python, and key databases (PostgreSQL).
Master essential programming constructs and data manipulation techniques.
Advanced Database Management
Create and manage tables, indexes, and optimize queries in PostgreSQL.
Leverage pre-defined functions for efficient data processing.
Data Engineering with Spark
Use PySpark and Spark Dataframe APIs for powerful data manipulation.
Write high-quality Spark SQL queries for complex data operations.
Cluster Deployment and Optimization
Deploy Hadoop and Spark clusters on Google Cloud Platform (GCP) using Dataproc.
Configure, scale, and optimize cluster resources for efficient data processing.
End-to-End Data Engineering Project
Apply all learned concepts to design and implement a comprehensive data engineering project.
Utilize SQL, Python, PySpark, and GCP cluster setup to build data pipelines and perform data transformations.

Course Content:

Module 1: Setting up the Data Engineering Environment

Installing and Configuring SQL and Python Environments
IDEs and Tools for Data Engineering

Module 2: Database Essentials for Data Engineering

Introduction to PostgreSQL and Database Management
Creating and Managing Tables
Indexing and Query Optimization
Utilizing Pre-defined Functions in Data Engineering
Advanced SQL Queries for Data Manipulation

Module 3: Data Engineering Programming with Python

Basic Programming Constructs in Python
Working with Collections (Lists, Dictionaries, etc.)
Data Manipulation with Pandas Library
Database Interaction with Python
Error Handling and Exception

Module 4: Data Engineering with Spark Dataframe APIs (PySpark)

Introduction to PySpark and Spark Dataframes
Data Transformation with select, filter, groupBy, orderBy, etc.
Advanced Data Manipulation Techniques
Joins and Aggregations with Dataframes

Module 5: Advanced Data Engineering with Spark SQL (PySpark and Spark SQL)

Writing High-Quality Spark SQL Queries
Complex SQL Operations: SELECT, WHERE, GROUP BY, ORDER BY, etc.
Window Functions in Spark SQL
Optimization Techniques for Spark SQL

Module 6: Spark Metastore and Integration

Understanding Spark Metastore and its Role
Integrating Dataframes and Spark SQL
Managing Metadata in Spark

Module 7: Building Data Engineering Pipelines with Spark and Python

Designing Data Pipelines with Spark and Python
Implementing ETL Processes
Error Handling and Logging in Data Pipelines

Module 8: Working with Different File Formats

Handling Parquet, JSON, CSV, and Other Formats
Data Serialization and Deserialization
File Formats for Efficient Data Storage and Processing

Module 9: Setting up Hadoop and Spark Cluster on GCP

Deploying Hadoop and Spark Cluster on Google Cloud Platform (GCP) using Dataproc
Configuring Cluster Settings and Scaling
Data Partitioning and Shuffling in Distributed Systems
Managing Resources and Jobs on a Cluster
Implementing Fault Tolerance and High Availability
Monitoring and Optimization of Cluster Performance
Security and Access Control in Hadoop and Spark Clusters
Integrating External Storage and Data Sources with GCP Cluster

Module 10: Final Project: Applying Data Engineering Concepts

Designing and Implementing an End-to-End Data Engineering Project
Utilizing SQL, Python, PySpark, and Cluster Setup on GCP
Building Data Pipelines, Performing Data Transformation, and Loading Data
Presenting the Final Project and Demonstrating Proficiency in Data Engineering Concepts

Requirements

Laptop with decent configuration (Minimum 4 GB RAM and Dual Core)
Sign up for GCP with the available credit or AWS Access
Setup self support lab on cloud platforms (you might have to pay the applicable cloud fee unless you have credit)
CS or IT degree or prior IT experience is highly desired