Explore Big Data tools and design pipelines using Hadoop, Spark, Kafka, and Hive.
Course Description
In the new paradigm of Big Data where we trust distributed systems to process information across server clusters, we increasingly rely on technologies to manage the massive amounts of information generated by social media, online transactions, web logs, and sensors. These technologies include handling unstructured, semi-structured, and structured data, as well as processing, real-time analytics, and visualization. They are especially useful for reporting in circumstances where a relational database approach is not effective or is too costly.
In this comprehensive introductory course for managers, analysts, architects and developers, you will gain insights into cloud-based Big Data architectures. We will cover Hadoop, Spark and other Big Data platforms based on SQL, such as Hive.
This course includes an overview of the Big Data technologies and frameworks such as HDFS, MapReduce, Spark, Kafka and Hive. The final project will give the ability to design the Big Data Pipeline with the understanding of all acquired knowledge of Big Data Technologies.
Topics
- Evolution of Big Data
- Big Data use cases
- Big Data applications architecture
- Understanding Hadoop distributed file system (HDFS)
- How MapReduce framework works
- Introduction to HBase (Hadoop NoSQL database)
- Introduction to Apache Kafka
- Introduction to Spark and SparkSQL
- Developing Spark/SparkSQL applications
- Managing tables and query development in Hive
- Introduction to data pipelines
Prerequisites / Skills Needed
Moderate level of programming knowledge in Python and SQL
