1. Hadoop
- Why Big Data and Hadoop?
- Problem in Data Driven Businesses o How Hadoop Solves it and why Big Data Solutions
- Hadoop Fundamental
- What comprises of Hadoop, Subprojects and Ecosystem
- Core Hadoop Components
- Apache Subprojects
- Hadoop Ecosystem
2. HDFS
- HDFS o HDFS Feature
- HDFS Architecture – Non HA o HDFS Architecture – HA
- Writing and Reading Files in HDFS o NameNode Memory and Load Handling
- Basic HDFS Security
- HDFS commands
- Hands-on in writing, reading files with HDFS, Permissions, Viewing Blocks and other basic HDFS Operations
3. Yarn
- Mapreduce and YARN – Basics
- Why Computational Framework
- YARN Architecture
- MapReduce Architecture and Hands-on
- Spark Architecture
- How YARN executes MR and Spark jobs
- How to see YARN Applications in WEB UIs and Shell o YARN Application Logs
4. Sqoop
- Importing RDBMS Data to Hadoop
- Introduction to Apache Sqoop
- Sqoop Architecture
- Using Sqoop to import RDBMS Table to HDFS
- Change the Delimiter and File Format of imported Tables
- Control which columns to be imported
- Sqoop Performance improvement o Sqoop – Import and Export using Sqoop.
- Incremental Data Load using Sqoop
5. Hive
- Hive Architecture and Data model
- How to query Hive and Impala/Tez o How Hive and Impala/Tez differs RDBMS
- Usage of Hive Metasore by Hive and Impala
- HiveQL and Impala SQL for query operations
- Managed and External Tables o Introduction to Hue
- Create Tables using Hue o Load Data using Hive, impala and sqoop import to Hive tables
- Overview of Partitions
- Partitions in Hive and Impala o Dealing with Hive Partition Tables
6.Hadoop Data Formats
- Introduction to Data Formats
- Various Data Formats o Introduction to AVRO
- Parquet
- Evolution of Avro Schema – Compatabilities
- Extracting Metadata and data from AVRO data file
- Using AVRO with hive, sqoop
- Using Parquet with hive, sqoop
7. Spark
- What is spark
- Spark architecture
- RDD Intro
- Transformations & Actions
- Dataframe API
- Spark Execution framework