Learning Outcome
Course Covers
Learning Outcome
- Understand what is Big Data, the challenges with Big Data and how Hadoop propose a solution for the Big Data problem
- Work and navigate Hadoop cluster with ease
- Install and configure a Hadoop cluster on cloud services like Amazon Web Services (AWS)
- Understand the difference phases of MapReduce in detail
- Write optimized Pig Latin instruction to perform complex data analysis
- Write optimized Hive queries to perform data analysis on simple and nested datasets
- Work with file formats like SequenceFile, AVRO etc
- Understand Hadoop architecture, Single Point Of Failures (SPOF), Secondary/Checkpoint/Backup nodes, HA configuration and YARN
- Tune and optimize slowing running MapReduce jobs, Pig instructions and Hive queries
- Understand how Joins work behind the scenes and will be able to write optimized join statements
- Wherever possible, students will be introduced to difficult questions that are asked in real Hadoop interviews
Course Covers
Hadoop Course Content
- Hadoop Overview, Architecture Considerations, Infrastructure, Platforms and Automation
Use case walkthrough
- ETL
- Log Analytics
- Real Time Analytics
Hbase for Developers :
NoSQL Introduction
- Traditional RDBMS approach
- NoSQL introduction
- Hadoop & Hbase positioning
Hbase Introduction
- What it is, what it is not, its history and common use-cases
- Hbase Client – Shell, exercise
Hbase Architecture
- Building Components
- Storage, B+ tree, Log Structured Merge Trees
- Region Lifecycle
- Read/Write Path
Hbase Schema Design
- Introduction to hbase schema
- Column Family, Rows, Cells, Cell timestamp
- Deletes
- Exercise – build a schema, load data, query data
Hbase Java API – Exercises
- Connection
- CRUD API
- Scan API
- Filters
- Counters
- Hbase MapReduce
- Hbase Bulk load
Hbase Operations, cluster management
- Performance Tuning
- Advanced Features
- Exercise
- Recap and Q&A
MapReduce for Developers
Introduction
- Traditional Systems / Why Big Data / Why Hadoop
- Hadoop Basic Concepts/Fundamentals
Hadoop in the Enterprise
- Where Hadoop Fits in the Enterprise
- Review Use Cases
Architecture
- Hadoop Architecture & Building Blocks
- HDFS and MapReduce
Hadoop CLI
- Walkthrough
- Exercise
MapReduce Programming
- Fundamentals
- Anatomy of MapReduce Job Run
- Job Monitoring, Scheduling
- Sample Code Walk Through
- Hadoop API Walk Through
- Exercise
MapReduce Formats
- Input Formats, Exercise
- Output Formats, Exercise
Hadoop File Formats
MapReduce Design Considerations
Hadoop File Formats
MapReduce Algorithms
- Walkthrough of 2-3 Algorithms
MapReduce Features
- Counters, Exercise
- Map Side Join, Exercise
- Reduce Side Join, Exercise
- Sorting, Exercise
Use Case A (Long Exercise)
- Input Formats, Exercise
- Output Formats, Exercise
MapReduce Testing
Hadoop Ecosystem
- Oozie
- Flume
- Sqoop
- Exercise 1 (Sqoop)
- Streaming API
- Exercise 2 (Streaming API)
- Hcatalog
- Zookeeper
HBase Introduction
- Introduction
- HBase Architecture
VIEW Types
- Default Views
- Overriden Views
- Normal Views
MapReduce Performance Tuning
Development Best Practice and Debugging
Apache Hadoop for Administrators
Hadoop Fundamentals and Architecture
- Why Hadoop, Hadoop Basics and Hadoop Architecture
- HDFS and Map Reduce
Hadoop Ecosystems Overview
- Hive
- Hbase
- ZooKeeper
- Pig
- Mahout
- Flume
- Sqoop
- Oozie
Hardware and Software requirements
- Hardware, Operating System and Other Software
- Management Console
Deploy Hadoop ecosystem services
- Hive
- ZooKeeper
- HBase
- Administration
- Pig
- Mahout
- Mysql
- Setup Security
Enable Security – Configure Users, Groups, Secure HDFS, MapReduce, HBase and Hive
- Configuring User and Groups
- Configuring Secure HDFS
- Configuring Secure MapReduce
- Configuring Secure HBase and Hive
Manage and Monitor your cluster
Command Line Interface
Troubleshooting your cluster
Introduction to Big Data and Hadoop
Hadoop Overview
- Why Hadoop
- Hadoop Basic Concepts
- Hadoop Ecosystem – MapReduce, Hadoop Streaming, Hive, Pig, Flume, Sqoop, Hbase, Oozie, Mahout
- Where Hadoop fits in the Enterprise
- Review use cases
Apache Hive & Pig for Developers
Overview of Hadoop
- Why Hadoop
- Hadoop Basic Concepts
- Hadoop Ecosystem – MapReduce, Hadoop Streaming, Hive, Pig, Flume, Sqoop, Hbase, Oozie, Mahout
- Where Hadoop fits in the Enterprise
- Review use cases
Overview of Hadoop
- Big Data and the Distributed File System
- MapReduce
Hive Introduction
- Why Hive?
- Compare vs SQL
- Use Cases
Hive Architecture – Building Blocks
- Hive CLI and Language (Exercise)
- HDFS Shell
- Hive CLI
- Data Types
- Hive Cheat-Sheet
- Data Definition Statements
- Data Manipulation Statements
- Select, Views, GroupBy, SortBy/DistributeBy/ClusterBy/OrderBy, Joins
- Built-in Functions
- Union, Sub Queries, Sampling, Explain
Hive Architecture – Building Blocks
- Hive CLI and Language (Exercise)
- HDFS Shell
- Hive CLI
- Data Types
- Hive Cheat-Sheet
- Data Definition Statements
- Data Manipulation Statements
- Select, Views, GroupBy, SortBy/DistributeBy/ClusterBy/OrderBy, Joins
- Built-in Functions
- Union, Sub Queries, Sampling, Explain
Hive Architecture – Building Blocks
- Hive CLI and Language (Exercise)
- HDFS Shell
- Hive CLI
- Data Types
- Hive Cheat-Sheet
- Data Definition Statements
- Data Manipulation Statements
- Select, Views, GroupBy, SortBy/DistributeBy/ClusterBy/OrderBy, Joins
- Built-in Functions
- Union, Sub Queries, Sampling, Explain
Hive Usecase implementation -(Exercise)
- Use Case 1
- Use Case 2
- Best Practices
Advance Features
- Transform and Map-Reduce Scripts
- Custom UDF
- UDTF
- SerDe
- Recap and Q&A
Pig Introduction
- Position Pig in Hadoop ecosystem
- Why Pig and not MapReduce
- Simple example (slides) comparing Pig and MapReduce
- Who is using Pig now and what are the main use cases
- Pig Architecture
- Discuss high level components of Pig
- Pig Grunt – How to Start and Use
Pig Latin Programming
- Data Types
- Cheat sheet
- Schema
- Expressions
- Commands and Exercise
- Load, Store, Dump, Relational Operations,Foreach, Filter, Group, Order By, Distinct, Join, Cogroup,Union, Cross, Limit, Sample, Parallel
Use Cases (working exercise)
- Use Case 1
- Use Case 2
- Use Case 3 (compare pig and hive)
Advanced Features, UDFs
Best Practices and common pitfalls
Mahout & Machine Learning
- Mahout Overview
- Mahout Installation
- Introduction to the Math Library
- Vector implementation and Operations (Hands-on exercise)
- Matrix Implementation and Operations (Hands-on exercise)
- Anatomy of a Machine Learning Application
Classification
- Introduction to Classification
- Classification Workflow
- Feature Extraction
- Classification Techniques (Hands-on exercise)
Evaluation (Hands-on exercise)
- Clustering
- Use Cases
- Clustering algorithms in Mahout
- K-means clustering (Hands-on exercise)
- Canopy clustering (Hands-on exercise)
Clustering
- Mixture Models
- Probabilistic Clustering – Dirichlet (Hands-on exercise)
- Latent Dirichlet Model (Hands-on exercise)
- Evaluating and Improving Clustering quality (Hands-on exercise)
- Distance Measures (Hands-on exercise)
Recommendation Systems
- Overview of Recommendation Systems
- Use cases
- Types of Recommendation Systems
- Collaborative Filtering (Hands-on exercise)
- Recommendation System Evaluation (Hands-on exercise)
- Similarity Measures
- Architecture of Recommendation Systems
- Wrap Up
Hadoop Course Duration
Track | Regular Track | Weekend Track | Fast Track |
---|---|---|---|
Course Duration | 45 – 60 Days | 8 Weekends | 5 Days |
Hours | 2 hours a day | 3 hours a day | 6+ hours a day |
Training Mode | Live Classroom | Live Classroom | Live Classroom |
Online and Offline Mode Available