RKCP Information Technology




Hadoop
Home » Courses   » Hadoop

Hadoop

From this training, participants get to learn HDFS, Hadoop Administration & Maintenance, Job Scheduling, Map Reduce, Into To Hive, Hbase, Flume, Sqoop, Oozie & Pig.

Overview

Apache Hadoop, the open source data management software that helps organizations to analyze massive volumes of structured and unstructured data is a very hot topic across the tech industry. This course enables you to use this technology and to become industry ready. After attending this course a developer or architect with their full confidence can use Apache Hadoop.


Objective

From this training, the participants gets to learn:

  • What is Big Data
  • What is Hadoop and why is it important
  • Hadoop Distributed File System (HDFS)
  • Hadoop Deployment
  • Hadoop Administration and Maintenance
  • Map-Reduce
  • Hive, Hbase, Flume, Sqoop, Oozie and Pig

Suggested Audience:

Developers, Architects, System Engineers

Total Duration - 3 Days


Prerequisites

Prior experience in database and programming is preferred.


Syllabus

Expand All
  • 1. Introduction to BigData
    • Which data are called BigData
    • What are business use cases for BigData
    • Requirement of BigData for traditional Data warehousing and BI space
    • BigData solutions
  • 2. Introduction to Hadoop
    • Amount of data processing in today's life
    • What is Hadoop why it is important
    • Comparison of Hadoop with traditional systems
    • History of Hadoop
    • Main components of Hadoop and its architecture
  • 3. Hadoop Distributed File System (HDFS)
    • HDFS overview and design
    • HDFS architecture
    • HDFS file storage
    • Component failures and recoveries
    • Block placement
    • Balancing the Hadoop cluster
  • 4. Hadoop Deployment
    • Different types of Hadoop deployment
    • Hadoop distribution options
    • Hadoop competitors
    • Procedure of Hadoop installation
    • Distributed cluster architecture
    • Lab: Hadoop Installation
  • 5. Working with HDFS
    • Ways to access data in HDFS
    • Common operations and commands of HDFS
    • Different HDFS commands
    • Internals of a file read in HDFS
    • Data copying with 'distcp'
    • Lab: Working with HDFS
  • 6. Hadoop Cluster Configuration
    • Configuration overview of Hadoop and the important configuration files
    • Parameters and values for configuration
    • Parameters of HDFS
    • MapReduce parameters
    • Environment setup of Hadoop
    • 'Include' and 'Exclude' configuration files
    • Lab: MapReduce Performance Tuning
  • 7. Hadoop Administration and Maintenance
    • Directory structures and files of Namenode/Datanode
    • Filesystem image and Edit log
    • The Checkpoint Procedure
    • Namenode failure and procedure for recovery
    • Safe Mode
    • Metadata and Data backup
    • Potential problems and solutions / What to look for
    • Adding and removing nodes
    • Lab: MapReduce Filesystem Recovery
  • 8. Job Scheduling
    • How to schedule Hadoop Jobs on the same cluster
    • Default Hadoop FIFO Schedule
    • Fair Scheduler and its configuration
  • 9. Map-Reduce Abstraction
    • What MapReduce is and why is it popular
    • The Big Picture of the MapReduce
    • MapReduce process and terminology
    • MapReduce components, failures and recoveries
    • Working with MapReduce
    • Lab: Working with MapReduce
  • 10. Programming MapReduce Jobs
    • Java MapReduce implementation
    • Map() and Reduce() methods
    • Code for calling Java MapReduce
    • Lab: Programming Word Count
  • 11. Input/Output Formats and Conversion Between Different Formats
    • Default Input and Output formats
    • Sequence File structure
    • Sequence File Input and Output formats
    • Sequence File access via Java API and HDS
    • MapFile
    • Lab: Input Format
    • Lab: Format Conversion
  • 12. MapReduce Features
    • Joining Data Sets in MapReduce Jobs
    • How to write a Map-Side Join
    • How to write a Reduce-Side Join
    • MapReduce Counters
    • Built-in and user-defined counters
    • Retrieving MapReduce counters
    • Lab: Map-Side Join
    • Lab: Reduce-Side Join
  • 13. Introduction to Hive, Hbase, Flume, Sqoop, Oozie and Pig
    • Hive as a data warehouse infrastructure
    • Hbase as the Hadoop Database
    • Using Pig as a scripting language for Hadoop
  • 14. Hadoop Case studies
    • How different organizations are using Hadoop cluster in their infrastructure