Course Outline
Introduction
- Introduction to Cloud Computing and Big Data solutions
- Overview of Apache Hadoop Features and Architecture
Setting up Hadoop
- Planning a Hadoop cluster (on-premise, cloud, etc.)
- Selecting the OS and Hadoop distribution
- Provisioning resources (hardware, network, etc.)
- Downloading and installing the software
- Sizing the cluster for flexibility
Working with HDFS
- Understanding the Hadoop Distributed File System (HDFS)
- Overview of HDFS Command Reference
- Accessing HDFS
- Performing Basic File Operations on HDFS
- Using S3 as a complement to HDFS
Overview of the MapReduce
- Understanding Data Flow in the MapReduce Framework
- Map, Shuffle, Sort and Reduce
- Demo: Computing Top Salaries
Working with YARN
- Understanding resource management in Hadoop
- Working with ResourceManager, NodeManager, Application Master
- Scheduling jobs under YARN
- Scheduling for large numbers of nodes and clusters
- Demo: Job scheduling
Integrating Hadoop with Spark
- Setting up storage for Spark (HDFS, Amazon, S3, NoSQL, etc.)
- Understanding Resilient Distributed Datasets (RDDs)
- Creating an RDD
- Implementing RDD Transformations
- Demo: Implementing a Text Search Program for Movie Titles
Managing a Hadoop Cluster
- Monitoring Hadoop
- Securing a Hadoop cluster
- Adding and removing nodes
- Running a performance benchmark
- Tuning a Hadoop cluster to optimizing performance
- Backup, recovery and business continuity planning
- Ensuring high availability (HA)
Upgrading and Migrating a Hadoop Cluster
- Assessing workload requirements
- Upgrading Hadoop
- Moving from on-premise to cloud and vice-versa
- Recovering from failures
Troubleshooting
Summary and Conclusion
Requirements
- System administration experience
- Experience with Linux command line
- An understanding of big data concepts
Audience
- System administrators
- DBAs
Testimonials (5)
A lot of practical examples, different ways to approach the same problem, and sometimes not so obvious tricks how to improve the current solution
Rafał - Nordea
Course - Apache Spark MLlib
very interactive...
Richard Langford
Course - SMACK Stack for Data Science
Sufficient hands on, trainer is knowledgable
Chris Tan
Course - A Practical Introduction to Stream Processing
Trainer's preparation & organization, and quality of materials provided on github.
Mateusz Rek - MicroStrategy Poland Sp. z o.o.
Course - Impala for Business Intelligence
Get to learn spark streaming , databricks and aws redshift