Learning how to tame the Big Data with Hadoop and related technologies

Hadoop

Hadoop is an open source software platform for distributed storage and distributed processing of very large datasets on computer clusters built from commodity hardware
Why Hadoop?
- Data is too big
- Vertical scaling isn't an option
  - Disk seek times
  - Hardware failures
  - Processing times
- Horizontal scaling is linear
- You can do much more instead of just batch processing

Installation

Download Virtual Box from https://www.virtualbox.org/
Download image of Hadoop to run on Virtual Box
- (Horton Works Data Platform) HDP 2.5 Sandbox is preffered because it boots up faster than new versions
  - Download from https://hortonworks.com/downloads/#sandbox
Import the image into Virtual Box
Once you bootup, you will have CentOS instance that has Hadoop up and running
We can use CLI, it also has browser interface
- Ambari is available to easily navigate and manage different systems on Hadoop
- Goto http://localhost:8888
Launch Dashboard and login to Ambari
- Username: maria_dev
- Password: maria_dev
Trouble shooting
- Enable virtualization in your BIOS
- Disable Hyper-V acceleration in Windows