Hadoop is a free framework that’s designed to support the processing of large data sets.
The Java-based programming framework is designed to support the processing of large data sets in a distributed computing environment that is typically built from commodity hardware.
The rise of big data into an essential component of business strategy to modernise and better serve customers has in part been boosted by the appearance of Hadoop.
Hadoop is part of the Apache Software Foundation, which supports the development of open-source software projects.
At its core, Hadoop consists of a storage part called the Hadoop Distributed File System (HDFS), and a processing part called MapReduce.
Basically, Hadoop works by splitting large files into blocks which are then distributed across nodes in a cluster to be processed.
The base framework is made up of Hadoop Common, which contains libraries and utilities for other Hadoop modules; HDFS, a distributed file system that stores data on commodity machines; YARN, which works as a resource management platform; and MapReduce, which is for large scale data processing.
The MapReduce and HDFS components of Hadoop were originally inspired by Google papers on their MapReduce and Google File System, the paper was published in 2003.
Why is it called Hadoop?
Doug Cutting, the creator of Hadoop, named it after his son’s toy elephant. This is why many of the logos of Hadoop vendors, and of Hadoop itself, are elephants.
Java is the most common language on the Hadoop framework, although there is some native code in C and command line utilities written as shell scripts.
Hadoop can be used for low-cost storage and active data archiving, as a staging area for a data warehouse and analytics store, as a data lake, a sandbox for discovery and analysis, and for recommendation systems.
Companies such as Facebook, LinkedIn, Netflix and eBay are all users of Hadoop.
The Hadoop ecosystem is extremely large and is made up of companies such as Hortonworks, Cloudera, MapR, Teradata, and many more.
Some of the key software components include; Cassandra, a distributed database system; Spark, a cluster computing framework with in-memory analytics; and Oozie, a Hadoop job scheduler.