Hadoop is an open source- software framework from Apache that is designed for massive data handling. Apache Hadoop is a set of algorithms (an open-source software framework) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.
The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop Map/Reduce transfers code (specifically Jar files) to nodes that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking.
Hadoop is
- designed for massive scale
- designed to run on commodity servers
- design to recover from failures
- parallel programming model
- designed to handle structured and unstructured data
- used by companies like Yahoo, Facebook, Ebay, Linkedin etc
Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.
Hadoop changes the economics and the dynamics of large-scale computing. Its impact can be boiled down to four salient characteristics.
- Scalable - A cluster can be expanded by adding new servers or resources without having to move, reformat, or change the dependent analytic workflows or applications.
- Flexible - Hadoop is schema-less and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
- Cost Effective - Hadoop brings massively parallel computing to commodity servers. The result is a sizable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
- Fault Tolerant - When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.
Hadoop Common :-
A module containing the utilities that support the other Hadoop components.
Hadoop Distributed File System (HDFS):-
A file system that provides reliable data storage and access across all the nodes in a Hadoop cluster. It links together the file systems on many local
nodes to create a single file system.
MapReduce:-
A framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines,
in a reliable, fault-tolerant manner.
YARN (Yet Another Resource Negotiator) :-
The next-generation MapReduce, which assigns CPU, memory and storage
to applications running on a Hadoop cluster. It enables application frameworks
other than MapReduce to run on Hadoop, opening up a wealth of possibilities.
Hadoop Data Access
HBase
|
A column-oriented non-relational (noSQL) database that runs on
top of HDFS and is often used for sparse data sets.
|
Flume
|
A distributed, reliable and available service for efficiently
collecting, aggregating and moving large amounts of log data. Its main goal
is to deliver data from applications to the HDFS
|
Hive
|
A Hadoop runtime component that allows those fluent with SQL
to write Hive Query Language (HQL) statements, which are similar to SQL statements.
These are broken down into MapReduce jobs and executed across the cluster.
|
Pig
|
A programming language designed to handle any type of data,
helping users to focus more on analyzing large data sets and less on writing
map programs and reduce programs.
|
HCatalog
|
A table and storage management service for Hadoop data that
presents a table abstraction so the user does not need to know where or how
the data is stored.
|
Avro
|
An Apache open source project that provides data serialization
and data exchange services for Hadoop.
|
Sqoop
|
An ELT tool to support the transfer of data between Hadoop and
structured data sources.
|
Spark
|
An open-source cluster computing framework with in-memory
analytics performance that is up to 100 times faster than MapReduce,
depending on the application.
|