Sagar Saiboyani Oracle Apps & Big Data Tech: June 2015

Hadoop is an open source- software framework from Apache that is designed for massive data handling. Apache Hadoop is a set of algorithms (an open-source software framework) for distributed storage and distributed processing of very large data sets (Big Data) on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.

The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop Map/Reduce transfers code (specifically Jar files) to nodes that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking.

Hadoop is

designed for massive scale
designed to run on commodity servers
design to recover from failures
parallel programming model
designed to handle structured and unstructured data
used by companies like Yahoo, Facebook, Ebay, Linkedin etc

Hadoop was created by Doug Cutting, who named it after his son's toy elephant.

Apache Hadoop is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.

Why Hadoop?
Hadoop changes the economics and the dynamics of large-scale computing. Its impact can be boiled down to four salient characteristics.

Scalable - A cluster can be expanded by adding new servers or resources without having to move, reformat, or change the dependent analytic workflows or applications.
Flexible - Hadoop is schema-less and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
Cost Effective - Hadoop brings massively parallel computing to commodity servers. The result is a sizable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
Fault Tolerant - When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

Hadoop Architecture

Hadoop is composed of four core components—
                      Hadoop Common :-
                                         A module containing the utilities that support the other Hadoop components.
                      Hadoop Distributed File System (HDFS):-
                                        A file system that provides reliable data storage and access across all the                         nodes in a Hadoop cluster. It links together the file systems on many local
                                      nodes to create a single file system.
                      MapReduce:-
                                       A framework for writing applications that process large amounts of structured                                      and unstructured data in parallel across a cluster of thousands of machines,
                                       in a reliable, fault-tolerant manner.
                       YARN (Yet Another Resource Negotiator) :-
                                      The next-generation MapReduce, which assigns CPU, memory and storage
                                       to applications running on a Hadoop cluster. It enables application frameworks
other than MapReduce to run on Hadoop, opening up a wealth of possibilities.

Hadoop Data Access

HBase	A column-oriented non-relational (noSQL) database that runs on top of HDFS and is often used for sparse data sets.
Flume	A distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data. Its main goal is to deliver data from applications to the HDFS
Hive	A Hadoop runtime component that allows those fluent with SQL to write Hive Query Language (HQL) statements, which are similar to SQL statements. These are broken down into MapReduce jobs and executed across the cluster.
Pig	A programming language designed to handle any type of data, helping users to focus more on analyzing large data sets and less on writing map programs and reduce programs.
HCatalog	A table and storage management service for Hadoop data that presents a table abstraction so the user does not need to know where or how the data is stored.
Avro	An Apache open source project that provides data serialization and data exchange services for Hadoop.
Sqoop	An ELT tool to support the transfer of data between Hadoop and structured data sources.
Spark	An open-source cluster computing framework with in-memory analytics performance that is up to 100 times faster than MapReduce, depending on the application.

Apache Sqoop can be used to import data from any relational DB into HDFS, Hive or HBase.

Sqoop command can be used for both Importing Data from any DB to HDFS/Hive/Hbase and Exporting from HDFS to any other DB like HANA/Oracle

Here i am explaining about using the SQOOP command to bring/importing the data from relational DB (oracle) into Hive DB.

Below Picture Tell you how Sqoop import command Works in the Backgroud.

How this Sqoop Command works in Hadoop.

When Sqoop Command execute,

First it will connect to the Relational DB using the credentials provided int he Sqoop command.
Next It will look for the Schema and table.
Next it will select the data
Next bring data into Hadoop HDFS file system.
And then It will look for Where you are going to import in Hadoop like Hive/HBASE etc.. this case it is Hive.
Next it will connect to hive and look for the Schema in the Hive.
Next it will check the table is Existing or not, if not it will create and Dump the data.
If the table is already exists then look for the Insert or Overwrite and dump the data.

Sqoop Command Syntax

sqoop import --connect 'jdbc:oracle:thin:@//scan-prd-109.hadoopcomp.com:1542/TSMPRD.hadoopcomp.com' \
    --verbose                             \
    --username DWRPTRO                    \
    --password DWRPTRO                    \
    --table TEMP1                         \
    --hive-import                         \
    --hive-database 'XXX'                 \
    --hive-table TEMP1                    \
    --hive-drop-import-delims             \
    --target-dir '/tmp/$Table'            \
    --hive-overwrite                      \
     -m 1

connect- this string determinse where to connect
username- Username to the relational DB
password- Password to the above user
table- WHich table name you want import
hive-import - This is target Env. Hive and it is Import
hive-database -Represent Hive DB name
hive-table- Hive Table Name in that Hive DB
hive-drop - It will drop the Delimeters while importing
target-dir - it is HDFS target Directory in Hadoop
hive-overwrite - Overwrite the data in Hive (truncate and load)
m- Number of Mappers to allocate for this import

Thank You
Sagar Saiboyani

Sagar Saiboyani Oracle Apps & Big Data Tech

Thursday, 11 June 2015

Know about HADOOP

Importing Table Data from Oracle DB to Hive DB Table in Hadoop Using SQOOP