There is so much untapped information stored in unstructured data.
There is a huge rise in unstructured data.
Hadoop is central to big data. Hadoop is an open-source software framework for storage and large-scale processing of data-sets on
clusters of commodity
hardware. Hadoop is an Apache top-level project being built and used by a global community of
contributors and users. Let’s first start with benefits of
Hadoop:
1)
Distributed storage: this features enables
large processing and massive clusters
2)
Scalable
3)
Flexible
4)
Fault tolerant
5)
Intelligent
So lets go little high level. I will start with some basic high level Big data technologies and then try to go little deep.
Lets describe Hadoop technology stack
Core to Hadoop is HDFS, Map Reduce and YARN
The Hadoop distributed file system (HDFS) is a distributed,
scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance
typically has a single namenode; a cluster of datanodes form the HDFS cluster.
The situation is typical because each node does not require a datanode to be
present. Each datanode serves up blocks of data over the network using a block
protocol specific to HDFS. The file system uses TCP/IP sockets for communication.
HDFS is a self-healing high
bandwidth clustered storage. If we put petabyte file in the Hadoop, HDFS
will break up the file into blocks and distribute between all the nodes in the
cluster.
When we setup HDFS, we set up replication factor, default is
3. What it means that when we put this file in Hadoop, it will make 3 copies of
every block spread across all the nodes in the cluster. So if we lose a node,
then we would still have all the data. It will re replicate the data the on
that node to the other nodes. It’s the fault tolerance in Hadoop
How it does it this ?
As I told earlier it has a name node and a data node. There is one name node and rest are data
nodes.
Name node is the metadata server, it holds in memory every
block of every node .How it does this? Here comes the map reduce: it’s a 2 step
process- a mapper and a reducer
Map reduce is the retrieval and processing. Programmers will
write the mapper function, which will go out and tell us which data point to
retrieve. Reducer will take all that data and aggregate it
Yarn- yet another resource negotiator:
it is nothing but MapReduce 2.0. The
fundamental idea is to split up the two major functionalities of the
JobTracker, resource management and job scheduling/monitoring, into separate
daemons. The idea is to have a global ResourceManager (RM) and
per-application ApplicationMaster (AM)
Hadoop essential:
·
Pig
·
Hive
·
Hbase
·
Avro
·
Zookeeper
·
-oozie
·
sqoop
So lets describe them in detail
Pig is high level data flow scripting language. It is at the
data access layer of the Hadoop architecture. Its all about getting the data, loading
it up, transforming the data and filtering and retuning the results
There are 2 core components to pig
1)
Pig Latin- It is the programming language
2)
Ping Runtime- it compiles pig latin and converts
it into map reduce job
Hive is another data access project. Hive is a way to
project structure onto the data inside the cluster. So it is really a database
and data warehouse built on top of Hadoop. It contains a query language called
Hive QL. It is extremely similar to SQL.
Under DATA storage layer
HBase is an open source, non-relational, distributed
database modeled after Google's BigTable and written in Java. It runs on top of HDFS
(Hadoop Distributed Filesystem), providing
BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data (small amounts of information caught within a large
collection of empty or unimportant data, such as finding the 50 largest items
in a group of 2 billion records, or finding the non-zero items representing
less than 0.1% of a huge collection).
Apache
Cassandra is an open source distributed database
management system designed to handle large amounts of data across many
commodity servers, providing high availability with no single point of failure.
Cassandra offers robust support for clusters spanning multiple datacenters, with asynchronous masterless
replication allowing low latency operations for all clients.
Under Interaction visualization execution development layer
HCatalog is a table and storage management layer
for Hadoop that enables users with different data processing tools – Apache
Pig, Apache MapReduce, and Apache Hive – to more easily read and write data on
the grid. HCatalog’s table abstraction presents users with a relational view of
data in the Hadoop Distributed File System (HDFS) and ensures that users need
not worry about where or in what format their data is stored. HCatalog displays
data from RCFile format, text files, or sequence files in a tabular view. It
also provides REST APIs so that external systems can access these tables’
metadata.
Lucene is a free/open source information
retrieval software
library
Under Data serialization layer
Avro is a remote
procedure call and serialization framework developed within Apache's
Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a
compact binary format. In other words, Avro is a data serialization system. Its
primary use is in Apache Hadoop,
where it can provide both a serialization format for persistent data, and a
wire format for communication between Hadoop nodes, and from client programs to
the Hadoop services.
It is similar to Thrift,
but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).
Under Data intelligence layer
Apache Drill is an open-source software
framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets
Apache Mahout is a
project of the Apache Software Foundation to
produce free implementations of distributed or otherwise scalable
machine algorithms
focused primarily in the areas of collaborative filtering,
clustering and classification.
Apache
Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Under Management, Monitoring, orchestration layer
ZooKeeper is a centralized service
for maintaining configuration information, naming, providing distributed
synchronization, and providing group services. All of these kinds of services
are used in some form or another by distributed applications.
In next few blogs we will go into more details. so stay tuned.

Amazing information,thank you for your ideas.after along time i have studied
ReplyDeletean interesting information's.we need more updates in your blog.
Hadoop Training in Chennai
Big data training in chennai
big data training in velachery
JAVA Training in Chennai
Python Training in Chennai
Software testing training in chennai
hadoop training in Annanagar
big data training in chennai anna nagar
Big data training in annanagar