Early
adopters of Apache Hadoop, including high-profile users such as Yahoo, Facebook
and Google, had to rely on the partnership of the Hadoop Distributed File
System (HDFS) and the MapReduce programming and resource management
environment. Together, those technologies enabled users to process, manage and
store large amounts of structured, unstructured and semi-structured data in
Hadoop clusters.
But
there were limitations inherent in the Hadoop-MapReduce pairing. For example,
Yahoo and other users have cited issues with the first generation of Hadoop
technology not being able to keep pace with the deluge of information they're
collecting online because of MapReduce's batch processing format.
Hadoop
2, an upgrade released by the Apache Software Foundation in October 2013,
offers performance improvements that can benefit related technologies in the
Hadoop ecosystem, including the HBase database and Hive data warehouse. But the
most notable addition in Hadoop 2 -- which originally was referred to as Hadoop
2.0 -- is YARN, a new component that takes over MapReduce's resource management
and job scheduling duties. YARN (short for Yet Another Resource Negotiator)
enables users to deploy Hadoop systems without MapReduce. Running MapReduce
applications is still an option, but other kinds of programs can now be run
natively as well -- for example, real-time querying and streaming data
applications. The enhanced flexibility opens the door to broader uses for big
data and Hadoop 2 implementations; in addition, YARN allows users to
consolidate multiple Hadoop clusters into one system to lower costs and
streamline management tasks. The upgrades in Hadoop 2 also boost cluster
availability and scalability, two other issues that held back the first version
of Hadoop.
Even
with the added capabilities, Hadoop 2 still has a long way to go in moving
beyond the early adopter stage, particularly in mainstream IT shops. But the
new version heralds a maturing technology and a revamped concept for developing
and implementing big data applications. This guide explores the features of
Hadoop 2 and potential new uses for Hadoop tools and systems with insight and
advice from experienced Hadoop users as well as industry analysts and
consultants.
Apache Hadoop
Hadoop
is a free, Java-based programming framework that supports the processing of
large data sets in a distributed computing environment. It is part of the
Apache project sponsored by the Apache Software Foundation.
Hadoop
makes it possible to run applications on systems with thousands of nodes
involving thousands of terabytes. Its distributed file system facilitates rapid
data transfer rates among nodes and allows the system to continue operating
uninterrupted in case of a node failure. This approach lowers the risk of
catastrophic system failure, even if a significant number of nodes become
inoperative.
Hadoop
was inspired by Google's MapReduce, a software framework in which an
application is broken down into numerous small parts. Any of these parts (also
called fragments or blocks) can be run on any node in the cluster. Doug
Cutting, Hadoop's creator, named the framework after his child's stuffed toy
elephant. The current Apache Hadoop ecosystem consists of the Hadoop kernel,
MapReduce, the Hadoop distributed file system (HDFS) and a number of related
projects such as Apache Hive, HBase and Zookeeper.
The
Hadoop framework is used by major players including Google, Yahoo and IBM,
largely for applications involving search engines and advertising. The
preferred operating systems are Windows and Linux but Hadoop can also work with
BSD and OS X.
Apache Hadoop 2
Apache
Hadoop 2 (Hadoop 2.0) is the second iteration of the Hadoop framework for
distributed data processing.
One
important change that comes with the Hadoop 2 upgrade is the separation of the
Hadoop Distributed File System from MapReduce. The articles in this section
explore the changing dynamics of big data and Hadoop 2 applications triggered
by that breakup as well as the role of the new YARN resource manager and Hadoop
2's other new features.
Hadoop
2 adds support for running non-batch applications through the introduction of
YARN, a redesigned cluster resource manager that eliminates Hadoop's sole
reliance on the MapReduce programming model. Short for Yet Another Resource
Negotiator, YARN puts resource management and job scheduling functions in a
separate layer beneath the data processing one, enabling Hadoop 2 to run a
variety of applications. Overall, the changes made in Hadoop 2 position the
framework for wider use in big data analytics and other enterprise
applications. For example, it is now possible to run event processing as well
as streaming, real-time and operational applications. The capability to support
programming frameworks other than MapReduce also means that Hadoop can serve as
a platform for a wider variety of analytical applications.
Hadoop
2 also includes new features designed to improve system availability and
scalability. For example, it introduced an Hadoop Distributed File System
(HDFS) high-availability (HA) feature that brings a new NameNode architecture
to Hadoop. Previously, Hadoop clusters had one NameNode that maintained a
directory tree of HDFS files and tracked where data was stored in a cluster.
The Hadoop 2 high-availability scheme allows users to configure clusters with
redundant NameNodes, removing the chance that a lone NameNode will become a
single point of failure (SPoF) within a cluster. Meanwhile, a new HDFS
federation capability lets clusters be built out horizontally with multiple
NameNodes that work independently but share a common data storage pool,
offering better compute scaling as compared to Apache Hadoop 1.x.
Hadoop
2 also added support for Microsoft Windows and a snapshot capability that makes
read-only point-in-time copies of a file system available for data backup and
disaster recovery (DR). In addition, the revision offers all-important binary
compatibility with existing MapReduce applications built for Hadoop 1.x
releases.
MapReduce
MapReduce
is a software framework that allows developers to write programs that process
massive amounts of unstructured data in parallel across a distributed cluster
of processors or stand-alone computers. It was developed at Google for indexing
Web pages and replaced their original indexing algorithms and heuristics in
2004.
The
framework is divided into two parts:
·
Map, a function that parcels out work to different nodes in the
distributed cluster.
·
Reduce, another function that collates the work and resolves the results
into a single value.
·
The MapReduce framework is fault-tolerant because each node in the
cluster is expected to report back periodically with completed work and status
updates. If a node remains silent for longer than the expected interval, a
master node makes note and re-assigns the work to other nodes.
"The key to how MapReduce works is to take input as,
conceptually, a list of records. The records are split among the different
computers in the cluster by Map. The result of the Map computation is a list of
key/value pairs. Reduce then takes each set of values that has the same key and
combines them into a single value. So Map takes a set of data chunks and
produces key/value pairs and Reduce merges things, so that instead of a set of
key/value pair sets, you get one result. You can't tell whether the job was
split into 100 pieces or 2 pieces...MapReduce isn't intended to replace
relational databases: it's intended to provide a lightweight way of programming
things so that they can run fast by running in parallel on a lot of machines."
MapReduce
is important because it allows ordinary developers to use MapReduce library
routines to create parallel programs without having to worry about programming
for intra-cluster communication, task monitoring or failure handling. It is
useful for tasks such as data mining, log file analysis, financial analysis and
scientific simulations. Several implementations of MapReduce are available in a
variety of programming languages, including Java, C++, Python, Perl, Ruby, and
C.
Apache Hive
Apache
Hive is an open-source data warehouse system for querying and analyzing large
datasets stored in Hadoop files. Hadoop is a framework for handling large
datasets in a distributed computing environment.
Hive
has three main functions: data summarization, query and analysis. It
supports queries expressed in a language called HiveQL, which automatically
translates SQL-like queries into MapReduce jobs executed on Hadoop. In
addition, HiveQL supports custom MapReduce scripts to be plugged into queries.
Hive also enables data serialization/deserialization and increases flexibility
in schema design by including a system catalog called Hive-Metastore.
According
to the Apache Hive wiki, "Hive is not designed for OLTP workloads and does
not offer real-time queries or row-level updates. It is best used for batch
jobs over large sets of append-only data (like web logs)."
Hive
supports text files (also called flat files), SequenceFiles (flat files
consisting of binary key/value pairs) and RCFiles (Record Columnar Files which
store columns of a table in a columnar database way.)
Apache ZooKeeper
Apache
ZooKeeper is an open source file application program interface (API) that
allows distributed processes in large systems to synchronize with each other so
that all clients making requests receive consistent data.
The
Zookeeper service, which is a sub-project of Hadoop, is provided by a cluster
of servers to avoid a single point of failure. Zookeeper uses a distributed
consensus protocol to determine which node in the ZooKeeper service is the
leader at any given time.
The
leader assigns a timestamp to each update to keep order. Once a majority of
nodes have acknowledged receipt of a time-stamped update, the leader can
declare a quorum, which means that any data contained in the update can be
coordinated with elements of the data store. The use of a quorum ensures that
the service always returns consistent answers.
According
to the Hadoop developer's wiki, the service is named zookeeper because
"coordinating distributing services is a zoo."
Apache HBase
Apache
HBase is a column-oriented key/value data store built to run on top of the
Hadoop Distributed File System (HDFS). Hadoop is a framework for handling large
datasets in a distributed computing environment.
HBase
is designed to support high table-update rates and to scale out horizontally in
distributed compute clusters. Its focus on scale enables it to support very
large database tables -- for example, ones containing billions of rows and
millions of columns. Currently, one of the most prominent uses of HBase is as a
structured data handler for Facebook's basic messaging infrastructure.
HBase
is known for providing strong data consistency on reads and writes, which
distinguishes it from other NoSQL databases. Much like Hadoop, an important
aspect of the HBase architecture is the use of master nodes to manage region
servers that distribute and process parts of data tables.
HBase
is part of a long list of Apache Hadoop add-ons that includes tools such as
Hive, Pig and ZooKeeper. Like Hadoop, HBase is typically programmed using Java,
not SQL. As an open source project, its development is managed by the Apache
Software Foundation. HBase became a top-level Apache project in 2010.
Apache Pig
Apache
Pig is an open-source technology that offers a high-level mechanism for the
parallel programming of MapReduce jobs to be executed on Hadoop clusters.
Pig
enables developers to create query execution routines for analyzing large,
distributed data sets without having to do low-level work in MapReduce, much
like the way the Apache Hive data warehouse software provides a SQL-like
interface for Hadoop that doesn't require direct MapReduce programming,
The
key parts of Pig are a compiler and a scripting language known as Pig Latin.
Pig Latin is a data-flow language geared toward parallel processing. Managers
of the Apache Software Foundation's Pig project position the language as being
part way between declarative SQL and the procedural Java approach used in
MapReduce applications. Proponents say, for example, that data joins are easier
to create with Pig Latin than with Java. However, through the use of
user-defined functions (UDFs), Pig Latin applications can be extended to
include custom processing tasks written in Java as well as languages such as
JavaScript and Python.
Apache
Pig grew out of work at Yahoo Research and was first formally described in a
paper published in 2008. Pig is intended to handle all kinds of data, including
structured and unstructured information and relational and nested data. That
omnivorous view of data likely had a hand in the decision to name the
environment for the common barnyard animal. It also extends to Pig's take on
application frameworks; while the technology is primarily associated with
Hadoop, it is said to be capable of being used with other frameworks as well.
The
underlying Hadoop framework grew out of large-scale Web applications whose
architects chose non-SQL methods to economically collect and analyze massive
amounts of data. It has lots of add-on help for handling big data applications
because Apache Pig is just part of a long list of Hadoop ecosystem technologies
that also includes Hive, HBase, ZooKeeper and other utilities intended to fill
in functionality gaps in the framework.
Apache Hadoop YARN
Apache
Hadoop YARN (Yet Another Resource Negotiator) is a cluster management
technology.
YARN
is one of the key features in the second-generation Hadoop 2 version of the
Apache Software Foundation's open source distributed processing framework.
Originally described by Apache as a redesigned resource manager, YARN is now
characterized as a large-scale, distributed operating system for big data
applications.
In
2012, YARN became a sub-project of the larger Apache Hadoop project. Sometimes
called MapReduce 2.0, YARN is a software rewrite that decouples MapReduce's
resource management and scheduling capabilities from the data processing
component, enabling Hadoop to support more varied processing approaches and a
broader array of applications. For example, Hadoop clusters can now run
interactive querying and streaming data applications simultaneously with
MapReduce batch jobs. The original incarnation of Hadoop closely paired the
Hadoop Distributed File System (HDFS) with the batch-oriented MapReduce
programming framework, which handles resource management and job scheduling on
Hadoop systems and supports the parsing and condensing of data sets in
parallel.
YARN
combines a central resource manager that reconciles the way applications use
Hadoop system resources with node manager agents that monitor the processing
operations of individual cluster nodes. Running on commodity hardware
clusters, Hadoop has attracted particular interest as a staging area and data
store for large volumes of structured and unstructured data intended for use in
analytics applications. Separating HDFS from MapReduce with YARN makes the
Hadoop environment more suitable for operational applications that can't wait
for batch jobs to finish.
yacc (yet another compiler compiler)
Yacc (for "yet another
compiler compiler." ) is the standard parser generator for the Unix
operating system. An open source program, yacc generates code for the parser in
the C programming language. The acronym is usually rendered in lowercase but is
occasionally seen as YACC or Yacc. The original version of yacc was written by
Stephen Johnson at American Telephone and Telegraph (AT&T). Versions of
yacc have since been written for use with Ada, Java and several other less
well-known programming languages.
NoSQL (Not Only SQL)
NoSQL
database, also called Not Only SQL, is an approach to data management and
database design that's useful for very large sets of distributed data.
NoSQL,
which encompasses a wide range of technologies and architectures, seeks to
solve the scalability and big data performance issues that relational databases
weren’t designed to address. NoSQL is especially useful when an enterprise
needs to access and analyze massive amounts of unstructured data or data that's
stored remotely on multiple virtual servers in the cloud. .
Contrary
to misconceptions caused by its name, NoSQL does not prohibit structured query
language (SQL). While it's true that some NoSQL systems are entirely
non-relational, others simply avoid selected relational functionality such as fixed
table schemas and join operations. For example, instead of using tables, a
NoSQL database might organize data into objects, key or tuples.
Arguably,
the most popular NoSQL database is Apache Cassandra. Cassandra, which was once
Facebook’s proprietary database, was released as open source in 2008. Other
NoSQL implementations include SimpleDB, Google BigTable, Apache Hadoop,
MapReduce, MemcacheDB, and Voldemort. Companies that use NoSQL include NetFlix,
LinkedIn and Twitter.
NoSQL
is often mentioned in conjunction with other big data tools such as massive
parallel processing, columnar-based databases and Database-as-a-Service (DaaS).
Relational
databases built around the SQL programming language have long been the top --
and, in many cases, only -- choice of database technologies for organizations.
Now, with the emergence of various NoSQL software platforms, IT managers and
business executives involved in technology decisions have more options on
database deployments. NoSQL databases support dynamic schema design, offering
the potential for increased flexibility, scalability and customization compared
to relational software. That makes them a good fit for Web applications,
content management systems and other uses involving large amounts of
non-uniform data requiring frequent updates and varying field formats. In
particular, NoSQL technologies are designed with "big data" needs in
mind.
But
for prospective users, the array of NoSQL database choices may seem confusing
or even overwhelming. NoSQL databases are grouped into four primary product
categories with different architectural characteristics: document databases,
graph databases, key-value databases and wide column stores. Many NoSQL
platforms are also tailored for specific purposes, and they may or may not work
well with SQL technologies, which could be a necessity in some organizations.
In addition, most NoSQL systems aren't suitable replacements for relational
databases in transaction processing applications, because they lack full ACID
compliance for guaranteeing transactional integrity and data consistency.
As
a result, IT and business professionals making database buying decisions must
carefully evaluate whether the available NoSQL options fit their business
needs. In this guide, you can learn more about what NoSQL software can do and
how it differs from relational databases. Trend stories and user case studies
document how NoSQL databases can be used to support big data, cloud computing
and business analytics applications. And experienced users from companies that
have already deployed NoSQL tools offer advice on how to make the technology
selection and implementation process smoother.
MongoDB
MongoDB
is an open source database that uses a document-oriented data model.
MongoDB
is one of several database types to arise in the mid-2000s under the NoSQL
banner. Instead of using tables and rows as in relational databases, MongoDB is
built on architecture of collections and documents. Documents comprise sets of
key-value pairs and are the basic unit of data in MongoDB. Collections contain
sets of documents and function as the equivalent of relational database
tables.
Like other NoSQL databases, MongoDB supports dynamic schema design, allowing the documents in a collection to have different fields and structures. The database uses a document storage and data interchange format called BSON, which provides a binary representation of JSON-like documents. Automatic sharding enables data in a collection to be distributed across multiple systems for horizontal scalability as data volumes increase.
MongoDB was created by Dwight Merriman and Eliot Horowitz, who had encountered development and scalability issues with traditional relational database approaches while building Web applications at Double Click, an Internet advertising company that is now owned by Google Inc. According to Merriman, the name of the database was derived from the word humongous to represent the idea of supporting large amounts of data. Merriman and Horowitz helped form 10Gen Inc. in 2007 to commercialize MongoDB and related software. The company was renamed MongoDB Inc. in 2013.
The database was released to open source in 2009 and is available under the terms of the Free Software Foundation's GNU AGPL Version 3.0 commercial license. At the time of this writing, among other users, the insurance company MetLife is using MongoDB for customer service applications, the website Craigslist is using it for archiving data, the CERN physics lab is using it for data aggregation and discovery and The New York Times newspaper is using MongoDB to support a form-building application for photo submissions.
Sharding
Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.
Here's how Jason Tee explains sharding on The Server Side: "In the simplest sense, sharding your database involves breaking up your big database into many, much smaller databases that share nothing and can be spread across multiple servers."
Technically, sharding is a synonym for horizontal partitioning. In practice, the term is often used to refer to any database partitioning that is meant to make a very large database more manageable.
The governing concept behind sharding is based on the idea that as the size of a database and the number of transactions per unit of time made on the database increase linearly, the response time for querying the database increases exponentially.
Additionally, the costs of creating and maintaining a very large database in one place can increase exponentially because the database will require high-end computers. In contrast, data shards can be distributed across a number of much less expensive commodity servers. Data shards have comparatively little restriction as far as hardware and software requirements are concerned.
In some
cases, database sharding can be done fairly simply. One common example is
splitting a customer database geographically. Customers located on the East
Coast can be placed on one server, while customers on the West Coast can be
placed on a second server. Assuming there are no customers with multiple
locations, the split is easy to maintain and build rules around.
Data
sharding can be a more complex process in some scenarios, however. Sharding a
database that holds less structured data, for example, can be very complicated
and the resulting shards may be difficult to maintain.
Key-value pair
A key-value pair (KVP) is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. Key-value pairs are frequently used in lookup tables, hash tables and configuration files.
JSON (JavaScript Object Notation)
JSON (Javascript Object Notation) is a text-based, human-readable data interchange format used for representing simple data structures and objects in Web browser-based code. JSON is also sometimes used in desktop and server-side programming environments. JSON was originally based on the Javascript programming language and was introduced as the page scripting language for the Netscape Navigator Web browser.
JSON is used in Javascript on the Internet as an alternative to XML for organizing data. Like XML, JSON is language-independent and may be combined with C++, Java, Python, Lisp and many other languages. Unlike XML, however, JSON is simply a way to represent data structures, as opposed to a full markup language. JSON documents are relatively light weight and are rapidly executed on Web server.
JSON consists of "name: object" pairs and punctuation in the form of brackets, parentheses, semi-colons and colons. Each object is defined with an operator like "text:" or "image:" and then grouped with a value for that operator. The simple structure and absence of mathematical notation or algorithms, JSON is easy to understand and quickly mastered, even by users with limited formal programming experience, which has spurred adoption of the format as a quick, approachable way to create interactive pages.
Novice users of JSON need to be aware of potential security implications. As JSON scripts automatically execute in any Web page that's requested by a Web browser, they can be used to implement JavaScript insertion attacks against a Web client, like a command injection or cross-site scripting. For example, if a hacker inserts non-JSON code into the string, like a Trojan horse, the targeted algorithm executes the text in as if it were Javascript and then returns the value of the last statement. If the only statement was a JSON value, there's no effect. If a previous statement contains other Javascript code, however, that code will be executed by the script. The hacker might then have access to all the variables a script has access to, potentially compromising a user's PC.
Difference between SAP BWA and HANA?
BWA is one of the ancestors of HANA, so there are definite similarities between the two SAP technologies. Both are in-memory, columnar, parallel database architectures. This means that both systems can distribute large queries evenly among hundreds of processors across several servers. However, BWA is very narrowly focused on its use case as an analytics accelerator and lacks many of the capabilities that allow HANA to act as a full database and application platform. For example, BWA requires that whenever data changes, its columnar indexes are fully rebuilt.
HANA allows for much more fine-grained updates, allowing indices to be rebuilt much less often, making HANA workable as a transactional and an analytic database. HANA also provides standard interfaces (like SQL and MDX), a modelling environment, analytic and statistical function libraries and a full-fledged application platform on top of its database capabilities in the form of XS Engine and River -- all things that BWA never really aspired to.
So is there any place for BWA anymore? Not really. If a company already has BWA and doesn't want to go through the potential expense and disruption of migrating to BW on HANA, then it might make sense to stick with BWA for the moment. But the future looks to be BW on HANA. On the bright side, BW on HANA may not be as expensive a proposition as it seems at first, because BW and HANA both provide ample options for keeping data in cheaper, disk-based storage systems like SAP's Sybase IQ or Apache Hive while still getting good (but not as good, in most cases) performance.
Thanks for sharing your knowledge with us.very nice post and very useful information.
ReplyDeleteThank you....
big data and hadoop online training
big data and hadoop course
online courses for big data and hadoop
The article is so appealing. You should read this article before choosing the data warehousing consultant you want to learn.
ReplyDeleteNice post,Keep sharing more articles with us.
ReplyDeletethank you,keep updating
Big data online training
Big data hadoop training
Nice article please do visit my website for Bigdata hadoop online training
ReplyDeleteBig data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. Those tools perform well when dealing with datasets that are smaller and more structured, such as relational databases. But big data doesn't fit into neat rows and columns. This is because big data is made up of multiple large structured and unstructured data sets.
ReplyDeleteIm obliged for the blog article.Thanks Again. Awesome.
ReplyDeleteMuleSoft training
python training
Angular js training
selenium trainings
sql server dba training
Excellent article for the people who need information about this course.
ReplyDeleteHadoop
Big Data
to implement one end to end project all hadoop, nosql and spark knowledge is mandatory to implement one project.
ReplyDeleteThanks & Regards
Venu
spark training in Hyderabad
Great, thanks for sharing this post.Much thanks again. Awesome.
ReplyDeletebest machine learning course in india
best machine learning course online
This is a really authentic and informative blog. Share more posts like this.
ReplyDeletePhonetics Sounds With Examples
Basics Of Phonetics
Thank you for the useful information. Share more updates.
ReplyDeleteIdioms
Speaking Test
Very neat blog article.Really looking forward to read more. Want more.
ReplyDeletesccm training
sccm online training