Tuesday, September 13, 2016

Bigdata and Hadoop

Early adopters of Apache Hadoop, including high-profile users such as Yahoo, Facebook and Google, had to rely on the partnership of the Hadoop Distributed File System (HDFS) and the MapReduce programming and resource management environment. Together, those technologies enabled users to process, manage and store large amounts of structured, unstructured and semi-structured data in Hadoop clusters.

But there were limitations inherent in the Hadoop-MapReduce pairing. For example, Yahoo and other users have cited issues with the first generation of Hadoop technology not being able to keep pace with the deluge of information they're collecting online because of MapReduce's batch processing format.

Hadoop 2, an upgrade released by the Apache Software Foundation in October 2013, offers performance improvements that can benefit related technologies in the Hadoop ecosystem, including the HBase database and Hive data warehouse. But the most notable addition in Hadoop 2 -- which originally was referred to as Hadoop 2.0 -- is YARN, a new component that takes over MapReduce's resource management and job scheduling duties. YARN (short for Yet Another Resource Negotiator) enables users to deploy Hadoop systems without MapReduce. Running MapReduce applications is still an option, but other kinds of programs can now be run natively as well -- for example, real-time querying and streaming data applications. The enhanced flexibility opens the door to broader uses for big data and Hadoop 2 implementations; in addition, YARN allows users to consolidate multiple Hadoop clusters into one system to lower costs and streamline management tasks. The upgrades in Hadoop 2 also boost cluster availability and scalability, two other issues that held back the first version of Hadoop.

Even with the added capabilities, Hadoop 2 still has a long way to go in moving beyond the early adopter stage, particularly in mainstream IT shops. But the new version heralds a maturing technology and a revamped concept for developing and implementing big data applications. This guide explores the features of Hadoop 2 and potential new uses for Hadoop tools and systems with insight and advice from experienced Hadoop users as well as industry analysts and consultants.

Apache Hadoop

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.

Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes. Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure. This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.

Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts. Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant. The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper.

The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS X.

Apache Hadoop 2

Apache Hadoop 2 (Hadoop 2.0) is the second iteration of the Hadoop framework for distributed data processing.  

One important change that comes with the Hadoop 2 upgrade is the separation of the Hadoop Distributed File System from MapReduce. The articles in this section explore the changing dynamics of big data and Hadoop 2 applications triggered by that breakup as well as the role of the new YARN resource manager and Hadoop 2's other new features.

Hadoop 2 adds support for running non-batch applications through the introduction of YARN, a redesigned cluster resource manager that eliminates Hadoop's sole reliance on the MapReduce programming model. Short for Yet Another Resource Negotiator, YARN puts resource management and job scheduling functions in a separate layer beneath the data processing one, enabling Hadoop 2 to run a variety of applications. Overall, the changes made in Hadoop 2 position the framework for wider use in big data analytics and other enterprise applications. For example, it is now possible to run event processing as well as streaming, real-time and operational applications. The capability to support programming frameworks other than MapReduce also means that Hadoop can serve as a platform for a wider variety of analytical applications.

Hadoop 2 also includes new features designed to improve system availability and scalability. For example, it introduced an Hadoop Distributed File System (HDFS) high-availability (HA) feature that brings a new NameNode architecture to Hadoop. Previously, Hadoop clusters had one NameNode that maintained a directory tree of HDFS files and tracked where data was stored in a cluster. The Hadoop 2 high-availability scheme allows users to configure clusters with redundant NameNodes, removing the chance that a lone NameNode will become a single point of failure (SPoF) within a cluster. Meanwhile, a new HDFS federation capability lets clusters be built out horizontally with multiple NameNodes that work independently but share a common data storage pool, offering better compute scaling as compared to Apache Hadoop 1.x.

Hadoop 2 also added support for Microsoft Windows and a snapshot capability that makes read-only point-in-time copies of a file system available for data backup and disaster recovery (DR). In addition, the revision offers all-important binary compatibility with existing MapReduce applications built for Hadoop 1.x releases.

MapReduce

MapReduce is a software framework that allows developers to write programs that process massive amounts of unstructured data in parallel across a distributed cluster of processors or stand-alone computers. It was developed at Google for indexing Web pages and replaced their original indexing algorithms and heuristics in 2004.

The framework is divided into two parts:

·   Map, a function that parcels out work to different nodes in the distributed cluster.

·   Reduce, another function that collates the work and resolves the results into a single value.

·   The MapReduce framework is fault-tolerant because each node in the cluster is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and re-assigns the work to other nodes.

     "The key to how MapReduce works is to take input as, conceptually, a list of records. The records are split among the different computers in the cluster by Map. The result of the Map computation is a list of key/value pairs. Reduce then takes each set of values that has the same key and combines them into a single value. So Map takes a set of data chunks and produces key/value pairs and Reduce merges things, so that instead of a set of key/value pair sets, you get one result. You can't tell whether the job was split into 100 pieces or 2 pieces...MapReduce isn't intended to replace relational databases: it's intended to provide a lightweight way of programming things so that they can run fast by running in parallel on a lot of machines."

MapReduce is important because it allows ordinary developers to use MapReduce library routines to create parallel programs without having to worry about programming for intra-cluster communication, task monitoring or failure handling. It is useful for tasks such as data mining, log file analysis, financial analysis and scientific simulations. Several implementations of MapReduce are available in a variety of programming languages, including Java, C++, Python, Perl, Ruby, and C.

Apache Hive

Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hadoop is a framework for handling large datasets in a distributed computing environment.

Hive has three main functions: data summarization, query and analysis.  It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into queries. Hive also enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore.

According to the Apache Hive wiki, "Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs)." 

Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database way.)

Apache ZooKeeper

Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.

The Zookeeper service, which is a sub-project of Hadoop, is provided by a cluster of servers to avoid a single point of failure. Zookeeper uses a distributed consensus protocol to determine which node in the ZooKeeper service is the leader at any given time.

The leader assigns a timestamp to each update to keep order. Once a majority of nodes have acknowledged receipt of a time-stamped update, the leader can declare a quorum, which means that any data contained in the update can be coordinated with elements of the data store. The use of a quorum ensures that the service always returns consistent answers.

According to the Hadoop developer's wiki, the service is named zookeeper because "coordinating distributing services is a zoo."

Apache HBase

Apache HBase is a column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS). Hadoop is a framework for handling large datasets in a distributed computing environment.

HBase is designed to support high table-update rates and to scale out horizontally in distributed compute clusters. Its focus on scale enables it to support very large database tables -- for example, ones containing billions of rows and millions of columns. Currently, one of the most prominent uses of HBase is as a structured data handler for Facebook's basic messaging infrastructure.

HBase is known for providing strong data consistency on reads and writes, which distinguishes it from other NoSQL databases. Much like Hadoop, an important aspect of the HBase architecture is the use of master nodes to manage region servers that distribute and process parts of data tables.

HBase is part of a long list of Apache Hadoop add-ons that includes tools such as Hive, Pig and ZooKeeper. Like Hadoop, HBase is typically programmed using Java, not SQL. As an open source project, its development is managed by the Apache Software Foundation. HBase became a top-level Apache project in 2010.

Apache Pig

Apache Pig is an open-source technology that offers a high-level mechanism for the parallel programming of MapReduce jobs to be executed on Hadoop clusters. 

Pig enables developers to create query execution routines for analyzing large, distributed data sets without having to do low-level work in MapReduce, much like the way the Apache Hive data warehouse software provides a SQL-like interface for Hadoop that doesn't require direct MapReduce programming,

The key parts of Pig are a compiler and a scripting language known as Pig Latin. Pig Latin is a data-flow language geared toward parallel processing. Managers of the Apache Software Foundation's Pig project position the language as being part way between declarative SQL and the procedural Java approach used in MapReduce applications. Proponents say, for example, that data joins are easier to create with Pig Latin than with Java. However, through the use of user-defined functions (UDFs), Pig Latin applications can be extended to include custom processing tasks written in Java as well as languages such as JavaScript and Python.

Apache Pig grew out of work at Yahoo Research and was first formally described in a paper published in 2008. Pig is intended to handle all kinds of data, including structured and unstructured information and relational and nested data. That omnivorous view of data likely had a hand in the decision to name the environment for the common barnyard animal. It also extends to Pig's take on application frameworks; while the technology is primarily associated with Hadoop, it is said to be capable of being used with other frameworks as well.

The underlying Hadoop framework grew out of large-scale Web applications whose architects chose non-SQL methods to economically collect and analyze massive amounts of data. It has lots of add-on help for handling big data applications because Apache Pig is just part of a long list of Hadoop ecosystem technologies that also includes Hive, HBase, ZooKeeper and other utilities intended to fill in functionality gaps in the framework.

Apache Hadoop YARN

Apache Hadoop YARN (Yet Another Resource Negotiator) is a cluster management technology.

YARN is one of the key features in the second-generation Hadoop 2 version of the Apache Software Foundation's open source distributed processing framework. Originally described by Apache as a redesigned resource manager, YARN is now characterized as a large-scale, distributed operating system for big data applications.

In 2012, YARN became a sub-project of the larger Apache Hadoop project. Sometimes called MapReduce 2.0, YARN is a software rewrite that decouples MapReduce's resource management and scheduling capabilities from the data processing component, enabling Hadoop to support more varied processing approaches and a broader array of applications. For example, Hadoop clusters can now run interactive querying and streaming data applications simultaneously with MapReduce batch jobs. The original incarnation of Hadoop closely paired the Hadoop Distributed File System (HDFS) with the batch-oriented MapReduce programming framework, which handles resource management and job scheduling on Hadoop systems and supports the parsing and condensing of data sets in parallel.

YARN combines a central resource manager that reconciles the way applications use Hadoop system resources with node manager agents that monitor the processing operations of individual cluster nodes.  Running on commodity hardware clusters, Hadoop has attracted particular interest as a staging area and data store for large volumes of structured and unstructured data intended for use in analytics applications. Separating HDFS from MapReduce with YARN makes the Hadoop environment more suitable for operational applications that can't wait for batch jobs to finish.

yacc (yet another compiler compiler) 

Yacc (for "yet another compiler compiler." ) is the standard parser generator for the Unix operating system. An open source program, yacc generates code for the parser in the C programming language. The acronym is usually rendered in lowercase but is occasionally seen as YACC or Yacc. The original version of yacc was written by Stephen Johnson at American Telephone and Telegraph (AT&T). Versions of yacc have since been written for use with Ada, Java and several other less well-known programming languages.

 NoSQL (Not Only SQL)

NoSQL database, also called Not Only SQL, is an approach to data management and database design that's useful for very large sets of distributed data.  

NoSQL, which encompasses a wide range of technologies and architectures, seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address. NoSQL is especially useful when an enterprise needs to access and analyze massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud. .  

Contrary to misconceptions caused by its name, NoSQL does not prohibit structured query language (SQL). While it's true that some NoSQL systems are entirely non-relational, others simply avoid selected relational functionality such as fixed table schemas and join operations. For example, instead of using tables, a NoSQL database might organize data into objects, key or tuples.

Arguably, the most popular NoSQL database is Apache Cassandra. Cassandra, which was once Facebook’s proprietary database, was released as open source in 2008. Other NoSQL implementations include SimpleDB, Google BigTable, Apache Hadoop, MapReduce, MemcacheDB, and Voldemort. Companies that use NoSQL include NetFlix, LinkedIn and Twitter.

NoSQL is often mentioned in conjunction with other big data tools such as massive parallel processing, columnar-based databases and Database-as-a-Service (DaaS).

Relational databases built around the SQL programming language have long been the top -- and, in many cases, only -- choice of database technologies for organizations. Now, with the emergence of various NoSQL software platforms, IT managers and business executives involved in technology decisions have more options on database deployments. NoSQL databases support dynamic schema design, offering the potential for increased flexibility, scalability and customization compared to relational software. That makes them a good fit for Web applications, content management systems and other uses involving large amounts of non-uniform data requiring frequent updates and varying field formats. In particular, NoSQL technologies are designed with "big data" needs in mind.

But for prospective users, the array of NoSQL database choices may seem confusing or even overwhelming. NoSQL databases are grouped into four primary product categories with different architectural characteristics: document databases, graph databases, key-value databases and wide column stores. Many NoSQL platforms are also tailored for specific purposes, and they may or may not work well with SQL technologies, which could be a necessity in some organizations. In addition, most NoSQL systems aren't suitable replacements for relational databases in transaction processing applications, because they lack full ACID compliance for guaranteeing transactional integrity and data consistency.

As a result, IT and business professionals making database buying decisions must carefully evaluate whether the available NoSQL options fit their business needs. In this guide, you can learn more about what NoSQL software can do and how it differs from relational databases. Trend stories and user case studies document how NoSQL databases can be used to support big data, cloud computing and business analytics applications. And experienced users from companies that have already deployed NoSQL tools offer advice on how to make the technology selection and implementation process smoother.

MongoDB

MongoDB is an open source database that uses a document-oriented data model.

MongoDB is one of several database types to arise in the mid-2000s under the NoSQL banner. Instead of using tables and rows as in relational databases, MongoDB is built on architecture of collections and documents. Documents comprise sets of key-value pairs and are the basic unit of data in MongoDB. Collections contain sets of documents and function as the equivalent of relational database tables. 

 Like other NoSQL databases, MongoDB supports dynamic schema design, allowing the documents in a collection to have different fields and structures. The database uses a document storage and data interchange format called BSON, which provides a binary representation of JSON-like documents. Automatic sharding enables data in a collection to be distributed across multiple systems for horizontal scalability as data volumes increase.

 MongoDB was created by Dwight Merriman and Eliot Horowitz, who had encountered development and scalability issues with traditional relational database approaches while building Web applications at Double Click, an Internet advertising company that is now owned by Google Inc. According to Merriman, the name of the database was derived from the word humongous to represent the idea of supporting large amounts of data. Merriman and Horowitz helped form 10Gen Inc. in 2007 to commercialize MongoDB and related software. The company was renamed MongoDB Inc. in 2013. 

 The database was released to open source in 2009 and is available under the terms of the Free Software Foundation's GNU AGPL Version 3.0 commercial license. At the time of this writing, among other users, the insurance company MetLife is using MongoDB for customer service applications, the website Craigslist is using it for archiving data, the CERN physics lab is using it for data aggregation and discovery and  The New York Times newspaper is using MongoDB to support a form-building application for photo submissions.

 Sharding

 Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.

 Here's how Jason Tee explains sharding on The Server Side: "In the simplest sense, sharding your database involves breaking up your big database into many, much smaller databases that share nothing and can be spread across multiple servers."

 Technically, sharding is a synonym for horizontal partitioning. In practice, the term is often used to refer to any database partitioning that is meant to make a very large database more manageable.

 The governing concept behind sharding is based on the idea that as the size of a database and the number of transactions per unit of time made on the database increase linearly, the response time for querying the database increases exponentially. 

 Additionally, the costs of creating and maintaining a very large database in one place can increase exponentially because the database will require high-end computers. In contrast, data shards can be distributed across a number of much less expensive commodity servers. Data shards have comparatively little restriction as far as hardware and software requirements are concerned. 

In some cases, database sharding can be done fairly simply. One common example is splitting a customer database geographically. Customers located on the East Coast can be placed on one server, while customers on the West Coast can be placed on a second server. Assuming there are no customers with multiple locations, the split is easy to maintain and build rules around.

Data sharding can be a more complex process in some scenarios, however. Sharding a database that holds less structured data, for example, can be very complicated and the resulting shards may be difficult to maintain.

 Key-value pair

 A key-value pair (KVP) is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified or a pointer to the location of that data. Key-value pairs are frequently used in lookup tables, hash tables and configuration files.

 JSON (JavaScript Object Notation)     

 JSON (Javascript Object Notation) is a text-based, human-readable data interchange format used for representing simple data structures and objects in Web browser-based code. JSON is also sometimes used in desktop and server-side programming environments. JSON was originally based on the Javascript programming language and was introduced as the page scripting language for the Netscape Navigator Web browser.

 JSON is used in Javascript on the Internet as an alternative to XML for organizing data. Like XML, JSON is language-independent and may be combined with C++, Java, Python, Lisp and many other languages. Unlike XML, however, JSON is simply a way to represent data structures, as opposed to a full markup language. JSON documents are relatively light weight and are rapidly executed on Web server.

 JSON consists of "name: object" pairs and punctuation in the form of brackets, parentheses, semi-colons and colons. Each object is defined with an operator like "text:" or "image:" and then grouped with a value for that operator. The simple structure and absence of mathematical notation or algorithms, JSON is easy to understand and quickly mastered, even by users with limited formal programming experience, which has spurred adoption of the format as a quick, approachable way to create interactive pages.

 Novice users of JSON need to be aware of potential security implications. As JSON scripts automatically execute in any Web page that's requested by a Web browser, they can be used to implement JavaScript insertion attacks against a Web client, like a command injection or cross-site scripting. For example, if a hacker inserts non-JSON code into the string, like a Trojan horse, the targeted algorithm executes the text in as if it were Javascript and then returns the value of the last statement. If the only statement was a JSON value, there's no effect. If a previous statement contains other Javascript code, however, that code will be executed by the script. The hacker might then have access to all the variables a script has access to, potentially compromising a user's PC.

Difference between SAP BWA and HANA?

BWA is one of the ancestors of HANA, so there are definite similarities between the two SAP technologies. Both are in-memory, columnar, parallel database architectures. This means that both systems can distribute large queries evenly among hundreds of processors across several servers. However, BWA is very narrowly focused on its use case as an analytics accelerator and lacks many of the capabilities that allow HANA to act as a full database and application platform. For example, BWA requires that whenever data changes, its columnar indexes are fully rebuilt.

HANA allows for much more fine-grained updates, allowing indices to be rebuilt much less often, making HANA workable as a transactional and an analytic database. HANA also provides standard interfaces (like SQL and MDX), a modelling environment, analytic and statistical function libraries and a full-fledged application platform on top of its database capabilities in the form of XS Engine and River -- all things that BWA never really aspired to.

So is there any place for BWA anymore? Not really. If a company already has BWA and doesn't want to go through the potential expense and disruption of migrating to BW on HANA, then it might make sense to stick with BWA for the moment. But the future looks to be BW on HANA. On the bright side, BW on HANA may not be as expensive a proposition as it seems at first, because BW and HANA both provide ample options for keeping data in cheaper, disk-based storage systems like SAP's Sybase IQ or Apache Hive while still getting good (but not as good, in most cases) performance.


12 comments:

  1. Thanks for sharing your knowledge with us.very nice post and very useful information.

    Thank you....

    big data and hadoop online training
    big data and hadoop course
    online courses for big data and hadoop

    ReplyDelete
  2. The article is so appealing. You should read this article before choosing the data warehousing consultant you want to learn.

    ReplyDelete
  3. Nice article please do visit my website for Bigdata hadoop online training

    ReplyDelete
  4. Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze. Those tools perform well when dealing with datasets that are smaller and more structured, such as relational databases. But big data doesn't fit into neat rows and columns. This is because big data is made up of multiple large structured and unstructured data sets.

    ReplyDelete
  5. Excellent article for the people who need information about this course.
    Hadoop
    Big Data

    ReplyDelete
  6. to implement one end to end project all hadoop, nosql and spark knowledge is mandatory to implement one project.

    Thanks & Regards
    Venu
    spark training in Hyderabad

    ReplyDelete
  7. This is a really authentic and informative blog. Share more posts like this.
    Phonetics Sounds With Examples
    Basics Of Phonetics

    ReplyDelete
  8. Thank you for the useful information. Share more updates.
    Idioms
    Speaking Test



    ReplyDelete
  9. Very neat blog article.Really looking forward to read more. Want more.
    sccm training
    sccm online training

    ReplyDelete