Introduction to Hadoop and Big Data

Big Data

What is Big Data

Definition in Oxford English Dictionary

big data n. Computing (also with capital initials) data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges; (also) the branch of computing involving such data.

Definition in Wikipedia

Big data is a term for data sets that are so large or complex that traditional data processing application software is inadequate to deal with them. Big data challenges include capturing data,data storage,data analysis, search, sharing, transfer, visualization, querying, updating and information privacy.

Lately, the term "big data" tends to refer to the use of predictive analytics,user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.

Challenges of Big Data

Scalability: being able to accommodate (rapid) changes in the growth of data, either in traffic or volume.

Scaling up, or vertical scaling, involves obtaining a faster server with more powerful processors and more memory.
Scaling out, or horizontal scaling, involves adding servers for parallel computing.

Complexity: scalable systems are more complex than traditional ones.Techniques like Shards (or partitioning), Replicas, Queues, Resharding scripts can be difficult to setup and maintain. Parallel programming is complex.

Reliability: Corrupted data, downtime, human mistake, etc. are more likely to happen on a complex system.

Hadoop

History of Hadoop

The set of storage and processing methodologies commonly known as “Big Data” emerged from the search engine providers in the early 2000s, principally Google and Yahoo!. The search engine providers were the first group of users faced with Internet scale problems, mainly how to process and store indexes of all of the documents in the Internet universe.

In 2003, Google released a whitepaper called “The Google File System”. Subsequently, in 2004, Google released another whitepaper called “MapReduce: Simplified Data Processing on Large Clusters”. At the same time, at Yahoo!, Doug Cutting (who is generally acknowledged as the initial creator of Hadoop) was working on a web indexing project called Nutch.

The Google whitepapers inspired Doug Cutting to take the work he had done to date on the Nutch project and incorporate the storage and processing principles outlined in these whitepapers. The resultant product is what is known today as Hadoop.

In January 2008, Hadoop was made its own top-level project at Apache, confirming its success and its diverse, active community. Today, Hadoop is widely used in mainstream enterprises.

What is Hadoop

Hadoop is a framework that allows for distributed storage and distributed processing of large data sets across clusters of computers using simple programming models.

A software framework is an abstraction in which common code providing generic functionality can be selectively overridden or specialized by user code providing specific functionality.
A computer cluster is a set of loosely or tightly connected computers that work together so that, in many respects, they can be viewed as a single system. Individual computers within a cluster are referred to as nodes.

Hadoop is designed to scale out from a single server to thousands of machines, each machine offers local computing and storage capabilities.

Hadoop Ecosystem

Non-exhaustive illustration of the Hadoop Ecosystem

With other components such as:

Sqoop: used to transfer bulk data between Apache Hadoop and structured datastores such as relational databases.
Flume: is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into HDFS.
Oozie: is a workflow scheduler system to manage Apache Hadoop jobs
Kafka: is a highly scalable, fast and fault-tolerant messaging application used for streaming applications and data processing.
Hue: is a Web application for querying and visualizing data by interacting with Apache Hadoop.
Ambari: provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

Hadoop Commercial Landscape

Hadoop is an open source project but there are many commercial vendors who supply commercial distributions, support, management utilities and more.

In 2008, the first commercial vendor, Cloudera, was formed by engineers from Google, Yahoo!, and Facebook. Cloudera subsequently released the first commercial distribution of Hadoop, called CDH (Cloudera Distribution of Hadoop).

In 2009, MapR was founded as a company delivering a “Hadoop-derived” software solution implementing a custom adaptation of the Hadoop filesystem (called MapRFS) with Hadoop API compatibility.

In 2011, Hortonworks was spun off from Yahoo! as a Hadoop vendor providing a distribution called HDP (Hortonworks Data Platform).

Cloudera, Hortonworks, and MapR are referred to as “pure play” Hadoop vendors as their business models are founded upon Hadoop. Many other vendors would follow with their own distribution—such as IBM, Pivotal, and Teradata.

Introduction Hadoop and Big Data