A SEMINAR REPORT ON HADOOP MAPREDUCE [BIG DATA] OVERVIEW

HADOOP MAPREDUCE [BIG DATA] OVERVIEW
Hadoop MapReduce. Source: (Shi, 2008)
             
           MADONNA UNIVERSITY, NIGERIA            
A SEMINAR REPORT ON
HADOOP MAPREDUCE
BY
####### ####### ######
CS/##/##

SUBMITTED TO THE
DEPARTMENT OF COMPUTER SCIENCE
FACULTY OF SCIENCE

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR
THE AWARD OF BACHELOR OF SCIENCE (B.Sc.)
DEGREE IN COMPUTER SCIENCE



DECEMBER, 2015


 

DEDICATION

This report is dedicated to my youngest brother – ####### ####### ######.



ACKNOWLEDGEMENT

My utmost gratitude goes to the almighty and ever-living God, for His divine grace from which I have always benefited. In Him, I live, move and have my being.
My profound gratitude goes to the entire members of staff of the Department of Computer Science Madonna University Elele, Rivers State, for their unwavering support and attention.
I specially wish to express my gratitude to my supervisor, ####### ####### ######, for all her help throughout the preparation of this report.





DECLARATION
This is to certify that this Seminar report on “Hadoop MapReduce” by ####### ####### ######(CS/##/###) has met the conditions for the award of Bachelor of Science (BSc.) degree in Computer Science, Madonna University Nigeria, Elele Campus. It is hereby approved for its contribution to knowledge.


___________________________                                                _____________________
####### ####### ######                                                                            DATE
(Name of Student)

                               
 

####### ####### ######                                                        DATE
(Supervisor)      
               
               
                                                                                                                                               
####### ####### ######                                                                        DATE
(Acting Head of Department)








ABSTRACT

With the continuous advance in computer technology over the years, the quantity of data being generated is growing exponentially. Some of these data are structured, semi-structured or unstructured. This poses a great challenge when these data are to be analyzed because conventional data processing techniques are not suited to handling such data. This is where Hadoop MapReduce comes in. Hadoop MapReduce is a programming model for developing applications that process large amount of data in parallel across clusters of commodity hardware in a reliable and fault-tolerant manner. This report covers the origin of Hadoop MapReduce, its features and mode of operation, it describes how it is being implemented, as well as reviews how some Information Technology companies are making use of it. It was found out that MapReduce is an efficient model for developing applications that process Big Data.





CHAPTER 1
INTRODUCTION


1.1  BACKGROUND OF STUDY
The quantity of data being generated on a daily basis is quite alarming. For example, statistics from 2014 info graphics show that every minute:
·         Facebook users share nearly 2.5 million pieces of content.
·         Twitter users tweet nearly 300,000 times.
·         Instagram users post nearly 220,000 new photos.
·         YouTube users upload 7 hours of new video content.
·         Apple users download nearly 50,000 apps.
·         Email users send over 200 million messages
·         Amazon generates over $80,000 in online sales.
Also, with the internet of things, the amount of data generated by car sensors, railway sensors, GPS devices and other devices is also at a very high level. Various medical and research institutions also tend to gather an enormous amount of data each day. In fact, according to IBM, every day we create an average of 2.5 quintillion bytes of data (equivalent to 2.3 trillion gigabytes).Wow! These large amount of data come in various formats ranging from images, videos, JSON strings, audios, PDFs, webpages, text documents even to 3D models. Data of this nature is known as Big Data.
Efforts to simplify data processing have always been a matter of research and development. In the year 2000, Seinst Inc. developed a C++ based distributed file sharing framework to be used for data storage and query. The system was able to store and distribute structured, semi-structured and unstructured data across multiple servers. Developers could build queries with a modified C++ language called ECL. By 2004, LexisNexis had acquired and merged Seinst Inc and ChoicePoint Inc. and their high speed parallel processing platform into HPCC Systems. That same year, Google published a paper on MapReduce - a similar architecture to the HPCC Systems. The framework turned out to be successful so others wanted to replicate the algorithm. Hence, an implementation of the MapReduce framework was adopted by an Apache open source project named Hadoop. Hadoop allows for distributed processing of large data sets across clusters of computers using simple programming models. It is very scalable as it can scale up from single servers to thousands of machines with each offering local computation and storage. The library is designed to detect and handle failures at the application layer while delivering highly-available service on top of a cluster of computers which may be prone to failures. The Hadoop MapReduce is a software framework or programming model for developing applications that process large amount of data in parallel across clusters of commodity hardware in a reliable and fault-tolerant manner.

1.2  PROBLEM STATEMENT

Some developers still make use of conventional data storage and processing techniques (such as relational databases) to store and harness information from large data sets which proves to be inefficient these days. Hence, there is a need for them to adopt a new method of developing applications that can process information in a very efficient and fault-tolerant way.
1.3  AIM AND OBJECTIVES

The aim of this study is to discover how Hadoop MapReduce technology solves the problem of Big Data Processing.

1.4  SIGNIFICANCE OF STUDY

After going through this report, readers would understand how Hadoop MapReduce works, the importance of Hadoop MapReduce and how Hadoop MapReduce is implemented.

1.5  SCOPE OF STUDY

In order to achieve the objective of this study as earlier stated, this report will cover the history of Hadoop MapReduce, the features of Hadoop MapReduce, the mode of operation of Hadoop MapReduce, the implementation of Hadoop MapReduce, the applications of Hadoop MapReduce, the benefits of Hadoop MapReduce and the limitations of Hadoop MapReduce. The scope of this study is strictly limited to Hadoop MapReduce and not MapReduce in general. There exist a slight distinction between the two. Whenever MapReduce is mentioned in this report, it refers strictly to Hadoop MapReduce. Also, implementing Hadoop MapReduce is a broad subject which this report will not be able to cover in full detail. This report however will provide a proper introduction into the methodologies used in implementing Hadoop MapReduce in order to aid readers who wish to go further into its implementation.

1.6  LIMITATIONS

There are several problems which may hinder the quality of this report. They include: Poor Internet connection during research, limited time frame to deliver the report and limited resources needed to have a first-hand study of Hadoop MapReduce. The last one is due to the fact that MapReduce is mostly implemented and appreciated in Server class machines which were not accessible at the time of this study.

1.7  GLOSSARY

The keywords and phrases used in this report include:
·         Dataset– This simply refers to a collection of data.
·         Semi-structured data - A form of data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables but can contain tags or other markers to separate semantic elements and show hierarchies of records and fields within the data.
·         Structured data - Data which resides within a record or file. It includes data contained in relational databases and spreadsheets. The data conforms to a particular data model.
·         Unstructured data – Refers to information that either does not have a pre-defined data model or is not organized in a predefined manner.
·         File System – This refers to the methods and data structures used by an operating system to keep track of files on a disk or partition. It can also be used to refer a partition or disk used to store files.
·         Distributed Systems – A distributed system is a software system in which components located on networked computers communicate and coordinate their actions by passing messages.
·         Node – Nodes are devices or data points on a larger network. These devices can be personal computers, cell phones or even printers. In respect to the internet, a node is anything that has an IP address.
·         Cluster – A computer cluster is a set of loosely or tightly connected computers which work together so that they can be viewed as a single system. Each node is set to perform the same task controlled and scheduled by a software.
·         Job – An application which implements Hadoop MapReduce.
·         Class - A class in Object Oriented Programming is a template used to create objects and to define object data types and methods.
·         Interface – A collection of method names without actual definitions, which specify that a class has a set of behaviors in addition to the behaviors the class gets from its superclass. They cannot be inherited but implemented.
·         Abstract Class – A class which may or may not include abstract methods. They cannot be instantiated but can be inherited.
·         Method – A set of codes which is referred by name and can be invoked at any point in a program to carry out specific functions by utilizing the method name.
·         HDFS – Hadoop Distributed File System. The component of Hadoop responsible for file distribution and storage.
·         DBMS – Abbreviation of Database Management System. It is a software for creating and managing databases.

1.8  ORGANIZATION OF CHAPTERS

This report is organized into 4 chapters.
Chapter 2 is the literature review. It talks about Hadoop MapReduce and its related literature. Hadoop is an Apache software library or framework that allows for distributed processing of large data sets across clusters of computers using simple programming models. MapReduce is one of its components. Hadoop MapReduce is a software framework or programming model for developing applications that process large amount of data in parallel across clusters of commodity hardware in a reliable and fault-tolerant manner. Chapter 3 describes all the findings of this study. It exposes the features of Hadoop MapReduce, its mode of operation, its implementation, its benefits and its limitations. It also exposes how some renowned companies like Facebook, Yahoo, Amazon and others make use of Hadoop MapReduce. Chapter 4 is all about the conclusion gotten from this study.

CHAPTER 2LITERATURE REVIEW

Since its inception, the MapReduce framework has been a subject of research and discussion among various experts, professionals and institutions. Many of these researchers have praised the subject. MapReduce is arguably the most successful parallelization framework especially for processing large data sets in datacenters comprising commodity computers (Zhiqiang & Lin, 2010). MapReduce has proven to be a useful abstraction and greatly simplifies large-scale computations (Prasad, 2009). The research community uses MapReduce and Hadoop to solve data-intensive problems in bioinformatics, computational finance, chemistry, and environmental science (Serge, 2013). In praise of Hadoop MapReduce, Andrew McAfee and Erik Brynjolfsson, 2011, stated: “The evidence is clear: Data-driven decisions tend to be better decisions. Leaders will either embrace this fact or be replaced by others who do” (McAfee & Brynjolfsson, 2012).
Despite its numerous benefits, a number of researchers have criticized MapReduce for different reasons. Even if high-level, declarative-style abstractions exist and have been widely adopted, Hadoop MapReduce is still far from offering interactive analysis capabilities (Vasiliki & Vladimir, 2014). The most popular of these criticism is found in David DeWitt and Michael Stonebraker, 2008. There, MapReduce was described as “a giant step backwards”:
“As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is a giant step backward in the programming paradigm for large-scale data intensive applications”. (DeWitt & Stonebraker, 2008)
David DeWitt and Michael Stonebraker also stated that MapReduce is not novel at all – It represents a specific implementation of well-known techniques developed nearly 25 years ago, it is missing most of the features that are routinely included in current DBMS and it is incompatible with all of the tools DBMS users have come to depend on.
Is MapReduce truly missing most of the features that are routinely included in current database management systems? Another component of Hadoop (HBase) is responsible for having those features included in current DBMS and as such, the burden should not be placed on MapReduce. It is correct to state that MapReduce is not the best solution there can be to solving the problem of Big Data, but indeed it is good enough and has room for improvement.

 

CHAPTER 3FINDINGS

3.1    FEATURES OF HADOOP MAPREDUCE

The Hadoop MapReduce framework consists of a single JobTracker (known as the master) and one TaskTracker per node in the cluster (known as slaves or workers). The JobTracker is responsible for scheduling the jobs’ component tasks, monitoring them and re-executing the failed tasks. The TaskTrackers execute the tasks as directed by the JobTracker. Both the input and the output of the job are stored in a file-system.
A SEMINAR REPORT ON HADOOP MAPREDUCE [BIG DATA] OVERVIEW
Fig 3.1 Features of Hadoop MapReduce. Source: (Shi, 2008)

Applications which implement MapReduce usually specify the input and output locations and supply Map and Reduce functions via implementations of appropriate interfaces and/or abstract-classes. These and other parameters are known as the Job configuration. The Hadoop client then submits the job (which should now be in jar/executable form) and the configuration to JobTracker which then assumes the responsibility of distributing the software and configuration to the TaskTrackers, schedules the tasks, monitors them, provide status and diagnostic information to the job client.

3.2    MODE OF OPERATION

A MapReduce job usually splits the input data set into independent blocks of data which are assigned to Map tasks (functions) in a completely parallel manner. The output of the map is then sorted and given as input to the Reduce tasks (functions) to produce the final result. Hadoop MapReduce works exclusively on <key, value> pairs. It views the input to the job as <key, value> pairs and produces its output from the job as a set of <key, value> pairs. It makes use of an interface known as the Writable interface to serialize the Key and Value classes. These classes have to implement the interface. Also, the key class has to implement the WritableComparable interface in order to facilitate sorting by the framework. The input and output of the MapReduce jobs can be depicted as:
(Input) <k1, v1> à map à <k2, v2> à combine à <k2, v2> à reduce à <k3, v3> (output)
Applications which implement MapReduce usually implement the Mapper and Reducer Interfaces which provide several methods for different tasks.

3.2.1   MAPPER

The Mapper is responsible for mapping input Key/Value pairs to a set of intermediate Key/Value pairs. The individual tasks that transform input records into intermediate records are known as Maps. The result of the maps do not necessarily have to be of the same data type as the input records. The output of the Maps are sorted then partitioned per Reducer. The total number of partitions depends on the number of reduce tasks for the job (this usually depends on the number of keys) and they are always equal. The MapReduce framework subsequently groups the intermediate values which are associated with a given output key and passes them to the reducer(s) to determine the final output. When developing applications with the Hadoop MapReduce framework, users can control which keys go to which Reducer by implementing an interface known as Partitioner.

3.2.2   REDUCER

After the Mapper has successfully mapped the input, the Reducer reduces the set of intermediate values which share a common key into smaller set of values. The user specifies the number of reduces for the job using specific functions. The reducer consists of three (3) primary phases.
1.                  Shuffle
2.                  Sort
3.                  Reduce

Shuffle
In this phase, the MapReduce framework fetches the relevant partition of all the mappers via Hyper Text Transfer Protocol (HTTP).
Sort
In this phase, the framework groups Reducer input by Keys (because different mappers may have produced output with the same key). The shuffle and sort framework occur simultaneously. That is, as map-outputs are being fetched, they are merged.
Reduce
In this phase, the reduce() method is called for each <key, list of values > pairs in the grouped input. Then the output is written to the file system.

SAMPLE PROBLEM
To understand the whole MapReduce process, let us assume that we have an application which counts the number of times a Computer Science student of Madonna University used certain words during CSC317 examination. For the sake of this example, assume we have the following words appearing in different sentences:
Variable Array Pointer
Object Pointer Object
Object Variable Array
The main steps MapReduce would take to process the data would be:
1.                  Get the input data.
2.                  Split the data into separate blocks.
3.                  Assign the blocks to Map tasks.
4.                  Sort the output of the Map tasks.
5.                  Reduce the sorted data using the Reduce tasks.
6.                  Store the output in the file system.
The following diagram illustrates the process:
mapping process A SEMINAR REPORT ON HADOOP MAPREDUCE [BIG DATA] OVERVIEW
Fig 3.2 Execution of a MapReduce count program. Modified from: (Zaharia, 2010)
Breaking down these steps, we can understand how each activity took place.
First the input was split into 3 blocks:
·         Block 1-Variable Array Pointer
·         Block 2-Object Pointer Object
·         Block 3-Object Variable Array
Then the Mapper transforms the inputs to intermediate <key, value> pairs:


Then the Mapper transforms the inputs to intermediate <key, value> pairs A SEMINAR REPORT ON HADOOP MAPREDUCE [BIG DATA] OVERVIEW



These values are passed to the Reducer. In the Reducer, the data passes through the Shuffle and Sort phases which then group the data by keys:
 In the Reducer, the data passes through the Shuffle and Sort phases which then group the data by keys A SEMINAR REPORT ON HADOOP MAPREDUCE [BIG DATA] OVERVIEW
Finally, the Reduce phase reduces the data and stores them in the file system.
the Reduce phase reduces the data and stores them in the file system. A SEMINAR REPORT ON HADOOP MAPREDUCE [BIG DATA] OVERVIEW
When you try to picture this behaviour on larger data, you will really begin to appreciate Hadoop MapReduce.

3.3    IMPLEMENTING HADOOP MAPREDUCE

Hadoop MapReduce is a component of Hadoop. Hence, it can operate in any Hardware that Hadoop can run on. Also, software involved in the use of MapReduce are software that can support Hadoop. We shall see some of them in this section.

3.3.1   SOFTWARE

Operating Systems
Hadoop has been known to run well on Linux. Operating systems of choice include:
  • RedHat Enterprise Linux (RHEL)
  • CentOS
  • Ubuntu (Server edition)
  • Windows
Usage
Hadoop MapReduce is a programming model and as such, is only made use of by developers. Some software have been developed for building MapReduce applications. Examples include:
·         HBase
·         Hive
·         Pig
·         Oozie
·         Mahout Etc.
Most applications using MapReduce are written in Java mainly because the Hadoop framework is implemented in Java. Apart from Java, other languages can be used such as:
·         JavaScript
·         Python
·         C#
·         Pig Latin
·         HQL (Hive Query Language) etc.
With any of these tools, one can successfully develop a MapReduce application. One thing is common for all of them – they must include the Hadoop library, and the programs must follow the same structure of having map and reduce functions or methods. The following pseudocode shows an example of a MapReduce program. The program counts the occurrence of each word in a large input document – similar to the sample problem we saw while understanding the mode of operation of Hadoop MapReduce.
Map(String fileName, String fileContents)
{
For each word key in fileContents:
EmitIntermediate(key, “1”);
}

Reduce(String word, Iterator Values)
{
int occurrence = 0
for each v in values:
occurrence += 1;
Emit(AsString(occurrence));
}
In the Map method, the fileName parameter is input key while the fileContents parameter is input value. With the aid of a MapReduce API’s, the code loops through the fileContents and finds each word. It then passes the word and the value ‘1’ to another MapReduce method to emit the intermediate <key, value> pairs. This method sorts the data and triggers the reducer.
In the Reduce method, the word parameter represents the input key and the Values parameter represents a list of counts. Just as we’ve seen in the sample problem, the reduce method is called for each <key, list of values > pairs in the grouped input. It adds up the number of occurrence of the keys and then the output is written to the file system.

3.3.2   HARDWARE

Since MapReduce is a component of Hadoop, it runs on any hardware that Hadoop can run on. Hadoop runs on commodity hardware. This does not mean then that it runs on cheap hardware. Hadoop runs on decent server class machines. The average hardware specification for Hadoop nodes is shown below:

Medium
High End
CPU
8 Physical cores
12 Physical cores
Memory
16 GB
48 GB
Disk
4 disk x 1TB = 4 TB
12 disks x 3TB = TB
Network
1 GB Ethernet
10 GB Ethernet or Infiniband

Examples of Hadoop servers include:
  • HP Proliant DL380
  • Dell C2100 series
  • Supermicro Hadoop series
3.4    SOME EXISTING USERS
Hadoop MapReduce is applicable in all areas of life where data is being generated. This is because the data being generated will certainly need to be stored and processed. Due to its awesome benefits, a number of “Big” companies have already started using MapReduce. They include:
  • Facebook
  • Yahoo
  • Amazon
  • eBay
  • Google
  • IBM
  • The New York Times
  • Walmart
Facebook started using Hadoop in 2007. As at then, developers at Facebook were using it to import a few data sets and writing MapReduce jobs to manipulate them. At the success of this, they started using it for major projects. One of such projects is the popular Facebook Lexicon – the tool where you can see the buzz surrounding different words and phrases on Facebook Walls. Most of their data stored in Hadoop's file system is published as Tables. Over time, they added classic data warehouse features like partitioning, sampling and indexing to the environment. This in-house data warehousing layer over Hadoop is called Hive – One of the software mentioned in the preceding sections implementing Hadoop MapReduce.
Amazon implements MapReduce in their Amazon Elastic MapReduce (Amazon EMR). It is a web service that makes it easy to process vast amounts of data quickly and cost-effectively.
Every day, a number of users adopt Hadoop MapReduce for processing their data and join in enjoying its benefits.

3.5    BENEFITS OF HADOOP MAPREDUCE

MapReduce provides a lot of benefits. They include:
1.      Cost-efficiency - MapReduce requires commodity hardware to function. Also, its fault-tolerance is automatic. Hence, fewer admins are needed on the network.
2.      Simplicity – Developers intending to implement MapReduce can write applications with their language of choice such as Java, C++ or Python. Also, MapReduce jobs are easy to run.
3.      Scalability – MapReduce can process petabytes of data which are stored in Hadoop Distributed File System (HDFS) in one cluster.
4.      Speed – With parallel processing, MapReduce can take problems that used to take days to solve and solve them in hours or minutes. For instance, in July 2008, one of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (Daytona) terabyte sort benchmark.
5.      Recovery – MapReduce handles failures. Due to redundancy in HDFS, if a machine with one copy of the data is unavailable, another machine has a copy of the same key/value pair, which can be used to solve the same sub-task. The JobTracker keeps track of it all.
6.      Minimal Data Motion – With MapReduce, processes are moved to data and not data to processes. Processing normally occurs on the same physical node where the data resides. This is known as data locality. This reduces the network input/output patterns and contributes to the processing speed.

3.6    LIMITATIONS OF HADOOP MapReduce

Despite the amazing benefits of Hadoop MapReduce, there are some limitations to it. Some are listed below:
1.      The development of efficient MapReduce applications requires advanced programming skills and also a deep understanding of the architecture of the system. Analysts who are used to SQL-like or declarative languages may view MapReduce programming model as too “low-level” because MapReduce doesn’t require implementing relational operations such as joins.
2.      MapReduce has a batch nature. Data needs to be uploaded to the file system always and even when the same dataset needs to be analyzed multiple times, it still needs to be read every time.
3.      The master node (where the JobTracker is) has to be more sophisticated than other nodes because it can easily become a single point of failure.
4.      Since Google has been granted patent, it raises the question of the long-term viability of using the mechanism in open environments.
5.      The “One Way Scalability” of its design. MapReduce allows a program to scale up to process very large datasets but constrains a programs ability to process smaller data items.




CHAPTER 4CONCLUSION
4.1 Summary
Hadoop MapReduce is a software framework or programming model for developing applications that process large amount of data in parallel across clusters of commodity hardware in a reliable and fault-tolerant manner. Hadoop MapReduce works exclusively on <key, value> pairs. A MapReduce job usually splits the input data set into independent blocks of data which are assigned to Map tasks (functions) in a completely parallel manner. The output of the map is then sorted and given as input to the Reduce tasks (functions) to produce the final result. Hadoop MapReduce applications can be written in several languages. One thing is common for all of them – they must include the Hadoop library, and the programs must follow the same structure of having map and reduce functions or methods. Several companies are already making use of it such as Facebook, Yahoo, and Google etc. This is due to the benefits of using MapReduce to develop applications such as the scalability, its speed, its simplicity, minimal data motion and others.

4.2 Recommendation
The MapReduce model is a very efficient model for processing of large data sets. Currently, the interest for Hadoop MapReduce is at its peak and there exist a lot of problems and challenges that still need to be addressed. The future ahead of Big Data is very bright, as businesses and organizations realize more the value of the information they can store and analyze. With the improvement and implementation of Hadoop MapReduce, data processing will be as fast and efficient as never before and life would be made easier. Hence, it should be adopted by developers who need to develop applications that process large amount of data.




 

References

DeWitt, D., & Stonebraker, M. (2008, January). MapReduce: A major step backwards. PM Permalink. Retrieved October 30, 2015, from http://databasecolumn.vertica.com/2008/01/mapreduce_a_major_step_back.html
McAfee, A., & Brynjolfsson, E. (2012, October). Big Data: The Management Revolution. Retrieved October 20, 2015, from http://hbr.org/2012/10/big-data-the-management-revolution/ar
Prasad. (2009). MapReduce Architecture. Retrieved October 10, 2015, from http://cecs.wright.edu/~tkprasad/courses/cs707/L06MapReduce.ppt
Serge, B. (2013). Introduction to Hadoop,MapReduce and HDFS for Big Data Applications. Chicago: Storage Networking Industry Association.
Shi, X. (2008). Take a close look at MapReduce. Retrieved October 2015, from http://slideplayer.com/slide/7070314
Vasiliki, K., & Vladimir, V. (2014). MapReduce: Limitations, Optimizations and Open Issues. Stockholm: KTH The Royal Institute of Technology.
Zaharia, M. (2010). Introduction to MapReduce and Hadoop. Retrieved from http://www.cs.berkeley.edu/~matei/talks/2010/amp_mapreduce.pdf
Zhiqiang, M., & Lin, G. (2010). The Limitation of MapReduce: A Probing Case and a Lightweight Solution. CLOUD COMPUTING 2010 : The First International Conference on Cloud Computing, GRIDs, and Virtualization, (68-73). IARIA.



CONVERSATION

0 comments:

Post a Comment

Back
to top