Describe the background of Apache YARN, especially its origin.
Apache Yarn and
Apache Hadoop is a collection of open-source utilities that facilitate through the
network of several computing devices for the solution of generated issues,
including the larger amounts of computation as well as the data. The software
framework is provided to the process as well as distributer storage of big data
through by using the model of MapReduce programming. Furthermore, the Apache Hadoop
manages the processing of the data as well as the storage of big data
applications that are running in clustered systems. The center of growth of the
ecosystem of the technologies of big data which are largely used for the
provision of support progressive analytical initiatives, involving the machine
learning applications, predictive analytics as well as data mining. Several
types of structured and unstructured data can be handled and managed by this
technology, as well as apache Hadoop is providing users more flexibility to
gather the informationor data, process as well as analysis on the collected
data rather than data from warehouses and related database (Shetty et al., 2019)
The existing
technology is formed like a part of an open sources project into the apache
software foundation. There are four primary vendors of big data platforms that
offer currently commercial distributions of the apache Hadoop, such as MapR
technologies, Cloudera, Hortonworks as well as Amazon web services. Furthermore,
the cloud-based managed services are offered by Microsoft, Google as well as
different vendors,which are related to such kind of technologies as well as on
the top of apache yarn. The potential use of Hadoop was significantly expanded
by the addition of Yarn. With the batch-oriented MapReduce processing engine as
well as the programming framework, the Hadoop distributed file system was
paired closely by the actual incarnation of Hadoop, which also functioned as the
resource manager of big data platform as well as a job scheduler. As a result, MapReduce
applications could only be run by the system, such as the limitation which Hadoop
yarn eliminated (Iconiq Inc., 2016)
Furthermore, Yarn
was casually known as the NextGen MapReduce or MapReduce 2 before having the
official name. While Apache yarn did introduce a new type of approach which did
decouple the resource management cluster as well as scheduling from the data
processing component of MapReduce that focus on enabling Hadoop for supporting various
kinds of broader arrays as well asprocessing of the applications, for instance,
the interactive querying, real-time analytics application on apache spark as
well as other processing engines as well as the streaming data can be run by
the clusters of Hadoop simultaneously along with the batch jobs of MapReduce. The
most popular open-source implementation is the apache Hadoop MapReduce for the
model of MapReduce. The potential use of Hadoop was significantly expanded by
the addition of Yarn. There are two components of Apache Yarna discussed in
this section, which tell that the resourceManager is unique for the overall
cluster. The main task of the resourceManager is to give permission to the
resources as well as balance the cluster load. On the other side, the
NodeManager is also executed at every node of computing. The main task of
NodeManager is to start as well as monitor the containers assigned for it (Shetty et al., 2019)
Study the technical details of Apache YARN and describe (with
appropriate diagrams) its process flow illustrating task parallelism function.
In this section,
the technical information about apache Yarn is discussed as well as the flow of
the process of necessary diagrams is also discussed,which are telling a lot
about apache Yarn. First of all, the significant features are discussed in this
document. Apache Hadoop Yarn lies among the processing engines as well as HDFS
being used for running applications in the cluster architecture. A central
resource manager is combined with the coordinators of the application,
containers as well as the node level agents whoscreen the operations of
processing within the nodes of an individual cluster. The resources can
dynamically be allocated by Yarn for the applications as needed, a capability
designed for bringing some kind of improvements in the usage of the resources
as well as the application performance compared along with the more static technique
or the methodology of the allocation of MapReduce (DATAFLAIR TEAM, 2019)
Furthermore, on
the Yarn features, the multiple scheduling methods are supported by Yarn, all
based on the queue format to submit the process of the jobs. The applications
which are working based on the FIFO Algorithm, are run by the default scheduler
of first in first out (FIFO) like reflected in its name. Although, for the
clusters, it may not beoptimal,whichis shared by more than two users. The
pluggable tool for the fair scheduler of Apache Hadoop rather than assigns every
running job at asimilar time of the cluster resources of the fair share, which
is associated with the metric weighting at the measurements of the scheduler. Apache
Hadoop Yarn also involves the feature of the reservation system, which let the
users reserve the resources of the cluster in advance for the significant jobs of
processing the information or the significant data which is collected from
related databases for making sure that the application can run smoothly in the
specific environment. Among the resourcescan be limited by related IT managers as
well as IT professionals. Furthermore, individual users can reserve for
avoiding the cluster overloading with the reservations as well as to reject the
requests of reservation, the set of automated policies that exceed the limits
can be reserved (DATAFLAIR TEAM, 2019)
Components on Apache Yarn and
Apache Hadoop
There are some
key components that are also discussed in this report as well as the figure is
also given, which is telling very important aspects of apache Hadoop Yarn.The
execution and monitoring of job processing are decentralized by Apache Hadoop
Yarn by splitting it into the mentioned components below, as well as separating
different kinds of responsibilities. The global ResourceManager receives the submission
of the job from users. It also allocates the resources to the clients as well
as schedules the jobs. The ApplicationMaster created for every application for
negotiation for the allocation of the resources as well as to monitor and
execute the tasks, the work with the NodeManager. Moreover, as monitoring as
well as reporting agent of the ResourceManager, a NodeManager slave that is
installed at every node of it as well as functions. The NodeManagers control the
Resource containers as well as allocated the resources of the system to the
applications individually (Chiang and Dawson, 2015)
Compare Apache Yarn’s and MapReduce parallelism functions. Give
critique on the MapReduce in terms of its strengths, weaknesses, and
application areas.
Yarn is a new
framework to manage resources like CPU and Memory. The potential use of Hadoop
was significantly expanded by the addition of Yarn. Yarn provide us essential
the APIs as well as daemons, and the Apache Yarn assists us in developing any
type of distributed application. Furthermore, one more significant feature of
Yarn is also discussed. A central resource manager is combined with the
coordinators of the application, containers as well as the node level agents
who screen the operations of processing within the nodes of an individual
cluster.The resources request is scheduled as well as managed from the
application by Yarn as well as it provides help the process for the execution of
the request.There are two components of Apache Yarna discussed in this section,
which tell that the resourceManager is unique for the overall cluster. To run
any distributed application, Yarn is a generic platform as well as MapReduce version
2 is the scatteredapplication that will run at the Yarn top whereas the unit
Hadoop component is being processed by MapReduce. The pluggable tool for the
fair scheduler of Apache Hadoop rather than assigns every running job at a
similar time of the cluster resources of the fair share, which is associated
with the metric weighting at the measurements of the scheduler. Among the
resources can be limited by related IT managers as well as IT professionals.The
data in the parallel is processed by Yarn within the distributed environment. The
data is processed by Yarn as well as it also saves in HDFS in a particular manner,
so basically, MapReduce works on the larger component of data as well as its
retrieval is also easier than typical storage (Janbask Training, 2018)
Comparison between Yarn and MapReduce
1.
There are two components of Hadoop 1, which are
given below for information. The first one is the MapReduce, as well as the
second component of Hadoop,which is HDFS. On the other side, there are two
other components of Hadoop 2 which are as follows. The first component is
Yarn/MRv2 which is generally known as MapReduce version 2, while the first component
is HDFS.
2.
All of the slave nodes will stop working
automatically at the time of halt of the MapReduce working as well as it is one
of the scenarios where the execution of the job may interrupt, as well as it is
also known as the failure of a single point. Furthermore, such kind of issue is
overcome due to the architecture by Yarn, as well as it has the concept of the
standby name node and the active name node. For some time,the passive node,
when an active node stops workings, the working is also started by the passive
node as the active node as well as starts the execution.
3.
If the master-slave goes down, then the overall
slave nodes will also stop the working because it is the failure of single
point within Hadoop version 1 and on the other side, Hadoop version 2 based on
the architecture of Yarn as well as the multiple slave architecture as well as
the single master are owned by the MapReduce. Furthermore, the concept of the slave,
as well as multiple master, are owned by it, and if one master does down and at
that time,its process will also be resumed by another master as well as continue
the execution.
4.
The main difference within both ecosystems such
as Hadoop version 2 as well as the Hadoop version 1 can be seen by us in the
diagram which is illustrated below. It is also interacted by the resource
management of the Yarn component-wise with the HDFS as well as MapReduce (EDUCBA, 2019)
Basis for comparison | Map Reduce | Apache YARN |
Version | Introduce in Hadoop 1.0 | Introduce in Hadoop 2.0 |
Meaning | Map Reduce is self-defined. | YARN Stands for Yet Another
Resource Negotiator. |
Execution model | Less Generic as compare to
YARN. | The yarn execution model is
more generic as compare to Map-reduce |
Architecture | In the earlier version of
MR1, YARN is not there In the place of YARN job tracker, and task tracker was
present which help in the execution of application or jobs | YARN is introduced in MR2 on
top of the job tracker and task tracker. In the place of job tracker and task
tracker Application, the master comes into the picture. |
Responsibility | Earlier Map-reduce was
responsible for Resource Management as well as data processing | Now YARN is responsible for the
Resource management part. |
Size | By default, the size of a
data node in Map-reduce is 64MB. | By default the size of a
data node in YARN is 128MB |
Daemons | Map Reduce has Name node,
Data node, Secondary Name node, job tracker, and task tracker. | YARN has Name Node, Data
node, secondary Name node, Resource Manager and Node Manager. |
Application execution | Map Reduce can execute its
own model-based application. | YARN can execute those
applications as well which don’t follow Map Reduce model |
Flexibility | Less scalable as compare to
YARN. | YARN is more isolated and
scalable |
Limitation | Single point of failure, low
resource utilization(Max of 4200 clusters by YAHOO) and less scalability when
compare to YARN | There is no concept of a single
point of failure in YARN because it has multiple Masters, so if one got
failed, another master will pick it up and resume the execution. |
References on Apache Yarn and Apache Hadoop
Chiang, R. and Dawson, D. (2015) Untangling
Apache Hadoop YARN, Part 1: Cluster and YARN Basics, [Online], Available:
https://blog.cloudera.com/untangling-apache-hadoop-yarn-part-1-cluster-and-yarn-basics/ [16 December 2019].
DATAFLAIR TEAM (2019) Hadoop Architecture in Detail – HDFS, Yarn
& MapReduce, [Online], Available: https://data-flair.training/blogs/hadoop-architecture/ [16 December 2019].
EDUCBA (2019) Learn The 10 Best Difference Between MapReduce vs Yarn,
[Online], Available: https://www.educba.com/mapreduce-vs-yarn/ [16 December 2019].
Iconiq Inc. (2016) Top 6 Hadoop Vendors providing Big Data Solutions
in Open Data Platform, [Online], Available: https://www.dezyre.com/article/top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93 [16 December 2019].
Janbask Training (2018) An Introduction and Differences Between YARN
and MapReduce, [Online], Available: https://www.janbasktraining.com/blog/yarn-vs-mapreduce/ [16 December 2019].
Shetty, N.R., Patnaik, L.M., Nagaraj, H.C., Hamsavath, P.N. and Nalini,
N. (2019) Emerging Research in Computing, Information, Communication and
Applications: ERCICA 2018, Volume 1, Springer