Welcome to TutorsOnSpot.Com!

World's No. 1 Assignment Writing Market

Post Your Homework

Proposals

Post your homework and get free proposals here!

Post Your Homework

Stuck in your homework and missing deadline?

Get Urgent Help In Your Essays, Assignments, Homeworks, Dissertation, Thesis Or Coursework Writing

100% Plagiarism Free Writing - Free Turnitin Report - Professional And Experienced Writers - 24/7 Online Support

Get Free 2 Pages Post Your Requirements And Get Free Help

Essay on Apache Yarn and Apache Hadoop

Category: Education Paper Type: Essay Writing Reference: APA Words: 2150

Describe the background of Apache YARN, especially its origin.

Apache Yarn and Apache Hadoop is a collection of open-source utilities that facilitate through the network of several computing devices for the solution of generated issues, including the larger amounts of computation as well as the data. The software framework is provided to the process as well as distributer storage of big data through by using the model of MapReduce programming. Furthermore, the Apache Hadoop manages the processing of the data as well as the storage of big data applications that are running in clustered systems. The center of growth of the ecosystem of the technologies of big data which are largely used for the provision of support progressive analytical initiatives, involving the machine learning applications, predictive analytics as well as data mining. Several types of structured and unstructured data can be handled and managed by this technology, as well as apache Hadoop is providing users more flexibility to gather the informationor data, process as well as analysis on the collected data rather than data from warehouses and related database (Shetty et al., 2019)

The existing technology is formed like a part of an open sources project into the apache software foundation. There are four primary vendors of big data platforms that offer currently commercial distributions of the apache Hadoop, such as MapR technologies, Cloudera, Hortonworks as well as Amazon web services. Furthermore, the cloud-based managed services are offered by Microsoft, Google as well as different vendors,which are related to such kind of technologies as well as on the top of apache yarn. The potential use of Hadoop was significantly expanded by the addition of Yarn. With the batch-oriented MapReduce processing engine as well as the programming framework, the Hadoop distributed file system was paired closely by the actual incarnation of Hadoop, which also functioned as the resource manager of big data platform as well as a job scheduler. As a result, MapReduce applications could only be run by the system, such as the limitation which Hadoop yarn eliminated (Iconiq Inc., 2016)

Furthermore, Yarn was casually known as the NextGen MapReduce or MapReduce 2 before having the official name. While Apache yarn did introduce a new type of approach which did decouple the resource management cluster as well as scheduling from the data processing component of MapReduce that focus on enabling Hadoop for supporting various kinds of broader arrays as well asprocessing of the applications, for instance, the interactive querying, real-time analytics application on apache spark as well as other processing engines as well as the streaming data can be run by the clusters of Hadoop simultaneously along with the batch jobs of MapReduce. The most popular open-source implementation is the apache Hadoop MapReduce for the model of MapReduce. The potential use of Hadoop was significantly expanded by the addition of Yarn. There are two components of Apache Yarna discussed in this section, which tell that the resourceManager is unique for the overall cluster. The main task of the resourceManager is to give permission to the resources as well as balance the cluster load. On the other side, the NodeManager is also executed at every node of computing. The main task of NodeManager is to start as well as monitor the containers assigned for it (Shetty et al., 2019)

Study the technical details of Apache YARN and describe (with appropriate diagrams) its process flow illustrating task parallelism function.

In this section, the technical information about apache Yarn is discussed as well as the flow of the process of necessary diagrams is also discussed,which are telling a lot about apache Yarn. First of all, the significant features are discussed in this document. Apache Hadoop Yarn lies among the processing engines as well as HDFS being used for running applications in the cluster architecture. A central resource manager is combined with the coordinators of the application, containers as well as the node level agents whoscreen the operations of processing within the nodes of an individual cluster. The resources can dynamically be allocated by Yarn for the applications as needed, a capability designed for bringing some kind of improvements in the usage of the resources as well as the application performance compared along with the more static technique or the methodology of the allocation of MapReduce (DATAFLAIR TEAM, 2019)

Furthermore, on the Yarn features, the multiple scheduling methods are supported by Yarn, all based on the queue format to submit the process of the jobs. The applications which are working based on the FIFO Algorithm, are run by the default scheduler of first in first out (FIFO) like reflected in its name. Although, for the clusters, it may not beoptimal,whichis shared by more than two users. The pluggable tool for the fair scheduler of Apache Hadoop rather than assigns every running job at asimilar time of the cluster resources of the fair share, which is associated with the metric weighting at the measurements of the scheduler. Apache Hadoop Yarn also involves the feature of the reservation system, which let the users reserve the resources of the cluster in advance for the significant jobs of processing the information or the significant data which is collected from related databases for making sure that the application can run smoothly in the specific environment. Among the resourcescan be limited by related IT managers as well as IT professionals. Furthermore, individual users can reserve for avoiding the cluster overloading with the reservations as well as to reject the requests of reservation, the set of automated policies that exceed the limits can be reserved (DATAFLAIR TEAM, 2019)

Components on Apache Yarn and Apache Hadoop

There are some key components that are also discussed in this report as well as the figure is also given, which is telling very important aspects of apache Hadoop Yarn.The execution and monitoring of job processing are decentralized by Apache Hadoop Yarn by splitting it into the mentioned components below, as well as separating different kinds of responsibilities. The global ResourceManager receives the submission of the job from users. It also allocates the resources to the clients as well as schedules the jobs. The ApplicationMaster created for every application for negotiation for the allocation of the resources as well as to monitor and execute the tasks, the work with the NodeManager. Moreover, as monitoring as well as reporting agent of the ResourceManager, a NodeManager slave that is installed at every node of it as well as functions. The NodeManagers control the Resource containers as well as allocated the resources of the system to the applications individually (Chiang and Dawson, 2015)

Compare Apache Yarn’s and MapReduce parallelism functions. Give critique on the MapReduce in terms of its strengths, weaknesses, and application areas.

Yarn is a new framework to manage resources like CPU and Memory. The potential use of Hadoop was significantly expanded by the addition of Yarn. Yarn provide us essential the APIs as well as daemons, and the Apache Yarn assists us in developing any type of distributed application. Furthermore, one more significant feature of Yarn is also discussed. A central resource manager is combined with the coordinators of the application, containers as well as the node level agents who screen the operations of processing within the nodes of an individual cluster.The resources request is scheduled as well as managed from the application by Yarn as well as it provides help the process for the execution of the request.There are two components of Apache Yarna discussed in this section, which tell that the resourceManager is unique for the overall cluster. To run any distributed application, Yarn is a generic platform as well as MapReduce version 2 is the scatteredapplication that will run at the Yarn top whereas the unit Hadoop component is being processed by MapReduce. The pluggable tool for the fair scheduler of Apache Hadoop rather than assigns every running job at a similar time of the cluster resources of the fair share, which is associated with the metric weighting at the measurements of the scheduler. Among the resources can be limited by related IT managers as well as IT professionals.The data in the parallel is processed by Yarn within the distributed environment. The data is processed by Yarn as well as it also saves in HDFS in a particular manner, so basically, MapReduce works on the larger component of data as well as its retrieval is also easier than typical storage (Janbask Training, 2018)

Comparison between Yarn and MapReduce

1. There are two components of Hadoop 1, which are given below for information. The first one is the MapReduce, as well as the second component of Hadoop,which is HDFS. On the other side, there are two other components of Hadoop 2 which are as follows. The first component is Yarn/MRv2 which is generally known as MapReduce version 2, while the first component is HDFS.

2. All of the slave nodes will stop working automatically at the time of halt of the MapReduce working as well as it is one of the scenarios where the execution of the job may interrupt, as well as it is also known as the failure of a single point. Furthermore, such kind of issue is overcome due to the architecture by Yarn, as well as it has the concept of the standby name node and the active name node. For some time,the passive node, when an active node stops workings, the working is also started by the passive node as the active node as well as starts the execution.

3. If the master-slave goes down, then the overall slave nodes will also stop the working because it is the failure of single point within Hadoop version 1 and on the other side, Hadoop version 2 based on the architecture of Yarn as well as the multiple slave architecture as well as the single master are owned by the MapReduce. Furthermore, the concept of the slave, as well as multiple master, are owned by it, and if one master does down and at that time,its process will also be resumed by another master as well as continue the execution.

4. The main difference within both ecosystems such as Hadoop version 2 as well as the Hadoop version 1 can be seen by us in the diagram which is illustrated below. It is also interacted by the resource management of the Yarn component-wise with the HDFS as well as MapReduce (EDUCBA, 2019)

Basis for comparison	Map Reduce	Apache YARN
Version	Introduce in Hadoop 1.0	Introduce in Hadoop 2.0
Meaning	Map Reduce is self-defined.	YARN Stands for Yet Another Resource Negotiator.
Execution model	Less Generic as compare to YARN.	The yarn execution model is more generic as compare to Map-reduce
Architecture	In the earlier version of MR1, YARN is not there In the place of YARN job tracker, and task tracker was present which help in the execution of application or jobs	YARN is introduced in MR2 on top of the job tracker and task tracker. In the place of job tracker and task tracker Application, the master comes into the picture.
Responsibility	Earlier Map-reduce was responsible for Resource Management as well as data processing	Now YARN is responsible for the Resource management part.
Size	By default, the size of a data node in Map-reduce is 64MB.	By default the size of a data node in YARN is 128MB
Daemons	Map Reduce has Name node, Data node, Secondary Name node, job tracker, and task tracker.	YARN has Name Node, Data node, secondary Name node, Resource Manager and Node Manager.
Application execution	Map Reduce can execute its own model-based application.	YARN can execute those applications as well which don’t follow Map Reduce model
Flexibility	Less scalable as compare to YARN.	YARN is more isolated and scalable
Limitation	Single point of failure, low resource utilization(Max of 4200 clusters by YAHOO) and less scalability when compare to YARN	There is no concept of a single point of failure in YARN because it has multiple Masters, so if one got failed, another master will pick it up and resume the execution.

References on Apache Yarn and Apache Hadoop

Chiang, R. and Dawson, D. (2015) Untangling Apache Hadoop YARN, Part 1: Cluster and YARN Basics, [Online], Available: https://blog.cloudera.com/untangling-apache-hadoop-yarn-part-1-cluster-and-yarn-basics/ [16 December 2019].

DATAFLAIR TEAM (2019) Hadoop Architecture in Detail – HDFS, Yarn & MapReduce, [Online], Available: https://data-flair.training/blogs/hadoop-architecture/ [16 December 2019].

EDUCBA (2019) Learn The 10 Best Difference Between MapReduce vs Yarn, [Online], Available: https://www.educba.com/mapreduce-vs-yarn/ [16 December 2019].

Iconiq Inc. (2016) Top 6 Hadoop Vendors providing Big Data Solutions in Open Data Platform, [Online], Available: https://www.dezyre.com/article/top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93 [16 December 2019].

Janbask Training (2018) An Introduction and Differences Between YARN and MapReduce, [Online], Available: https://www.janbasktraining.com/blog/yarn-vs-mapreduce/ [16 December 2019].

Shetty, N.R., Patnaik, L.M., Nagaraj, H.C., Hamsavath, P.N. and Nalini, N. (2019) Emerging Research in Computing, Information, Communication and Applications: ERCICA 2018, Volume 1, Springer