YARN

Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. YARN was introduced in Hadoop 2 to improve the MapReduce implementation,

YARN provides APIs for requesting and working with cluster resources, but these APIs are not typically used directly by user code.

Instead, users write to higher-level APIs provided by distributed computing frameworks, which themselves are built on YARN and hide the resource management details from the user.

Pig, and Hive are examples of processing frameworks that run on MapReduce, Spark, or Tez (or on all three), and don’t interact with YARN directly.

The YARN cluster architecture is a master-slave cluster framework like HDFS, with a master node daemon called the ResourceManager and one or more slave node daemons called NodeManagers running on worker or slave nodes in the cluster.

ResourceManager

The ResourceManager is responsible for granting cluster compute resources to applications running on the cluster.

Resources are granted in units called containers, which are predefined combinations of CPU cores and memory.

The ResourceManager also tracks available capacity on the cluster as applications finish and release their reserved resources as well as the status of applications running on the cluster.

Clients submit applications to the ResourceManager, the ResourceManager then allocates the first container on an available NodeManager in the cluster as a delegate process for the application called the ApplicationMaster, and the ApplicationMaster then negotiates all of the further containers required to run the application.

NodeManager Daemons

The NodeManager is the slave node YARN daemon that manages containers on the slave node host.

Containers execute the tasks involved in an application.

As Hadoop’s approach to solving large problems is to “divide and conquer,” a large problem is decomposed into a set of tasks, many of which can be run in parallel.

These tasks are run in containers on hosts running the NodeManager process.

The ApplicationMaster

The ApplicationMaster is the first container allocated by the ResourceManager to run on a NodeManager for an application.

Its job is to plan the application, including determining what resources are required—often based upon how much data is being processed—and to work out resourcing for application stages.

The ApplicationMaster requests these resources from the ResourceManager on behalf of the application.

The ResourceManager grants resources on the same or other NodeManagers to the ApplicationMaster, to use within the lifetime of the specific application.

The ApplicationMaster also monitors progress of tasks, stages (groups of tasks that can be performed in parallel), and dependencies.

The summary information is provided to the ResourceManager.

How YARN runs an application

  1. A client contacts the resource manager and asks it to run an application master process.
  2. The resource manager then finds a node manager that can launch the application master in a container.
    1. The application master could simply run a computation in the container it is running in and return the result to the client.
  3. Or the application master could request more containers from the resource managers.
  4. And use them to run a distributed computation.

results matching ""

    No results matching ""