스파크 어플리케이션이 YARN에서 실행되면 먼저 어플리케이션 마스터(Application Master) 프로세스가 생성이 되는 데 이것이 바로 Spark Driver를 실행하는 컨테이너가 됩니다.
그리고 이 Spark Driver가 YARN의 리소스 매니저와 협상하여 이 어플리케이션을 실행하기 위한 리소스를 받아냅니다.
리소스를 받아내면 YARN 노드 매니저(Node Manager)에게 Spark Executor를 실행하기 위한 컨테이너를 생성하도록 지시합니다.
이 후에 이 Spark Executor가 태스크들을 할당받아서 실제로 태스크를 수행하는 프로세스입니다.
실제로 executor 에 job(tasks)을 제출하는 것은 spark context
https://www.slideshare.net/datamantra/spark-on-yarn-54201193
https://spark.apache.org/docs/latest/cluster-overview.html
standalone 으로 Spark 를 실행할 때의 Spark Application 을 실행하는 단계를
아래 링크에서 자세히 설명해주고 있다.
https://www.samsungsds.com/global/ko/support/insights/Spark-Cluster-job-server.html
아래는 위의 설명을 위해 알아야 할 개념들
To know about the workflow of Spark Architecture, you can have a look at the infographic below:
Fig: Spark Architecture Infographic
STEP 1: The client submits spark user application code. When an application code is submitted, the driver implicitly converts user code that contains transformations and actions into a logically directed acyclic graph called DAG. At this stage, it also performs optimizations such as pipelining transformations.
STEP 2: After that, it converts the logical graph called DAG into physical execution plan with many stages. After converting into a physical execution plan, it creates physical execution units called tasks under each stage. Then the tasks are bundled and sent to the cluster.
STEP 3: Now the driver talks to the cluster manager and negotiates the resources. Cluster manager launches executors in worker nodes on behalf of the driver. At this point, the driver will send the tasks to the executors based on data placement. When executors start, they register themselves with drivers. So, the driver will have a complete view of executors that are executing the task.
STEP 4: During the course of execution of tasks, driver program will monitor the set of executors that runs. Driver node also schedules future tasks based on data placement.
https://www.edureka.co/blog/spark-architecture/
아래 slide 에 그림으로 Spark Architecture 와 YARN 의 동작이 step by step 잘 나와있다. 적극 참고.
https://www.slideshare.net/FerranGalReniu/yarn-by-default-spark-on-yarn
참고 :
'Spark' 카테고리의 다른 글
[Spark] 스칼라 DataFrame 다양한 연산 모음 (1) | 2019.08.28 |
---|---|
[Spark Streaming] trigger, window, sliding 이해 하기 (0) | 2019.08.22 |
[Spark] value toDF is not a member of org.apache.spark.rdd.RDD 에러 (0) | 2019.07.19 |
[Spark] MongoDB 와 연동시 Aggregate Query하는 방법 (0) | 2019.07.13 |
[Spark] MongoDB로부터 데이터 읽고 pagerank 알고리즘 구현하기 (0) | 2019.06.17 |