Spark 이론 정리 :
https://eyeballs.tistory.com/m/206
Spark 질문은 원리 + 실무 판단력을 보기 위함
Spark UI 언급 = 실무 경험 신뢰도 급상승
모든 질문에 완벽할 필요 없음
왜 그렇게 했는지 설명할 수 있으면 충분
| What is Apache Spark and why is it used? |
Apache Spark is a distributed data processing framework designed for large-scale data processing. It provides in-memory computation, fault tolerance, and a high-level API, which makes batch and iterative workloads much faster compared to traditional MapReduce. |
| What is the difference between RDD and DataFrame? |
RDD is a low-level API that provides fine-grained control but lacks automatic optimization. DataFrames are higher-level, schema-aware, and benefit from Catalyst Optimizer and Tungsten execution engine, so they are generally preferred for most workloads. |
| What is lazy evaluation in Spark? |
Spark does not execute transformations immediately. Instead, it builds a logical execution plan and only triggers computation when an action is called, which allows Spark to optimize the execution plan. |
| What is the difference between transformation and action? |
Transformations define how data should be processed and are lazily evaluated, while actions trigger the actual execution and return results or write data. |
| Explain Spark’s execution flow. |
When an action is called, Spark creates a job, which is divided into stages based on shuffle boundaries. Each stage consists of tasks that are executed in parallel on executors. |
| What is a shuffle and why is it expensive? |
A shuffle involves redistributing data across executors, usually during joins or aggregations. It is expensive because it requires disk I/O, network transfer, and serialization. |
| What causes shuffle in Spark? |
Operations like groupBy, reduceByKey, join, distinct, and repartition can trigger shuffle because they require data to be reorganized across partitions. |
| How do you optimize joins in Spark? |
I try to reduce shuffle by using broadcast joins when one dataset is small enough. I also ensure proper partitioning and avoid skewed join keys when possible. |
| What is data skew and how do you handle it? |
Data skew occurs when a few keys dominate the data distribution, causing some tasks to take much longer. Common approaches include salting keys, filtering hot keys, or using broadcast joins. |
| What is partitioning and why is it important? |
Partitioning determines how data is distributed across executors. Proper partitioning improves parallelism and resource utilization, while poor partitioning can lead to performance bottlenecks. |
| What is the difference between repartition and coalesce? |
repartition increases or decreases partitions and triggers a shuffle, while coalesce typically reduces partitions without a full shuffle. |
| When would you cache or persist data? |
I cache data when it is reused multiple times across different actions, especially if the computation is expensive. I choose the storage level based on memory availability. |
| How does Spark handle failures? |
Spark uses lineage information to recompute lost partitions. If a task or executor fails, Spark retries the task automatically on another executor. |
| Why is Parquet commonly used with Spark? |
Parquet is a columnar storage format that supports compression and predicate pushdown, which reduces I/O and improves query performance. |
| How do you debug a slow Spark job? |
I start by checking Spark UI to identify slow stages or skewed tasks, then review shuffle size, partition count, and executor utilization before applying optimizations. |
| Spark execution model 종류 및 차이 |
| Spark execution flow (논리적 모델, 물리적 모델, stage, task 등) |
| partition, parallelism. input 파일 개수와 partition 의 관계? |
| join 종류와 어떤 상황에서 어떤 join 을 선택해야 하는지 |
| executor 실패시 어떻게 되는지, 어떻게 복구할건지 |
| skew 발생시 어떻게 대응할지 |
| Why is this spark job slow? 느려지는 경우 어디서부터 어떻게 원인 찾고 해결할래? |
| How would you optimize it? 튜닝 방법 |
| What are the RDD's in Spark? |
| Spark UI 에서 확인해야 하는 포인트 https://eyeballs.tistory.com/715 |
< Jobs Tab > - 몇 개의 job이 실행되었는지 - 어떤 action이 job을 트리거했는지 하나의 action 이 하나의 job 을 만듦. 꼭 기억하자 < Stages Tab > - Stage 실행 시간 - Shuffle Read / Write 크기 - Failed / Skipped stages Stage 경계(boundary) 는 shuffle 기준으로 나뉨 |
< Tasks 상세 화면 > - Task 실행 시간 분포 - 특정 task만 유난히 느린지 (skew 발생 여부 확인) - Input size 차이 몇몇 task 가 특출나게 오래 실행된다면, skew 를 의심해볼 수 있음 |
< SQL Tab (DataFrame 사용 시) > - Physical Plan - BroadcastHashJoin 여부 - SortMergeJoin 사용 여부 물리적 실행 계획을 보고 어떤 join 전략이 사용되었는지 확인할 수 있음 |
< Storage Tab > - 캐시된 DataFrame/RDD - 메모리 사용량 - Storage Level |
< Executors Tab > - Executor 수 - CPU/Memory 사용률 - Shuffle Read/Write per executor - GC time 각 노드의 리소스가 고르게 쓰이고있는지 확인 가능함 |
면접에서 다음과 같이 말하면 좋음 When a Spark job is slow, I usually start with the Spark UI, especially the Stages and Tasks tabs, to identify shuffle-heavy stages or data skew. |
'Coding Interview' 카테고리의 다른 글
| [Data Engineer] interview questions (0) | 2026.01.16 |
|---|---|
| [Airflow] interview questions (0) | 2025.12.14 |
| [SQL] interview questions (0) | 2025.12.12 |
| [Canada] 내가 찾아본, 채용 공고 확인하는 곳들 (0) | 2025.10.01 |
| [IT] Data Engineer 기술 인터뷰 준비를 위한 이론 + English (0) | 2025.08.20 |