Internal Working of Spark or PySpark.

4 min readFeb 3, 2021

Cluster
Job
Stages
Tasks
DAG(Directed Acyclic Graph)
Executor
Driver
Master
Slave

Spark revolves around the concept of a DataFrame or resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel.

2.DataFrame’s or RDDs support two types of operations:-

transformations :- which create a new dataset from an existing one. and actions : -which return a value to the driver program after running a computation on the dataset.

3.Cluster :- A Cluster is a group of JVMs (nodes) connected by the network, each of which runs Spark, either in Driver or Worker roles.

4.Job:- A piece of code which reads some input from DBFS or local, performs some computation on the data and writes some output data.

A Job is a sequence of Stages.
A Job is started when there is a an Action such as count(), collect(), saveAstextFile(), etc.

5.Stages:-Jobs are divided into stages. Stages are classified as a Map or reduce stages. Stages are divided based on computational boundaries, all computations(operators) cannot be Updated in a single Stage. It happens over many stages.

A Stage is a sequence of Tasks that don’t require a Shuffle in-between.
The number of Tasks in a Stage also depends upon the number of Partitions your datasets have.

Types of Spark Stages

ShuffleMapStage
ResultStage

6.Tasks:- Each stage has some tasks, one task per partition. One task is executed on one partition of data on one executor(machine).

A Task is a single operation (.map or .filter) applied to a single Partition.
Each Task is executed as a single thread in an Executor!
If your dataset has 2 Partitions, an operation such as a filter() will trigger 2 Tasks, one for each Partition.

7.Spark translates the DataFrame transformations into something called DAG (Directed Acyclic Graph) and starts the execution, At high level, when any action is called on the DataFrame, Spark creates the DAG and submits to the DAG scheduler.

Directed- Graph which is directly connected from one node to another. This creates a sequence.
Acyclic — It defines that there is no cycle or loop available.
Graph — It is a combination of vertices and edges, with all the connections in a sequence

The DAG scheduler divides operators into stages of tasks. A stage is comprised of tasks based on partitions of the input data.
The DAG scheduler pipelines operators together.
For e.g. Many map operators can be scheduled in a single stage. The final result of a DAG scheduler is a set of stages.
The Stages are passed on to the Task Scheduler.The task scheduler launches tasks via cluster manager.The task scheduler doesn’t know about dependencies of the stages.
The Worker executes the tasks on the Slave.

Lets see How DAG works On Spark.

At higher level, we can apply two type of DataFrame transformations: narrow transformation (e.g. map(), filter() etc.) and wide transformation (e.g. reduceByKey()).
Narrow transformation :- does not require the shuffling of data across a partition, the narrow transformations will group into single stage.
wide transformation :- the data shuffles. Hence, Wide transformation results in stage boundaries.

Working of DAG Optimizer in Spark

We optimize the DAG in Apache Spark by rearranging and combining operators wherever possible.
For, example, if we submit a spark job which has a map() operation followed by a filter operation.
The DAG Optimizer will rearrange the order of these operators since filtering will reduce the number of records to undergo map operation.

Advantages of DAG in Spark

The lost RDD can recover using the Directed Acyclic Graph.
Map Reduce has just two queries the map, and reduce but in DAG we have multiple levels. So to execute SQL query, DAG is more flexible.
DAG helps to achieve fault tolerance. Thus we can recover the lost data.
It can do a better global optimization than a system like Hadoop MapReduce.

8.Executor:- The process responsible for executing a task.

Executors are JVMs that run on Worker nodes.
These are the JVMs that actually run Tasks on data Partitions.

9.Driver:- The program/process responsible for running the Job over the Spark Engine.

The Driver is one of the nodes in the Cluster.
The driver does not run computations (filter, map, reduce, etc).
It plays the role of a master node in the Spark cluster.
When you call collect() on an DataFrame or RDD or Dataset, the whole data is sent to the Driver. This is why you should be careful when calling collect().

10.Master:- The machine on which the Driver program runs.

11.Slave:- The machine on which the Executor program runs.

Internal Working of Spark or PySpark.

Written by M Karthik