Spark Interview Questions and Answers (2023)

In this post, we have put together frequently asked Spark Interview Questions and answers for beginner, intermediate and experienced candidates. These important questions are categorized for quick browsing before the interview.

Spark Online Test

General

What is PageRank in GraphX?

View answer

PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly.

GraphX comes with static and dynamic implementations of PageRank as methods on the PageRank Object. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). GraphOps allows calling these algorithms directly as methods on Graph.

What is the significance of Sliding Window operation?

View answer

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

What do you understand by Transformations in Spark?

View answer

Transformations are functions applied to RDDs, resulting in another RDD. It does not execute until an action occurs. Functions such as map() and filer() are examples of transformations, where the map() function iterates over every line in the RDD and splits into a new RDD. The filter() function creates a new RDD by selecting elements from the current RDD that passes the function argument.

What makes Spark good at low latency workloads like graph processing and Machine Learning?

View answer

Apache Spark stores data in-memory for faster processing and building machine learning models. Machine Learning algorithms require multiple iterations and different conceptual steps to create an optimal model. Graph algorithms traverse through all the nodes and edges to generate a graph. These low latency workloads that need multiple iterations can lead to increased performance.

How is Streaming implemented in Spark?

View answer

Spark Streaming is used for processing real-time streaming data. Thus it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.

Explain Idle Assessment.

View answer

The idle assessment, known as call by use, is a strategy that defers compliance until one needs a benefit.

Illustrate some demerits of using Spark.

View answer

Since Spark utilizes more storage space when compared to Hadoop and MapReduce, there might arise certain problems. Developers need to be careful while running their applications on Spark. To resolve the issue, they can think of distributing the workload over multiple clusters, instead of running everything on a single node.

What are the types of Transformation on DStream?

View answer

There are two types transformation on DStream:

Stateless transformation : In stateless transformation, the processing of each batch does not depend on the data of its previous batches. Each stateless transformation applies separately to each RDD. Examples: map(), flatMap(), filter(), repartition(), reduceByKey(), groupByKey().
Stateful transformation : stateful transformation use data or intermediate results from previous batches to compute the result of the current batch. The stateful transformations on the other hand allow us combining data across time. Examples: updateStateByKey and mapWithState

Explain Executor Memory in a Spark application.

View answer

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.

Why is BlinkDB used?

View answer

BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time. BlinkDB builds a few stratified samples of the original data and then executes the queries on the samples, rather than the original data in order to reduce the time taken for query execution. The sizes and numbers of the stratified samples are determined by the storage availability specified when importing the data. BlinkDB consists of two main components:

Sample building engine: determines the stratified samples to be built based on workload history and data distribution.
Dynamic sample selection module: selects the correct sample files at runtime based on the time and/or accuracy requirements of the query.

Hadoop uses replication to achieve fault tolerance. How is this achieved in Apache Spark?

View answer

Data storage model in Apache Spark is based on RDDs. RDDs help achieve fault tolerance through lineage. RDD always has the information on how to build from other datasets. If any partition of a RDD is lost due to failure, lineage helps build only that particular lost partition.

Explain the key features of Apache Spark.

View answer

Polyglot: Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed directory.

Speed: Spark runs upto 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed data processing with minimal network traffic.

Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra. The Data Sources API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark.

Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed. For transformations, Spark adds them to a DAG of computation and only when the driver requests some data, does this DAG actually gets executed.

Real Time Computation: Spark’s computation is real-time and has less latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models.

Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a great boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling.

Machine Learning: Spark’s MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use.

What are the functions of Spark SQL?

View answer

Spark SQL is Apache Spark’s module for working with structured data.

Spark SQL loads the data from a variety of structured data sources.

It queries data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC).

It provides a rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables and expose custom functions in SQL.

What are Spark Datasets?

View answer

Datasets are data structures in Spark (added since Spark 1.6) that provide the JVM object benefits of RDDs (the ability to manipulate data with lambda functions), alongside a Spark SQL-optimized execution engine.

Explain Caching in Spark Streaming.

View answer

DStreams allow developers to cache/ persist the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times. This can be done using the persist() method on a DStream. For input streams that receive data over the network (such as Kafka, Flume, Sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault-tolerance.

What is Executor Memory in a Spark application?

View answer

What does MLlib do?

View answer

MLlib is a scalable Machine Learning library provided by Spark. It aims at making Machine Learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and the like.

What is YARN in Spark?

View answer

YARN is one of the key features provided by Spark that provides a central resource management platform for delivering scalable operations throughout the cluster.
YARN is a cluster management technology and a Spark is a tool for data processing.

Is there any benefit of learning MapReduce if Spark is better than MapReduce?

View answer

Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.

What are benefits of Spark over MapReduce?

View answer

Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.
Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

What is shuffling in Spark? When does it occur?

View answer

Shuffling is the process of redistributing data across partitions that may lead to data movement across the executors. It occurs while joining two tables or while performing byKey operations such as GroupByKey or ReduceByKey.

What is the difference between CreateOrReplaceTempView and createGlobalTempView?

View answer

CreateOrReplaceTempView is used when you want to store the table for a particular spark session and CreateGlobalTempView is used when you want to share the temp table across multiple spark sessions.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

View answer

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Compare map() and flatMap() in Spark.

View answer

In Spark, map() transformation is applied to each row in a dataset to return a new dataset. flatMap() transformation is also applied to each row of the dataset, but a new flattened dataset is returned. In case of flatMap, if a record is nested (e.g. a column which is in itself made up of a list, array), the data within that record gets extracted and is returned as a new row of the returned dataset.

Both map() and flatMap() transformations are narrow, which means that they do not result in shuffling of data in Spark.

flatMap() is said to be a one-to-many transformation function as it returns more rows than the current DataFrame. map() returns the same number of records as what was present in the input DataFrame.

flatMap() can give a result which contains redundant data in some columns.

flatMap() can be used to flatten a column which contains arrays or lists. It can be used to flatten any other nested collection too.

Define Actions in Spark.

View answer

An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

reduce() is an action that implements the function passed again and again until one value if left. take() action takes all the values from RDD to a local node.

Explain Sparse Vector.

View answer

A vector is a one-dimensional array of elements. However, in many applications, the vector elements have mostly zero values that are said to be sparse.

What is a Parquet file?

View answer

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics formats so far.

Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows:

Columnar storage limits IO operations.
It can fetch specific columns that you need to access.
Columnar storage consumes less space.
It gives better-summarized data and follows type-specific encoding.

Explain Pair RDD.

View answer

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

What do you understand by SchemaRDD in Apache Spark RDD?

View answer

SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.

SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of code debugging and unit testing on SparkSQL core module. The idea can boil down to describing the data structures inside RDD using a formal description similar to the relational database schema. On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some straightforward relational query interface functions that are realized through SparkSQL.

Explain Lazy Evaluation.

View answer

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget - but it does nothing, unless asked for the final result. When a transformation like map () is called on a RDD-the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

What are the various levels of persistence in Apache Spark?

View answer

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

The various storage/persistence levels in Spark are -

MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER, DISK_ONLY
OFF_HEAP

What are receivers in Apache Spark Streaming?

View answer

Receivers are those entities that consume data from different data sources and then move them to Spark for processing. They are created by using streaming contexts in the form of long-running tasks that are scheduled for operating in a round-robin fashion. Each receiver is configured to use up only a single core. The receivers are made to run on various executors to accomplish the task of data streaming. There are two types of receivers depending on how the data is sent to Spark:

Reliable receivers: Here, the receiver sends an acknowledegment to the data sources post successful reception of data and its replication on the Spark storage space.
Unreliable receiver: Here, there is no acknowledgement sent to the data sources.

What is SparkContext in PySpark?

View answer

A SparkContext represents the entry point to connect to a Spark cluster. It can be used to create RDDs, accumulators and broadcast variables on that particular cluster. Only one SparkContext can be active per JVM. A SparkContext has to be stopped before creating a new one. PySpark uses the library Py4J to launch a JVM and creates a JavaSparkContext, By default, PySpark has SparkContext available as ‘sc’. Hence, creating a SparkContext will not work.

Does Apache Spark provide check pointing?

View answer

Lineage graphs are always useful to recover RDDs from a failure but this is generally time consuming if the RDDs have long lineage chains. Spark has an API for check pointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint - is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

Under what scenarios do you use Client and Cluster modes for deployment?

View answer

In case the client machines are not close to the cluster, then the Cluster mode should be used for deployment. This is done to avoid the network latency caused while communication between the executors which would occur in the Client mode. Also, in Client mode, the entire process is lost if the machine goes offline.
If we have the client machine inside the cluster, then the Client mode can be used for deployment. Since the machine is inside the cluster, there won’t be issues of network latency and since the maintenance of the cluster is already handled, there is no cause of worry in cases of failure.

Name the types of Cluster Managers in Spark.

View answer

The Spark framework supports three major types of Cluster Managers.

Standalone: A basic Cluster Manager to set up a cluster
Apache Mesos: A generalized/commonly-used Cluster Manager, running Hadoop MapReduce and other applications
YARN: A Cluster Manager responsible for resource management in Hadoop

What is a Lineage Graph?

View answer

A Lineage Graph is a dependencies graph between the existing RDD and the new RDD. It means that all the dependencies between the RDD will be recorded in a graph, rather than the original data.

The need for an RDD lineage graph happens when we want to compute a new RDD or if we want to recover the lost data from the lost persisted RDD. Spark does not support data replication in memory. So, if any data is lost, it can be rebuilt using RDD lineage. It is also called an RDD operator graph or RDD dependency graph.

What is RDD?

View answer

RDD stands for Resilient Distribution Datasets. It is a fault-tolerant collection of parallel running operational elements. The partitioned data of RDD is distributed and immutable. There are two types of datasets:

Parallelized collections: Meant for running parallelly.
Hadoop datasets: These perform operations on file record systems on HDFS or other storage systems.

What are DStreams?

View answer

DStreams, or discretized streams, are high-level abstractions provided in Spark Streaming that represents a continuous stream of data. DStreams can be either created from input sources such as Kafka, Flume or Kinesis; or by applying high-level operations on existing DStreams.

Internally, a DStream is represented by a continuous series of RDDs. Each RDD in a DStream contains data from a certain interval.

What are scalar and aggregate functions in Spark SQL?

View answer

In Spark SQL, Scalar functions are those functions that return a single value for each row. Scalar functions include built-in functions including array functions and map functions. Aggregate functions return a single value for a group of rows. Some of the built-in aggregate functions include min(), max(), count(), countDistinct(), avg(). Users can also create their own scalar and aggregate functions.

Explain coalesce in Spark.

View answer

Coalesce in Spark is a method which is used to reduce the number of partitions in a DataFrame. Reduction of partitions using the repartitioning method is an expensive operation. Instead, the coalesce method can be used. Coalesce does not perform a full shuffle and instead of creating new partitions, it shuffles the data using Hash Partitioner and adjusts the data into the existing partitions. The Coalesce method can only be used to decrease the number of partitions. Coalesce is to be ideally used in cases where one wants to store the same data in a lesser number of files.

How does Spark Streaming handle caching?

View answer

Caching can be handled in Spark Streaming by means of a change in settings on DStreams. A Discretized Stream (DStream) allows users to keep the stream’s data persistent in memory. By using the persist() method on a DStream, every RDD of that particular DStream is kept persistent on memory, and can be used if the data in the DStream has to be used for computation multiple times. Unlike RDDs, in the case of DStreams, the default persistence level involves keeping the data serialized in memory.

What is RDD Lineage?

View answer

Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

Why is there a need for broadcast variables when working with Apache Spark?

View answer

These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

What are the steps involved in structured API execution in Spark?

View answer

Writing DataFrame/Dataset/SQLCode.
If valid code, Spark converts this to a Logical Plan.
Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way.
Spark then executes this Physical Plan (RDD manipulations) on the cluster.

Explain Immutable in reference to Spark.

View answer

If a value has been generated and assigned, it cannot be changed. This attribute is called immutability. Spark is immutable by nature. It does not accept upgrades or alterations. Please notice that data storage is not immutable, but the data content is immutable.

What are the various functionalities supported by Spark Core?

View answer

Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:

Scheduling and monitoring jobs
Memory management
Fault recovery
Task dispatching

What do you understand by worker node?

View answer

Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.

Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.

Define Partitions in Apache Spark.

View answer

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks. Everything in Spark is a partitioned RDD.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

View answer

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Explain receivers in Spark Streaming.

View answer

Receivers are special entities in Spark Streaming that consume data from various data sources and move them to Apache Spark. Receivers are usually created by streaming contexts as long running tasks on various executors and scheduled to operate in a round robin manner with each receiver taking a single core.

Does Apache Spark provide checkpoints?

View answer

Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.

There are 2 types of data for which we can use checkpointing in Spark.

Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.

What is the role of accumulators in Spark?

View answer

Accumulators are variables used for aggregating information across the executors. This information can be about the data or API diagnosis like how many records are corrupted or how many times a library API was called.