Hadoop interview questions and answers 👇

  1. Hadoop Interview Questions For Freshers
  2. Hadoop Interview Questions For Freshers
  3. Hadoop Freshers Interview Questions
  4. Advanced Hadoop Interview Questions
  5. Hadoop Interview Questions For Experienced
  6. MCQ
  7. General


Hadoop Interview Questions For Freshers

1.

Explain the Storage Unit In Hadoop (HDFS).

HDFS is the Hadoop Distributed File System, is the storage layer for Hadoop. The files in HDFS are split into block-size parts called data blocks. These blocks are saved on the slave nodes in the cluster. By default, the size of the block is 128 MB by default, which can be configured as per our necessities. It follows the master-slave architecture. It contains two daemons- DataNodes and NameNode.

  • NameNode The NameNode is the master daemon that operates on the master node. It saves the filesystem metadata, that is, files names, data about blocks of a file, blocks locations, permissions, etc. It manages the Datanodes.
  • DataNode The DataNodes are the slave daemon that operates on the slave nodes. It saves the actual business data. It serves the client read/write requests based on the NameNode instructions. It stores the blocks of the files, and NameNode stores the metadata like block locations, permission, etc.
2.

Mention different Features of HDFS.

  • Fault Tolerance Hadoop framework divides data into blocks and creates various copies of blocks on several machines in the cluster. So, when any device in the cluster fails, clients can still access their data from the other machine containing the exact copy of data blocks.
  • High Availability In the HDFS environment, the data is duplicated by generating a copy of the blocks. So, whenever a user wants to obtain this data, or in case of an unfortunate situation, users can simply access their data from the other nodes because duplicate images of blocks are already present in the other nodes of the HDFS cluster.
  • High Reliability HDFS splits the data into blocks, these blocks are stored by the Hadoop framework on nodes existing in the cluster. It saves data by generating a duplicate of every block current in the cluster. Hence presents a fault tolerance facility. By default, it creates 3 duplicates of each block containing information present in the nodes. Therefore, the data is promptly obtainable to the users. Hence the user does not face the difficulty of data loss. Therefore, HDFS is very reliable.
  • Replication Replication resolves the problem of data loss in adverse conditions like device failure, crashing of nodes, etc. It manages the process of replication at frequent intervals of time. Thus, there is a low probability of a loss of user data.
  • Scalability HDFS stocks the data on multiple nodes. So, in case of an increase in demand, it can scale the cluster.
3.

Explain big data.

Gartner defined Big Data as– “Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

Simply, big data is larger, more complex data sets, particularly from new data sources. These data sets are so large that conventional data processing software can’t manage them. But these massive volumes of data can be used to address business problems you wouldn’t have been able to tackle before.

4.

What are the three modes that hadoop can Run?

  • Local Mode or Standalone Mode Hadoop, by default, is configured to run in a no distributed mode. It runs as a single Java process. Instead of HDFS, this mode utilizes the local file system. This mode is more helpful for debugging, and there isn't any requirement to configure core-site.xml, hdfs-site.xml, mapred-site.xml, masters & slaves. Stand-alone mode is ordinarily the quickest mode in Hadoop.
  • Pseudo-distributed Model In this mode, each daemon runs on a separate java process. This mode requires custom configuration ( core-site.xml, hdfs-site.xml, mapred-site.xml). The HDFS is used for input and output. This mode of deployment is beneficial for testing and debugging purposes.
  • Fully Distributed Mode It is the production mode of Hadoop. Basically, one machine in the cluster is designated as NameNode and another as Resource Manager exclusively. These are masters. Rest nodes act as Data Node and Node Manager. These are the slaves. Configuration parameters and environment need to be defined for Hadoop Daemons. This mode gives fully distributed computing capacity, security, fault endurance, and scalability.

Hadoop Interview Questions For Freshers

1.

List characteristics of big data.

  • Volume: A large amount of data stored in data warehouses refers to Volume.
  • Velocity: Velocity typically refers to the pace at which data is being generated in real-time.
  • Variety: Variety of Big Data relates to structured, unstructured, and semistructured data that is collected from multiple sources.
  • Veracity: Data veracity generally refers to how accurate the data is.
  • Value: No matter how fast the data is produced or its amount, it has to be reliable and valuable. Otherwise, the information is not good enough for processing or analysis.
2.

Explain Hadoop MapReduce.

Hadoop MapReduce is a software framework for processing enormous data sets. It is the main component for data processing in the Hadoop framework. It divides the input data into several parts and runs a program on every data component parallel at one. The word MapReduce refers to two separate and different tasks.

The first is the map operation, which takes a set of data and transforms it into a different collection of data, where individual elements are divided into tuples. The reduce operation consolidates those data tuples based on the key and subsequently modifies the value of the key.

3.

Find out the word count on the example_data.txt (The content of the example_data.txt file is: coding,jamming,ice,river,man,driving) using MapReduce.

To find out the word count on the example_data.txt using MapReduce we will be looking for the unique words and the number of times those unique words appeared.

  • First, we break the input into three divisions-
    1. coding, ice, jamming,
    2. river, driving, ice
    3. man, ice, jamming This will share the work among all the map nodes.
  • Then, all the words are tokenized in each of the mappers, and a hardcoded value (1) to each of the tokens is given. The reason behind giving a hardcoded value equal to 1 is that every word by itself will, at least, occur once.
  • Now, a list of key-value pairs will be created where the key is nothing but the individual words and value is one. So, for the first line (Coding Ice Jamming), we have three key-value pairs – Coding, 1; Ice, 1; Jamming, 1.
  • The mapping process persists the same on all the nodes.
  • Next, a partition process occurs where sorting and shuffling follow so that all the tuples with the same key are sent to the identical reducer.
  • Subsequent to the sorting and shuffling phase, every reducer will have a unique key and a list of values matching that very key. For example, Coding, [1,1]; Ice, [1,1,1].., etc.
  • Now, each Reducer adds the values which are present in that list of values. The reducer gets a list of values [1,1] for the key Jamming. Then, it adds the number of ones in the same list and gives the final output as – Jamming, 2.
  • Lastly, all the output key/value pairs are then assembled and written in the output file.
4.

What is shuffling in MapReduce?

In Hadoop MapReduce, shuffling is used to transfer data from the mappers to the important reducers. It is the process in which the system sorts the unstructured data and transfers the output of the map as an input to the reducer. It is a significant process for reducers. Otherwise, they would not accept any information. Moreover, since this process can begin even before the map phase is completed, it helps to save time and complete the process in a lesser amount of time.

5.

What is Yarn?

Yarn stands for Yet Another Resource Negotiator. It is the resource management layer of Hadoop. The Yarn was launched in Hadoop 2.x. Yarn provides many data processing engines like graph processing, batch processing, interactive processing, and stream processing to execute and process data saved in the Hadoop Distributed File System. Yarn also offers job scheduling. It extends the capability of Hadoop to other evolving technologies so that they can take good advantage of HDFS and economic clusters. Apache Yarn is the data operating method for Hadoop 2.x. It consists of a master daemon known as “Resource Manager,” a slave daemon called node manager, and Application Master.

6.

List Hadoop HDFS Commands.

  1. version: hadoop version hadoop version
  2. mkdir: Used to create a new directory. hadoop FD -mkdir/directory_name
  3. cat: Usedd to display the content of the file present in the directory of HDFS. hadoop fs –cat /path_to_file_in_hdfs
  4. mv : The HDFS mv command moves the files or directories from the source to a destination within HDFS. hadoop fs -mv <src> <dest>
  5. copyToLocal: This command copies the file from the file present in the newDataFlair directory of HDFS to the local file system. hadoop fs -copyToLocal <hdfs source> <localdst>
  6. get: Copies the file from the Hadoop File System to the Local File System. hadoop fs -get<src> <localdest>
7.

What are the differences between regular FileSystem and HDFS?

  • Regular FileSystem: In regular FileSystem, data is maintained in a single system. If the machine crashes, data recovery is challenging due to low fault tolerance. Seek time is more and hence it takes more time to process the data.
  • HDFS: Data is distributed and maintained on multiple systems. If a DataNode crashes, data can still be recovered from other nodes in the cluster. Time taken to read data is comparatively more, as there is local data read to the disc and coordination of data from multiple systems.
8.

What are the two types of metadata that a NameNode server holds?

The two types of metadata that a NameNode server holds are:

  • Metadata in Disk - This contains the edit log and the FSImage
  • Metadata in RAM - This contains the information about DataNodes
9.

What is the difference between a federation and high availability?

HDFS FederationHDFS High Availability
- There is no limitation to the number of NameNodes and the NameNodes are not related to each other- There are two NameNodes that are related to each other. Both active and standby NameNodes work all the time
- All the NameNodes share a pool of metadata in which each NameNode will have its dedicated pool- One at a time, active NameNodes will be up and running, while standby NameNodes will be idle and updating its metadata once in a while
- Provides fault tolerance, i.e., if one NameNode goes down, it will not affect the data of the other NameNode- It requires two separate machines. First, the active NameNode will be configured, while the secondary NameNode will be configured on the other system
10.

If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split?

By default, each block in HDFS is divided into 128 MB. The size of all the blocks, except the last block, will be 128 MB. For an input file of 350 MB, there are three input splits in total. The size of each split is 128 MB, 128MB, and 94 MB.

11.

How does rack awareness work in HDFS?

HDFS Rack Awareness refers to the knowledge of different DataNodes and how it is distributed across the racks of a Hadoop Cluster.

By default, each block of data is replicated three times on various DataNodes present on different racks. Two identical blocks cannot be placed on the same DataNode. When a cluster is “rack-aware,” all the replicas of a block cannot be placed on the same rack. If a DataNode crashes, you can retrieve the data block from different DataNodes.

12.

What would happen if you store too many small files in a cluster on HDFS?

Storing several small files on HDFS generates a lot of metadata files. To store these metadata in the RAM is a challenge as each file, block, or directory takes 150 bytes for metadata. Thus, the cumulative size of all the metadata will be too large.

13.

When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?

The commands below are used to refresh the node information while commissioning, or when the decommissioning of nodes is completed.

dfsadmin -refreshNodes

This is used to run the HDFS client and it refreshes node configuration for the NameNode.

rmadmin -refreshNodes

This is used to perform administrative tasks for ResourceManager.

14.

Is there any way to change the replication of files on HDFS after they are already written to HDFS?

Yes, the following are ways to change the replication of files on HDFS:

We can change the dfs.replication value to a particular number in the $HADOOP_HOME/conf/hadoop-site.xml file, which will start replicating to the factor of that number for any new content that comes in.

If you want to change the replication factor for a particular file or directory, use:

$HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /path_of_the_file

Example: $HADOOP_HOME/bin/Hadoop dfs –setrep –w4 /user/temp/questions.csv

15.

Who takes care of replication consistency in a Hadoop cluster?

In a cluster, it is always the NameNode that takes care of the replication consistency.

16.

what do under/over replicated blocks mean?

The fsck command provides information regarding the over and under-replicated block.

Under-replicated blocks: These are the blocks that do not meet their target replication for the files they belong to. HDFS will automatically create new replicas of under-replicated blocks until they meet the target replication.

Consider a cluster with three nodes and replication set to three. At any point, if one of the NameNodes crashes, the blocks would be under-replicated. It means that there was a replication factor set, but there are not enough replicas as per the replication factor. If the NameNode does not get information about the replicas, it will wait for a limited amount of time and then start the re-replication of missing blocks from the available nodes.

Over-replicated blocks: These are the blocks that exceed their target replication for the files they belong to. Usually, over-replication is not a problem, and HDFS will automatically delete excess replicas.

Consider a case of three nodes running with the replication of three, and one of the nodes goes down due to a network failure. Within a few minutes, the NameNode re-replicates the data, and then the failed node is back with its set of blocks. This is an over-replication situation, and the NameNode will delete a set of blocks from one of the nodes.

17.

What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?

  • RecordReader This communicates with the InputSplit and converts the data into key-value pairs suitable for the mapper to read.

  • Combiner This is an optional phase; it is like a mini reducer. The combiner receives data from the map tasks, works on it, and then passes its output to the reducer phase.

  • Partitioner The partitioner decides how many reduced tasks would be used to summarize the data. It also confirms how outputs from combiners are sent to the reducer, and controls the partitioning of keys of the intermediate map outputs.

18.

Why is MapReduce slower in processing data in comparison to other processing frameworks?

This is quite a common question in Hadoop interviews; let us understand why MapReduce is slower in comparison to the other processing frameworks:

MapReduce is slower because:

  • It is batch-oriented when it comes to processing data. Here, no matter what, you would have to provide the mapper and reducer functions to work on data.
  • During processing, whenever the mapper function delivers an output, it will be written to HDFS and the underlying disks. This data will be shuffled and sorted, and then be picked up for the reducing phase. The entire process of writing data to HDFS and retrieving it from HDFS makes MapReduce a lengthier process.
  • In addition to the above reasons, MapReduce also uses Java language, which is difficult to program as it has multiple lines of code.
19.

Is it possible to change the number of mappers to be created in a MapReduce job?

By default, you cannot change the number of mappers, because it is equal to the number of input splits. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

For example, if you have a 1GB file that is split into eight blocks (of 128MB each), there will only be only eight mappers running on the cluster. However, there are different ways in which you can either set a property or customize the code to change the number of mappers.

20.

Name some Hadoop-specific data types that are used in a MapReduce program.

This is an important question, as you would need to know the different data types if you are getting into the field of Big Data.

For every data type in Java, you have an equivalent in Hadoop. Therefore, the following are some Hadoop-specific data types that you could use in your MapReduce program:

  • IntWritable
  • FloatWritable
  • LongWritable
  • DoubleWritable
  • BooleanWritable
  • ArrayWritable
  • MapWritable
  • ObjectWritable
21.

What is speculative execution in Hadoop?

If a DataNode is executing any task slowly, the master node can redundantly execute another instance of the same task on another node. The task that finishes first will be accepted, and the other task would be killed. Therefore, speculative execution is useful if you are working in an intensive workload kind of environment.

For example, consider two nodes where node A has a slower task. A scheduler maintains the resources available, and with speculative execution turned on, a copy of the slower task runs on node B. If node A task is slower, then the output is accepted from node B.

22.

How is identity mapper different from chain mapper?

Identity MapperChain Mapper
- This is the default mapper that is chosen when no mapper is specified in the MapReduce driver class.- This class is used to run multiple mappers in a single map task.
- It implements identity function, which directly writes all its key-value pairs into output.- The output of the first mapper becomes the input to the second mapper, second to third and so on.
- It is defined in old MapReduce API (MR1) in: org.apache.Hadoop.mapred.lib.package.- It is defined in: org.apache.Hadoop.mapreduce.lib.chain.ChainMapperpackage
23.

What is the role of the OutputCommitter class in a MapReduce job?

MapReduce relies on the OutputCommitter for the following:

  • Set up the job initialization
  • Cleaning up the job after the job completion
  • Set up the task’s temporary output
  • Check whether a task needs a commit
  • Committing the task output
  • Discard the task commit
24.

25.

What happens when a node running a map task fails before sending the output to the reducer?

If this ever happens, map tasks will be assigned to a new node, and the entire task will be rerun to re-create the map output. In Hadoop v2, the YARN framework has a temporary daemon called application master, which takes care of the execution of the application. If a task on a particular node failed due to the unavailability of a node, it is the role of the application master to have this task scheduled on another node.

26.

What benefits did YARN bring in Hadoop 2.0 and how did it solve the issues of MapReduce v1?

In Hadoop v1, MapReduce performed both data processing and resource management; there was only one master process for the processing layer known as JobTracker. JobTracker was responsible for resource tracking and job scheduling.

Managing jobs using a single JobTracker and utilization of computational resources was inefficient in MapReduce 1. As a result, JobTracker was overburdened due to handling, job scheduling, and resource management. Some of the issues were scalability, availability issue, and resource utilization. In addition to these issues, the other problem was that non-MapReduce jobs couldn’t run in v1.

To overcome this issue, Hadoop 2 introduced YARN as the processing layer. In YARN, there is a processing master called ResourceManager. In Hadoop v2, you have ResourceManager running in high availability mode. There are node managers running on multiple machines, and a temporary daemon called application master. Here, the ResourceManager is only handling the client connections and taking care of tracking the resources.

27.

Can we have more than one ResourceManager in a YARN-based cluster?

Yes, Hadoop v2 allows us to have more than one ResourceManager. You can have a high availability YARN cluster where you can have an active ResourceManager and a standby ResourceManager, where the ZooKeeper handles the coordination.

There can only be one active ResourceManager at a time. If an active ResourceManager fails, then the standby ResourceManager comes to the rescue.

28.

Why do we use Hadoop for Big Data?

Hadoop proves to be one of the best solutions for managing Big Data operations due to the following reasons:

  • High storage: Hadoop enables the storage of huge raw files easily without schema.
  • Cost-effective: Hadoop is an economical solution for Big Data distributed storage and processing, as the commodity hardware required to run it is not expensive.
  • Scalable, reliable, and secure: The data can be easily scaled with Hadoop systems, as any number of new nodes can be added. Additionally, the data can be stored and accessed despite machine failure. Lastly, Hadoop provides a high level of data security.
29.

What are some limitations of Hadoop?

File size limitations: HDFS can not handle a large number of small files. If you use such files, the NameNode will be overloaded. Support for only batch-processing: Hadoop does not process streamed data and has support for batch-processing only. This lowers overall performance. Difficulty in management: It can become difficult to manage complex applications in Hadoop.

30.

What is indexing? How is indexing done in HDFS?

Indexing is used to speed up the process of access to the data. Hadoop has a unique way of indexing. It does not automatically index the data. Instead, it stores the last part of the data that shows where (the address) the next part of the data chunk is stored.

31.

What is meant by a block and block scanner?

In HDFS, a block is the minimum data that is read or written. The default block size is 64MB. A scanner tracks the blocks present on a DataNode. It then verifies them to find errors.

32.

Explain the three core methods of a reducer.

  • setup(): It is used to configure various parameters like input data size and heap size.
  • reduce(): It is the heart of the reducer.
  • cleanup(): This method is used to clear the temporary files at the end of the reduced task.
33.

What are the different scheduling policies you can use in YARN?

The YARN scheduler uses three different scheduling policies to assign resources to various applications.

  • FIFO scheduler: This scheduler runs on a first-come-first-serve basis. It lines up the application requests based on submission and aligns the resources accordingly.
  • Capacity scheduler: The capacity scheduler is more logical than the FIFO scheduler. It has a separate queue for smaller requests that require only a handful of resources. As soon as such small requests are submitted, the capacity scheduler starts executing them.
  • Fair scheduler: The fair scheduler runs on the balancing criteria. It allocates the resources in such a way that all the requests get something, creating a balance. As soon as a new request is submitted and a task starts, the fair scheduler divides the resources and allocates each task’s share accordingly.
34.

Why is block size set to 128 MB in Hadoop HDFS?

Block is a continuous location on the hard drive which stores the data. In general, FileSystem store data as a collection of blocks. HDFS stores each file as blocks, and distributes it across the Hadoop cluster. In HDFS, the default size of data block is 128 MB, which we can configure as per our requirement. Block size is set to 128 MB:

  • To reduce the disk seeks (IO). Larger the block size, lesser the file blocks and less number of disk seek and transfer of the block can be done within respectable limits and that to parallelly.
  • HDFS have huge data sets, i.e. terabytes and petabytes of data. If we take 4 KB block size for HDFS, just like Linux file system, which have 4 KB block size, then we would be having too many blocks and therefore too much of metadata. Managing this huge number of blocks and metadata will create huge overhead and traffic which is something which we don’t want. So, the block size is set to 128 MB. On the other hand, block size can’t be so large that the system is waiting a very long time for the last unit of data processing to finish its work.
35.

How data or file is written into HDFS?

When a client wants to write a file to HDFS, it communicates to namenode for metadata. The Namenode responds with details of a number of blocks, replication factor. Then, on basis of information from NameNode, client split files into multiple blocks. After that client starts sending them to first DataNode. The client sends block A to Datanode 1 with other two Datanodes details. When Datanode 1 receives block A sent from the client, Datanode 1 copy same block to Datanode 2 of the same rack. As both the Datanodes are in the same rack so block transfer via rack switch. Now Datanode 2 copies the same block to Datanode 3. As both the Datanodes are in different racks so block transfer via an out-of-rack switch. After the Datanode receives the blocks from the client. Then Datanode sends write confirmation to Namenode. Now Datanode sends write confirmation to the client. The Same process will repeat for each block of the file. Data transfer happen in parallel for faster write of blocks.

36.

Can multiple clients write into an HDFS file concurrently?

Multiple clients cannot write into an HDFS file at same time. Apache Hadoop HDFS follows single writer multiple reader models. The client which opens a file for writing, the NameNode grant a lease. Now suppose, some other client wants to write into that file. It asks NameNode for the write operation. NameNode first checks whether it has granted the lease for writing into that file to someone else or not. When someone already acquires the lease, then, it will reject the write request of the other client.

37.

How data or file is read in HDFS?

To read from HDFS, the first client communicates to namenode for metadata. A client comes out of namenode with the name of files and its location. The Namenode responds with details of the number of blocks, replication factor. Now client communicates with Datanode where the blocks are present. Clients start reading data parallel from the Datanode. It read on the basis of information received from the namenodes. Once client or application receives all the blocks of the file, it will combine these blocks to form a file. For read performance improvement, the location of each block ordered by their distance from the client. HDFS selects the replica which is closest to the client. This reduces the read latency and bandwidth consumption. It first read the block in the same node. Then another node in the same rack, and then finally another Datanode in another rack.

38.

Why HDFS stores data using commodity hardware despite the higher chance of failures?

HDFS stores data using commodity hardware because HDFS is highly fault-tolerant. HDFS provides fault tolerance by replicating the data blocks. And then distribute it among different DataNodes across the cluster. By default, replication factor is 3 which is configurable. Replication of data solves the problem of data loss in unfavorable conditions. And unfavorable conditions are crashing of the node, hardware failure and so on. So, when any machine in the cluster goes down, then the client can easily access their data from another machine. And this machine contains the same copy of data blocks.

39.

In HDFS, how Name node determines which data node to write on?

Namenode contains Metadata i.e. number of blocks, replicas, their location, and other details. This meta-data is available in memory in the master for faster retrieval of data. NameNode maintains and manages the Datanodes, and assigns tasks to them. Answer these type of Hadoop interview questions answers very shortly and to the point.

40.

Why is Reading done in parallel and writing is not in HDFS?

Client read data parallelly because by doing so the client can access the data fast. Reading in parallel makes the system fault tolerant. But the client does not perform the write operation in Parallel. Because writing in parallel might result in data inconsistency. Suppose, you have a file and two nodes are trying to write data into a file in parallel. Then the first node does not know what the second node has written and vice-versa. So, we can not identify which data to store and access. Client in Hadoop writes data in pipeline anatomy. There are various benefits of a pipeline write:

  • More efficient bandwidth consumption for the client – The client only has to transfer one replica to the first datanode in the pipeline write. So, each node only gets and send one replica over the network (except the last datanode only receives data). This results in balanced bandwidth consumption. As compared to the client writing three replicas into three different datanodes.
  • Smaller sent/ack window to maintain – The client maintains a much smaller sliding window. Sliding window record which blocks in the replica is sending to the DataNodes. It also records which blocks are waiting for acks to confirm the write has been done. In a pipeline write, the client appears to write data to only one datanode.
41.

What is Mapper in Hadoop?

Mapper task processes each input record (from RecordReader) and generates a key-value pair. This key-value pairs generated by mapper is completely different from the input pair. The Mapper store intermediate-output on the local disk. Thus, it does not store its output on HDFS. It is temporary data and writing on HDFS will create unnecessary multiple copies. Mapper only understands key-value pairs of data. So before passing data to the mapper, it, first converts the data into key-value pairs.

Mapper only understands key-value pairs of data. So before passing data to the mapper, it, first converts the data into key-value pairs. InputSplit and RecordReader convert data into key-value pairs. Input split is the logical representation of data. RecordReader communicates with the InputSplit and converts the data into Kay-value pairs. Hence

Key is a reference to the input value. Value is the data set on which to operate. Number of maps depends on the total size of the input. i.e. the total number of blocks of the input files.

Mapper= {(total data size)/ (input split size)}

If data size= 1 Tb and input split size= 100 MB Hence Mapper= (1000*1000)/100= 10,000

42.

What is Reducer in Hadoop?

Reducer takes the output of the Mapper (intermediate key-value pair) as the input. After that, it runs a reduce function on each of them to generate the output. Thus the output of the reducer is the final output, which it stored in HDFS. Usually, in Reducer, we do aggregation or summation sort of computation. Reducer has three primary phases-

  • Shuffle- The framework, fetches the relevant partition of the output of all the Mappers for each reducer via HTTP.
  • Sort- The framework groups Reducers inputs by the key in this Phase. Shuffle and sort phases occur simultaneously.
  • Reduce- After shuffling and sorting, reduce task aggregates the key-value pairs. In this phase, call the reduce (Object, Iterator, OutputCollector, Reporter) method for each <key, (list of values)> pair in the grouped inputs.

With the help of Job.setNumreduceTasks(int) the user set the number of reducers for the job. Hence, right number of reducers is 0.95 or 1.75 multiplied by (<no. of nodes>*<no. of maximum container per node>)

43.

How to set mappers and reducers for MapReduce jobs?

One can configure JobConf to set number of mappers and reducers.

  • For Mapper – job.setNumMaptasks()
  • For Reducer – job.setNumreduceTasks()
44.

What is the need of key-value pair to process the data in MapReduce?

Hadoop MapReduce works on unstructured and semi-structured data apart from structured data. One can read the Structured data like the ones stored in RDBMS by columns. But handling unstructured data is feasible using key-value pairs. The very core idea of MapReduce work on the basis of these pairs. Framework map data into a collection of key-value pairs by mapper and reducer on all the pairs with the same key. In most of the computations- Map operation applies on each logical “record” in our input. This computes a set of intermediate key-value pairs. Then apply reduce operation on all the values that share the same key. This combines the derived data properly. Thus, we can say that key-value pairs are the best solution to work on data problems on MapReduce.

45.

If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?

So, Hadoop MapReduce by default uses ‘HashPartitioner’. It uses the hashCode() method to determine, to which partition a given (key, value) pair will be sent. HashPartitioner also has a method called getPartition. HashPartitioner also takes key.hashCode() & integer>MAX_VALUE. It takes these code to finds the modulus using the number of reduce tasks. Suppose there are 10 reduce tasks, then getPartition will return values 0 through 9 for all keys.

Public class HashPartitioner<k, v>extends Partitioner<k, v>
{
    Public int getpartitioner(k key, v value, int numreduceTasks)
    {
        Return (key.hashCode() & Integer.Max_VALUE) % numreduceTasks;
    }
}
46.

How to write a custom partitioner for a Hadoop MapReduce job?

It stores the results uniformly across different reducers, based on the user condition.

By setting a Partitioner to partition by the key, we can guarantee that records for the same key will go the same reducer. It also ensures that only one reducer receives all the records for that particular key.

By the following steps, we can write Custom partitioner for a Hadoop MapReduce job:

  • Create a new class that extends Partitioner Class.
  • Then, Override method getPartition, in the wrapper that runs in the MapReduce.
  • By using method set Partitioner class, add the custom partitioner to the job. Or add the custom partitioner to the job as config file.
47.

Why aggregation cannot be done in Mapper?

Mapper task processes each input record (From RecordReader) and generates a key-value pair. The Mapper store intermediate-output on the local disk. We cannot perform aggregation in mapper because:

  • Sorting takes place only on the Reducer function. Thus there is no provision for sorting in the mapper function. Without sorting aggregation is not possible.
  • To perform aggregation, we need the output of all the Mapper function. Thus, which may not be possible to collect in the map phase. Because mappers may be running on different machines where the data blocks are present.
  • If we will try to perform aggregation of data at mapper, it requires communication between all mapper functions. Which may be running on different machines. Thus, this will consume high network bandwidth and can cause network bottlenecking.
48.

Explain map-only job?

MapReduce is the data processing layer of Hadoop. It is the framework for writing applications that process the vast amount of data stored in the HDFS. It processes the huge amount of data in parallel by dividing the job into a set of independent tasks (sub-job). In Hadoop, MapReduce have 2 phases of processing: Map and Reduce. In Map phase we specify all the complex logic/business rules/costly code. Map takes a set of data and converts it into another set of data. It also break individual elements into tuples (key-value pairs). In Reduce phase we specify light-weight processing like aggregation/summation. Reduce takes the output from the map as input. After that it combines tuples (key-value) based on the key. And then, modifies the value of the key accordingly.

Consider a case where we just need to perform the operation and no aggregation required. Thus, in such case, we will prefer “Map-Only job” in Hadoop. In Map-Only job, the map does all task with its InputSplit and the reducer do no job. Map output is the final output. This we can achieve by setting job.setNumreduceTasks(0) in the configuration in a driver. This will make a number of reducer 0 and thus only mapper will be doing the complete task.

49.

Define Writable data types in Hadoop MapReduce.

Hadoop reads and writes data in a serialized form in the writable interface. The Writable interface has several classes like Text, IntWritable, LongWriatble, FloatWritable, BooleanWritable. Users are also free to define their personal Writable classes as well.

Hadoop Freshers Interview Questions

1.

List the YARN components.

  • Resource Manager: It runs on a master daemon and controls the resource allocation in the cluster.
  • Node Manager: It runs on the slave daemons and executes a task on each single Data Node.
  • Application Master: It controls the user job lifecycle and resource demands of single applications. It works with the Node Manager and monitors the execution of tasks.
  • Container: It is a combination of resources, including RAM, CPU, Network, HDD, etc., on a single node.

Advanced Hadoop Interview Questions

1.

What is the difference between RDBMS with Hadoop MapReduce?

FeaturesRDBMSMapReduce
Size of DataTraditional RDBMS can handle upto gigabytes of data.Hadoop MapReduce can hadnle upto petabytes of data or more.
UpdatesRead and Write multiple times.Read many times but write once model.
SchemaStatic Schema that needs to be pre-defined.Has a dynamic schema
Processing ModelSupports both batch and interactive processing.Supports only batch processing.
ScalabilityNon-LinearLinear
2.

When is it not recommended to use MapReduce paradigm for large scale data processing?

For iterative processing use cases it is not suggested to use MapReduce. As it is not cost effective, instead Apache Pig can be used for the same.

3.

Explain the usage of Context Object.

With the help of Context Object, Mapper can easily interact with other Hadoop systems. It also helps in updating counters. So counters can report the progress and provide any application-level status updates. It contains configuration details for the job.

4.

How many InputSplits will be made by hadoop framework?

InputFormat is responsible for creating InputSplit, which is the logical representation of data. Further Hadoop framework divides split into records. Then, Mapper process each record (which is a key-value pair). MapReduce system use storage locations to place map tasks as close to split’s data as possible. By default, split size is approximately equal to HDFS block size (128 MB). For, example the file size is 514 MB, 128MB: 1st block, 128Mb: 2nd block, 128Mb: 3rd block, 128Mb: 4th block, 2Mb: 5th block So, 5 InputSplit is created based on 5 blocks.

5.

How is the splitting of file invoked in Hadoop ?

InputFormat is responsible for creating InputSplit, which is the logical representation of data. Further Hadoop framework divides split into records. Then, Mapper process each record (which is a key-value pair). By running getInputSplit() method Hadoop framework invoke Splitting of file . getInputSplit() method belongs to Input Format class (like FileInputFormat) defined by the user.

6.

What are the parameters of mappers and reducers?

The parameters for Mappers are:

  • LongWritable(input)
  • text (input)
  • text (intermediate output)
  • IntWritable (intermediate output)

The parameters for Reducers are:

  • text (intermediate output)
  • IntWritable (intermediate output)
  • text (final output)
  • IntWritable (final output)
7.

What is Chain Mapper?

We can use multiple Mapper classes within a single Map task by using Chain Mapper class. The Mapper classes invoked in a chained (or piped) fashion. The output of the first becomes the input of the second, and so on until the last mapper. The Hadoop framework write output of the last mapper to the task’s output.

The key benefit of this feature is that the Mappers in the chain do not need to be aware that they execute in a chain. And, this enables having reusable specialized Mappers. We can combine these mappers to perform composite operations within a single task in Hadoop. Hadoop framework take Special care when create chains. The key/values output by a Mapper are valid for the following mapper in the chain.

The class name is org.apache.hadoop.mapred.lib.ChainMapper

8.

Explain the process of spilling in MapReduce.

Spilling is a process of copying the data from memory buffer to disk when the buffer usage reaches a specific threshold size. This happens when there is not enough memory to fit all of the mapper output. By default, a background thread starts spilling the content from memory to disk after 80 percent of the buffer size is filled.

For a 100 MB size buffer, the spilling will start after the content of the buffer reaches a size of 80 MB.

9.

How to add/delete a Node to the existing cluster?

To add a Node to the existing cluster follow: Add the host name/Ip address in dfs.hosts/slaves file. Then, refresh the cluster with $hadoop dfsamin -refreshNodes To delete a Node to the existing cluster follow: Add the hostname/Ip address to dfs.hosts.exclude/remove the entry from slaves file. Then, refresh the cluster with $hadoop dfsamin -refreshNodes $hadoop dfsamin -refreshNodes

10.

Is Namenode machine same as DataNode machine as in terms of hardware in Hadoop?

NameNode is highly available server, unlike DataNode. NameNode manages the File System Namespace. It also maintains the metadata information. Metadata information is the number of blocks, their location, replicas and other details. It also executes file system execution such as naming, closing, opening files/directories. Because of the above reasons, NameNode requires higher RAM for storing the metadata for millions of files. Whereas, DataNode is responsible for storing actual data in HDFS. It performs read and write operation as per request of the clients. Therefore, Datanode needs to have a higher disk capacity for storing huge data sets.

11.

How NameNode tackle Datanode failures in Hadoop?

HDFS has a master-slave architecture in which master is namenode and slave are datanode. HDFS cluster has single namenode that manages file system namespace (metadata) and multiple datanodes that are responsible for storing actual data in HDFs and performing the read-write operation as per request for the clients. NameNode receives Heartbeat and block report from Datanode. Heartbeat receipt implies that the datanode is alive and functioning properly and block report contains a list of all blocks on a datanode. When NameNode observes that DataNode has not sent heartbeat message after a certain amount of time, the datanode is marked as dead. The namenode replicates the blocks of the dead node to another datanode using the replica created earlier. Hence, NameNode can easily handle Datanode failure.

12.

How many Reducers run for a MapReduce job?

With the help of Job.setNumreduceTasks(int) the user set the number of reduces for the job. To set the right number of reducers use the below formula: 0.95 Or 1.75 multiplied by (<no. of nodes> * <no. of maximum container per node>).

As the map finishes, all the reducers can launch immediately and start transferring map output with 0.95. With 1.75, faster nodes finsihes first round of reduces and launch second wave of reduces .

With the increase of number of reducers:

  • Load balancing increases.
  • Cost of failures decreases.
  • Framework overhead increases.
13.

What counter in Hadoop MapReduce?

Counters in MapReduce are useful Channel for gathering statistics about the MapReduce job. Statistics like for quality control or for application-level. They are also useful for problem diagnosis. Counters validate that:

  • Number of bytes read and write within map/reduce job is correct or not
  • The number of tasks launches and successfully run in map/reduce job is correct or not.
  • The amount of CPU and memory consumed is appropriate for our job and cluster nodes.

There are two types of counters:

  • Built-In Counters – In Hadoop there are some built-In counters for every job. These report various metrics, like, there are counters for the number of bytes and records. Thus, this allows us to confirm that it consume the expected amount of input. Also make sure that it produce the expected amount of output.
  • User-Defined Counters – Hadoop MapReduce permits user code to define a set of counters. These are then increased as desired in the mapper or reducer. For example, in Java, use ‘enum’ to define counters.
14.

What happen if number of reducer is set to 0 in Hadoop?

If we set the number of reducer to 0:

  • Then no reducer will execute and no aggregation will take place.
  • In such case we will prefer “Map-only job” in Hadoop. In map-Only job, the map does all task with its InputSplit and the reducer do no job. Map output is the final output. In between map and reduce phases there is key, sort, and shuffle phase. Sort and shuffle phase are responsible for sorting the keys in ascending order. Then grouping values based on same keys. This phase is very expensive. If reduce phase is not required we should avoid it. Avoiding reduce phase would eliminate sort and shuffle phase as well. This also saves network congestion. As in shuffling an output of mapper travels to reducer,when data size is huge, large data travel to reducer.
15.

What is KeyValueTextInputFormat in Hadoop?

KeyValueTextInputFormat- It treats each line of input as a separate record. It breaks the line itself into key and value. Thus, it uses the tab character (‘/t’) to break the line into a key-value pair.

Key- Everything up to tab character.

Value- Remaining part of the line after tab character. Consider the following input file, where → represents a (horizontal) tab character: But→ his face you could not see Account→ of his beaver hat Hence,

Output: Key- But Value- his face you could not see Key- Account Value- of his beaver hat

16.

Explain about the partitioning, shuffle and sort phase in MapReduce?

  • Partitioning Phase – Partitioning specifies that all the values for each key are grouped together. Then make sure that all the values of a single key go on the same Reducer. Thus allows even distribution of the map output over the Reducer.

  • Shuffle Phase – It is the process by which the system sorts the key-value output of the map tasks. After that it transfer to the reducer.

  • Sort Phase – Mapper generate the intermediate key-value pair. Before starting of Reducer, map reduce framework sort these key-value pairs by the keys. It also helps reducer to easily distinguish when a new reduce task should start. Thus saves time for the reducer.

17.

What is meant by streaming access?

HDFS works on the principle of write once, read many. Its focus is on fast and accurate data retrieval. Steaming access means reading the complete data instead of retrieving a single record from the database.

18.

Explain what happens if, during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3?

Replication factor can be set for the entire cluster to adjust the number of replicated block. It ensures high data availability. The cluster will have n-1 duplicated blocks, for every block that are present in HDFS. So, if the replication factor during PUT operation is set to 1 in place of the default value 3. Then it will have a single copy of data. If one set replication factor 1. And if DataNode crashes under any circumstances, then an only single copy of the data would lose.

19.

If DataNode increases, then do we need to upgrade NameNode in Hadoop?

Namenode stores meta-data i.e. number of blocks, their location, replicas. In Hadoop, meta-data is present in memory in the master for faster retrieval of data. NameNode manages and maintains the slave nodes, and assigns tasks to them. It regulates client’s access to files. It also executes file system execution such as naming, closing, opening files/directories. During Hadoop installation, framework determines NameNode based on the size of the cluster. Mostly we don’t need to upgrade the NameNode because it does not store the actual data. But it stores the metadata, so such requirement rarely arise.

Hadoop Interview Questions For Experienced

1.

What is meant by a heartbeat in HDFS?

The signal sent between the DataNode and NameNode is called a heartbeat. If there is no response to the signal by the NameNode, it means there is an issue.

2.

What is DistCp?

It is a tool that is used for copying a very large amount of data to and from Hadoop file systems in parallel. It uses MapReduce to affect its distribution, error handling, recovery, and reporting. It expands a list of files and directories into input to map tasks, each of which will copy a partition of the files specified in the source list.

3.

Why are blocks in HDFS huge?

By default, the size of the HDFS data block is 128 MB. The ideas for the large size of blocks are:

  • To reduce the expense of seek: Because of the large size blocks, the time consumed to shift the data from the disk can be longer than the usual time taken to commence the block. As a result, the multiple blocks are transferred at the disk transfer rate.
  • If there are small blocks, the number of blocks will be too many in Hadoop HDFS and too much metadata to store. Managing such a vast number of blocks and metadata will create overhead and head to traffic in a network.
4.

What is the default replication factor?

By default, the replication factor is 3. There are no two copies that will be on the same data node. Usually, the first two copies will be on the same rack, and the third copy will be off the shelf. It is advised to set the replication factor to at least three so that one copy is always safe, even if something happens to the rack. We can set the default replication factor of the file system as well as of each file and directory exclusively. For files that are not essential, we can lower the replication factor, and critical files should have a high replication factor.

5.

How can you skip the bad records in Hadoop?

Hadoop provides an option where a particular set of lousy input records can be skipped when processing map inputs. Applications can manage this feature through the SkipBadRecords class. This feature can be used when map tasks fail deterministically on a particular input. This usually happens due to faults in the map function. The user would have to fix these issues.

6.

Where are the two types of metadata that NameNode server stores?

The two types of metadata that NameNode server stores are in Disk and RAM. Metadata is linked to two files which are:

  1. EditLogs: It contains all the latest changes in the file system regarding the last FsImage.
  2. FsImage: It contains the whole state of the namespace of the file system from the origination of the NameNode. Once the file is deleted from HDFS, the NameNode will immediately store this in the EditLog. All the file systems and metadata which are present in the Namenode’s Ram are read by the Secondary NameNode continuously and later get recorded into the file system or hard disk. EditLogs is combined with FsImage in the NameNode. Periodically, Secondary NameNode downloads the EditLogs from the NameNode, and then it is implemented to FsImage. The new FsImage is then copied back into the NameNode and used only after the NameNode has started the subsequent time.
7.

Explain the purpose of the dfsadmin tool?

The dfsadmin tools are a specific set of tools designed to help you root out information about your Hadoop Distributed File system (HDFS). As a bonus, you can use them to perform some administration operations on HDFS as well like -report,-refreshNodes, finalizeUpgrade, safemode enter | leave | get | wait, -upgradeProgress status | details | force

8.

Explain the actions followed by a Jobtracker in Hadoop.

  • The client application is used to submit the jobs to the Jobtracker.
  • The JobTracker associates with the NameNode to determine the data location.
  • With the help of available slots and the near the data, JobTracker locates TaskTracker nodes.
  • It submits the work on the selected TaskTracker Nodes.
  • When a task fails, JobTracker notifies and decides the further steps.
  • JobTracker monitors the TaskTracker nodes.
9.

Explain the distributed Cache in MapReduce framework.

Distributed Cache is a significant feature provided by the MapReduce Framework, practiced when you want to share the files across all nodes in a Hadoop cluster. These files can be jar files or simple properties files.

Hadoop's MapReduce framework allows the facility to cache small to moderate read-only files such as text files, zip files, jar files, etc., and distribute them to all the Datanodes(worker-nodes) MapReduce jobs are running. All Datanode gets a copy of the file(local-copy), which is sent by Distributed Cache.

10.

List the actions that happen when a DataNode fails.

  • Both the Jobtracker and the name node detect the failure on which blocks were the DataNode failed.
  • On the failed node all the tasks are rescheduled by locating other DataNodes with copies of these blocks
  • User’s data will be replicated to another node from namenode to maintain the configured replication factor.
11.

What are the basic parameters of a mapper?

The primary parameters of a mapper are text, LongWritable, text, and IntWritable. The initial two represent input parameters, and the other two signify intermediate output parameters.

12.

Mention the main Configuration parameters that has to be specified by the user to run MapReduce.

The chief configuration parameters that the user of the MapReduce framework needs to mention is:

  • Job’s input Location
  • Job’s Output Location
  • The Input format
  • The Output format
  • The Class including the Map function
  • The Class including the reduce function
  • JAR file, which includes the mapper, the Reducer, and the driver classes.
13.

How can you restart NameNode and all the daemons in Hadoop?

The following commands will help you restart NameNode and all the daemons:

  • You can stop the NameNode with ./sbin /Hadoop-daemon.sh stop NameNode command and then start the NameNode using ./sbin/Hadoop-daemon.sh start NameNode command.
  • You can stop all the daemons with the ./sbin /stop-all.sh command and then start the daemons using the ./sbin/start-all.sh command.
14.

What is Apache Flume in Hadoop ?

Apache Flume is a tool/service/data ingestion mechanism for assembling, aggregating, and carrying huge amounts of streaming data such as record files, events from various references to a centralized data store. Flume is a very stable, distributed, and configurable tool. It is generally designed to copy streaming data (log data) from various web servers to HDFS.

15.

Mention the consequences of Distributed Applications.

  • Heterogeneity: The design of applications should allow the users to access services and run applications over a heterogeneous collection of computers and networks taking into consideration Hardware devices, OS, networks, Programming languages.
  • Transparency: Distributed system Designers must hide the complexity of the system as much as they can. Some Terms of transparency are location, access, migration, Relocation, and so on.
  • Openness: It is a characteristic that determines whether the system can be extended and reimplemented in various ways.
  • Security: Distributed system Designers must take care of confidentiality, integrity, and availability.
  • Scalability: A system is said to be scalable if it can handle the addition of users and resources without suffering a noticeable loss of performance.
16.

Explain how YARN allocates resources to an application with the help of its architecture.

There is a client/application/API which talks to ResourceManager. The ResourceManager manages the resource allocation in the cluster. It has two internal components, scheduler, and application manager. The ResourceManager is aware of the resources that are available with every node manager. The scheduler allocates resources to various running applications when they are running in parallel. It schedules resources based on the requirements of the applications. It does not monitor or track the status of the applications.

Applications Manager is what accepts job submissions. It monitors and restarts the application masters in case of failures. Application Master manages the resource needs of individual applications. It interacts with the scheduler to acquire the required resources, and with NodeManager to execute and monitor tasks, which tracks the jobs running. It monitors each container’s resource utilization.

A container is a collection of resources, such as RAM, CPU, or network bandwidth. It provides the rights to an application to use a specific amount of resources.

Whenever a job submission happens, ResourceManager requests the NodeManager to hold some resources for processing. NodeManager then guarantees the container that would be available for processing. Next, the ResourceManager starts a temporary daemon called application master to take care of the execution. The App Master, which the applications manager launches, will run in one of the containers. The other containers will be utilized for execution. This is briefly how YARN takes care of the allocation.

17.

Explain Data Locality in Hadoop?

Hadoop major drawback was cross-switch network traffic due to the huge volume of data. To overcome this drawback, Data locality came into the picture. It refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. Data locality increases the overall throughput of the system. In Hadoop, HDFS stores datasets. Datasets are divided into blocks and stored across the datanodes in Hadoop cluster. When a user runs the MapReduce job then NameNode sends this MapReduce code to the datanodes on which data is available related to MapReduce job. Data locality has three categories:

  • Data local – In this category data is on the same node as the mapper working on the data. In such case, the proximity of the data is closer to the computation. This is the most preferred scenario.
  • Intra – Rack- In this scenarios mapper run on the different node but on the same rack. As it is not always possible to execute the mapper on the same datanode due to constraints.
  • Inter-Rack – In this scenarios mapper run on the different rack. As it is not possible to execute mapper on a different node in the same rack due to resource constraints.
18.

What is Safemode in Hadoop?

Safemode in Apache Hadoop is a maintenance state of NameNode. During which NameNode doesn’t allow any modifications to the file system. During Safemode, HDFS cluster is in read-only and doesn’t replicate or delete blocks. At the startup of NameNode:

  • It loads the file system namespace from the last saved FsImage into its main memory and the edits log file.
  • Merges edits log file on FsImage and results in new file system namespace.
  • Then it receives block reports containing information about block location from all datanodes. In SafeMode NameNode perform a collection of block reports from datanodes. NameNode enters safemode automatically during its start up. NameNode leaves Safemode after the DataNodes have reported that most blocks are available. Use the command: hadoop dfsadmin –safemode get: To know the status of Safemode bin/hadoop dfsadmin –safemode enter: To enter Safemode hadoop dfsadmin -safemode leave: To come out of Safemode NameNode front page shows whether safemode is on or off.
19.

How is security achieved in Hadoop?

Apache Hadoop achieves security by using Kerberos. At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.

  • Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
  • Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
  • Service Request – The client uses the service ticket to authenticate itself to the server.
20.

Why does one remove or add nodes in a Hadoop cluster frequently?

The most important features of the Hadoop is its utilization of Commodity hardware. However, this leads to frequent Datanode crashes in a Hadoop cluster. Another striking feature of Hadoop is the ease of scale by the rapid growth in data volume. Hence, due to above reasons, administrator Add/Remove DataNodes in a Hadoop Cluster.

21.

What is throughput in Hadoop?

The amount of work done in a unit time is Throughput. Because of bellow reasons HDFS provides good throughput:

  • The HDFS is Write Once and Read Many Model. It simplifies the data coherency issues as the data written once, one can not modify it. Thus, provides high throughput data access.
  • Hadoop works on Data Locality principle. This principle state that moves computation to data instead of data to computation. This reduces network congestion and therefore, enhances the overall system throughput.
22.

What does jps command do in Hadoop?

The jbs command helps us to check if the Hadoop daemons are running or not. Thus, it shows all the Hadoop daemons that are running on the machine. Daemons are Namenode, Datanode, ResourceManager, NodeManager etc.

23.

What is fsck?

fsck is the File System Check . Hadoop HDFS use the fsck (filesystem check) command to check for various inconsistencies. It also reports the problems with the files in HDFS. For example, missing blocks for a file or under-replicated blocks. It is different from the traditional fsck utility for the native file system. Therefore it does not correct the errors it detects. Normally NameNode automatically corrects most of the recoverable failures. Filesystem check also ignores open files. But it provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command. It can also run as bin/hdfs fsck. Filesystem check can run on the whole file system or on a subset of files.

hdfs fsck <path>
[-list-corruptfileblocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]
[-includeSnapshots]

Path - Start checking from this path -delete - Delete corrupted files. -files - Print out the checked files. -files –blocks - Print out the block report. -files –blocks –locations - Print out locations for every block. -files –blocks –rack - Print out network topology for data-node locations -includeSnapshots - Include snapshot data if the given path indicates or include snapshottable directory. -list -corruptfileblocks - Print the list of missing files and blocks they belong to.

24.

How to debug Hadoop code?

First, check the list of MapReduce jobs currently running. Then, check whether orphaned jobs is running or not; if yes, you need to determine the location of RM logs.

  1. First of all, Run: ps –ef| grep –I ResourceManager and then, look for log directory in the displayed result. Find out the job-id from the displayed list. Then check whether error message associated with that job or not.
  2. Now, on the basis of RM logs, identify the worker node which involves in the execution of the task.
  3. Now, login to that node and run- ps –ef| grep –I NodeManager
  4. Examine the NodeManager log.
  5. The majority of errors come from user level logs for each amp-reduce job.
25.

Explain Hadoop streaming?

Hadoop distribution provides generic application programming interface (API). This allows writing Map and Reduce jobs in any desired programming language. The utility allows creating/running jobs with any executable as Mapper/Reducer. For example:

hadoop jar hadoop-streaming-3.0.jar \
-input myInputDirs \
-output myOutputDir \
-mapper /bin/cat \
-reducer /usr/bin/wc

In the example, both the Mapper and reducer are executables. That read the input from stdin (line by line) and emit the output to stdout. The utility allows creating/submitting Map/Reduce job, to an appropriate cluster. It also monitors the progress of the job until it completes. Hadoop Streaming uses both streaming command options as well as generic command options. Be sure to place the generic options before the streaming. Otherwise, the command will fail. The general line syntax shown below: [php]Hadoop command [genericOptions] [streamingOptions][/php]

26.

How Hadoop’s CLASSPATH plays a vital role in starting or stopping in Hadoop daemons?

CLASSPATH includes all directories containing jar files required to start/stop Hadoop daemons. For example- HADOOP_HOME/share/hadoop/common/lib contains all the utility jar files. We cannot start/ stop Hadoop daemons if we don’t set CLASSPATH. We can set CLASSPATH inside /etc/hadoop/hadoop-env.sh file. The next time you run hadoop, the CLASSPATH will automatically add. That is, you don’t need to add CLASSPATH in the parameters each time you run it.

27.

What is configured in /etc/hosts and what is its role in setting Hadoop cluster?

./etc/hosts file contains the hostname and their IP address of that host. It also, maps the IP address to the hostname. In hadoop cluster, we store all the hostnames (master and slaves) with their IP address in ./etc/hosts. So, we can use hostnames easily instead of IP addresses.

28.

How is the splitting of file invoked in Hadoop framework?

Input file store data for Hadoop MapReduce task’s, and these files typically reside in HDFS. InputFormat defines how these input files split and read. It is also responsible for creating InputSplit, which is the logical representation of data. InputFormat also divides split into records. Then, mapper will process each record (which is a key-value pair). Hadoop framework invokes Splitting of the file by running getInputSplit() method. This method belongs to InputFormat class (like FileInputFormat) defined by the user.

29.

How to provide multiple input to Hadoop?

It is possible by using MultipleInputs class. For example: If we had weather data from the UK Met Office. And we want to combine with the NCDC data for our maximum temperature analysis. Then, we can set up the input as follows:

[php]
MultipleInputs.addInputPath(job,ncdcInputPath,TextInputFormat.class,MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job,metofficeInputPath,TextInputFormat.class, MetofficeMaxTemperatureMapper.class);
[/php]

The above code replaces the usual calls to FileInputFormat.addInputPath() and job.setmapperClass(). Both the Met Office and NCDC data are text based. So, we use TextInputFormat for each. And, we will use two different mappers, as the two data sources have different line format. The MaxTemperatureMapperr reads NCDC input data and extracts the year and temperature fields. The MetofficeMaxTemperatureMappers reads Met Office input data. Then, extracts the year and temperature fields.

30.

How to have hadoop job output in multiple directories?

It is possible by using following approaches:

  1. Using MultipleOutputs class- This class simplifies writing output data to multiple outputs. [php]MultipleOutputs.addNamedOutput(job,”OutputFileName”,OutputFormatClass,keyClass,valueClass);[/php] The API provides two overloaded write methods to achieve this [php]MultipleOutput.write(‘OutputFileName”, new Text (key), new Text(value));[/php] Then, we need to use overloaded write method, with an extra parameter for the base output path. This will allow to write the output file to separate output directories. [php]MultipleOutput.write(‘OutputFileName”, new Text (key), new Text(value), baseOutputPath);[/php] Then, we need to change your baseOutputpath in each of our implementation.
  2. Rename/Move the file in driver class- This is the easiest hack to write output to multiple directories. So, we can use MultipleOutputs and write all the output files to a single directory. But the file names need to be different for each category.
31.

How to copy a file into HDFS with a different block size to that of existing block size configuration?

One can copy a file into HDFS with a different block size by using: –Ddfs.blocksize=block_size, where block_size is in bytes. So, let us explain it with an example: Suppose, you want to copy a file called test.txt of size, say of 128 MB, into the hdfs. And for this file, you want the block size to be 32MB (33554432 Bytes) in place of the default (128 MB). So, you would issue the following command: [php]Hadoop fs –Ddfs.blocksize=33554432 –copyFromlocal/home/dataflair/test.txt/sample_hdfs[/php] Now, you can check the HDFS block size associated with this file by: [php]hadoop fs –stat %o/sample_hdfs/test.txt[/php] Else, you can also use the NameNode web UI for seeing the HDFS directory.

32.

Why HDFS performs replication, although it results in data redundancy?

In HDFS, Replication provides the fault tolerance. Data replication is one of the most important and unique features of HDFS. Replication of data solves the problem of data loss in unfavorable conditions. Unfavorable conditions are crashing of the node, hardware failure and so on. HDFS by default creates 3 replicas of each block across the cluster in Hadoop. And we can change it as per the need. So, if any node goes down, we can recover data on that node from the other node. In HDFS, Replication will lead to the consumption of a lot of space. But the user can always add more nodes to the cluster if required. It is very rare to have free space issues in the practical cluster. As the very first reason to deploy HDFS was to store huge data sets. Also, one can change the replication factor to save HDFS space. Or one can also use different codec provided by the Hadoop to compress the data.

33.

Explain Hadoop Archives?

Apache Hadoop HDFS stores and processes large (terabytes) data sets. However, storing a large number of small files in HDFS is inefficient, since each file is stored in a block, and block metadata is held in memory by the namenode. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is inefficient data access pattern. Hadoop Archive (HAR) basically deals with small files issue. HAR pack a number of small files into a large file, so, one can access the original files in parallel transparently (without expanding the files) and efficiently. Hadoop Archives are special format archives. It maps to a file system directory. Hadoop Archive always has a *.har extension. In particular, Hadoop MapReduce uses Hadoop Archives as an Input.

34.

Explain the Single point of Failure in Hadoop?

In Hadoop 1.0, NameNode is a single point of Failure (SPOF). If namenode fails, all clients would unable to read/write files. In such event, whole Hadoop system would be out of service until new namenode is up. Hadoop 2.0 overcomes this SPOF by providing support for multiple NameNode. High availability feature provides an extra NameNode to Hadoop architecture. This feature provides automatic failover. If active NameNode fails, then Standby-Namenode takes all the responsibility of active node. And cluster continues to work. The initial implementation of Namenode high availability provided for single active/standby namenode. However, some deployment requires high degree fault-tolerance. So new version 3.0 enable this feature by allowing the user to run multiple standby namenode. For instance configuring three namenode and five journal nodes. So, the cluster is able to tolerate the failure of two nodes rather than one.

35.

Explain Erasure Coding in Hadoop?

In Hadoop, by default HDFS replicates each block three times for several purposes. Replication in HDFS is very simple and robust form of redundancy to shield against the failure of datanode. But replication is very expensive. Thus, 3 x replication scheme has 200% overhead in storage space and other resources. Thus, Hadoop 2.x introduced Erasure Coding a new feature to use in the place of Replication. It also provides the same level of fault tolerance with less space store and 50% storage overhead. Erasure Coding uses Redundant Array of Inexpensive Disk (RAID). RAID implements EC through striping. In which it divide logical sequential data (such as a file) into the smaller unit (such as bit, byte or block). Then, stores data on different disk. Encoding- In this process, RAID calculates and sort Parity cells for each strip of data cells. And recover error through the parity. Erasure coding extends a message with redundant data for fault tolerance. EC codec operates on uniformly sized data cells. In Erasure Coding, codec takes a number of data cells as input and produces parity cells as the output. Data cells and parity cells together are called an erasure coding group. There are two algorithms available for Erasure Coding:

XOR Algorithm Reed-Solomon Algorithm

36.

What is Disk Balancer in Hadoop?

HDFS provides a command line tool called Diskbalancer. It distributes data evenly on all disks of a datanode. This tool operates against a given datanode and moves blocks from one disk to another. Disk balancer works by creating a plan (set of statements) and executing that plan on the datanode. Thus, the plan describes how much data should move between two disks. A plan composes multiple steps. Move step has source disk, destination disk and the number of bytes to move. And the plan will execute against an operational datanode. By default, disk balancer is not enabled; Hence, to enable disk balancer dfs.disk.balancer.enabled must be set true in hdfs-site.xml. When we write new block in hdfs, then, datanode uses volume choosing the policy to choose the disk for the block. Each directory is the volume in hdfs terminology. Thus, two such policies are:

  • Round-robin: It distributes the new blocks evenly across the available disks.
  • Available space: It writes data to the disk that has maximum free space (by percentage).
37.

Explain the difference between a MapReduce InputSplit and HDFS block using an example?

Consider an example, where we need to store the file in HDFS. HDFS stores files as blocks. Block is the smallest unit of data that can store or retrieved from the disk. The default size of the block is 128MB. HDFS break files into blocks and stores these blocks on different nodes in the cluster. We have a file of 130 MB, so HDFS will break this file into 2 blocks. Now, if one wants to perform MapReduce operation on the blocks, it will not process, as the 2nd block is incomplete. InputSplit solves this problem. InputSplit will form a logical grouping of blocks as a single block. As the InputSplit include a location for the next block. It also includes the byte offset of the data needed to complete the block. From this, we can conclude that InputSplit is only a logical chunk of data. i.e. it has just the information about blocks address or location. Thus, during MapReduce execution, Hadoop scans through the blocks and create InputSplits.

38.

What is a Backup node in Hadoop?

Backup node provides the same checkpointing functionality as the Checkpoint node (Checkpoint node is a node which periodically creates checkpoints of the namespace. Checkpoint Node downloads FsImage and edits from the active NameNode, merges them locally, and uploads the new image back to the active NameNode). In Hadoop, Backup node keeps an in-memory, up-to-date copy of the file system namespace, which is always synchronized with the active NameNode state.

The Backup node does not need to download FsImage and edits files from the active NameNode in order to create a checkpoint, as would be required with a Checkpoint node or Secondary Namenode since it already has an up-to-date state of the namespace state in memory. The Backup node checkpoint process is more efficient as it only needs to save the namespace into the local FsImage file and reset edits. One Backup node is supported by the NameNode at a time. No checkpoint nodes may be registered if a Backup node is in use.

39.

What is active and passive NameNode in Hadoop?

  • Active NameNode – It is the NameNode which works and runs in the cluster. It is also responsible for all client operations in the cluster.
  • Passive NameNode – It is a standby namenode, which has similar data as active NameNode. It simply acts as a slave, maintains enough state to provide a fast failover, if necessary. If Active NameNode fails, then Passive NameNode takes all the responsibility of active node. The cluster works continuously.
40.

What are the most common OutputFormat in Hadoop?

Reducer takes mapper output as input and produces output (zero or more key-value pair). RecordWriter writes these output key-value pair from the Reducer phase to output files. So, OutputFormat determines, how RecordWriter writes these key-value pairs in Output files. FileOutputFormat.setOutputpath() method used to set the output directory. So, every Reducer writes a separate in a common output directory. Most common OutputFormat are:

  • TextOutputFormat – It is the default OutputFormat in MapReduce. TextOutputFormat writes key-value pairs on individual lines of text files. Keys and values of this format can be of any type. Because TextOutputFormat turns them to string by calling toString() on them.
  • SequenceFileOutputFormat – This OutputFormat writes sequences files for its output. It is also used between MapReduce jobs.
  • SequenceFileAsBinaryOutputFormat – It is another form of SequenceFileInputFormat. which writes keys and values to sequence file in binary format.
  • DBOutputFormat – We use this for writing to relational databases and HBase. Then, it sends the reduce output to a SQL table. It accepts key-value pairs, where the key has a type extending DBwritable.
41.

What is LazyOutputFormat in Hadoop?

FileOutputFormat subclasses will create output files (part-r-nnnn), even if they are empty. Some applications prefer not to create empty files, which is where LazyOutputFormat helps. LazyOutputFormat is a wrapper OutputFormat. It make sure that the output file should create only when it emit its first record for a given partition.

  • To use LazyOutputFormat, call its SetOutputFormatClass() method with the JobConf.
  • To enable LazyOutputFormat, streaming and pipes supports a – lazyOutput option.
42.

How to handle record boundaries in Text files or Sequence files in MapReduce InputSplits?

InputSplit’s RecordReader in MapReduce will “start” and “end” at a record boundary. In SequenceFile, every 2k bytes has a 20 bytes sync mark between the records. And, the sync marks between the records allow the RecordReader to seek to the start of the InputSplit. It contains a file, length and offset. It also find the first sync mark after the start of the split. And, the RecordReader continues processing records until it reaches the first sync mark after the end of the split.

Similarly, Text files use newlines instead of sync marks to handle record boundaries.

43.

What is Identity Mapper?

Identity Mapper is the default Mapper provided by Hadoop. When MapReduce program has not defined any mapper class then Identity mapper runs. It simply passes the input key-value pair for the reducer phase. Identity Mapper does not perform computation and calculations on the input data. So, it only writes the input data into output. The class name is org.apache.hadoop.mapred.lib.IdentityMapper

44.

What is Identity reducer?

Identity Reducer is the default Reducer provided by Hadoop. When MapReduce program has not defined any mapper class then Identity mapper runs. It does not mean that the reduce step will not take place. It will take place and related sorting and shuffling will also take place. But there will be no aggregation. So you can use identity reducer if you want to sort your data that is coming from the map but don’t care for any grouping.

MCQ

1.

What is HBase used as?

  • A: Tool for Random and Fast Read/Write operations in Hadoop
  • B: MapReduce wrapper
  • C: Hadoop SQL interface
  • D: Fast MapReduce layer in Hadoop

Answer: A

2.

Hive can be used as?

  • A: Hadoop query engine
  • B: Small Batch Processing framework
  • C: Hadoop SQL interface
  • D: both (a) and (c)

Answer: D

3.

Where is the HDFS replication factor controlled?

  • A: mapred-site.xml
  • B: yarn-site.xml
  • C: core-site.xml
  • D: hdfs-site.xml

Answer: D

4.

Which of the following writable can be used to know the value from a mapper/reducer?

  • A: Text
  • B: IntWritable
  • C: Nullwritable
  • D: String

Answer: C

5.

Hive data models represent

  • A: Table in HbaseStorage
  • B: Table in HiveStorage
  • C: Directories in HDFS
  • D: None of the above

Answer: C

6.

Hive managed tables stores the data in

  • A: Local Linux path
  • B: Any HDFS path
  • C: HDFS warehouse path
  • D: None of the above

Answer: C

7.

Which of the following statements is correct?

  • A: Master and slaves files are optional in Hadoop 2.x
  • B: Master file has list of all name nodes
  • C: Core-site has hdfs and MapReduce related common properties
  • D: hdfs-site file is now deprecated in Hadoop 2.x

Answer: C

8.

Data from HBase can be loaded into Pig using

  • A: HiveStorage
  • B: SqoopStorage
  • C: DiskStorage
  • D: HbaseStorage

Answer: D

9.

The number of maps is usually driven by the total size of?

  • A: outputs
  • B: tasks
  • C: inputs
  • D: None of the mentioned

Answer: C

10.

Which function is accountable for consolidating the results produced by each of the Map() functions/tasks.

  • A: Reduce
  • B: Reducer
  • C: Map
  • D: All of the mentioned

Answer: A

11.

Select the correct statement.

  • A: MapReduce tries to place the data and the compute as close as possible
  • B: Reduce Task in MapReduce is performed using the Map() function
  • C: Map Task in MapReduce is performed using the Mapper() function
  • D: All of the mentioned

Answer: A

12.

Who will initiate the mapper?

  • A: Task tracker
  • B: Sqoop
  • C: Mapper
  • D: Reducer

Answer: A

13.

Which of the following are true for Hadoop Pseudo Distributed Mode?

  • A: It runs on multiple machines
  • B: Runs on multiple machines without any daemons
  • C: Runs on Single Machine with all daemons
  • D: Runs on Single Machine without all daemons

Answer: C

14.

Which of the following has replaced JobTracker from MapReduce v1?

  • A: NodeManager
  • B: ApplicationManager
  • C: ResourceManager
  • D: Scheduler

Answer: C

General

1.

Which one of the following is false about Hadoop?

  • A: It is a distributed framework
  • B: The main algorithm used in Hadoop is Map Reduce
  • C: Hadoop can work with commodity hardware
  • D: All are true

Answer: D

2.

Hadoop Framework is written in

  • A:
  • B:
  • C:
  • D:

Answer:

3.

  • A:
  • B:
  • C:
  • D:

Answer:

4.

  • A:
  • B:
  • C:
  • D:

Answer:

5.

  • A:
  • B:
  • C:
  • D:

Answer:

6.

  • A:
  • B:
  • C:
  • D:

Answer:

7.

  • A:
  • B:
  • C:
  • D:

Answer:

8.

  • A:
  • B:
  • C:
  • D:

Answer:

9.

  • A:
  • B:
  • C:
  • D:

Answer:

10.

  • A:
  • B:
  • C:
  • D:

Answer:

11.

  • A:
  • B:
  • C:
  • D:

Answer:

12.

  • A:
  • B:
  • C:
  • D:

Answer:

13.

  • A:
  • B:
  • C:
  • D:

Answer:

14.

  • A:
  • B:
  • C:
  • D:

Answer:

15.

  • A:
  • B:
  • C:
  • D:

Answer: