- List characteristics of big data.
- Explain Hadoop MapReduce.
- Find out the word count on the example_data.txt (The content of the example_data.txt file is: coding,jamming,ice,river,man,driving) using MapReduce.
- What is shuffling in MapReduce?
- What is Yarn?
- List Hadoop HDFS Commands.
- What are the differences between regular FileSystem and HDFS?
- What are the two types of metadata that a NameNode server holds?
- What is the difference between a federation and high availability?
- If you have an input file of 350 MB, how many input splits would HDFS create and what would be the size of each input split?
- How does rack awareness work in HDFS?
- What would happen if you store too many small files in a cluster on HDFS?
- When do you use the dfsadmin -refreshNodes and rmadmin -refreshNodes commands?
- Is there any way to change the replication of files on HDFS after they are already written to HDFS?
- Who takes care of replication consistency in a Hadoop cluster?
- what do under/over replicated blocks mean?
- What role do RecordReader, Combiner, and Partitioner play in a MapReduce operation?
- Why is MapReduce slower in processing data in comparison to other processing frameworks?
- Is it possible to change the number of mappers to be created in a MapReduce job?
- Name some Hadoop-specific data types that are used in a MapReduce program.
- What is speculative execution in Hadoop?
- How is identity mapper different from chain mapper?
- What is the role of the OutputCommitter class in a MapReduce job?
- What happens when a node running a map task fails before sending the output to the reducer?
- What benefits did YARN bring in Hadoop 2.0 and how did it solve the issues of MapReduce v1?
- Can we have more than one ResourceManager in a YARN-based cluster?
- Why do we use Hadoop for Big Data?
- What are some limitations of Hadoop?
- What is indexing? How is indexing done in HDFS?
- What is meant by a block and block scanner?
- Explain the three core methods of a reducer.
- What are the different scheduling policies you can use in YARN?
- Why is block size set to 128 MB in Hadoop HDFS?
- How data or file is written into HDFS?
- Can multiple clients write into an HDFS file concurrently?
- How data or file is read in HDFS?
- Why HDFS stores data using commodity hardware despite the higher chance of failures?
- In HDFS, how Name node determines which data node to write on?
- Why is Reading done in parallel and writing is not in HDFS?
- What is Mapper in Hadoop?
- What is Reducer in Hadoop?
- How to set mappers and reducers for MapReduce jobs?
- What is the need of key-value pair to process the data in MapReduce?
- If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?
- How to write a custom partitioner for a Hadoop MapReduce job?
- Why aggregation cannot be done in Mapper?
- Explain map-only job?
- Define Writable data types in Hadoop MapReduce.
- What is the difference between RDBMS with Hadoop MapReduce?
- When is it not recommended to use MapReduce paradigm for large scale data processing?
- Explain the usage of Context Object.
- How many InputSplits will be made by hadoop framework?
- How is the splitting of file invoked in Hadoop ?
- What are the parameters of mappers and reducers?
- What is Chain Mapper?
- Explain the process of spilling in MapReduce.
- How to add/delete a Node to the existing cluster?
- Is Namenode machine same as DataNode machine as in terms of hardware in Hadoop?
- How NameNode tackle Datanode failures in Hadoop?
- How many Reducers run for a MapReduce job?
- What counter in Hadoop MapReduce?
- What happen if number of reducer is set to 0 in Hadoop?
- What is KeyValueTextInputFormat in Hadoop?
- Explain about the partitioning, shuffle and sort phase in MapReduce?
- What is meant by streaming access?
- Explain what happens if, during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3?
- If DataNode increases, then do we need to upgrade NameNode in Hadoop?
- What is meant by a heartbeat in HDFS?
- What is DistCp?
- Why are blocks in HDFS huge?
- What is the default replication factor?
- How can you skip the bad records in Hadoop?
- Where are the two types of metadata that NameNode server stores?
- Explain the purpose of the dfsadmin tool?
- Explain the actions followed by a Jobtracker in Hadoop.
- Explain the distributed Cache in MapReduce framework.
- List the actions that happen when a DataNode fails.
- What are the basic parameters of a mapper?
- Mention the main Configuration parameters that has to be specified by the user to run MapReduce.
- How can you restart NameNode and all the daemons in Hadoop?
- What is Apache Flume in Hadoop ?
- Mention the consequences of Distributed Applications.
- Explain how YARN allocates resources to an application with the help of its architecture.
- Explain Data Locality in Hadoop?
- What is Safemode in Hadoop?
- How is security achieved in Hadoop?
- Why does one remove or add nodes in a Hadoop cluster frequently?
- What is throughput in Hadoop?
- What does jps command do in Hadoop?
- What is fsck?
- How to debug Hadoop code?
- Explain Hadoop streaming?
- How Hadoop’s CLASSPATH plays a vital role in starting or stopping in Hadoop daemons?
- What is configured in /etc/hosts and what is its role in setting Hadoop cluster?
- How is the splitting of file invoked in Hadoop framework?
- How to provide multiple input to Hadoop?
- How to have hadoop job output in multiple directories?
- How to copy a file into HDFS with a different block size to that of existing block size configuration?
- Why HDFS performs replication, although it results in data redundancy?
- Explain Hadoop Archives?
- Explain the Single point of Failure in Hadoop?
- Explain Erasure Coding in Hadoop?
- What is Disk Balancer in Hadoop?
- Explain the difference between a MapReduce InputSplit and HDFS block using an example?
- What is a Backup node in Hadoop?
- What is active and passive NameNode in Hadoop?
- What are the most common OutputFormat in Hadoop?
- What is LazyOutputFormat in Hadoop?
- How to handle record boundaries in Text files or Sequence files in MapReduce InputSplits?
- What is Identity Mapper?
- What is Identity reducer?
- What is HBase used as?
- Hive can be used as?
- Where is the HDFS replication factor controlled?
- Which of the following writable can be used to know the value from a mapper/reducer?
- Hive data models represent
- Hive managed tables stores the data in
- Which of the following statements is correct?
- Data from HBase can be loaded into Pig using
- The number of maps is usually driven by the total size of?
- Which function is accountable for consolidating the results produced by each of the Map() functions/tasks.
- Select the correct statement.
- Who will initiate the mapper?
- Which of the following are true for Hadoop Pseudo Distributed Mode?
- Which of the following has replaced JobTracker from MapReduce v1?