In this post, we put together the best Hive interview questions for beginner, intermediate and experienced candidates. These top questions and quiz is for quick browsing before the interview or to act as a detailed guide on different topics in Hive interviewers look for.
What is the major difference between local and remote meta-store?
- Local Meta-store:In local meta-store design, the meta-store service keeps running in the same JVM in which the Hive service is running and associates with a database running in a different JVM, either on a similar machine or a remote machine.
- Remote Meta-store: In the remote meta-store design, the meta-store service keeps running alone separating JVM and not in the Hive benefit JVM. Different procedures communicate with the meta-store server utilizing Thrift Network APIs. You can have at least one meta-store servers for this situation to give greater accessibility.
What kind of data warehouse application is suitable for Hive? What are the types of tables in Hive?
Hive is not considered a full database. The design rules and regulations of Hadoop and HDFS have put restrictions on what Hive can do. However, Hive is most suitable for data warehouse applications because it:
- Analyzes relatively static data
- Has less responsive time
- Does not make rapid changes in data
- Although Hive doesn’t provide fundamental features required for Online Transaction Processing (OLTP), it is suitable for data warehouse applications in large datasets.
There are two types of tables in Hive:
- Managed tables
- External tables
Explain the SMB Join in Hive?
In SMB join in Hive, every mapper peruses a bucket from the first table and the relating bucket from the second table, and after that, a merge sort join is performed. Sort Merge Bucket (SMB) joins in the hive is for the most utilized as there are no restrictions on file or segment or table join. SMB join can best be utilized when the tables are huge. In SMB join the sections are bucketed and arranged to utilize the join segments. All tables ought to have a similar number of buckets in SMB join.
How HIVE is different from RDBMS?
- RDBMS supports schema on Write whereas Hive provides schema on Read.
- In Hive, we can write once but in RDBMS we can write as many times as we want.
- Hive can handle big datasets whereas RDBMS can’t handle beyond 10TB.
- Hive is highly scalable but scalability in RDBMS costs a lost.
- Hive has a feature of Bucketing which is not there in RDBMS.
What are the types of database does Hive support ?
For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.
In Hive, how can you enable buckets?
In Hive, you can enable buckets by using the following command: set.hive.enforce.bucketing=true;
Is Hive suitable to be used for OLTP systems? Why?
No, it is not suitable for OLTP system since it does not offer insert and update at the row level.
What is the Object Inspector functionality is in Hive?
In Hive the analysis of the inner structure of the segments, columns, and complex items are finished utilizing Object Inspector functionality. Question Inspector functionality makes availability to the inner fields, which are present inside the objects.
What are limitations of Hive?
- We can not perform real-time queries with Hive. Also, it does not offer row-level updates.
- For interactive data browsing Hive offers acceptable latency.
- Hive is not the right choice for online transaction processing.
What are the different Modes in the Hive?
Sometimes interviewers like to ask these basic questions to see how confident you are when it comes to your Hive knowledge. Answer by saying that Hive can sometimes operate in two modes, which are MapReduce mode and local mode. Explain that this depends on the size of the DataNodes in Hadoop.
What is Hive Bucketing?
When performing queries on large datasets in Hive, bucketing can offer better structure to Hive tables. You’ll also want to take your answer a step further by explaining some of the specific bucketing features, as well as some of the advantages of bucketing in Hive. For example, bucketing can give programmers more flexibility when it comes to record-keeping and can make it easier to debug large datasets when needed.
What is the difference between partition and bucketing?
The main aim of both Partitioning and Bucketing is execute the query more efficiently. When you are creating a table the slices are fixed in the partitioning the table.
Where does the data of a Hive table gets stored?
In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. Moreover, by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml, one can change it.
How data transfer happens from HDFS to Hive?
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user simply has to define the table using the keyword external that creates the table definition in the hive metastore.
What does the Hive query processor do?
Hive query processor converts graph of MapReduce jobs with the execution time framework so that the jobs can be executed in the order of dependencies.
Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.
- SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.
- ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.
- DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.
- CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of data which is then sorted by those ranges at the respective reducers.
What is the difference between local and remote metastore?
Local Metastore: It is the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.
Remote Metastore: In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.
Which classes are used in Hive to Read and Write HDFS Files?
Following classes are used by Hive to read and write HDFS files:
- TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format
- SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.
Explain the functionality of ObjectInspector.
ObjectInspector helps analyze the internal structure of a row object and the individual structure of columns in Hive. It also provides a uniform way to access complex objects that can be stored in multiple formats in the memory.
- An instance of Java class
- A standard Java object
- A lazily initialized object
ObjectInspector tells the structure of the object and also the ways to access the internal fields inside the object.
What is ObjectInspector functionality in Hive?
Hive uses ObjectInspector to analyze the internal structure of the rows, columns and complex objects . Additionally gives us ways to access the internal fields inside the object. It not only process common data-types like int, bigint, STRING, but also process complex data-types like arrays, maps, structs and union.
How does bucketing help in the faster execution of queries?
If you have to join two large tables, you can go for reduce side join. But if both the tables have the same number of buckets or same multiples of buckets and also sorted on the same column there is a possibility of SMBMJ in which all the joins take place in the map phase itself by matching the corresponding buckets.
Buckets are basically files that are created inside the HDFS directory.
There are different properties which you need to set for bucket map joins and they are as follows:
set hive.enforce.sortmergebucketmapjoin = false;set hive.auto.convert.sortmerge.join = false;set hive.optimize.bucketmapjoin = true;set hive.optimize.bucketmapjoin.sortedmerge = true;
Why will mapreduce not run if you run select * from table in hive?
When you perform a ""select * from
What is Hive MetaStore?
MetaStore is a central repository of Hive, that allows to store meta data in external database. By default Hive store meta data in Derby database, but you can store in MySql, Oracle depends on project.
What are the three different modes in which hive can be run?
- Local mode
- Distributed mode
- Pseudodistributed mode
How can you prevent a large job from running for a long time?
This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict; The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.
When do we use explode in Hive?
Hadoop developers sometimes take an array as input and convert into a separate table row. To convert complex data types into desired table formats, Hive use explode.
What are the different components of a Hive architecture?
Hive Architecture consists of a –
- User Interface – UI component of the Hive architecture calls the execute interface to the driver.
- Driver - Driver creates a session handle to the query and sends the query to the compiler to generate an execution plan for it.
- Metastore - Sends the metadata to the compiler for the execution of the query on receiving the sendMetaData request.
- Compiler- Compiler generates the execution plan which is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
- Execute Engine- Execution engine is responsible for submitting each of these stages to the relevant components by managing the dependencies between the various stages in the execution plan generated by the compiler.
How can you connect an application, if you run Hive as a server?
When running Hive as a server, the application can be connected in one of the 3 ways-
- ODBC Driver-This supports the ODBC protocol
- JDBC Driver- This supports the JDBC protocol
- Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby.
Can we LOAD data into a view?
No. A view can not be the target of a INSERT or LOAD statement.
Is it possible to add 100 nodes when we already have 100 nodes in Hive? If yes, how?
Yes, we can add the nodes by following the below steps:
- Step 1: Take a new system; create a new username and password
- Step 2: Install SSH and with the master node setup SSH connections
- Step 3: Add ssh public_rsa id key to the authorized keys file
- Step 4: Add the new DataNode hostname, IP address, and other details in /etc/hosts slaves file: 192.168.1.102 slave3.in slave3
- Step 5: Start the DataNode on a new node
- Step 6: Login to the new node like suhadoop or: ssh -X email@example.com
- Step 7: Start HDFS of the newly added slave node by using the following command: ./bin/hadoop-daemon.sh start data node
- Step 8: Check the output of the jps command on the new node
Can Hive process any type of data formats?
Yes, Hive uses the SerDe interface for IO operations. Different SerDe interfaces can read and write any type of data. If normal directly process the data where as different type of data is in the Hadoop, Hive use different SerDe interface to process such data.
How can you stop a partition form being queried?
You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.
What is a Hive variable? What do we use it for?
Hive variables are basically created in the Hive environment that is referenced by Hive scripting languages. They allow to pass some values to a Hive query when the query starts executing. They use the source command.
What is SerDe in Apache Hive?
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(""CREATE EXTERNAL TABLE"" or ""LOAD DATA INPATH,"" ) and use Hive to correctly ""parse"" that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to ""parse"" data stored in HDFS to be used by Hive.
Whenever we run a Hive query, a new metastore_db is created. Why?
A local metastore is created when we run Hive in an embedded mode. Before creating, it checks whether the metastore exists or not, and this metastore property is defined in the configuration file, hive-site.xml. The property is: javax.jdo.option.ConnectionURL with the default value: jdbc:derby:;databaseName=metastore_db;create=true. Therefore, we have to change the behavior of the location to an absolute path so that from that location the metastore can be used.
Can we change the data type of a column in a hive table?
Using REPLACE column option: ALTER TABLE table_name REPLACE COLUMNS
Why does Hive not store metadata information in HDFS?
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are time consuming processes.
How does Hive deserialize and serialize the data?
Usually, while read/write the data, the user first communicate with inputformat. Then it connects with Record reader to read/write record. To serialize the data, the data goes to row. Here deserialized custom serde use object inspector to deserialize the data in fields.
What is RegexSerDe?
Regex stands for a regular expression. Whenever you want to have a kind of pattern matching, based on the pattern matching, you have to store the fields. RegexSerDe is present in org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
In the SerDeproperties, you have to define your input pattern and output fields. For example, you have to get the column values from line xyz/pq@def if you want to take xyz, pq and def separately.
While loading data into a hive table using the LOAD DATA clause, how do you specify it is a hdfs file and not a local file ?
By Omitting the LOCAL CLAUSE in the LOAD DATA statement.
Explain about the different types of partitioning in Hive?
Partitioning in Hive helps prune the data when executing the queries to speed up processing. Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.
Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.
When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.
What is the significance of ‘IF EXISTS” clause while dropping a table?
When we issue the command DROP TABLE IF EXISTS table_name, Hive throws an error if the table being dropped does not exist in the first place.
How can Hive avoid mapreduce?
If we set the property hive.exec.mode.local.auto to true then hive will avoid mapreduce to fetch query results.
What is the relationship between MapReduce and Hive? or How Mapreduce jobs submits on the cluster?
Hive provides no additional capabilities to MapReduce. The programs are executed as MapReduce jobs via the interpreter. The Interpreter runs on a client machine which rurns HiveQL queries into MapReduce jobs. Framework submits those jobs onto the cluster.
What is ObjectInspector functionality?
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
Instance of a Java class (Thrift or native Java):
- A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map)
- A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field) A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.
Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state. Can this be achieved using Apache Hive?
Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor various hive transactions.
Can a partition be archived? What are the advantages and disadvantages?
Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.
Does the archiving of Hive tables save space in HDFS?
No. It only reduces the number of files which becomes easier for namenode to manage.
Does Hive support record level Insert, delete or update?
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.
What are the default record and field delimiter used for hive text files?
The default record delimiter is − \n. And the filed delimiters are − \001,\002,\003.
What is difference between static and dynamic partition of a table?
To prune data during query, partition can minimize the query time. The partition is created when the data is inserted into table. Static partition can insert individual rows where as Dynamic partition can process entire table based on a particular column. At least one static partition is must to create any (static, dynamic) partition. If you are partitioning a large datasets, doing sort of a ETL flow Dynamic partition partition recommendable.
Why do we perform partitioning in Hive?
In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.
How does partitioning help in the faster execution of queries?
With the help of partitioning, a subdirectory will be created with the name of the partitioned column and when you perform a query using the WHERE clause, only the particular sub-directory will be scanned instead of scanning the whole table. This gives you faster execution of queries.
Can you list few commonly used Hive services?
- Command Line Interface (cli)
- Hive Web Interface (hwi)
- HiveServer (hiveserver)
- Printing the contents of an RC file using the tool rcfilecat.
What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?
By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command: SET hive.exec.max.dynamic.partitions.pernode =
Why do we need buckets?
Basically, for performing bucketing to a partition there are two main reasons:
- A map side join requires the data belonging to a unique join key to be present in the same partition.
- It allows us to decrease the query time. Also, makes the sampling process more efficient.
Can we name view the same as the name of a Hive table?
No. The name of a view must be unique compared to all other tables and as views present in the same database.
What Options Are Available When It Comes to Attaching Applications to the Hive Server?
Explain the three different ways (Thrift Client, JDBC Driver, and ODBC Driver) you can connect applications to the Hive Server. You’ll also want to explain the purpose for each option: for example, using JDBC will support the JDBC protocol.
When should we use SORT BY instead of ORDER BY?
Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer. Hence, using ORDER BY will take a lot of time to execute a large number of inputs.
What are the uses of Hive Explode?
Hadoop Developers consider an array as their input and convert it into a separate table row. To convert complicated data types into desired table formats, Hive uses Explode.
Can we run UNIX shell commands from Hive? Can Hive queries be executed from script files? If yes, how?
Yes, we can run UNIX shell commands from Hive using an ‘!‘ mark before the command. For example, !pwd at Hive prompt will display the current directory. We can execute Hive queries from the script files using the source command.
How is ORC file format optimised for data storage and analysis?
ORC stores collections of rows in one file and within the collection the row data will be stored in a columnar format. With columnar format, it is very easy to compress, thus reducing a lot of storage cost. While querying also, it queries the particular column instead of querying the whole row as the records are stored in columnar format. ORC has got indexing on every block based on the statistics min, max, sum, count on columns so when you query, it will skip the blocks based on the indexing.
What is the difference between Internal and External Table?
- Internal table: MetaStore and actual data both stored in local system. If any situation, data lost, both actual data and meta store will be lost.
- External table: Schema is stored in Database. Actual data stored in Hive tables. If data lost in External table, it lost only metastore, but not actual data.
Explain the different types of join in Hive.
HiveQL has 4 different types of joins –
- JOIN- Similar to Outer Join in SQL
- FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
- LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
- RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.
What is a metastore in Hive?
It is a relational database storing the metadata of hive tables, partitions, Hive databases etc
What is the functionality of Query Processor in Apache Hive?
This components implements the processing framework for converting SQL to graph of map/reduce jobs and the execution time framework to run those jobs in the order od dependencies.
What is the utilization of Hcatalog?
Hcatalog can be utilized to share information structures with external systems. Hcatalog gives access to hive meta-store to clients of other devices on Hadoop with the goal that they can read and compose information to hive’s data warehouse.
How will you optimize Hive performance?
There are various ways to run Hive queries faster -
- Using Apache Tez execution engine
- Using vectorization
- Using ORCFILE
- Do cost based query optimization.
In case of embedded Hive, can the same metastore be used by multiple users?
We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL.
When to use Map reduce mode?
Map reduce mode is used when:
- It will perform on large amount of data sets and query going to execute in a parallel way
- Hadoop has multiple data nodes, and data is distributed across different node we use Hive in this mode
- Processing large data sets with better performance needs to be achieve
What is the importance of Thrift server & client, JDBC and ODBC driver in Hive?
Thrift is a cross language RPC framework which generate code and cobines a software stack finally execute the Thrift code in remote server. Thrift compiler acts as interpreter between server and client. Thrift server allows a remove client to submit request to Hive, using different programming languages like Python, Ruby and Scala.
JDBC driver: A JDBC driver is a software component enabling a Java application to interact with a database.
ODBC driver: ODBC accomplishes DBMS independence by using an ODBC driver as a translation layer between the application and the DBMS.
The property set to run hive in local mode as true so that it runs without creating a mapreduce job is
- A: hive.exec.mode.local.auto
- B: hive.exec.mode.local.override
- C: hive.exec.mode.local.settings
- D: hive.exec.mode.local.config
When a partition is archived in Hive it
- A: Reduces space through compression
- B: Reduces the length of records
- C: Reduces the number of files stored
- D: Reduces the block size
A user creates a UDF which accepts arguments of different data types, each time it is run. It is an example of
- A. Aggregate Function
- B. Generic Function
- C. Standard UDF
- D. Super Functions
While querying a hive table for a Array type column, if the array index is nonexistent then
- A. NULL is returnedcorrect
- B. Error is reported.wrong
- C. Partial results are returned
- D. ""NA"" is returned
A GenericUDF is a Function that
- A. Takes one or more columns form a row and returns a single value
- B. Takes one or more columns form many rows and returns a single valuewrong
- C. Take zero or more inputs and produce multiple columns or rows of output
- D. Detects the type of input programmatically and provides appropriate response
Which of the following scenarios are not prevented by enabling strict mode in Hive?
- A. Scanning all the partitions
- B. Generating random sample of data
- C. Running a order by clause without a LIMIT clause
- D. Cartesian product
In hive, what happens when the schema does not match the file content?
- A: It cannot read the file
- B: It reads only the string data type
- C: it throws an error and stops reading the file
- D: It returns null values for mismatched fields.
Answer: DExplanation: Instead of returning error, Hive returns null values for mismatch between schema and actual data.
The DISTRIBUTED BY clause in hive
- A. comes Before the sort by clausecorrect
- B. comes after the sort by clausewrong
- C. does not depend on position of sort by clause
- D. cannot be present along with the sort by clause
In ______ mode HiveServer2 only accepts valid Thrift calls.
- A: Remote
- B: HTTP
- C: Embedded
- D: Interactive
Answer: AExplanation: In HTTP mode, the message body contains Thrift payloads.
The disadvantage of compressing files in HDFS is
- A. Unused HDFS blocks
- B. Less I/Owrong
- C. Files do not become splittablecorrect
- D. Files have to move to local filesystem to be usable
The partitioning of a table in Hive creates more
- A. subdirectories under the database name
- B. subdirectories under the table name
- C. files under database name
- D. files under the table name
The Property that decides what is the maximum number of files that can be sampled during the use of the LIMIT clause is
- A: hive.limit.optimize.file.max
- B: hive.limit.optimize.limit.file
- C: hive.limit.optimize.file.restrict
- D: hive.limit.optimize.limit.most
Answer : BExplanation: This property decides the number files to be looked into for the sample result.
For optimizing join of three tables, the largest sized tables should be placed as
- A: the first table in the join clause
- B: second table in the join clause
- C: third table in the join clause
- D: third table in the join clause
Answer: CExplanation: Hive reads the tables from left to right. Small tables should be read first and if possible cached into the memory.
The drawback of managed tables in hive is
- A. They are always stored under default directory
- B. They cannot grow bigger than a fixed size of 100GB
- C. They can never be droppedwrong
- D. They cannot be shared with other applications
Which of the following command sets the value of a particular configuration variable (key)?
- A. set -v
- B. set =
- C. set
- D. reset
The below expression in the where clause RLIKE '.*(Chicago|Ontario).*'; gives the result which match
- A. words containing both Chicago and Ontario
- B. words containing either Chicago or Ontario
- C. words Ending with Chicago or Ontario
- D. words starting with Chicago or Ontario
What is the disadvantage of using too many partitions in Hive tables?
- A: It slows down the namenode
- B: Storage space is wasted
- C: Join quires become slow
- D: All of the above
By default when a database is dropped in Hive:
- A: The tables are also deleted
- B: The directory is deleted if there are no tables
- C: The HDFS blocks are formatted
- D: None of the above
Explode in Hive is used to convert complex data types into desired table formats.
- A: True
- B: False
Point out the correct statement.
- A: list FILE[S]
* executes a Hive query and prints results to standard output
executes a Hive query and prints results to standard output
executes a Hive query and prints results to standard output
- D: All of the mentioned
Point out the correct statement
- A: Hive is not a relational database, but a query engine that supports the parts of SQL
- B: Hive is a relational database with SQL support
- C: Pig is a relational database with SQL support
- D: None of the above
Point out the wrong statement:
- A. bfs executes a dfs command from the Hive shell
- B. source FILE executes a script file inside the CLI
- C. hive is Query language similar to SQL
- D. none of the mentioned
Hive converts queries to all except
- A: Apache tez
- B: Spark ten
- C: Map reduce
- D: Spark jobs
The thrift service component in hive is used for
- A: Moving hive data files between different servers
- B: Use multiple hive versions
- C: Submit hive queries from a remote client
- D: Installing hive