- What kind of data warehouse application is suitable for Hive? What are the types of tables in Hive?
- Explain the SMB Join in Hive?
- How HIVE is different from RDBMS?
- What are the types of database does Hive support ?
- In Hive, how can you enable buckets?
- Is Hive suitable to be used for OLTP systems? Why?
- What is the Object Inspector functionality is in Hive?
- What are limitations of Hive?
- What are the different Modes in the Hive?
- What is Hive Bucketing?
- What is the difference between partition and bucketing?
- Where does the data of a Hive table gets stored?
- How data transfer happens from HDFS to Hive?
- What does the Hive query processor do?
- Explain about SORT BY, ORDER BY, DISTRIBUTE BY and CLUSTER BY in Hive.
- What is the difference between local and remote metastore?
- Which classes are used in Hive to Read and Write HDFS Files?
- Explain the functionality of ObjectInspector.
- What is ObjectInspector functionality in Hive?
- How does bucketing help in the faster execution of queries?
- Why will mapreduce not run if you run select * from table in hive?
- What is Hive MetaStore?
- What are the three different modes in which hive can be run?
- How can you prevent a large job from running for a long time?
- When do we use explode in Hive?
- What are the different components of a Hive architecture?
- How can you connect an application, if you run Hive as a server?
- Can we LOAD data into a view?
- Is it possible to add 100 nodes when we already have 100 nodes in Hive? If yes, how?
- Can Hive process any type of data formats?
- How can you stop a partition form being queried?
- What is a Hive variable? What do we use it for?
- What is SerDe in Apache Hive?
- Whenever we run a Hive query, a new metastore_db is created. Why?
- Can we change the data type of a column in a hive table?
- Why does Hive not store metadata information in HDFS?
- How does Hive deserialize and serialize the data?
- What is RegexSerDe?
- While loading data into a hive table using the LOAD DATA clause, how do you specify it is a hdfs file and not a local file ?
- Explain about the different types of partitioning in Hive?
- What is the significance of ‘IF EXISTS” clause while dropping a table?
- How can Hive avoid mapreduce?
- What is the relationship between MapReduce and Hive? or How Mapreduce jobs submits on the cluster?
- What is ObjectInspector functionality?
- Suppose that I want to monitor all the open and aborted transactions in the system along with the transaction id and the transaction state. Can this be achieved using Apache Hive?
- Can a partition be archived? What are the advantages and disadvantages?
- does the archiving of Hive tables save space in HDFS?
- does Hive support record level Insert, delete or update?
- What are the default record and field delimiter used for hive text files?
- What is difference between static and dynamic partition of a table?
- Why do we perform partitioning in Hive?
- How does partitioning help in the faster execution of queries?
- Can you list few commonly used Hive services?
- What is the default maximum dynamic partition that can be created by a mapper/reducer? How can you change it?
- Why do we need buckets?
- Can we name view the same as the name of a Hive table?
- What Options are Available When It Comes to Attaching Applications to the Hive Server?
- When should we use SORT BY instead of ORDER BY?
- What are the uses of Hive Explode?
- Can we run UNIX shell commands from Hive? Can Hive queries be executed from script files? If yes, how?
- How is ORC file format optimised for data storage and analysis?
- What is the difference between Internal and External Table?
- Explain the different types of join in Hive.
- What is a metastore in Hive?
- What is the functionality of Query Processor in Apache Hive?
- What is the utilization of Hcatalog?
- How will you optimize Hive performance?
- In case of embedded Hive, can the same metastore be used by multiple users?
- When to use Map reduce mode?
- What is the importance of Thrift server & client, JDBC and ODBC driver in Hive?
- The property set to run hive in local mode as true so that it runs without creating a mapreduce job is
- When a partition is archived in Hive it
- A user creates a UDF which accepts arguments of different data types, each time it is run. It is an example of
- While querying a hive table for a Array type column, if the array index is nonexistent then
- A GenericUDF is a Function that
- Which of the following scenarios are not prevented by enabling strict mode in Hive?
- In hive, what happens when the schema does not match the file content?
- The DISTRIBUTED BY clause in hive
- In ______ mode HiveServer2 only accepts valid Thrift calls.
- The disadvantage of compressing files in HDFS is
- The partitioning of a table in Hive creates more
- The Property that decides what is the maximum number of files that can be sampled during the use of the LIMIT clause is
- For optimizing join of three tables, the largest sized tables should be placed as
- The drawback of managed tables in hive is
- Which of the following command sets the value of a particular configuration variable (key)?
- The below expression in the where clause RLIKE '.*(Chicago|Ontario).*'; gives the result which match
- What is the disadvantage of using too many partitions in Hive tables?
- By default when a database is dropped in Hive:
- Explode in Hive is used to convert complex data types into desired table formats.
- Point out the correct statement.
- Point out the correct statement
- Point out the wrong statement:
- Hive converts queries to all except
- The thrift service component in hive is used for
Hive is not considered a full database. The design rules and regulations of Hadoop and HDFS have put restrictions on what Hive can do. However, Hive is most suitable for data warehouse applications because it:
- Analyzes relatively static data.
- Has less responsive time.
- does not make rapid changes in data.
- Although Hive doesn’t provide fundamental features required for Online Transaction Processing (OLTP), it is suitable for data warehouse applications in large datasets.
There are two types of tables in Hive:
- Managed tables
- External tables
In SMB join in Hive, every mapper peruses a bucket from the first table and the relating bucket from the second table, and after that, a merge sort join is performed. Sort Merge Bucket (SMB) joins in the hive is for the most utilized as there are no restrictions on file or segment or table join. SMB join can best be utilized when the tables are huge. In SMB join the sections are bucketed and arranged to utilize the join segments. All tables ought to have a similar number of buckets in SMB join.
- RDBMS supports schema on Write whereas Hive provides schema on Read.
- In Hive, we can write once but in RDBMS we can write as many times as we want.
- Hive can handle big datasets whereas RDBMS can’t handle beyond 10TB.
- Hive is highly scalable but scalability in RDBMS costs a lost.
- Hive has a feature of Bucketing which is not there in RDBMS.
For single user metadata storage, Hive uses derby database and for multiple user Metadata or shared Metadata case Hive uses MYSQL.
In Hive, you can enable buckets by using the following command: set.hive.enforce.bucketing=true;
No, it is not suitable for OLTP system since it does not offer insert and update at the row level.
In Hive the analysis of the inner structure of the segments, columns, and complex items are finished utilizing Object Inspector functionality. Question Inspector functionality makes availability to the inner fields, which are present inside the objects.
- We can not perform real-time queries with Hive. Also, it does not offer row-level updates.
- For interactive data browsing Hive offers acceptable latency.
- Hive is not the right choice for online transaction processing.
Sometimes interviewers like to ask these basic questions to see how confident you are when it comes to your Hive knowledge. Answer by saying that Hive can sometimes operate in two modes, which are MapReduce mode and local mode. Explain that this depends on the size of the DataNodes in Hadoop.
When performing queries on large datasets in Hive, bucketing can offer better structure to Hive tables. You’ll also want to take your answer a step further by explaining some of the specific bucketing features, as well as some of the advantages of bucketing in Hive. For example, bucketing can give programmers more flexibility when it comes to record-keeping and can make it easier to debug large datasets when needed.
The main aim of both Partitioning and Bucketing is execute the query more efficiently. When you are creating a table the slices are fixed in the partitioning the table.
In an HDFS directory – /user/hive/warehouse, the Hive table is stored, by default only. Moreover, by specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml, one can change it.
If data is already present in HDFS then the user need not LOAD DATA that moves the files to the /user/hive/warehouse/. So the user simply has to define the table using the keyword external that creates the table definition in the hive metastore.
Hive query processor converts graph of MapReduce jobs with the execution time framework so that the jobs can be executed in the order of dependencies.
- SORT BY – Data is ordered at each of ‘N’ reducers where the reducers can have overlapping range of data.
- ORDER BY- This is similar to the ORDER BY in SQL where total ordering of data takes place by passing it to a single reducer.
- DISTRUBUTE BY – It is used to distribute the rows among the reducers. Rows that have the same distribute by columns will go to the same reducer.
- CLUSTER BY- It is a combination of DISTRIBUTE BY and SORT BY where each of the N reducers gets non overlapping range of data which is then sorted by those ranges at the respective reducers.
Local Metastore: It is the metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM. Either on the same machine or on a remote machine.
Remote Metastore: In this configuration, the metastore service runs on its own separate JVM and not in the Hive service JVM.
Following classes are used by Hive to read and write HDFS files:
- TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
- SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.
ObjectInspector helps analyze the internal structure of a row object and the individual structure of columns in Hive. It also provides a uniform way to access complex objects that can be stored in multiple formats in the memory.
- An instance of Java class.
- A standard Java object.
- A lazily initialized object
ObjectInspector tells the structure of the object and also the ways to access the internal fields inside the object.
Hive uses ObjectInspector to analyze the internal structure of the rows, columns and complex objects . Additionally gives us ways to access the internal fields inside the object. It not only process common data-types like int, bigint, STRING, but also process complex data-types like arrays, maps, structs and union.
If you have to join two large tables, you can go for reduce side join. But if both the tables have the same number of buckets or same multiples of buckets and also sorted on the same column there is a possibility of SMBMJ in which all the joins take place in the map phase itself by matching the corresponding buckets.
Buckets are basically files that are created inside the HDFS directory.
There are different properties which you need to set for bucket map joins and they are as follows:
set hive.enforce.sortmergebucketmapjoin = false; set hive.auto.convert.sortmerge.join = false; set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true;
When you perform a "select * from
MetaStore is a central repository of Hive, that allows to store meta data in external database. By default Hive store meta data in Derby database, but you can store in MySql, Oracle depends on project.
- Local mode
- Distributed mode
- Pseudodistributed mode
This can be achieved by setting the MapReduce jobs to execute in strict mode set hive.mapred.mode=strict; The strict mode ensures that the queries on partitioned tables cannot execute without defining a WHERE clause.
Hadoop developers sometimes take an array as input and convert into a separate table row. To convert complex data types into desired table formats, Hive use explode.
Hive Architecture consists of a –
- User Interface – UI component of the Hive architecture calls the execute interface to the driver.
- Driver - Driver creates a session handle to the query and sends the query to the compiler to generate an execution plan for it.
- Metastore - Sends the metadata to the compiler for the execution of the query on receiving the sendMetaData request.
- Compiler- Compiler generates the execution plan which is a DAG of stages where each stage is either a metadata operation, a map or reduce job or an operation on HDFS.
- Execute Engine- Execution engine is responsible for submitting each of these stages to the relevant components by managing the dependencies between the various stages in the execution plan generated by the compiler.
When running Hive as a server, the application can be connected in one of the 3 ways-
- ODBC Driver-This supports the ODBC protocol
- JDBC Driver- This supports the JDBC protocol
- Thrift Client- This client can be used to make calls to all hive commands using different programming language like PHP, Python, Java, C++ and Ruby.
No. A view can not be the target of a INSERT or LOAD statement.
Yes, we can add the nodes by following the below steps:
- Step 1: Take a new system; create a new username and password
- Step 2: Install SSH and with the master node setup SSH connections
- Step 3: Add ssh public_rsa id key to the authorized keys file
- Step 4: Add the new DataNode hostname, IP address, and other details in /etc/hosts slaves file: 192.168.1.102 slave3.in slave3
- Step 5: Start the DataNode on a new node
- Step 6: Login to the new node like suhadoop or: ssh -X email@example.com
- Step 7: Start HDFS of the newly added slave node by using the following command: ./bin/hadoop-daemon.sh start data node
- Step 8: Check the output of the jps command on the new node
Yes, Hive uses the SerDe interface for IO operations. Different SerDe interfaces can read and write any type of data. If normal directly process the data where as different type of data is in the Hadoop, Hive use different SerDe interface to process such data.
You can stop a partition form being queried by using the ENABLE OFFLINE clause with ALTER TABLE statement.
Hive variables are basically created in the Hive environment that is referenced by Hive scripting languages. They allow to pass some values to a Hive query when the query starts executing. They use the source command.
A SerDe is a short name for a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy("CREATE EXTERNAL TABLE" or "LOAD DATA INPATH," ) and use Hive to correctly "parse" that file format in a way that can be used by Hive. A SerDe is a powerful (and customizable) mechanism that Hive uses to "parse" data stored in HDFS to be used by Hive.
A local metastore is created when we run Hive in an embedded mode. Before creating, it checks whether the metastore exists or not, and this metastore property is defined in the configuration file, hive-site.xml. The property is: javax.jdo.option.ConnectionURL with the default value: jdbc:derby:;databaseName=metastore_db;create=true. Therefore, we have to change the behavior of the location to an absolute path so that from that location the metastore can be used.
Using REPLACE column option: ALTER TABLE table_name REPLACE COLUMNS
Hive stores metadata information in the metastore using RDBMS instead of HDFS. The main reason for choosing RDBMS is to achieve low latency because HDFS read/write operations are time consuming processes.
Usually, while read/write the data, the user first communicate with inputformat. Then it connects with Record reader to read/write record. To serialize the data, the data goes to row. Here deserialized custom serde use object inspector to deserialize the data in fields.
Regex stands for a regular expression. Whenever you want to have a kind of pattern matching, based on the pattern matching, you have to store the fields. RegexSerDe is present in org.apache.hadoop.hive.contrib.serde2.RegexSerDe.
In the SerDeproperties, you have to define your input pattern and output fields. For example, you have to get the column values from line xyz/pq@def if you want to take xyz, pq and def separately.
By Omitting the LOCAL CLAUSE in the LOAD DATA statement.
Partitioning in Hive helps prune the data when executing the queries to speed up processing. Partitions are created when data is inserted into the table. In static partitions, the name of the partition is hardcoded into the insert statement whereas in a dynamic partition, Hive automatically identifies the partition based on the value of the partition field.
Based on how data is loaded into the table, requirements for data and the format in which data is produced at source- static or dynamic partition can be chosen. In dynamic partitions the complete data in the file is read and is partitioned through a MapReduce job based into the tables based on a particular field in the file. Dynamic partitions are usually helpful during ETL flows in the data pipeline.
When loading data from huge files, static partitions are preferred over dynamic partitions as they save time in loading data. The partition is added to the table and then the file is moved into the static partition. The partition column value can be obtained from the file name without having to read the complete file.
When we issue the command DROP TABLE IF EXISTS table_name, Hive throws an error if the table being dropped does not exist in the first place.
If we set the property hive.exec.mode.local.auto to true then hive will avoid mapreduce to fetch query results.
Hive provides no additional capabilities to MapReduce. The programs are executed as MapReduce jobs via the interpreter. The Interpreter runs on a client machine which rurns HiveQL queries into MapReduce jobs. Framework submits those jobs onto the cluster.
Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
Instance of a Java class (Thrift or native Java):
- A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map).
- A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field) A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.
Hive 0.13.0 and above version support SHOW TRANSACTIONS command that helps administrators monitor various hive transactions.
Yes. A partition can be archived. Advantage is it decreases the number of files stored in namenode and the archived file can be queried using hive. The disadvantage is it will cause less efficient query and does not offer any space savings.
No. It only reduces the number of files which becomes easier for namenode to manage.
Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.
The default record delimiter is − \n. And the filed delimiters are − \001,\002,\003.
To prune data during query, partition can minimize the query time. The partition is created when the data is inserted into table. Static partition can insert individual rows where as Dynamic partition can process entire table based on a particular column. At least one static partition is must to create any (static, dynamic) partition. If you are partitioning a large datasets, doing sort of a ETL flow Dynamic partition partition recommendable.
In a Hive table, Partitioning provides granularity. Hence, by scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.
With the help of partitioning, a subdirectory will be created with the name of the partitioned column and when you perform a query using the WHERE clause, only the particular sub-directory will be scanned instead of scanning the whole table. This gives you faster execution of queries.
- Command Line Interface (cli)
- Hive Web Interface (hwi)
- HiveServer (hiveserver)
- Printing the contents of an RC file using the tool rcfilecat.
By default the number of maximum partition that can be created by a mapper or reducer is set to 100. One can change it by issuing the following command:
SET hive.exec.max.dynamic.partitions.pernode =
Basically, for performing bucketing to a partition there are two main reasons:
- A map side join requires the data belonging to a unique join key to be present in the same partition.
- It allows us to decrease the query time. Also, makes the sampling process more efficient.
No. The name of a view must be unique compared to all other tables and as views present in the same database.
Explain the three different ways (Thrift Client, JDBC Driver, and ODBC Driver) you can connect applications to the Hive Server. You’ll also want to explain the purpose for each option: for example, using JDBC will support the JDBC protocol.
Despite ORDER BY we should use SORT BY. Especially while we have to sort huge datasets. The reason is SORT BY clause sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer. Hence, using ORDER BY will take a lot of time to execute a large number of inputs.
Hadoop Developers consider an array as their input and convert it into a separate table row. To convert complicated data types into desired table formats, Hive uses Explode.
Yes, we can run UNIX shell commands from Hive using an ‘!‘ mark before the command. For example, !pwd at Hive prompt will display the current directory. We can execute Hive queries from the script files using the source command.
ORC stores collections of rows in one file and within the collection the row data will be stored in a columnar format. With columnar format, it is very easy to compress, thus reducing a lot of storage cost. While querying also, it queries the particular column instead of querying the whole row as the records are stored in columnar format. ORC has got indexing on every block based on the statistics min, max, sum, count on columns so when you query, it will skip the blocks based on the indexing.
- Internal table: MetaStore and actual data both stored in local system. If any situation, data lost, both actual data and meta store will be lost.
- External table: Schema is stored in Database. Actual data stored in Hive tables. If data lost in External table, it lost only metastore, but not actual data.
HiveQL has 4 different types of joins –
- JOIN- Similar to Outer Join in SQL
- FULL OUTER JOIN – Combines the records of both the left and right outer tables that fulfil the join condition.
- LEFT OUTER JOIN- All the rows from the left table are returned even if there are no matches in the right table.
- RIGHT OUTER JOIN-All the rows from the right table are returned even if there are no matches in the left table.
It is a relational database storing the metadata of hive tables, partitions, Hive databases etc
This components implements the processing framework for converting SQL to graph of map/reduce jobs and the execution time framework to run those jobs in the order od dependencies.
Hcatalog can be utilized to share information structures with external systems. Hcatalog gives access to hive meta-store to clients of other devices on Hadoop with the goal that they can read and compose information to hive’s data warehouse.
There are various ways to run Hive queries faster -
- Using Apache Tez execution engine
- Using vectorization
- Using ORCFILE
- do cost based query optimization.
We cannot use metastore in sharing mode. It is suggested to use standalone real database like PostGreSQL and MySQL.
Map reduce mode is used when: - It will perform on large amount of data sets and query going to execute in a parallel way - Hadoop has multiple data nodes, and data is distributed across different node we use Hive in this mode - Processing large data sets with better performance needs to be achieve
Thrift is a cross language RPC framework which generate code and cobines a software stack finally execute the Thrift code in remote server. Thrift compiler acts as interpreter between server and client. Thrift server allows a remove client to submit request to Hive, using different programming languages like Python, Ruby and Scala.
JDBC driver: A JDBC driver is a software component enabling a Java application to interact with a database.
ODBC driver: ODBC accomplishes DBMS independence by using an ODBC driver as a translation layer between the application and the DBMS.
- A: hive.exec.mode.local.auto
- B: hive.exec.mode.local.override
- C: hive.exec.mode.local.settings
- D: hive.exec.mode.local.config
- A: Reduces space through compression
- B: Reduces the length of records
- C: Reduces the number of files stored
- D: Reduces the block size
- A. Aggregate Function
- B. Generic Function
- C. Standard UDF
- D. Super Functions
- A. NULL is returnedcorrect
- B. Error is reported.wrong
- C. Partial results are returned
- D. "NA" is returned
- A. Takes one or more columns form a row and returns a single value
- B. Takes one or more columns form many rows and returns a single valuewrong
- C. Take zero or more inputs and produce multiple columns or rows of output
- D. Detects the type of input programmatically and provides appropriate response
- A. Scanning all the partitions
- B. Generating random sample of data
- C. Running a order by clause without a LIMIT clause
- D. Cartesian product
- A: It cannot read the file
- B: It reads only the string data type
- C: it throws an error and stops reading the file
- D: It returns null values for mismatched fields.
Answer: D Explanation: Instead of returning error, Hive returns null values for mismatch between schema and actual data.
- A. comes Before the sort by clause correct
- B. comes after the sort by clause wrong
- C. does not depend on position of sort by clause
- D. cannot be present along with the sort by clause
- A: Remote
- B: HTTP
- C: Embedded
- D: Interactive
Answer: A Explanation: In HTTP mode, the message body contains Thrift payloads.
- A. Unused HDFS blocks
- B. Less I/Owrong
- C. Files do not become splittablecorrect
- D. Files have to move to local filesystem to be usable
- A. subdirectories under the database name
- B. subdirectories under the table name
- C. files under database name
- D. files under the table name
- A: hive.limit.optimize.file.max
- B: hive.limit.optimize.limit.file
- C: hive.limit.optimize.file.restrict
- D: hive.limit.optimize.limit.most
Answer : B Explanation: This property decides the number files to be looked into for the sample result.
- A: the first table in the join clause
- B: second table in the join clause
- C: third table in the join clause
- D: third table in the join clause
Answer: C Explanation: Hive reads the tables from left to right. Small tables should be read first and if possible cached into the memory.
- A. They are always stored under default directory
- B. They cannot grow bigger than a fixed size of 100GB
- C. They can never be droppedwrong
- D. They cannot be shared with other applications
- A. set -v
- B. set =
- C. set
- D. reset
- A. words containing both Chicago and Ontario
- B. words containing either Chicago or Ontario
- C. words Ending with Chicago or Ontario
- D. words starting with Chicago or Ontario
- A: It slows down the namenode
- B: Storage space is wasted
- C: Join quires become slow
- D: All of the above
- A: The tables are also deleted
- B: The directory is deleted if there are no tables
- C: The HDFS blocks are formatted
- D: None of the above
- A: True
- B: False
- A: list FILE[S]
* executes a Hive query and prints results to standard output
executes a Hive query and prints results to standard output
executes a Hive query and prints results to standard output
- D: All of the mentioned
- A: Hive is not a relational database, but a query engine that supports the parts of SQL
- B: Hive is a relational database with SQL support
- C: Pig is a relational database with SQL support
- D: None of the above
- A. bfs executes a dfs command from the Hive shell
- B. source FILE executes a script file inside the CLI
- C. hive is Query language similar to SQL
- D. none of the mentioned
- A: Apache tez
- B: Spark ten
- C: Map reduce
- D: Spark jobs
- A: Moving hive data files between different servers
- B: Use multiple hive versions
- C: Submit hive queries from a remote client
- D: Installing hive
(100% free to get started, no credit card required)