ETL Testing interview questions and answers 👇
ETL Testing Interview Questions
What are Cubes and OLAP Cubes?↑
Cubes are data processing units composed of fact tables and dimensions from the data warehouse.
An OLAP cube is a data structure that overcomes the limitations of relational databases by providing rapid analysis of data. Cubes can display and sum large amounts of data while also providing users with searchable access to any data points. This way, the data can be rolled up, sliced, and diced as needed to handle the widest variety of questions that are relevant to a user's area of interest.
What is Lookup Transformation used for?↑
Lookup transformation is a passive transformation used to look up a source, source qualifier, or target to get the relevant data. Basically, it's a kind of join operation in which one of the joining tables is the source data, and the other joining table is the lookup table.
What is partitioning, hash partitioning and round robin partitioning?↑
Partitioning is the process of dividing up a data storage area for improved performance. It can be used to organize your work. Having all your data in one place without organization makes it more difficult for digital tools to find and analyze the data. It is easier and faster to locate and analyze data when your data warehouse is partitioned.
Hash Partitioning: With hash partitioning, rows are evenly distributed across partitions based on a partition key. Using a hash function, the server creates partition keys to group data.
Round-robin Partitioning: This is a method in which data is evenly spread among all partitions. Therefore, each partition has approximately the same number of rows. Unlike hash partitioning, the partitioning columns do not need to be specified. New rows are assigned to partitions in round-robin style.
What is Bus Schema?↑
A BUS schema is used to identify the common dimensions across business processes, like identifying conforming dimensions. BUS schema has conformed dimension and standardized definition of facts. This schema has conformed dimensions and facts defined to be shared across all enterprise data marts.
What are Schema Objects?↑
Schema objects are logical structures that you use to store and manipulate data. Schema objects include databases, tables, views, functions, and so on.
What are incremental load and full load?↑
There are two primary methods to load data into a warehouse:
Full load: entire data dump that takes place the first time a data source is loaded into the warehouse
Incremental load: delta between target and source data is dumped at regular intervals. The last extract date is stored so that only records added after this date are loaded. Incremental loads come in two flavors that vary based on the volume of data you’re loading:
- Streaming incremental load: better for loading small data volumes
- Batch incremental load: better for loading large data volumes
What are views?↑
A view is a mechanism for querying data. When you create a query on your database you can store the query (i.e. the SQL commands that perform the query) as a view, and by giving the view a unique name you can make that query available to use elsewhere in your database environment.
What is the importance of ETL testing?↑
Powerful ETL testing identifies problems with the source information from the start before it is stacked to the data repository as well as discrepancies or uncertainties in business rules designed to manage data transformation and integration.
What is a data warehouse?↑
A data warehouse is a central repository of information that can be analyzed to make more informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence.
What do you mean by data purging?↑
Data purging is a mechanism that permanently deletes inactive or obsolete records from the database. It should be considered for entities with evident high-growth rate, which can quickly achieve considerable volumes of data. All purging strategies must always align with business and (or) legal requirements.
What is data source view?↑
A data source view is a single, unified view of the metadata from the specified tables and views that the data source defines in the project. Storing the metadata in the data source view enables you to work with the metadata during development without an open connection to any underlying data source.
What is ETL Pipeline?↑
An ETL pipeline is the set of processes used to move data from a source or multiple sources into a database such as a data warehouse. ETL stands for “extract, transform, load,” the three interdependent processes of data integration used to pull data from one database and move it to another. Once loaded, data can be used for reporting, analysis, and deriving actionable business insights.
Explain the data cleaning process.↑
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset.
What are facts in ETL?↑
A "fact" is a numeric value that a business wishes to count or sum.
What is staging area?↑
A staging area, or landing zone, is an intermediate storage area used for data processing during the extract, transform and load (ETL) process. The data staging area sits between the data source(s) and the data target(s), which are often data warehouses, data marts, or other data repositories.
What is a factless table?↑
A factless fact table captures the many-to-many relationships between dimensions, but contains no numeric or textual facts. They are often used to record events or coverage information.
What is SCD (Slowly Change Dimension)?↑
A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.
What is a materialized view?↑
Materialized views are physical structures that improve data access time by pre-computing intermediary results.
What are the disadvantages of indexes?↑
- Indexes allow fast lookups, but they decrease load performance- heavily indexed tables will not allow effective DML operations, i.e. insertions and updates.
- Indexes take additional disk space.
- The database backend needs to update all relevant indexes whenever data changes.
- Creates additional overhead due to index fragmentation: Developers or DBAs have to take care of index maintenance, reorganization, and rebuilds.
- Index fragmentation causes serious performance issues. When new data is inserted into an index, the database engine has to find space for it. It might happen that the new data insert messes up the current order—the SQL engine might split the data from a single data page, which creates an excessive amount of free space (internal fragmentation).
How would you prepare logging for ETL process?↑
Logging is extremely important to keep track of all changes and failures during a load. The most common ways to prepare for logging are to use flat files or a logging table. That is, during the process, counts, timestamps, and metadata about the source and target are added and then dumped into a flat file or table.
This way the load can be checked for invalid runs. When such a table or file exists, the next step would be to prepare notifications. This could be a report, or a simple formatted email, describing the load as soon as it finishes (e.g. the number of processed records compared to the previous load.)
To achieve that in an ETL process, a developer would add event handlers (SSIS) or use variables (like the system variable @@ROWCOUNT in Transact-SQL) to keep track of inserted, updated, and deleted records.
What is partitioning in ETL?↑
Partitioning is an important technique for organizing datasets so they can be queried efficiently. It organizes data in a hierarchical directory structure based on the distinct values of one or more columns.
What are joiner and lookup?↑
A joiner is used to join data from different sources. A lookup is used to get related values from another table or check for updates etc. in the target table.
How is ETL used in data migration projects?↑
Data migration projects commonly use ETL tools. As an example, if an organization managed their data in Oracle 10g earlier and now want to move to SQL Server cloud database, the data will need to be migrated from Source to Target. ETL tools can be very helpful for carrying out this type of migration. The user will have to spend a lot of time writing ETL code. The ETL tools are therefore very useful since they make coding simpler than P-SQL or T-SQL.