- What are the five V’s of Big Data?
- How is Hadoop related to Big Data?
- How is big data analysis helpful in increasing business revenue?
- Explain the steps to be followed to deploy a Big Data solution.
- How is Big data different?
- Name some tools or systems used in big data processing?
- How can big data support organizations?
- Which are the best tools that can be used by a Data-Analyst?
- What is Dala Cleansing?
- What are the sources of Unstructured data in Big Data?
- What are the different approaches to deal with Big Data?
- What are the different platforms to deal with Big Data?
- What kind of projects are better suitable for Big Data?
- How can you process Big Data?
- Why is Hadoop more suitable for Big Data?
- Describe Big Data deployment.
- Which language is preferred for Big Data - R, Python or any other language?
- How are Big Data and Data Science related?
- What are the factors or issues to be considered while building Big Data Models?
- Explain the ETL process concerning Big Data.
- What are the tools used to extract Big Data?
- What are the tools/languages to query Big Data?
- What is features selection?
- What is overfitting?
- How are missing values handled in Big Data?
- What are outliers?
- What do you mean by model optimization?
- Is a cloud-based solution a good option for Big Data?
- How should you deal with outliers?
- What is Data Enrichment?
- What is Lamda Architecture?
- What is Graph Analytics concerning Big Data?
- Explain data preparation in Big Data.
- What is Dimensionality Reduction?
- What are the different techniques for Dimensionality Reduction?
- Is Hadoop different from other parallel computing systems? How?
The five V’s of Big data is as follows:
- Volume – Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data volume in Petabytes
- Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
- Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
- Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
- Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.
When we talk about Big Data, we talk about Hadoop. So, this is another Big Data interview question that you will definitely face in an interview.
Hadoop is an open-source framework for storing, processing, and analyzing complex unstructured data sets for deriving insights and intelligence.
Big data analysis has become very important for the businesses. It helps businesses to differentiate themselves from others and increase the revenue. Through predictive analytics, big data analytics provides businesses customized recommendations and suggestions. Also, big data analytics enables businesses to launch new products depending on customer needs and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies those are using big data analytics to increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America etc.
Followings are the three steps that are followed to deploy a Big Data Solution –
i. Data Ingestion
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS.
ii. Data Storage
After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access.
iii. Data Processing
The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.
Big data needs specialized systems and software tools to process all unstructured data. In fact, according to some industry estimates almost 85% of data generated on the internet is unstructured. Usually, relational databases have a structured format and the database is centralized. Hence, RDBMS processing can be quickly done using a query language such as SQL. On the other hand, big data is very large and is distributed across the internet and hence processing big data will need distributed systems and tools to extract information from them. Big data needs specialized tools such as Hadoop, Hive, or others along with high-performance hardware and networks to process them.
Big data processing and analysis can be done using:
Big data has the potential to support organizations in many ways. Information extracted from big data can be used in:
- Better coordination with customers and stakeholders and to resolve problems
- Improve reporting and analysis for product or service improvements
- Customize products and services to selected markets
- Ensure better information sharing
- Support in management decisions
- Identify new opportunities, product ideas, and new markets
- Gather data from multiple sources and archive them for future reference
- Maintain databases, systems
- Determine performance metrics
- Understand interdependencies between business functions
- Evaluate organizational performance
- Google Search Operators
- Wolfram Alpha’s
- Google Fusion Tables
Data cleansing it is also known as Data scrubbing, it is a process of removing data which incorrect, duplicated or corrupted. This process is used for enhancing the data quality by eliminating errors and irregularities.
The sources of Unstructured data are as follows:
- Textfiles and documents
- Server website and application log
- Sensor data
- Images, Videos and audio files
- Social media Data
As the Big Data offers an extra competitive edge to a business over its competitors, a business can decide to tap the potential of Big Data as per its requirements and streamline the various business activities as per its objectives.
So the approaches to deal with Big Data are to be determined as per your business requirements and the available budgetary provisions.
First, you have to decide the kind of business concerns you are having right now. What kind of questions you want your data to answer. What are your business objectives and how do you want to achieve them.
As far as the approaches regarding Big Data processing are concerned, we can do it in two ways:
- Batch processing
- Stream processing
As per your business requirements, you can process the Big Data in batches daily or after a certain duration. If your business demands it, you can process it in streamline fashion after every hour or after every 15 seconds or so.
It all depends on your business objectives and the strategies you adopt.
There are various platforms available for Big Data. Some of these are open source and the others are license based.
In open-source, we have Hadoop as the biggest Big Data platform. The other alternative being HPCC. HPCC stands for High-Performance Computing Cluster.
In a licensed category, we have Big Data platform offerings from Cloudera(CDH), Hortonworks(HDP), MapR(MDP), etc. (Hortonworks is now merged with Cloudera.)
For Stream processing, we have tools like - Storm. The Big Data platforms landscape can be better understood if we consider it usage wise. For example, in the data storage and management category, we have big players like Cassandra, MongoDB, etc. In data cleaning category we have tools like OpenRefine, DataCleaner, etc. In data mining category we have IBM SPSS, RapidMiner, Teradata, etc. In the data visualization category, the tools are Tableau, SAS, Spark, Chartio, etc. Features and specialities of these Big Data platforms/tools are as follows:
- Open Source
- Highly Scalable
- Runs on Commodity Hardware
- Has a good ecosystem
- Open Source
- Good Alternative to Hadoop
- Parallelism at Data, Pipeline and System Level
- High-Performance Online Query Applications
- Open Source
- Distributed Stream Processing
- Log Processing
- Real-Time Analytics
- Licence based (Limited Free Version available)
- Cloudera Manager for easy administration
- Easy implementation
- More Secure
- Licence based (Limited Free Version available)
- Dashboard with Ambari UI
- Data Analytics Studio
- HDP Sandbox available for VirtualBox, VMware, Docker.
- Licence based (Limited Free Version available)
- On-premise and cloud support
- Features AI and ML
- Open APIs
- Open Source
- NoSQL Database
- Log-Structured Storage
- Includes Cassandra Structure Language (CQL)
- Licence based (also Open Source)
- NoSQL Database
- Document Oriented
- Aggregation Pipeline etc.
All the projects that involve a lot of data crunching (mostly unstructured) are better candidates for Big Data projects. Thus Telecom, Banking, Healthcare, Pharma, e-commerce, Retail, energy, transportation, etc. are the major sectors that are playing big with Big Data. Apart from these any business or sector that is dealing with a lot of data is better candidates for implementing Big Data projects. Even the manufacturing companies can utilize Big Data for product improvement, quality improvement, inventory management, reducing expenses, improving operations, predicting equipment failures, etc. Big Data is being used in Educational fields also. Educational industry is generating a lot of data related to students, courses, faculties, results, and so on. If this data is properly analyzed and studied, it can provide many useful insights that we can be used to have an improvement in the operational efficiency and the overall working of the educational entities.
By harnessing the potential of Big Data in the Educational field, we can expect the following benefits:
- Customized Contents
- Dynamic Learning Programs
- Enhanced Grading system
- Flexible Course Materials
- Success Prediction
- Better Career Options
Healthcare is one of the biggest domains which makes use of the Big Data. Better treatment can be given to patients as the patient's related data gives us the necessary details about the patient's history. It helps you to perform only the required tests, so the costs related to diagnosis gets reduced. Any outbreaks of epidemics can be better predicted and hence the necessary steps for its prevention can be taken early. Some of the diseases can be prevented or their severity can be reduced by taking preventive steps and early medication.
Following are the observed benefits of using Big Data in Healthcare:
- Better Prediction
- Enhanced Treatment
- Only the necessary tests to be performed.
- Reduced Costs
- Increased Care
Another area/project which is suitable for the implementation of Big Data is - 'Welfare Schemes'. It assists in making informed decisions about various welfare schemes. We can identify those areas of concern that need immediate attention. The national challenges like Unemployment, Health concerns, Depletion of energy resources, Exploration of new avenues for growth, etc. can be better understood and accordingly dealt with. Cyber Security is another area where we can apply Big Data for the detection of security loopholes, identifying cyber crimes, illegal online activities or transactions, etc. Not only we can detect such activities but also we can predict in advance and have better control of such fraudulent activities.
Some of the benefits of using Big Data in Media and Entertainment Industry can be as given below:
- On-demand content delivery.
- Predicting the preferences and interests of the audience.
- Insights from reviews of the customers.
- Targeted Advertisements etc.
The projects related to Weather Forecasting, Transportation, Retail, Logistics, etc. can also be good players for Big Data.
There are various frameworks for Big Data processing.
One of the most popular is MapReduce. It consists of mainly two phases called Map phase and the Reduce phase. In between Map and Reduce phase there is an intermediate phase called Shuffle. The given job is divided into two tasks:
- Map tasks
- Reduce tasks.
The input is divided into splits of fixed size. Each input split is then given to each mapper. The mappers run in parallel. So the execution time is drastically reduced and we get the output very fast.
The input to the mapper is a key-value pair. The output of mappers is another key-value pair. This intermediate result is then shuffled and given to reducers. The output of reducers is your desired output.
Hadoop is probably the very first open-source Big Data platform. It is highly scalable and runs on commodity hardware. It includes HDFS which is Hadoop Distributed File System. It can store a very large amount of unstructured data in a distributed fashion.
Hadoop also includes MapReduce which is a data processing framework. It processes data in a highly parallel fashion.
For a large quantity of data, the processing time is drastically reduced. There are so many API's and other tools available that can be integrated with Hadoop that further extends its usefulness and enhances its capability and makes it more suitable for Big Data.
The Hadoop Framework let the user write and test the distributed systems quickly.
It is fault-tolerant and automatically distributes the data across the cluster of machines. It makes use of massive parallelism. To provide high availability and fault-tolerance, Hadoop does not depend on the underlying hardware.
At the application layer itself, it provides such support. We can add or remove nodes as per our requirements. You are not required to make any changes to the application.
Apart from being open-source, the other biggest advantage we have of Hadoop is its compatibility with almost all the platforms. The amount of data that is being generated is increasing by a very large quantity day by day. So, the need for data storage and processing will increase accordingly. The best part of Hadoop is that by adding more number of commodity machines you can increase the storage and processing power of Hadoop without any other investment in software or other tools.
Thus, just by adding more machines, we can accommodate the ever-increasing volume of data. Due to Fault-tolerant feature of Hadoop, the data as well as the application processing, both are protected against any hardware failure.
If a particular node goes down, the jobs are redirected automatically to other nodes. This ensures that 'Distributed Computing' does not fail.
There are multiple copies (by default 3) of data stored by the Hadoop automatically.
Hadoop provides more flexibility in terms of data capture and storage. You can capture any data, in any format, from any source into the Hadoop and store it as it is without any kind of preprocessing on it. Whereas in traditional systems, you are required to pre-process the data before storage.
So, in Hadoop, you can store any data and then later process it as per your requirements.
The ecosystem around Hadoop is very strong. There are so many tools available for different needs. We have tools for automatic data extraction, storage, transformation, processing, analysis etc.
There are a variety of cloud options available for Hadoop. So, you have a choice to use on-premise as well as cloud-based features/tools as per your requirements.
Thus, by considering all these features that Hadoop provides and the robustness, cost-effectiveness it offers and also by taking into consideration the nature of Big Data, we can say that Hadoop is more suitable for Big Data.
There are various things to be considered very carefully before going for Big Data deployment. First, the business objectives and the requirement for Big Data solutions are to be well understood and written. What kind of insights are needed out of the Big Data needs to be clearly defined?
Then you have to find out the various sources of data collection. You have to decide the data extraction strategy. Find out the various architectures and tools for Big Data deployment. Compare and decide the best fit depending on your requirements and the drafted policy.
You also have to take into consideration the data ingestion policy, storage, and processing requirements for the Big Data deployment. You can either manually deploy the solution or you can choose to automate the process by using the automated deployment tools.
The choice of language for a particular Big Data project depends on the kind of solution we want to develop. For example, if we want to do data manipulation, certain languages are good at the manipulation of data.
If we are looking for Big Data Analytics, we see another set of languages that should be preferred. As far as R and Python are concerned, both of these languages are preferred choices for Big Data. When we are looking into the visualization aspect of Big Data, R language is preferred as it is rich in tools and libraries related to graphics capabilities.
When we are into Big Data development, Model building, and testing, we choose Python.
R is more favourite among statisticians whereas developers prefer Python.
Next, we have Java as a popular language in the Big Data environment as the most preferred Big Data platform ‘Hadoop’ itself is written in java. There are other languages also popular such as Scala, SAS, and MATLAB.
There is also a community of Big Data people who prefer to use both R and Python. So we see that there are ways we can use a combination of both of these languages such as PypeR, PyRserve, rPython, rJython, PythonInR etc.
Thus, it is up to you to decide which one or a combination will be the best choice for your Big Data project.
Data science is a broad spectrum of activities involving analysis of Big Data, finding patterns, trends in data, interpreting statistical terms and predicting future trends. Big Data is just one part of Data Science. Though Data Science is a broad term and very important in the overall Business operations, it is nothing without Big Data.
All the activities we perform in Data Science are based on Big Data. Thus Big Data and Data Science are interrelated and can not be seen in isolation.
Special emphasis needs to be given when building Big Data Models. It is so because the Big Data itself is less predictable when compared to the other traditional kind of data. It is a little bit complex process as it involves reorganizing and rearranging the business data by the business processes.
To support the business objectives, the data models need to be designed to have logical inter-relationships among the various business data.
Then these logical designs need to be translated into the corresponding physical models.
Big Data is significantly different than the traditional data, the old data modelling techniques do no longer apply to Big Data. So you are required to apply different approaches for modelling Big Data.
The data interfaces should be designed to incorporate elasticity and openness due to the unpredictable nature of the Big Data to accommodate future changes.
Here the focus should not be on a schema but on designing a system. We should also take into consideration the various Big Data modeling tools out there. Not all the Big Data present there should be considered for modeling. Only the data appropriate to your business concerns should be selected to build models around.
ETL stands for Extract-Transform-Load. Mostly the Big Data is unstructured and it is in very large quantity and also gets accumulated at a very fast pace. So, at the time of extraction, it becomes very difficult to transform it because of its sheer volume, velocity, and variety. Also, we can not afford to lose Big Data. So, it requires to be stored as it is and then in the future as per the business requirements can be transformed and analysed.
The process of extraction of Big Data involves the retrieval of data from various data sources. The enterprises extract data for various reasons such as:
- For further processing.
- Migrate it to some other data repository such as a data warehouse/data lake.
- For analyzing etc.
Sometimes, while extracting the data, it may be desired to add some additional information to the data, depending on the business requirements. This additional information can be something like geolocation data, timestamps, etc. It is called as data enrichment. Sometimes it may be required to consolidate the data with some other data in the target datastore. These different processes are collectively known as ETL. ie. Extract-Transform-Load.
In ETL, Extraction is the very first step.
The Big Data tools for data extraction assist in collecting the data from a variety of different data sources. The functionalities of these tools can be as mentioned below:
- Extract the data from various homogeneous/heterogeneous sources.
- Transform it to store in a proper format/structure for further processing and querying.
- Load the data in the target store such as data mart, an operational data store, or a data warehouse.
It's a usual activity in ETL tools that the common 3 steps are executed in parallel. As the extraction of data takes a longer time, the other process of transformation starts. It processes the already pulled data and prepares it for loading.
As the data becomes ready for loading into the target store, the process of loading the data starts immediately irrespective of the completion of previous steps.
ETL for Structured Data: If the data under consideration is structured, then the extraction process is performed generally within the source system itself.
Following extraction strategies may be used:
- Full Extraction: In the full extraction method, the data is extracted completely from the source. Tracking the changes are not required. The logic here is simpler but the load on the system is greater.
- Incremental extraction: In the incremental extraction method, the changes occurring in the source data are tracked from the last successful data extraction. It is so because you are not required to go through the entire process of extracting all the data every time there occurs a change. For this, a changing table is created to track the changes. In some data warehouses, a special functionality known as 'CDC' (Change Data Capture) is built-in.
The logic required for incremental data extraction is a little bit more complex but the load on the system is reduced.
ETL for Unstructured Data: When the data under consideration is unstructured, a major part of the work goes into preparing the data so that the data can be extracted. In most cases, such data is stored data lakes until it is required to extract for some kind of processing, analysis or migration.
The data is cleaned up by removing the so-called 'noise' from it.
It is done in the following ways:
- Removing whitespaces/symbols
- Removing duplicate results
- Handling missing values.
- Removing outliers etc.
There are some challenges in the ETL process. When you are consolidating data from one system to the other system, you have to ensure that the combination is good/successful. A lot of strategic planning is required. The complexity of planning increases manyfold when the data under consideration is both structured and unstructured. The other challenges include maintaining the security of the data intact and complying with the various regulations.
Thus performing ETL on Big Data is a very important and sensitive process that is to be done with the utmost care and strategic planning.
There are numerous tools available for Big Data extraction. For example, Flume, Kafka, Nifi, Sqoop, Chukwa, Talend, Scriptella, Morphlines, etc. Apart from data extraction, these tools also assist in modification and formatting the data.
The Big Data extraction can be done in various modes :
There are other issued also that needs to be addressed. The source and destination systems may have different I/O formats, different protocols, scalability, security issues, etc. So the data extraction and storage needs to be taken care of accordingly.
Open source tools: Open source tools can be more suitable for budget-constrained users. They are supposed to have a sufficient knowledge base and the required supporting infrastructure in place. Some vendors do offer light or limited versions of their tools as open source.
Batch processing tools: The existing Legacy data extraction tools, combine/consolidate the data in batches. It is generally done in off-hours to have minimum impact on the working systems. For on-premise, closed environments, a batch extraction seems to be a good approach.
Cloud-based tools: These are the new generation of data extraction tools. Here, the emphasis is on the real-time extraction of the data. These tools offer an added advantage of data security and also takes care of any data compliance issues. So, an enterprise need not worry about these things.
'Talend Open Studio' is one of the good tools which offers data extraction as one of its features. It is one of the 'most powerful Data Integration' tools out there in the market. It is a set of versatile open- source products that can be better used in Developing, Testing, Deploying as well as Administering the various Data Management applications and the other integration projects.
'Scriptella' is one of the open-source ETL tools by Apache. It has various features related to data extraction, transformation, loading, database migration, etc. It can also execute the java scripts, SQL, Velocity, JEXL, etc. It also has interoperability with JDBC, LDAP, XML, and many other data sources. It is a very popular tool due to its ease of use and simplicity.
Another best open-source tool is 'KETL'. It is best for data warehousing. It is Built on open, multi-threaded java oriented, XML based architecture. The major features of KETL are integration with 'security' and 'data management tools', scalable across multiple servers, etc.
'Kettle' - Pentaho Data Integrator. It is the default tool in 'Pentaho' Business-Intelligence Suite.
There are other tools also such as Jaspersoft ETL, Clover ETL, Apatar ETL, GeoKettle, Jedox, etc.
To query Big Data, there are various languages available. Some of these languages are either functional, dataflow, declarative, or imperative. Querying Big Data often involves certain challenges. For example:
- Unstructured data
- Fault tolerance etc.
By 'unstructured data’ we mean that the data, as well as the various data sources, do not follow any particular format or protocol.
By 'latency’ we mean the time taken by certain processes such as Map-Reduce to produce the result.
By 'fault tolerance’ we mean the steps in the analysis that support partial failures, rolling back to previous results, etc.
To query Big Data, there are various tools available. You have to decide which one to use as per your infrastructural requirements. The following are some of the tools/languages to query the Big Data: HiveQL, Pig Latin, Scriptella, BigQuery, DB2 Big SQL, JAQL, etc.
The tools such as Flume and Pig are based on the concept of processing pipeline which is explicit. The other approach is to translate the SQL into an equivalent construct in Big Data.
For example, HiveQL, Drill, Impala, Dremel, etc. follow this approach.
It is always desirable from a user perspective to use the second approach based on SQL. It is easy to follow and widely known. The query optimization part is left for the tool/system to perform.
The major limitation of using such a query language is the built-in operators. They are very limited. The dataflow languages such as Flume and Pig are designed in such a manner to incorporate user-specified operators.
Therefore such languages can be easily extensible. The construction of processing pipelines is a major limitation in such query languages.
'Presto' is a good example of a distributed 'SQL query' engine which is an open source also. It can run interactive analytical queries over various data stores.
One of the features of Presto which is worth mentioning is its ability to combine data from multiple stores by a single query. Thus it allows you to perform analytics across the entire organization.
Feature selection is a process of extracting only the required features from the given Big Data. Big Data may contain a lot of features that may not be needed at a particular time during processing, so we are required to select only the features in which we are interested and do further processing.
There are several methods for features selection:
- Filters method
- Wrappers method
- Embedded method
- Filters Method:
In this method, the selection of features is not dependent on the designated classifiers. The selection of variables for the ordering purpose, a variable ranking technique is used.
In the technique of variable ranking, we take into consideration the importance and usefulness of a feature for classification. In the filters method, to filter out the less relevant features, we can apply the ranking method before classification.
Some of the examples of filters method are:
- Chi-Square Test
- Variance Threshold
- Information Gain etc.
- Wrappers method:
In the wrappers method, the algorithm for feature subset selection exists as a 'wrapper' around the algorithm known as 'induction algorithm'.
The induction algorithm is considered as a 'Black Box'. It is used to produce a classifier that will be used in classifying.
It requires a heavy computation to obtain the subset of features. This is considered as a drawback of this technique.
Some of the examples of Wrappers Method are:
- Genetic Algorithms
- Recursive Feature Elimination
- Sequential Feature Selection
- Embedded Method:
This method combines the efficiencies of the Filters method and the Wrappers method.
It is generally specific to a given learning machine. The selection of variables is usually done in the training process itself. What is learned by this method is the 'feature' that provides the most accurate to the model.
Some of the examples of Embedded Method are:
- L1 Regularisation Technique (such as LASSO)
- Ridge Regression (also known as L2 Regularisation)
- Elastic Net etc.
The process of feature selection simplifies machine learning models. So, it becomes easier to interpret them. It eliminates the burden of dimensionality. The generality of the model is enhanced by this technique. So, the overfitting problem gets reduced.
Thus, we get various benefits by using Feature Selection methods. Following are some of the obvious benefits:
- A better understanding of data.
- Improved prediction performance.
- Reduced computation time.
- Reduced space etc.
Tools such as SAS, MATLAB, Weka also include methods/ tools for feature selection.
Overfitting refers to a model that is tightly fitted to the data. It is a modeling error. It occurs when a modeling function is too closely fit a limited data set. Here the model is made too complex to explain the peculiarity or individuality in the data which is under consideration.
The predictivity of such models gets reduced due to overfitting. The generalization ability of such models also gets affected. Such models generally fail when applied on the outside data i.e. the data which was not part of the sample data.
There are several methodologies to avoid overfitting. These are:
- Early stopping
- Regularization etc.
Overfitting seems to be a common problem in the world of data science and machine learning. Such a model learns noise also along with the signal. It proves to be a poor fit when applied to new data sets.
A model should be considered as an overfitted when it performs better on the training set but poor on the test set. Following is a description of the most widely used cross-validation method:
The cross-validation method is considered to be one of the powerful techniques for the prevention of overfitting. Here, the training data is used to obtain multiple small test sets. These small test sets should be used to tune the model.
In 'k-fold cross-validation' method, the data is partitioned into 'k' subsets. These subsets are called folds. The model is then trained on 'k-1' folds and the remaining fold is used as the test set. It is also called the 'holdout fold'.
This method allows us to keep the test set as an unseen dataset and lets us select the final model.
Missing values refer to the values that are not present for a particular column. If we do not take care of the missing values, it may lead to erroneous data and in turn incorrect results. So before processing the Big Data, we are required to properly treat the missing values so that we get the correct sample. There are various ways to handle missing values.
We can either drop the data or decide to replace them with the data imputation.
If the number of missing values is small, then the general practice is to leave it. If the number of cases is more then the data imputation is done.
There are certain techniques in statistics to estimate the so-called missing values:
- Maximum Likelihood Estimation,
- Listwise/pairwise Deletion
- Multiple data imputation etc.
Outliers are data points/values that are very far from the group. These do not belong to any particular group/cluster.
The presence of outliers may affect the behavior of the model. So proper care is to be taken to identify and properly treat the outliers.
The outliers may contain valuable and often useful information. So they should be handled very carefully. Most of the time, they are considered to be bad data points but their presence in the data set should also be investigated.
Outliers present in the input data may skew the result. They may mislead the process of training of machine learning algorithms. This results in:
- Longer Training Time
- Less Accurate Models
- Poor Results.
It is observed that many machine learning models are sensitive to:
- The range of attribute values
- Distribution of attribute values
The presence of outliers may create misleading representations. This will lead to misleading interpretations of the collected data.
As in descriptive statistics, the presence of outliers may skew the mean and standard deviation of the attribute values The effects can be observed in plots like scatterplots and histograms.
For some problems, outliers can be more relevant. For example anomalies in:
- Fraud detection
- Computer security.
Some of the outlier detection methods are:
- Extreme Value Analysis: Here we determine the statistical tails of the distribution of data. For example, Statistical methods like 'z-scores' on univariate data.
- Probabilistic and Statistical Models: Here we determine the 'unlikely instances' from a 'probabilistic model' of data. For example, the Optimization of' Gaussian mixture' models using 'expectation-maximization'.
- Linear Models: Using the linear correlations, the data is modeled into lower dimensions. For example, Data having large residual errors can be outliers.
- Proximity-based Models: Here, the data instances which are isolated from the group or mass of the data are determined by Cluster, Density or by the Nearest Neighbor Analysis.
- Information-Theoretic Models: Here the outliers can be detected as data instances that increase the complexity of the dataset (minimum code length).
- High-Dimensional Outlier Detection: In this method, we search subspaces for the outliers based on distance measures in higher dimensions.
By model optimization, we mean to build/refine the model in such a way to be as realistic as it can be. It should reflect the real-life situation as closely as possible. When we apply a model to the real-world data, it should give the expected results. So optimization is required. This is achieved by capturing some significant or key components from the dataset.
There are some tools available in the market for optimizing the models. One such tool is the ‘TensorFlow Model Optimization Toolkit’. There are three major components in model optimization:
- An objective function.
- Decision Variables
An objective function is a function that we need to optimize for model Optimization. The solution to a given optimization problem is nothing but the set of values of the decision variables. These are those values of the decision variables for which our objective function reaches its expected optimal value. The values of the decision variables are restricted by the constraints.
The classification of optimization problems is based on the nature of our objective function and the nature of given constraints. In an unconstrained optimization problem, there are no constraints and our objective function can be of any kind - linear/nonlinear. In the linear optimization problem, our objective function is linear in variables and the given constraints are also linear.
In a quadratic optimization problem, our objective function quadratic in variables and the given constraints are linear. In a nonlinear optimization problem, our objective function is an arbitrary function that is nonlinear of the given decision variables.
The given constraints can be linear or they can be nonlinear. The objective of model optimization is to find the optimal values of the given decision variables.
Using the Cloud for Big Data Development is a good choice. It will help the businesses to increase their operational efficiencies with a minimal initial investment. They just have to pay only for the facilities they are using. Furthermore, they can upgrade or downgrade the facilities as per the changing business requirements.
For some enterprises, deploying Big Data technologies on their premises prove to be a costly affair. Most of the time, they do not possess the required expertise to deal with Big Data deployment. Furthermore, the initial investments in these technologies are more. Their business requirements are also changing with market conditions. The tools and technologies related to Big Data also tend to evolve with the changing requirements. So keeping updated with the latest versions/tools prove to be costly for the enterprises. So, the cloud seems to be a better alternative to start with the Big Data initiative. There are several players in the cloud space. The major Big Data Cloud providers are:
- Amazon Web Services
- Microsoft Azure
- Google Cloud Platform
- Qubole etc.
Developing a cloud-based solution for Big Data involves a lot of awareness regarding the various Big Data offerings by different cloud providers. At the very first a business should be very, very clear in its requirements regarding Big Data. These requirements can be something like:
- Kind of insights needed
- Data Sources
- Storage needs
- Processing requirements (batch/real-time) etc.
Once you are clear in your Big Data requirements and strategies for future developments, you can choose a better combination of the storage solutions, processing platforms and the analytical tools to get the required results from your Big Data initiatives.
Depending on the business constraints and the state regulations, you can decide to opt some Big Data solutions from the cloud and some tools can be employed within the enterprise to have a better tradeoff. This way you can ensure confirming the various regulations as well as make efficient use of the available resources and the budgetary provisions.
In Big Data projects, one of the greatest concerns is data availability and accessibility. The cloud providers assure 99.9 % uptime. They also employ various data checking and security mechanisms to ensure data availability all the time. Making such provisions at an enterprise-level requires heavy investments in not just capital but also in tackling the operational challenges. It is observed mostly that despite having sound planning, the demand becomes difficult to anticipate. This results in under or over-allocation of resources which ultimately affects your investments. This will enable the new services, products and projects to initiate on a small scale basis with minimal or at very low costs. This gives a lot of room for innovations. So, opting for cloud seems to be a better choice as far as the initial journey into the world of Big Data is concerned.
Outliers are observations that appear far away from the group. They diverge from the overall pattern outlined by the given sample. Due to the presence of outliers in the dataset, we can observe a drastic change in the results. There are various unfavourable effects of outliers in the data set. Some of the impacts can be stated as follows:
- It may increase the error variance.
- Normality may get decreased.
- It may decrease the power of various statistical tests.
- We may get biased estimates.
Outliers must not be ignored and should be properly treated as their presence may change the basic assumptions in statistical modelling. The results may get skewed due to the presence of outliers. Before applying procedures to deal with the outliers, we should always try to reason out the presence of outliers.
If we know the reason for the presence of outliers in our dataset, we can use the methods accordingly, to deal with the outliers. The reasons for having outliers in the dataset can be as follows:
- Non-natural (Data Errors)
- Natural (True Outliers)
The non-natural reasons for outliers can be :
- Data Entry Errors
- Measurement Error
- Sampling Error
- Experimental Error
- Data Processing Error etc.
Natural or true outliers can be originally present in the dataset. To deal with outliers, the following approaches can be used:
- Deleting observations
- Imputing values
- Treating as a separate group
- Other statistical methods.
Trimming can also be used at the extremes/both ends to remove the outliers. Weights can also be assigned to different observations. Mean, Mode and Median can also be used to remove outliers. Before imputing values, we should analyze if it is a natural outlier or artificial. If the outliers present are significantly large in number, it is advisable to treat them as separate groups. We can then build corresponding models for both the groups. The output is then combined.
Data enrichment is a process to improve, refine or enhance the data. It is something like adding some additional details to the existing data. It also includes adding external data from some trusted sources to the existing data. Data enrichment helps you to have complete and accurate data. More informed decisions can be made by having enriched data. As data is the most valuable asset in the Big Data world, it must be ensured that the data is in good condition. It should not be incomplete, missing, redundant or inaccurate. If we do not have good data, we can not expect good results out of it.
What we mean by good data is that it should be complete and accurate. The process of data enrichment helps us to add more details to the existing data so that it becomes a complete data.
Incomplete or little data can not give a bigger or complete picture of your customer. If you have insufficient information about your customers, you may not be able to give the expected service or customized offerings. This affects the business conversion rate and ultimately the business revenue. So having data in a good and complete condition is a must for Big Data analytics to give the correct insights and hence produce the expected results. Data enrichment involves data refinement that may be insufficient, inaccurate or may have small errors. Extrapolating data is also a kind of data enrichment. Here we produce more data from the available raw data. There are several types of data enrichment methods. Out of these, the two significant methods are:
- Demographic Data Enrichment
- Geographic Data Enrichment
It is up to you to decide what kind of data enrichment you need depending on your business requirements and objectives. Data enrichment is not a one time process, it is to be done continuously because the customer data tends to change with time. There are several data enrichment tools available. Some of these are:
- ZoomInfo etc.
Lambda architecture is a Big Data processing architecture. To handle the enormous quantities of data, the lambda architecture makes use of batch as well as stream processing methods. It is a fault-tolerant architecture and achieves a balance between latency and throughput. Lambda architecture makes use of the model of data that has an append-only, immutable data source which serves as a system of record.
In this architecture, new events are appended to the existing events. The new events do not overwrite existing events. The lambda architecture is designed for ingesting and the processing of timestamp-based events. The state can be determined from the 'natural', 'time-based' ordering of the data.
In Lambda architecture, we have a system that consists of three layers:
- Batch processing
- Real-time processing
- Serving layer
The third layer is to respond to queries. The data is ingested to the processing layers from a master copy of the entire data set. This master copy is immutable. The real-time processing layer processes the data streams in real-time. It does not require completeness or any fix-ups.
This layer provides real-time views on the most recent data. So the latency is minimized but the throughput is sacrificed. The real-time processing is also termed as speed processing.
As there is a lag by the batch layer in providing the views on the most recent data, we can say that the speed layer does the work of filling this gap. The benefit which we get from the speed layer is that the view is immediately available once we receive the data. This view may not be complete or we can say when compared with the view generated by the batch layer. However, there is always a choice with you to replace the view produced by the speed layer with the batch layer's view when that data made available to the batch layer. The output obtained from the batch layer and the speed layer is stored in the serving layer. In response to the 'ad-hoc queries', this serving layer returns the views that are pre-computed or building the views from processed data.
In a Graph Analytics of Big Data, we try to model the given problem into a graph database and then perform analysis over that graph to get the required answers to our questions. There are several types of graph analytics used such as:
- Path Analysis
- Connectivity Analysis
- Community Analysis
- Centrality Analysis
Path Analysis is generally used to find out the shortest distance between any two nodes in a given graph.
Route optimization is the best example of Path Analysis. It can be used in applications such as supply chain, logistics, traffic optimization, etc. Connectivity Analysis is used to determine the weaknesses in a network. For Example - a Utility PowerGrid.
The connectivity across a network can also be determined using the Connectivity Analysis. Community Analysis is based on Density and Distance. It can be used to identify the different groups of people in a social network. Centrality Analysis enables us to determine the most 'Influential People' in a social-network.
Using this analysis, we can find out the web pages that are highly accessed. Various algorithms are making use of Graph Analytics. For example- PageRank, Eigen Centrality and Closeness, Betweenness Centrality, etc.
Graphs are made up of nodes/vertices and edges. When applied to real-life examples, 'people' can be considered as nodes. For example customers, employees, social groups, companies etc. There can be other examples also for nodes such as buildings, cities and towns, airports, bus depots, distribution points, houses, bank accounts, assets, devices, policies, products, grids, web pages, etc.
Edges can be the things that represent relationships. For example- social networking likes and dislikes emails, payment transactions, phone calls, etc. The Edges can be directed, non-directed or weighted. For example -John transferred money to Smith, Peter follows David on some social platform, etc. The examples of non-directed edges can be - Sam likes America etc. An example of weighted edges can be something like - 'the number of transactions between any two accounts is very high', the time required to reach any two stations or locations', etc. In a big data environment, we can do Graph Analytics using Apache Spark 'GraphX' by loading the given data into memory and then running the 'Graph Analysis' in parallel.
There is also an interface called 'Tinkerpop' that can be used to connect Spark with the other graph databases. By this process, you can extract the data out of any graph database and load it into memory for faster graph analysis. For analyzing the graphs, we can use some tools such as Neo4j, GraphFrames, etc. GraphFrames is massively scalable.
Graph analytics can be applied to detect fraud, financial crimes, identifying social media influencers, route optimization, network optimization, etc.
Data preparation involves collecting, combining, organizing and structuring data so that it can be analyzed for patterns, trends, and insights. The Big Data needs to be preprocessed, cleansed, validated and transformed. For this, the required data is pulled in from different sources internal or external. One of the major focuses of data preparation is that the data under consideration for analysis is consistent and accurate. It so because accurate data will only produce valid results.
When the data is collected, it is not complete. It may have some missing values, outliers, etc. Data preparation is the major and very important activity in any Big Data project. Only good data will produce good results. Most of the time, the data resides in silos, in different databases. It is also in different formats. So it needs to be reconciled. There are five D's associated with the process of data preparation. These are :
The process of data preparation is automated. Various machine learning algorithms can be used in data preparation like filling missing values, fields renaming, ensuring consistency, removing redundancy, etc. There are various terminologies related to the process of data preparation such as data cleansing, transforming variables, removing outliers, data curation, data enrichment, data structuring and modeling, etc. These terminologies are actually the various processes or activities that are done under the process of data preparation.
It is seen that the time spent on data preparation is generally more than the time required for data analysis.
Though the methods used for data preparation are automated, it takes a lot of time to prepare the data as the volume of data is very large in quantity and it tends to grow continuously.
Dimensionality reduction means reducing the number of dimensions or variables that are under consideration. Big Data contains a large number of variables. Most of the time, some of these variables are correlated. So there is always room to select only the major/distinct variables that contribute in a big way to produce the result. Such variables are also called Principal Components.
In most cases, some features are redundant. We can always reduce the features where we observe a high correlation. Dimensionality Reduction technique is also known as 'Low Dimensional Embedding'.
When the number of variables is huge, it becomes difficult to draw inferences from the given data set. Visualization also becomes too difficult. So, it is always desirable in such situations to reduce the number of features and utilize only the more significant features. Thus the technique of Dimensionality Reduction helps a lot in such situations by allowing us to reduce the number of dimensions and speed up our analytics. There are several obvious advantages of Dimensionality Reduction such as:
- Reduced storage due to data compression.
- Reduced computation time.
- Removal of redundant features
- Visualization becomes easier.
- Dimensionality Reduction may cause some loss of data but the advantages gain is more.
There are two approaches to do Dimensionality Reduction:
- Feature Selection
- Feature Extraction
Following are the different ways by which we can perform 'Feature Selection':
- Filter Method
- Wrapper Method
- Embedded Method
In 'Feature Extraction' we reduce the data from a 'high dimensional space' to a lesser number of dimensions or 'lower-dimensional space'. The process of 'Dimensionality Reduction' can be linear or nonlinear. Several methods are used with Dimensionality Reduction.
Some of these are:
- PCA (Principal Component Analysis)
- LCA (Linear Discriminant Analysis)
- DCA (Generalized Discriminant Analysis)
When we are using the 'Principal Component Analysis', there is a requirement that the variance of the data which is in the 'lower-dimensional space' should be 'maximum'. When it is being mapped to a 'lower-dimensional space' from a 'higher dimensional space'. The following steps are followed in the process of Principal Component
- Constructing the Covariance Matrix of the given data.
- Computing the EigenVectors of the computed matrix.
- Reconstructing the variance of the original data by using the Eigen Vectors corresponding to the largest eigenvalue.
By using 'Linear Discriminant Analysis' we try to find such a linear combination of features that can separate the two or more classes of objects/events.
The 'Generalized Discriminant Analysis' method is used to provide a mapping of the given 'input vectors' into a 'high dimensional feature space’.
Yes, it is. Hadoop is a distributed file system. It allows us to store and manage large amounts of data in a cloud of machines, managing data redundancy.
The main benefit of this is that since the data is stored in multiple nodes, it is better to process it in a distributed way. Each node is able to process the data stored on it instead of wasting time moving the data across the network.
In contrast, in a relational database computing system, we can query data in real-time, but it is not efficient to store data in tables, records, and columns when the data is huge.
Hadoop also provides a schema for building a column database with Hadoop HBase for run-time queries on rows.
Want to test this skill? Check out Adaface assessments
Big Data Assessment Test
Big Data - Oozie Test
Big Data - Pig Test
Big Data - Sqoop Test
We evaluated several of their competitors and found Adaface to be the most compelling. Great default library of questions that are designed to test for fit rather than memorization of algorithms.