Data Science is getting bigger and better with each passing day. As such, it is churning out plenty of opportunities for those interested in pursuing the career of a data scientist.

If you are someone who is just starting out with data science, then you would like to know how to become a data scientist first.

## Data Science Interview Questions

However, if you’re already past that and preparing for a data scientist job interview, here are the 20 most important data science interview questions with answers to help you secure the spot:

**Q**: **Can you enumerate the various differences between Supervised and Unsupervised Learning?**

**A**: Supervised learning is a type of machine learning where a function is inferred from labeled training data. The training data contains a set of training examples.

Unsupervised learning, on the other hand, is a type of machine learning where inferences are drawn from datasets containing input data without labeled responses. Following are the various other differences between the two types of machine learning:

**Algorithms Used – Supervised learning makes use of Decision Trees, K-nearest Neighbor algorithm, Neural Networks, Regression, and Support Vector Machines. Unsupervised learning uses Anomaly Detection, Clustering, Latent Variable Models, and Neural Networks.**

- Enables – Supervised learning enables classification and regression, whereas unsupervised learning enables classification, dimension reduction, and density estimation
- Use – While supervised learning is used for prediction, unsupervised learning finds use in analysis

**Q**: **What do you understand by the Selection Bias? What are its various types?**

**A**: Selection bias is typically associated with research that doesn’t have a random selection of participants. It is a type of error that occurs when a researcher decides who is going to be studied. On some occasions, selection bias is also referred to as the selection effect.

In other words, selection bias is a distortion of statistical analysis that results from the sample collecting method. When selection bias is not taken into account, some conclusions made by a research study might not be accurate. Following are the various types of selection bias:

- Sampling Bias – A systematic error resulting due to a non-random sample of a populace causing certain members of the same to be less likely included than others that results in a biased sample.

- Time Interval – A trial might be ended at an extreme value, usually due to ethical reasons, but the extreme value is most likely to be reached by the variable with the most variance, even though all variables have a similar mean.

- Data – Results when specific data subsets are selected for supporting a conclusion or rejection of bad data arbitrarily.

- Attrition – Caused due to attrition, i.e. loss of participants, discounting trial subjects or tests that didn’t run to completion.

**Q**: **Please explain the goal of A/B Testing.**

**A**: A/B Testing is a statistical hypothesis testing meant for a randomized experiment with two variables, A and B. The goal of A/B Testing is to maximize the likelihood of an outcome of some interest by identifying any changes to a webpage.

A highly reliable method for finding out the best online marketing and promotional strategies for a business, A/B Testing can be employed for testing everything, ranging from sales emails to search ads and website copy.

**Q**: **How will you calculate the Sensitivity of machine learning models?**

**A**: In machine learning, Sensitivity is used for validating the accuracy of a classifier, such as Logistic, Random Forest, and SVM. It is also known as REC (recall) or TPR (true positive rate).

Sensitivity can be defined as the ratio of predicted true events and total events i.e.:

Sensitivity = True Positives / Positives in Actual Dependent Variable

Here, true events are those events that were true as predicted by a machine learning model. The best sensitivity is 1.0 and the worst sensitivity is 0.0.

**Q**: **Could you draw a comparison between overfitting and underfitting?**

**A**: In order to make reliable predictions on general untrained data in machine learning and statistics, it is required to fit a (machine learning) model to a set of training data. Overfitting and underfitting are two of the most common modeling errors that occur while doing so.

Following are the various differences between overfitting and underfitting:

**Definition – A statistical model suffering from overfitting describes some random error or noise in place of the underlying relationship. When underfitting occurs, a statistical model or machine learning algorithm fails in capturing the underlying trend of the data.**

- Occurrence – When a statistical model or machine learning algorithm is excessively complex, it can result in overfitting. Example of a complex model is one having too many parameters when compared to the total number of observations. Underfitting occurs when trying to fit a linear model to non-linear data.
- Poor Predictive Performance – Although both overfitting and underfitting yield poor predictive performance, the way in which each one of them does so is different. While the overfitted model overreacts to minor fluctuations in the training data, the underfit model under-reacts to even bigger fluctuations.

**Q**: **Between Python and R, which one would you pick for text analytics and why?**

**A**: For text analytics, Python will gain an upper hand over R due to these reasons:

- The Pandas library in Python offers easy-to-use data structures as well as high-performance data analysis tools
- Python has a faster performance for all types of text analytics
- R is a best-fit for machine learning than mere text analysis

**Q**: **Please explain the role of data cleaning in data analysis.**

**A**: Data cleaning can be a daunting task due to the fact that with the increase in the number of data sources, the time required for cleaning the data increases at an exponential rate.

This is due to the vast volume of data generated by additional sources. Also, data cleaning can solely take up to 80% of the total time required for carrying out a data analysis task.

Nevertheless, there are several reasons for using data cleaning in data analysis. Two of the most important ones are:

**Cleaning data from different sources helps in transforming the data into a format that is easy to work with**- Data cleaning increases the accuracy of a machine learning model

**Q**: **What do you mean by cluster sampling and systematic sampling?**

**A**: When studying the target population spread throughout a wide area becomes difficult and applying simple random sampling becomes ineffective, the technique of cluster sampling is used. A cluster sample is a probability sample, in which each of the sampling units is a collection or cluster of elements.

Following the technique of systematic sampling, elements are chosen from an ordered sampling frame. The list is advanced in a circular fashion. This is done in such a way so that once the end of the list is reached, the same is progressed from the start, or top, again.

**Q**: **Please explain Eigenvectors and Eigenvalues.**

**A**: Eigenvectors help in understanding linear transformations. They are calculated typically for a correlation or covariance matrix in data analysis.

In other words, eigenvectors are those directions along which some particular linear transformation acts by compressing, flipping, or stretching.

Eigenvalues can be understood either as the strengths of the transformation in the direction of the eigenvectors or the factors by which the compressions happens.

**Q**: **Can you compare the validation set with the test set?**

**A**: A validation set is part of the training set used for parameter selection as well as for avoiding overfitting of the machine learning model being developed. On the contrary, a test set is meant for evaluating or testing the performance of a trained machine learning model.

**Q**: **What do you understand by linear regression and logistic regression?**

**A**: Linear regression is a form of statistical technique in which the score of some variable Y is predicted on the basis of the score of a second variable X, referred to as the predictor variable. The Y variable is known as the criterion variable.

Also known as the logit model, logistic regression is a statistical technique for predicting the binary outcome from a linear combination of predictor variables.

**Q**: **Please explain Recommender Systems along with an application.**

**A**: Recommender Systems is a subclass of information filtering systems, meant for predicting the preferences or ratings awarded by a user to some product.

An application of a recommender system is the product recommendations section in Amazon. This section contains items based on the user’s search history and past orders.

**Q**: **What are outlier values and how do you treat them?**

**A**: Outlier values, or simply outliers, are data points in statistics that don’t belong to a certain population. An outlier value is an abnormal observation that is very much different from other values belonging to the set.

Identification of outlier values can be done by using univariate or some other graphical analysis method. Few outlier values can be assessed individually but assessing a large set of outlier values require the substitution of the same with either the 99th or the 1st percentile values.

There are two popular ways of treating outlier values:

- To change the value so that it can be brought within a range
- To simply remove the value

**Note**: – Not all extreme values are outlier values.

**Q**: **Please enumerate the various steps involved in an analytics project.**

**A**: Following are the numerous steps involved in an analytics project:

**Understanding the business problem**

- Exploring the data and familiarizing with the same

- Preparing the data for modeling by means of detecting outlier values, transforming variables, treating missing values, et cetera

- Running the model and analyzing the result for making appropriate changes or modifications to the model (an iterative step that repeats until the best possible outcome is gained)

- Validating the model using a new dataset
- Implementing the model and tracking the result for analyzing the performance of the same

**Q**: **Could you explain how to define the number of clusters in a clustering algorithm?**

**A**: The primary objective of clustering is to group together similar identities in such a way that while entities within a group are similar to each other, the groups remain different from one another.

Generally, Within Sum of Squares is used for explaining the homogeneity within a cluster. For defining the number of clusters in a clustering algorithm, WSS is plotted for a range pertaining to a number of clusters. The resultant graph is known as the Elbow Curve.

The Elbow Curve graph contains a point that represents the point post which there aren’t any decrements in the WSS. This is known as the bending point and represents K in K–Means.

Although the aforementioned is the widely-used approach, another important approach is the Hierarchical clustering. In this approach, dendrograms are created first and then distinct groups are identified from there.

**Q**: **What do you understand by Deep Learning?**

**A**: Deep Learning is a paradigm of machine learning that displays a great degree of analogy with the functioning of the human brain. It is a neural network method based on convolutional neural networks (CNN).

Deep learning has a wide array of uses, ranging from social network filtering to medical image analysis and speech recognition. Although Deep Learning has been present for a long time, it’s only recently that it has gained worldwide acclaim. This is mainly due to:

**An increase in the amount of data generation via various sources**- The growth in hardware resources required for running Deep Learning models

Caffe, Chainer, Keras, Microsoft Cognitive Toolkit, Pytorch, and TensorFlow are some of the most popular Deep learning frameworks as of today.

**Q**: **Please explain Gradient Descent.**

**A**: The degree of change in the output of a function relating to the changes made to the inputs is known as a gradient. It measures the change in all weights with respect to the change in error. A gradient can also be comprehended as the slope of a function.

Gradient Descent refers to escalating down to the bottom of a valley. Simply, consider this something as opposed to climbing up a hill. It is a minimization algorithm meant for minimizing a given activation function.

**Q**: **How does Backpropagation work? Also, state its various variants.**

**A**: Backpropagation refers to a training algorithm used for multilayer neural networks. Following the backpropagation algorithm, the error is moved from an end of the network to all weights inside the network. Doing so allows for efficient computation of the gradient.

Backpropagation works in the following way:

**Forward propagation of training data**

- Output and target is used for computing derivatives

- Backpropagate for computing the derivative of the error with respect to the output activation

- Using previously calculated derivatives for output generation
- Updating the weights

Following are the various variants of Backpropagation:

**Batch Gradient Descent – The gradient is calculated for the complete dataset and updation is performed on each iteration**

- Mini-batch Gradient Descent – Mini-batch samples are used for calculating gradient and updating parameters (a variant of the Stochastic Gradient Descent approach)
- Stochastic Gradient Descent – Only a single training example is used to calculate gradient and updating parameters

**Q**: **What do you know about Autoencoders?**

**A**: Autoencoders are simplistic learning networks used for transforming inputs into outputs with minimum possible error. It means that the outputs resulted are very close to the inputs.

A couple of layers are added between the input and the output with the size of each layer smaller than the size pertaining to the input layer. An autoencoder receives unlabeled input that is encoded for reconstructing the output.

**Q**: **Please explain the concept of a Boltzmann Machine.**

**A**: A Boltzmann Machine features a simple learning algorithm that enables the same to discover fascinating features representing complex regularities present in the training data. It is basically used for optimizing the quantity and weight for some given problem.

The simple learning algorithm involved in a Boltzmann Machine is very slow in networks that have many layers of feature detectors.

That completes the list of the 20 essential data science interview questions. Hope you will find them useful to prepare well for your upcoming data science job interview(s). Wish you good luck!

Check out these best data science tutorials to step up your data science game today.