Key Questions for Data Analyst : Taken from a R1 interview conducted by LatentView Analytics

Ashish Kumar Singh
5 min readJun 15, 2021

--

What is mean and median?

The mean of a data set is found by adding all numbers in the data set and then dividing by the number of values in the set.

The median is the middle value when a data set is ordered from least to greatest.

Difference between Normal and Gaussian distribution

Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a bell curve.

The standard normal distribution has two parameters: the mean and the standard deviation. For a normal distribution, 68% of the observations are within +/- one standard deviation of the mean, 95% are within +/- two standard deviations, and 99.7% are within +- three standard deviations.

Real life data rarely, if ever, follow a perfect normal distribution. The skewness and kurtosis coefficients measure how different a given distribution is from a normal distribution. The skewness measures the symmetry of a distribution. The normal distribution is symmetric and has a skewness of zero.

What is Central Limit Theorem?

It is a statistical theory stating that given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population. Furthermore, all the samples will follow an approximate normal distribution pattern, with all variances being approximately equal to the variance of the population, divided by each sample’s size.

  • Sample sizes equal to or greater than 30 are considered sufficient for the CLT to hold.
  • A key aspect of CLT is that the average of the sample means and standard deviations will equal the population mean and standard deviation.
  • A sufficiently large sample size can predict the characteristics of a population accurately.

What is Null Hypothesis?

Hypothesis testing is carried out in order to test the validity of a claim or assumption that is made about the larger population. This claim that involves attributes to the trial is known as the Null Hypothesis. The null hypothesis testing is denoted by H0.

An Alternative Hypothesis would be considered valid if the null hypothesis is fallacious. The evidence that is present in the trial is basically the data and the statistical computations that accompany it. The alternative hypothesis testing is denoted by H1or Ha.

P-value is the probability that you would arrive at the same results as the null hypothesis.

The two types of error that can occur from the hypothesis testing:

Type I Error — Type I error occurs when the researcher rejects a null hypothesis when it is true.

Type II Error — Accepting a false null hypothesis H0 is referred to as the Type II error.

What is Confidence Interval?

It is the range of values in which we are fairly confident our true value lies in.

Formula for Calculating the Confidence Interval

Where:

  • X is the mean
  • Z is the chosen Z-value from the table above
  • s is the standard deviation
  • n is the number of observations
  • The value after the ± is called the margin of error

What is Covariance and Correlation and How will u

interpret it?

Covariance provides insight into how two variables are related to one another. It refers to the measure of how 2 random variables in a data set will change together.

A positive covariance means that the two variables at hand are positively related, and they move in the same direction.

A negative covariance means that the variables are inversely related, or that they move in opposite directions.

  • x is the independent variable
  • y is the dependent variable
  • n represents the number of data points in the sample
  • x-bar represents the mean of the independent variable x
  • y-bar represents the mean of the dependent variable y

Correlation not only shows the kind of relation (in terms of direction) but also how strong the relationship is. The main result of a correlation is called the correlation coefficient.

Correlation coefficient is a dimensionless metric and is used to refer to the resulting correlation measurement. It will always maintain a value between one and negative one.

Thus, we can say the correlation values have standardized notions, whereas the covariance values are not standardized and cannot be used to compare how strong or weak the relationship is because the magnitude has no direct significance.

Covariance can vary between -∞ and +∞ while Correlation ranges between -1 and +1

When it comes to choosing between Covariance vs Correlation, the latter stands to be the first choice as it remains unaffected by the change in dimensions, location, and scale, and can also be used to make a comparison between two pairs of variables. Since it is limited to a range of -1 to +1, it is useful to draw comparisons between variables across domains.

--

--

Ashish Kumar Singh
Ashish Kumar Singh

Written by Ashish Kumar Singh

Founder @ CareerTrek | Data Analytics | Machine Learning | Predictive Modeling

Responses (1)