Foundation Of Statistics For Data Science and Analytics
An essential guide for entry-level Data Enthusiasts
Statistics provides tools and methods to find structure and to give deeper data insights. Both Statistics and Mathematics love facts and hate guesses. Knowing the fundamentals of these two important subjects will allow you to think critically, and be creative when using the data to solve business problems and make data-driven decisions.
So in simple terms:
“Statistics is the grammar of science.” Karl Pearson
Let's get into this,
Sample and Population
Understanding the sample and population Suppose you have a dataset of a company having 56k employees, when you are taking out 10k random rows from that data that is a sample when you are considering all 56k rows its population
Statistics have majorly been categorized into two types:
- Descriptive statistics (summarizes or describes the characteristics of a data set.)
- Inferential statistics (you take out sample from data(also called population) to describe and make inferences about the population.)
Descriptive statistics consists of two basic categories of measures:
- measures of central tendency (Mean, Median, Mode)
- measures of variability or spread (Standard Deviation, variance)
Where Measures of central tendency describe the center of a data set.
Measures of variability or spread describe the dispersion of data within the set.
Key tests in Inferential statistics
Confidence interval Hypothesis testing Z-test T-test Chi-square test Anova
Difference between Descriptive statistics and Inferential statistics
In Descriptive statistics, you take the data or population and further analyze, Visualize and summarize the data in form of numbers and graphs.
On the other side in Inferential statistics, we take the sample of the population do some tests to come up with inferences and conclusions about that particular population.
A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment's outcomes.
A random variable can be either discrete (having specific values) or continuous (any value in a continuous range).
Discrete- can take on a countable number of distinct values. is a whole number, not float or integer Eg: Bank Account number in a random group.
Continuous- can represent any value within a specified range or interval and can take on an infinite number of possible values. can be a whole number, float or decimal. Eg: Height of people in a random group.
Mean, Median, Mode Explained
Mean: is the sum of all observations divided by the number of observations. Mean is denoted by x̄ (pronounced as x bar). also μ pronounced “mew”
Median: The value of the middlemost observation, obtained after arranging the data in ascending order, is called the median of the data. Why median is used: if there are outliers in your data mean represents a different form of distribution and it is harmful to analysis.
Looking at the above example there is a significant difference in mean and median with only a single outlier. so median is used in such cases to find central tendency.
Mode: The value which appears most often in the given data i.e. the observation with the highest frequency is called a mode of data.
Measure of dispersion:
Range: Range, stated simply, is the difference between the largest (L) and smallest (S) value of the data in a data set. It is the simplest measure of dispersion.
Quartiles: are special percentiles, which divide the data into quarters.
The first quartile, Q1, is the same as the 25th percentile,
The median is called both the second quartile, Q2, and the 50th percentile.
and the third quartile, Q3, is the same as the 75th percentile.
Interquartile Range (IQR) The IQR is a number that indicates how spread the middle half (i.e. the middle 50%) of the dataset is and can help determine outliers. It is the difference between Q3 and Q1.
Generally speaking, outliers are those data points that fall outside of the lower whisker and upper whisker.
Standard Deviation: measures the dispersion of a dataset relative to its mean. denoted by symbol σ Standard Deviation is the square root of the variance.
Variance: Variance is a measurement of the spread between numbers in a data set. denoted by symbol σ2 (Square of Standard Deviation)
variance is used to see how individual numbers relate to each other within a data set.
Normal Distribution (also called Gaussian distribution)
Values that are symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. Normal distributions are symmetrical, but not all symmetrical distributions are normal.
Empirical rule in normal distribution:
The empirical rule also referred to as the three-sigma rule or 68-95-99.7 rule, is a statistical rule which states that for a normal distribution, almost all observed data will fall within three standard deviations
In particular, the empirical rule predicts that
68% of observations fall within the first standard deviation (µ ± σ),
95% within the first two standard deviations (µ ± 2σ), and
99.7% within the first three standard deviations (µ ± 3σ).
Central Limit Theorem:
The central limit theorem states that the sampling distribution of the mean approaches a normal distribution, as the sample size increases. This fact holds especially true for sample sizes over 30.
Here n= number of observations, as n increases distribution starts looking like a normal distribution.
The central limit theorem tells us that no matter what the distribution of the population is, the shape of the sampling distribution will approach normality as the sample size (N) increases.
There is, of course, much more to learn about statistics. Once you understand the basics.
This article is part of #2articles1week count 1.
Thanks for reading✌, catch you in the next one.
I am a Data Analyst form India, having Bachelor's in Mechanical Engineering. I have been Practicing Analytics in Python, SQL, BI Tools, Advanced Excel.