Statistics finds its use in various disciplines in our lives. Nowadays, businesses require statistics to better understand their customers. We also refer to it as stats. Statistics is a kind of mathematical analysis that uses quantified models, representations, and synopses from a given set of data obtained from experiments and real-life studies. It is also a study of methodologies to gather, review, and analyze the given set of data and draw a conclusion. There are some theories and sets of formulae that have been given in statistics.
One such concept is correlation. Correlation measures the strength of association between two variables as well as the direction. There are mainly three types of correlation that are measured. One significant type is Pearson's correlation coefficient. This type of correlation is used to measure the relationship between two continuous variables.
In this blog, we will be discussing everything about Pearson's correlation coefficient. We will start with a definition of Statistics and correlation. Later in the blog, we will look at the origin of Pearson's correlation coefficient and also how it is calculated. We will also briefly discuss the three other types of correlations measured in statistics.
Statistics is not just a branch of mathematics but rather it is a science. It is the science of collecting, analyzing, presenting, and interpreting empirical data. Statistics is a highly interdisciplinary field. Researches in statistics are applied to almost all scientific fields and also the researches in different scientific fields motivate the development of new statistical methods and theory.
Statistics is used in various disciplines such as psychology, business, physical and social sciences, humanities, government, and manufacturing. Statistics finds its use in business to make better-informed decisions. The two types of statistics are Descriptive statistics and Inferential statistics.
Descriptive statistics are used to gather from a sample exercising the mean or standard deviation. Inferential statistics are used when data is viewed as a subclass of a specific population.
For more detailed knowledge of statistics you can read our blog on What is Statistics? Types, Variance and Bayesian Statistics.
“Statistics is the best area to be in because statistics are everywhere! They are all around us in our daily lives. It is important to be able to think critically about all of the data and information that surround us. Statistics and statistical thinking help us to make sense out of all of it.”
- Jeri Mulrow, Vice President, ASA
Correlation is a statistic that measures the relationship between two variables in the finance and investment industries. It shows the strength of the relationship between the two variables as well as the direction and is represented numerically by the correlation coefficient. The numerical values of the correlation coefficient lies between -1.0 and +1.0.
A negative value of the correlation coefficient means that when there is a change in one variable, the other changes in a proportion but in the opposite direction, and if the value of the correlation coefficient is positive, both the variables change in a proportion and the same direction.
When the value of the correlation coefficient is exactly 1.0, it is said to be a perfect positive correlation. This situation means that when there is a change in one variable, either negative or positive, the second variable changes in lockstep, in the same direction.
A perfect negative correlation means that two assets move in opposite directions, while a zero correlation implies no linear relationship at all. We can determine the strength of the relationship between two variables by finding the absolute value of the correlation coefficient.
Also Read: Introduction to Bayesian Statistics
In Statistics, the Pearson's Correlation Coefficient is also referred to as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), or bivariate correlation. It is a statistic that measures the linear correlation between two variables. Like all correlations, it also has a numerical value that lies between -1.0 and +1.0.
Whenever we discuss correlation in statistics, it is generally Pearson's correlation coefficient. However, it cannot capture nonlinear relationships between two variables and cannot differentiate between dependent and independent variables.
Pearson's correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. The form of the definition involves a "product moment", that is, the mean (the first moment about the origin) of the product of the mean-adjusted random variables; hence the modifier product-moment in the name.
Pearson's Correlation Coefficient is named after Karl Pearson. He formulated the correlation coefficient from a related idea by Francis Galton in the 1880s.
Using the formula proposed by Karl Pearson, we can calculate a linear relationship between the two given variables. For example, a child's height increases with his increasing age (different factors affect this biological change). So, we can calculate the relationship between these two variables by obtaining the value of Pearson's Correlation Coefficient r. There are certain requirements for Pearson's Correlation Coefficient:
Scale of measurement should be interval or ratio
Variables should be approximately normally distributed
The association should be linear
There should be no outliers in the data
The formula given is:
Where,
N = the number of pairs of scores
Σxy = the sum of the products of paired scores
Σx = the sum of x scores
Σy = the sum of y scores
Σx2 = the sum of squared x scores
Σy2 = the sum of squared y scores
Some steps are needed to be followed:
Step 1: Make a Pearson correlation coefficient table. Make a data chart using the two variables and name them as X and Y. Add three additional columns for the values of XY, X^2, and Y^2. Refer to this table.
Person |
Age (X) |
Income (Y) |
XY |
X^2 |
Y^2 |
1 |
|||||
2 |
|||||
3 |
|||||
4 |
Step 2: Use basic multiplications to complete the table.
Person |
Age (X) |
Income (Y) |
XY |
X^2 |
Y^2 |
1 |
20 |
1500 |
30000 |
400 |
2250000 |
2 |
30 |
3000 |
90000 |
900 |
9000000 |
3 |
40 |
5000 |
200000 |
1600 |
25000000 |
4 |
50 |
7500 |
375000 |
2500 |
56250000 |
Step 3: Add up all the columns from bottom to top.
Person |
Age (X) |
Income (Y) |
XY |
X^2 |
Y^2 |
1 |
20 |
1500 |
30000 |
400 |
2250000 |
2 |
30 |
3000 |
90000 |
900 |
9000000 |
3 |
40 |
5000 |
200000 |
1600 |
25000000 |
4 |
50 |
7500 |
375000 |
2500 |
56250000 |
Total |
140 |
17000 |
695000 |
5400 |
92500000 |
Step 4: Use these values in the formula to obtain the value of r.
r = [4 * 695000 - 140 * 17000] / √{4 * 5400 - (140)^2} {4 * 92500000 - (17000)^2}
= [2780000 - 2380000] / √{21600 - 19600} {370000000 - 289000000}
= 400000 / √{2000} {81000000}
= 400000 / √162000000000
= 400000 / 402492.24
= 0.99
The positive value of Pearson’s correlation coefficient implies that if we change either of these variables, there will be a positive effect on the other. For example, if we increase the age there will be an increase in the income.
Referred blog: 4 types of Elasticity in Economics
As we have learned from the definition of the Pearson product-moment correlation coefficient, it measures the strength and direction of the linear relationship between two variables.
The more inclined the value of the Pearson correlation coefficient to -1 and 1, the stronger the association between the two variables.
Below, we have shown the guidelines to interpret the Pearson coefficient correlation :
A notable point is that the strength of association of the variables depend on the sample size and what you measure.
We have been mentioning the two terms ‘strength’ and ‘direction’, throughout the blog. These terms have a great statistical significance. Let us discuss them in detail.
Strength: Strength implies the relationship connection between the two given factors. It implies how reliably one variable will change because of the adjustment in the other. Qualities that are near +1 or - 1 show a solid relationship. These qualities are achieved if the information focuses fall on or near the line. The further the information focuses move away, the more vulnerable the strength of the direct relationship. When there is no useful method to draw a straight line because the information focuses are dissipated, the strength of the direct relationship is the most vulnerable.
Direction: The direction of the line demonstrates a positive direct or negative straight connection between factors. On the off chance that the line has an upward slant, the factors have a positive relationship. This implies an expansion in the estimation of one variable will prompt an increment in the estimation of the other variable. A negative relationship portrays a descending slant. This implies an expansion in the measure of one variable prompts a lessening in the estimation of another variable.
As mentioned above, there are mainly three types of correlations-
Pearson Product Moment Correlation
Spearman's Rank Correlation
Kendall Rank Correlation
Spearman's Rank Correlation- The Spearman's Rank Correlation was named after statistician Charles Edward Spearman. In statistics, Spearman's Rank Correlation is often used in place of Pearson's Correlation although it's less conclusive. Statisticians use Spearman's correlation both for qualitative as well as quantitative data. The correlation is calculated using the null hypothesis which is subsequently accepted and rejected.
Kendall Rank Correlation- The Kendall Rank Correlation was named after the British statistician Maurice Kendall. It measures the dependence between the sets of two random variables. In the case of rejection of correlation calculated from Spearman's Rank Correlation, the Kendall correlation is used for further analysis. It attains a correlation when the value of one variable is decreased and the value of the other variable is increased; this correlation is referred to as discordant pairs.
In this blog, we learned that Pearson's Correlation Coefficient denoted by r calculates the linear relationship between two variables. Karl Pearson had given the formula for PPMCC. We also learned that statistics is a science rather than just a branch of mathematics. It finds its use in various disciplines like psychology, humanities, science etc.
We also got to know about the correlation that it is the Statistic that measures the relationship between two variables. One notable point about correlation is that the value of correlation coefficients lie between -1 and +1. The magnitude tells us the strength of the relationship while the sign suggests the direction.
6 Major Branches of Artificial Intelligence (AI)
READ MOREReliance Jio and JioMart: Marketing Strategy, SWOT Analysis, and Working Ecosystem
READ MORE8 Most Popular Business Analysis Techniques used by Business Analyst
READ MORETop 10 Big Data Technologies
READ MOREElasticity of Demand and its Types
READ MOREWhat is PESTLE Analysis? Everything you need to know about it
READ MOREAn Overview of Descriptive Analysis
READ MORE5 Factors Affecting the Price Elasticity of Demand (PED)
READ MOREDijkstra’s Algorithm: The Shortest Path Algorithm
READ MOREWhat Are Recommendation Systems in Machine Learning?
READ MORE
Comments