bugl
bugl
HomeLearnPatternsPathsSearch
HomeLearnPatternsPathsSearch

Loading lesson path

Learn/Data Science/DS Statistics
Data Science•DS Statistics

Data Science - Statistics Correlation vs. Causality

Flash cards

Review the key moves

1/4
Core idea

What is the main idea behind Data Science - Statistics Correlation vs. Causality?

Lesson checks

Practice each idea before moving on

Short Mimo-style checks built from this lesson's code, terms, and sequence.

1Quick choice

Which statement best captures the main point of this lesson?

2Fill blank

Complete the missing token from the example code.

___ pandas as pd
3Order

Put the learning moves in the order that makes the concept easiest to apply.

Correlation vs Causality - The Beach Example
The Beach Example in Python
Correlation Does Not Imply Causality
4Data move

Before charting or modeling a dataset, which move should come first?

Correlation Does Not Imply Causality

Correlation measures the numerical relationship between two variables.

A high correlation coefficient (close to 1), does not mean that we can for sure conclude an actual relationship between two variables.

A classic example

  • During the summer, the sale of ice cream at a beach increases
  • Simultaneously, drowning accidents also increase as well

Does this mean that increase of ice cream sale is a direct cause of increased drowning accidents?

The Beach Example in Python

Here, we constructed a fictional data set for you to try:

Example

import pandas as pd
import matplotlib.pyplot as plt

Drowning_Accident = [20,40,60,80,100,120,140,160,180,200]
Ice_Cream_Sale =
[20,40,60,80,100,120,140,160,180,200]
Drowning = {"Drowning_Accident":
  [20,40,60,80,100,120,140,160,180,200],
  "Ice_Cream_Sale":
    [20,40,60,80,100,120,140,160,180,200]}
    Drowning = pd.DataFrame(data=Drowning)

    Drowning.plot(x="Ice_Cream_Sale", y="Drowning_Accident", kind="scatter")

    plt.show()
    correlation_beach = Drowning.corr()

    print(correlation_beach)

Correlation vs Causality - The Beach Example

In other words: can we use ice cream sale to predict drowning accidents?

The answer is - Probably not.

It is likely that these two variables are accidentally correlating with each other.

What causes drowning then?

  • Unskilled swimmers
  • Waves
  • Cramp
  • Seizure disorders
  • Lack of supervision
  • Alcohol (mis)use
  • etc.

Let us reverse the argument

Does a low correlation coefficient (close to zero) mean that change in x does not affect y?

Back to the question

  • Can we conclude that Average_Pulse does not affect Calorie_Burnage because of a low correlation coefficient?

The answer is no.

There is an important difference between correlation and causality:

  • Correlation is a number that measures how closely the data are related
  • Causality is the conclusion that x causes y.

Tip

Always critically reflect over the concept of causality when doing predictions!

Previous

Data Science - Statistics Correlation Matrix

Next chapter

DS Advanced

Start with Data Science - Linear Regression