import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px
df = pd.DataFrame({
'x': [2, 1, 9, 7, 8, 2, 3, 3, 5, 3],
'y': [17, 18, 14, 4, 1, 15, 17, 3, 10, 6],
'z': [38, 38, 46, 22, 18, 34, 40, 12, 30, 18]
})
df
plt.scatter(df['x'], df['z'])
As we can see, the correlation between $x$ and $z$ is weak.
np.corrcoef(df['x'], df['z'])[1][0]
fig = px.scatter_3d(df, x = 'x', y = 'y', z = 'z')
fig.show()
Here, we see that as $x$ and $y$ increase, $z$ increases. You can think of our points as lying on the plane $z = 2x + 2y$.
plt.scatter(2*df['x'] + 2*df['y'], df['z'], color = 'red')
plt.plot([10, 50], [10, 50]); # plotting y = x
When we scatter $z$ vs $2x + 2y$, we see that our points lie directly on the line $y = x$.
Note: The code used to generate the plots for this problem is random, and so you'll get different results each time you run it.
First, we re-create the leftmost plot, and experiment with what happens when we remove the dot.
x = np.random.normal(0, 1, 50)
y = x + np.random.normal(0, 0.5, 50)
outlier = (-3, 5)
plt.scatter(x, y);
plt.scatter(outlier[0], outlier[1], color = 'g');
# correlation with outlier
np.corrcoef(np.append(x, outlier[0]), np.append(y, outlier[1]))[1][0]
# correlation without outlier
np.corrcoef(x, y)[1][0]
# ignore this magic – creates the line of best fit
def plot_best_fit(x, y):
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))(np.unique(x)), color = 'r')
# line of best fit, with outlier
plot_best_fit(np.append(x, outlier[0]), np.append(y, outlier[1]));
plt.scatter(x, y);
plt.scatter(outlier[0], outlier[1], color = 'g');
# line of best fit, without outlier
plot_best_fit(x, y);
plt.scatter(x, y);
As we can see, when we drop the outlier, our "line of best fit" becomes much more accurate, and the correlation between our x
and y
lists becomes significantly stronger.
Let's take a look at the second plot.
x = np.random.normal(0, 2, 50)
y = np.random.normal(0, 2, 50)
outlier = (8, 8)
plt.scatter(x, y);
plt.scatter(outlier[0], outlier[1], color = 'g');
# correlation with outlier
np.corrcoef(np.append(x, outlier[0]), np.append(y, outlier[1]))[1][0]
# correlation without outlier
np.corrcoef(x, y)[1][0]
# line of best fit, with outlier
plot_best_fit(np.append(x, outlier[0]), np.append(y, outlier[1]));
plt.scatter(x, y);
plt.scatter(outlier[0], outlier[1], color = 'g');
# line of best fit, without outlier
plot_best_fit(x, y);
plt.scatter(x, y);
In this case, the strength of our correlation drops when removing the outlier. The outlier tries to strengthen the "up and to the right" trend, which doesn't really exist without it. (Depending on when you run this, you may actually see a negative correlation in either case.)