by Suraj Rampure (suraj.rampure@berkeley.edu) and Mansi Shah (emansishah@berkeley.edu)
This notebook was designed to supplement Discussion 7 (Lab 8) in Fall 2019. It is not meant to be a substitute for lecture or other readings; rather, it's meant to provide an example of PCA in action, and explain steps along the way.
Click here to access this notebook as a .ipynb
file.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
%matplotlib inline
df = pd.read_csv('https://gist.githubusercontent.com/surajrampure/76bef6c3561b6d99fb87106cb91ef2b1/raw/c19e1300a1f8cc36ba93713261fcdcea00f4e8ed/nba_19_stats.csv')
We will keep only players who started over half the time, since it reduces noise for later visualizations.
df = df.query('GS >= 41')
df.head()
Furthermore, since PCA requires that our data comes in the form of a matrix, we will only look at numerical columns. We'll take everything starting from MP (minutes played), since this is where the "per game" statistics begin.
df.iloc[:, 7:].head()
We currently have 23 numerical columns, i.e. each row of our data is 23-dimensional. With this many columns, it's hard to identify any patterns. Our goal will be to use PCA to reduce the dimensionality of our data.
Let's first normalize our data by subtracting the mean from each column.
d = np.matrix(df.iloc[:, 7:].fillna(0)) # Replace NaNs with 0, then convert to a matrix
col_means = d.mean(axis = 0)
Recall, the singular value decomposition decomposes our data matrix $X$ into the product $U \Sigma V^T$, where
$U$ and $V$ are the left and right singular matrices, respectively
$\Sigma$ is a diagonal matrix of singular values, which are sorted from greatest to least on the diagonal
We will extract meaning from each of these quantities shortly. For now, let's just perform the decomposition.
u, s, vt = np.linalg.svd(d - col_means, full_matrices = False)
Before we actually compute our principal components – i.e. the projection of our data onto our new directions – let's determine the number of principal components to use.
Choosing the number of principal components involves a tradeoff: if you choose too many principal components, you may defeat the purpose of doing dimensionality reduction in the first place. On the other hand, you don't want to choose too few principal components that you no longer capture much of the variance in the data.
In this section, we will analyze the variance captured by each of the principal components.
The singular values of our data matrix $X$ tell us how much variance each principal component captures. The variance of principal component $i$ is $\frac{\sigma_i^2}{N}$, i.e. $\frac{(\text{the $i$th singular value})^2}{\text{the number of data points}}$.
However, we don't care as much about the raw variance of a principal component – we care more about the proportion of the total variance in our dataset that a principal component captures. The proportion of variance captured by the $i$th principal component is $\frac{\sigma_i^2}{T}$, where $T = \sum_{i = 1}^r \sigma_i^2$ (i.e. the sum of the squares of all diagonal elements in $\Sigma$).
The proportion of variance captured by the first $k$ principal components together, then, is
$$\frac{\sigma_1^2 + \sigma_2^2 + ... + \sigma_k^2}{T}$$Typically, we will look to choose $k$ such that it captures a significant portion of our data's variance (usually > 80%).
# This function determines the proportion of variance captured by principal component i
def variance_captured(s, i):
return s[i - 1]**2 / np.sum(s**2)
# This function determines the proportion of variance captured by principal components 1, 2, ..., k
def variance_captured_first_k(s, k):
return sum([variance_captured(s, ki) for ki in range(1, k+1)])
To help us, let's create a scree plot. Scree plots help us visualize how much of the total variance each principal component captures.
x = np.arange(1, 10)
var = [variance_captured(s, xi) for xi in x]
plt.plot(x, var);
plt.xlabel('Number of Principal Components');
plt.ylabel('Proportion of Total Variance Captured');
This tells us that PC1 captures ~70% of the variation in our data, PC2 captures ~15%, and so on.
for i in range(1, 8):
print('{} of the total variance is captured by the first {} principal component(s)'.format(variance_captured_first_k(s, i), i))
The elbow method tells us that we should use either 2 or 3 principal components here, since there are diminishing returns for using any more. For ease of visualization, we will use 2.
As mentioned above, we have that $X = U \Sigma V^T$. We can right-multiply this equation by $V$, and since $V^TV = I$, we have that
$$\boxed{U\Sigma = XV}$$$U \Sigma$ and $XV$ both compute our principal components, i.e., they both project our original data $X$ onto our new-found directions. The first column of each of these matrices is our first principal component, and more generally, the $i$th column of these matrices is our $i$th principal component.
Let's zoom in on the calculation of $XV$.
First, we can think of $X$ as a set of $d$ column vectors, each of which corresponds to one of the columns in our dataset. The rows in $X$, then, each correspond to a single datapoint (in our case, a single player).
$$X = \begin{bmatrix} | & | & ... & | \\ X_{:, 1} & X_{:, 2} & ... & X_{:, d} \\ | & | & ... & | \end{bmatrix}$$$V$, then, contains the directions of our new principal components. The $i$th column of $V$ contains the direction vector corresponding to the $i$th principal component.
$$V = \begin{bmatrix} | & | & ... & | \\ v_{:, 1} & v_{:, 2} & ... & v_{:, d} \\ | & | & ... & | \end{bmatrix}$$Now, consider the product $XV$. By the nature of matrix multiplication, we can consider each of the columns in $V$ separately. Let's multiply by just the first column in $V$ first:
$$Xv_{:, 1} = X_{:, 1} v_{1, 1} + X_{:, 2} v_{2, 1} + ... + X_{:, d} v_{d, 1}$$The $j$th element of $v_{:, i}$ tells us the weight that principal component $i$ puts on column $j$.
Looking at $V^T$, we have
$$V^T = \begin{bmatrix} - & v_1^T & - \\ - & v_2^T & - \\ & \vdots & \\ - & v_d^T & - \end{bmatrix}$$As a toy example, let's suppose our dataset has three columns – height, weight, and income (in that order). Suppose we compute the SVD of our data matrix, and get that the first row of $V^T$ is $v_1^T = \begin{bmatrix} -0.3 & 0.8 & 0.52 \end{bmatrix}$. This tells us that our first principal component is
$$PC1 = -0.3 \cdot \text{height} + 0.8 \cdot \text{weight} + 0.52 \cdot \text{income}$$Great – now let's move back to our dataset. The weights
dataframe that we create in the next cell will list each original column, along with the weights assigned to each column by the first two principal component directions. We extract the first two directions by taking the first two rows of vt
.
weights = pd.DataFrame({
'labels': df.iloc[:, 7:].columns,
'PC1 weights': np.array(vt[0, :])[0],
'PC2 weights': np.array(vt[1, :])[0],
}).set_index('labels')
weights.sort_values('PC1 weights')
To be explicit, our first PC is computed by the following transformation:
$$PC1 = -0.626930 \cdot \text{PTS} -0.442062 \cdot \text{FGA} + ... - 0.000199 \cdot \text{eFG%} $$We can see that in PC1, only the first 10 or so columns have significant weight, and unsurprisingly, PTS
(i.e. average points per game) is weighed the heaviest. This suggests that PTS
contributes significantly to PC1. Columns such as FT%
, FG%
, 3P%
, 2P%
, and eFG%
contribute very little, and as such, they are assigned very little weight in both PC1 and PC2. (Note, when talking about "weight", we only care about absolute weight.)
There are columns that are assigned sizeable weight in PC1 but not in PC2, such as PTS
($-0.626930$ vs $-0.072263$), and vice versa (such as TRB
, which has weight $-0.116492$ in PC1 compared to $0.584020$ in PC2). This suggests that PTS
is significantly more important in defining PC1 than it is in defining PC2, and that TRB
is more important in defining PC2
.
As we expect, our two principal component directions are orthogonal:
sum(weights.iloc[:, 1] * weights.iloc[:, 0])
This value is effectively 0; the difference is due to floating point computational error.
The key takeaway here is that the rows of $V^T$, or equivalently, the columns of $V$, contain the direction vectors that define our principal components.
Now that we've gotten our data down to two dimensions, we can visualize it. Below, let's plot our data projected onto our new axes: PC1 and PC2.
pc = d @ vt.T[:, :2] # or, alternatively, u[:, :2] @ np.diag(s[:2])
plt.scatter(np.array(pc[:, 0]), np.array(pc[:, 1]))
plt.title('PC1 vs PC2');
plt.xlabel('PC1');
plt.ylabel('PC2');
Let's step back for a second. Our original data had 23 dimensions. Through PCA, we've generated a summary scatter plot of our data, in 2 dimensions, that captures ~86% of the variation in our original data. Pretty cool!
For fun, let's color each of our data points based on whether or not the corresponding player is a guard.
sns.scatterplot(x=list(pc[:, 0]), y=list(pc[:, 1]), hue=df['Pos'].apply(lambda x: 'G' in str(x)))
plt.legend(['Guard', 'Not Guard'])
plt.title('PC1 vs PC2');
plt.xlabel('PC1');
plt.ylabel('PC2');
This demonstrates the power of PCA – in addition to efficiently reducing the dimension of our data, our PCs implicitly clustered, or separated, our data for us. We see that the points that correspond to guards are generally above the points that correspond to non-guards.
This has nothing to do with dimensionality reduction, however, we will explore the idea of clustering and separation towards the end of the course.
Feel free to post on Piazza, or email us directly with questions.