In this notebook, we'll use dataframes containing (imaginary) students' roster information and grades to demonstrate various types of joins/merges. We'll also use the same dataframes to extract information using pandas
syntax you should be familiar with.
View the interactive notebook here.
import pandas as pd
import numpy as np
These demos use pandas
, but joining in SQL is the same as merging in pandas
.
roster = pd.DataFrame()
roster['sid'] = [123, 456, 194, 994, 430, 824, 881, 519]
roster['name'] = ['amy', 'billy', 'billy', 'sam', 'carly', 'johnson', 'ernie', 'shaq']
roster['email'] = ['amy@berkeley.edu', 'billy@berkeley.edu', 'notbilly@stanford.edu', 'sam2@berkeley.edu', 'icarly@berkeley.edu', 'magic@berkeley.edu', 'ern@berkeley.edu', 'big@berkeley.edu']
grades = pd.DataFrame()
grades['sid'] = [123, 456, 994, 519, 808, 456, 994, 430, 223, 881, 405, 519, 123, 456, 994, 430, 881, 519]
grades['assignment'] = ['hw1', 'hw1', 'hw1', 'hw1', 'hw1', 'hw2', 'hw2', 'hw2', 'hw2', 'hw2', 'hw2', 'hw2', 'hw3', 'hw3', 'hw3', 'hw3', 'hw3', 'hw3']
grades['scores'] = [94, 93, 91, 88, 75, 53, 100, 100, 89, 95, 96, 100, 82, 90, 91, 100, 25, 49]
roster
grades
You should note that there are some students with no grades (e.g. johnson
has no grades, since sid = 824
doesn't appear in grades
), and there are some submissions in grades
that belong to students who are not in the roster (e.g. sid = 808
doesn't appear in roster
).
Let's first perform an inner join:
roster.merge(grades, left_on = 'sid', right_on = 'sid', how = 'inner')
Here, we see that we have all grades of every student in the course, if they have any assignment submissions. You'll note that johnson
doesn't appear in this table, and neither does the billy
whose email is notbilly@stanford.edu
. We also see that the student with sid = 808
is not in this table.
What if we want at least one row for every single student in the roster, whether or not they submitted assignments? This comes up in practice from the instructor's side when computing grades at the end of the semester – even if you didn't submit something (or anything!), you still need to be assigned a grade. To do this, I can try a left join:
roster.merge(grades, how = 'left')
Now, we see that rows exist for johnson
and the other billy
, with NaN
for their assignment and scores. (Also, note that we can drop the left_on
and right_on
parameters if the column that we're merging on between the two tables has the same name. In practice, this won't be the case, so you should get used to specifying these values.)
What if I want to see information for all students that submitted things? I can try a right join:
roster.merge(grades, how = 'right')
Now, note that johnson
and other billy
don't appear in the output, but there are rows for sid = 808
and the other two students who submitted work but aren't in the class (with NaN
as their name and email).
It should be noted that LEFTJOIN(A, B)
is the same as RIGHTJOIN(B, A)
. In other words, to generate the above dataframe, I could've used roster.merge(grades, how = 'right')
OR grades.merge(grades, how = 'left')
. The resulting dataframes would've been the same, just with the columns in a different order.
Lastly, what if I want a combination of the above two dataframes? That is, rows for every student in the course, their assignment grades if they exist, and grades for students not in the class if they happened to submit. This can be achieved with an outer join:
df = roster.merge(grades, how = 'outer')
df
Now, let's try some queries. Moving forward we'll use the below merged dataframe.
df = roster.merge(grades, how = 'left')
df
To do this, I can use value_counts
:
df['assignment'].value_counts()
This requires a use of boolean filtering, which likely have already seen:
df[df['assignment'] == 'hw2']
# grades.loc[grades['assignment'] == 'hw2', :]
To do this, I can group by assignment
:
df.groupby('assignment').mean()
I can create a custom filter
function in order to help me do this.
def filter_fn(r):
# takes in all rows for a given student
# and returns true iff their mean homework score is > 80
return r['scores'].mean() > 80
df.groupby(['sid']).filter(filter_fn)
I can create a mapping between scores and letter grades, and then apply
this to the scores column.
def to_letter(x):
if x >= 90: return 'A'
elif x >= 80: return 'B'
elif x >= 70: return 'C'
elif x >= 60: return 'D'
else: return 'F'
df['letter'] = df['scores'].apply(to_letter)
df
This might be something like the gradebook your teachers in high school or elementary school used, with the rows containing students' names, and the columns containing the various assignments in their course. To do this, we can use pivot
(which you should've also seen in Data 8):
df.pivot(index = 'sid', columns = 'assignment', values = ['scores'])
In practice, this looks a little weird: let's get rid of the NaN
column and replace all NaN
values with a 0.
NaN
column, we can use .iloc[:, 1:]
to take all columns starting with hw1
.NaN
values, we can use .fillna(0)
.df.pivot(index = 'sid', columns = 'assignment', values = ['scores']).iloc[:, 1:].fillna(0)