Discussion 3 Supplemental Notebook

Data 100, Spring 2020

Suraj Rampure (suraj.rampure@berkeley.edu)

In this notebook, we'll use dataframes containing (imaginary) students' roster information and grades to demonstrate various types of joins/merges. We'll also use the same dataframes to extract information using pandas syntax you should be familiar with.

View the interactive notebook here.

In [1]:
import pandas as pd
import numpy as np

Demoing Joins

These demos use pandas, but joining in SQL is the same as merging in pandas.

In [2]:
roster = pd.DataFrame()
roster['sid'] = [123, 456, 194, 994, 430, 824, 881, 519]
roster['name'] = ['amy', 'billy', 'billy', 'sam', 'carly', 'johnson', 'ernie', 'shaq']
roster['email'] = ['amy@berkeley.edu', 'billy@berkeley.edu', 'notbilly@stanford.edu', 'sam2@berkeley.edu', 'icarly@berkeley.edu', 'magic@berkeley.edu', 'ern@berkeley.edu', 'big@berkeley.edu']

grades = pd.DataFrame()
grades['sid'] = [123, 456, 994, 519, 808, 456, 994, 430, 223, 881, 405, 519, 123, 456, 994, 430, 881, 519]
grades['assignment'] = ['hw1', 'hw1', 'hw1', 'hw1', 'hw1', 'hw2', 'hw2', 'hw2', 'hw2', 'hw2', 'hw2', 'hw2', 'hw3', 'hw3', 'hw3', 'hw3', 'hw3', 'hw3']
grades['scores'] = [94, 93, 91, 88, 75, 53, 100, 100, 89, 95, 96, 100, 82, 90, 91, 100, 25, 49]
In [3]:
roster
Out[3]:
sid name email
0 123 amy amy@berkeley.edu
1 456 billy billy@berkeley.edu
2 194 billy notbilly@stanford.edu
3 994 sam sam2@berkeley.edu
4 430 carly icarly@berkeley.edu
5 824 johnson magic@berkeley.edu
6 881 ernie ern@berkeley.edu
7 519 shaq big@berkeley.edu
In [4]:
grades
Out[4]:
sid assignment scores
0 123 hw1 94
1 456 hw1 93
2 994 hw1 91
3 519 hw1 88
4 808 hw1 75
5 456 hw2 53
6 994 hw2 100
7 430 hw2 100
8 223 hw2 89
9 881 hw2 95
10 405 hw2 96
11 519 hw2 100
12 123 hw3 82
13 456 hw3 90
14 994 hw3 91
15 430 hw3 100
16 881 hw3 25
17 519 hw3 49

You should note that there are some students with no grades (e.g. johnson has no grades, since sid = 824 doesn't appear in grades), and there are some submissions in grades that belong to students who are not in the roster (e.g. sid = 808 doesn't appear in roster).

Let's first perform an inner join:

In [5]:
roster.merge(grades, left_on = 'sid', right_on = 'sid', how = 'inner')
Out[5]:
sid name email assignment scores
0 123 amy amy@berkeley.edu hw1 94
1 123 amy amy@berkeley.edu hw3 82
2 456 billy billy@berkeley.edu hw1 93
3 456 billy billy@berkeley.edu hw2 53
4 456 billy billy@berkeley.edu hw3 90
5 994 sam sam2@berkeley.edu hw1 91
6 994 sam sam2@berkeley.edu hw2 100
7 994 sam sam2@berkeley.edu hw3 91
8 430 carly icarly@berkeley.edu hw2 100
9 430 carly icarly@berkeley.edu hw3 100
10 881 ernie ern@berkeley.edu hw2 95
11 881 ernie ern@berkeley.edu hw3 25
12 519 shaq big@berkeley.edu hw1 88
13 519 shaq big@berkeley.edu hw2 100
14 519 shaq big@berkeley.edu hw3 49

Here, we see that we have all grades of every student in the course, if they have any assignment submissions. You'll note that johnson doesn't appear in this table, and neither does the billy whose email is notbilly@stanford.edu. We also see that the student with sid = 808 is not in this table.

What if we want at least one row for every single student in the roster, whether or not they submitted assignments? This comes up in practice from the instructor's side when computing grades at the end of the semester – even if you didn't submit something (or anything!), you still need to be assigned a grade. To do this, I can try a left join:

In [6]:
roster.merge(grades, how = 'left')
Out[6]:
sid name email assignment scores
0 123 amy amy@berkeley.edu hw1 94.0
1 123 amy amy@berkeley.edu hw3 82.0
2 456 billy billy@berkeley.edu hw1 93.0
3 456 billy billy@berkeley.edu hw2 53.0
4 456 billy billy@berkeley.edu hw3 90.0
5 194 billy notbilly@stanford.edu NaN NaN
6 994 sam sam2@berkeley.edu hw1 91.0
7 994 sam sam2@berkeley.edu hw2 100.0
8 994 sam sam2@berkeley.edu hw3 91.0
9 430 carly icarly@berkeley.edu hw2 100.0
10 430 carly icarly@berkeley.edu hw3 100.0
11 824 johnson magic@berkeley.edu NaN NaN
12 881 ernie ern@berkeley.edu hw2 95.0
13 881 ernie ern@berkeley.edu hw3 25.0
14 519 shaq big@berkeley.edu hw1 88.0
15 519 shaq big@berkeley.edu hw2 100.0
16 519 shaq big@berkeley.edu hw3 49.0

Now, we see that rows exist for johnson and the other billy, with NaN for their assignment and scores. (Also, note that we can drop the left_on and right_on parameters if the column that we're merging on between the two tables has the same name. In practice, this won't be the case, so you should get used to specifying these values.)

What if I want to see information for all students that submitted things? I can try a right join:

In [7]:
roster.merge(grades, how = 'right')
Out[7]:
sid name email assignment scores
0 123 amy amy@berkeley.edu hw1 94
1 123 amy amy@berkeley.edu hw3 82
2 456 billy billy@berkeley.edu hw1 93
3 456 billy billy@berkeley.edu hw2 53
4 456 billy billy@berkeley.edu hw3 90
5 994 sam sam2@berkeley.edu hw1 91
6 994 sam sam2@berkeley.edu hw2 100
7 994 sam sam2@berkeley.edu hw3 91
8 430 carly icarly@berkeley.edu hw2 100
9 430 carly icarly@berkeley.edu hw3 100
10 881 ernie ern@berkeley.edu hw2 95
11 881 ernie ern@berkeley.edu hw3 25
12 519 shaq big@berkeley.edu hw1 88
13 519 shaq big@berkeley.edu hw2 100
14 519 shaq big@berkeley.edu hw3 49
15 808 NaN NaN hw1 75
16 223 NaN NaN hw2 89
17 405 NaN NaN hw2 96

Now, note that johnson and other billy don't appear in the output, but there are rows for sid = 808 and the other two students who submitted work but aren't in the class (with NaN as their name and email).

It should be noted that LEFTJOIN(A, B) is the same as RIGHTJOIN(B, A). In other words, to generate the above dataframe, I could've used roster.merge(grades, how = 'right') OR grades.merge(grades, how = 'left'). The resulting dataframes would've been the same, just with the columns in a different order.

Lastly, what if I want a combination of the above two dataframes? That is, rows for every student in the course, their assignment grades if they exist, and grades for students not in the class if they happened to submit. This can be achieved with an outer join:

In [8]:
df = roster.merge(grades, how = 'outer')
df
Out[8]:
sid name email assignment scores
0 123 amy amy@berkeley.edu hw1 94.0
1 123 amy amy@berkeley.edu hw3 82.0
2 456 billy billy@berkeley.edu hw1 93.0
3 456 billy billy@berkeley.edu hw2 53.0
4 456 billy billy@berkeley.edu hw3 90.0
5 194 billy notbilly@stanford.edu NaN NaN
6 994 sam sam2@berkeley.edu hw1 91.0
7 994 sam sam2@berkeley.edu hw2 100.0
8 994 sam sam2@berkeley.edu hw3 91.0
9 430 carly icarly@berkeley.edu hw2 100.0
10 430 carly icarly@berkeley.edu hw3 100.0
11 824 johnson magic@berkeley.edu NaN NaN
12 881 ernie ern@berkeley.edu hw2 95.0
13 881 ernie ern@berkeley.edu hw3 25.0
14 519 shaq big@berkeley.edu hw1 88.0
15 519 shaq big@berkeley.edu hw2 100.0
16 519 shaq big@berkeley.edu hw3 49.0
17 808 NaN NaN hw1 75.0
18 223 NaN NaN hw2 89.0
19 405 NaN NaN hw2 96.0

Demoing Pandas Methods

Now, let's try some queries. Moving forward we'll use the below merged dataframe.

In [9]:
df = roster.merge(grades, how = 'left')
df
Out[9]:
sid name email assignment scores
0 123 amy amy@berkeley.edu hw1 94.0
1 123 amy amy@berkeley.edu hw3 82.0
2 456 billy billy@berkeley.edu hw1 93.0
3 456 billy billy@berkeley.edu hw2 53.0
4 456 billy billy@berkeley.edu hw3 90.0
5 194 billy notbilly@stanford.edu NaN NaN
6 994 sam sam2@berkeley.edu hw1 91.0
7 994 sam sam2@berkeley.edu hw2 100.0
8 994 sam sam2@berkeley.edu hw3 91.0
9 430 carly icarly@berkeley.edu hw2 100.0
10 430 carly icarly@berkeley.edu hw3 100.0
11 824 johnson magic@berkeley.edu NaN NaN
12 881 ernie ern@berkeley.edu hw2 95.0
13 881 ernie ern@berkeley.edu hw3 25.0
14 519 shaq big@berkeley.edu hw1 88.0
15 519 shaq big@berkeley.edu hw2 100.0
16 519 shaq big@berkeley.edu hw3 49.0

What if I want to know how many submissions there were for each assignment?

To do this, I can use value_counts:

In [10]:
df['assignment'].value_counts()
Out[10]:
hw3    6
hw2    5
hw1    4
Name: assignment, dtype: int64

What if we want only scores on Homework 2?

This requires a use of boolean filtering, which likely have already seen:

In [11]:
df[df['assignment'] == 'hw2']
# grades.loc[grades['assignment'] == 'hw2', :]
Out[11]:
sid name email assignment scores
3 456 billy billy@berkeley.edu hw2 53.0
7 994 sam sam2@berkeley.edu hw2 100.0
9 430 carly icarly@berkeley.edu hw2 100.0
12 881 ernie ern@berkeley.edu hw2 95.0
15 519 shaq big@berkeley.edu hw2 100.0

What if we want the average homework score for each assignment?

To do this, I can group by assignment:

In [12]:
df.groupby('assignment').mean()
Out[12]:
sid scores
assignment
hw1 523.000000 91.500000
hw2 656.000000 89.600000
hw3 567.166667 72.833333

What if we only want to see students with an average homework score of over 80%?

I can create a custom filter function in order to help me do this.

In [13]:
def filter_fn(r):
    # takes in all rows for a given student
    # and returns true iff their mean homework score is > 80
    return r['scores'].mean() > 80

df.groupby(['sid']).filter(filter_fn)
Out[13]:
sid name email assignment scores
0 123 amy amy@berkeley.edu hw1 94.0
1 123 amy amy@berkeley.edu hw3 82.0
6 994 sam sam2@berkeley.edu hw1 91.0
7 994 sam sam2@berkeley.edu hw2 100.0
8 994 sam sam2@berkeley.edu hw3 91.0
9 430 carly icarly@berkeley.edu hw2 100.0
10 430 carly icarly@berkeley.edu hw3 100.0

What if I want a new column with scores converted to letter grades?

I can create a mapping between scores and letter grades, and then apply this to the scores column.

In [14]:
def to_letter(x):
    if x >= 90: return 'A'
    elif x >= 80: return 'B'
    elif x >= 70: return 'C'
    elif x >= 60: return 'D'
    else: return 'F'
In [15]:
df['letter'] = df['scores'].apply(to_letter)
df
Out[15]:
sid name email assignment scores letter
0 123 amy amy@berkeley.edu hw1 94.0 A
1 123 amy amy@berkeley.edu hw3 82.0 B
2 456 billy billy@berkeley.edu hw1 93.0 A
3 456 billy billy@berkeley.edu hw2 53.0 F
4 456 billy billy@berkeley.edu hw3 90.0 A
5 194 billy notbilly@stanford.edu NaN NaN F
6 994 sam sam2@berkeley.edu hw1 91.0 A
7 994 sam sam2@berkeley.edu hw2 100.0 A
8 994 sam sam2@berkeley.edu hw3 91.0 A
9 430 carly icarly@berkeley.edu hw2 100.0 A
10 430 carly icarly@berkeley.edu hw3 100.0 A
11 824 johnson magic@berkeley.edu NaN NaN F
12 881 ernie ern@berkeley.edu hw2 95.0 A
13 881 ernie ern@berkeley.edu hw3 25.0 F
14 519 shaq big@berkeley.edu hw1 88.0 B
15 519 shaq big@berkeley.edu hw2 100.0 A
16 519 shaq big@berkeley.edu hw3 49.0 F

What if I want a more traditional gradebook?

This might be something like the gradebook your teachers in high school or elementary school used, with the rows containing students' names, and the columns containing the various assignments in their course. To do this, we can use pivot (which you should've also seen in Data 8):

In [16]:
df.pivot(index = 'sid', columns = 'assignment', values = ['scores'])
Out[16]:
scores
assignment NaN hw1 hw2 hw3
sid
123 NaN 94.0 NaN 82.0
194 NaN NaN NaN NaN
430 NaN NaN 100.0 100.0
456 NaN 93.0 53.0 90.0
519 NaN 88.0 100.0 49.0
824 NaN NaN NaN NaN
881 NaN NaN 95.0 25.0
994 NaN 91.0 100.0 91.0

In practice, this looks a little weird: let's get rid of the NaN column and replace all NaN values with a 0.

  • To get rid of the NaN column, we can use .iloc[:, 1:] to take all columns starting with hw1.
  • To get rid of the NaN values, we can use .fillna(0).
In [17]:
df.pivot(index = 'sid', columns = 'assignment', values = ['scores']).iloc[:, 1:].fillna(0)
Out[17]:
scores
assignment hw1 hw2 hw3
sid
123 94.0 0.0 82.0
194 0.0 0.0 0.0
430 0.0 100.0 100.0
456 93.0 53.0 90.0
519 88.0 100.0 49.0
824 0.0 0.0 0.0
881 0.0 95.0 25.0
994 91.0 100.0 91.0