Discussion 3 Supplemental Notebook¶

Data 100, Spring 2020¶

Suraj Rampure (suraj.rampure@berkeley.edu)¶

In this notebook, we'll use dataframes containing (imaginary) students' roster information and grades to demonstrate various types of joins/merges. We'll also use the same dataframes to extract information using pandas syntax you should be familiar with.

View the interactive notebook here.

import pandas as pd
import numpy as np

Demoing Joins¶

These demos use pandas, but joining in SQL is the same as merging in pandas.

roster = pd.DataFrame()
roster['sid'] = [123, 456, 194, 994, 430, 824, 881, 519]
roster['name'] = ['amy', 'billy', 'billy', 'sam', 'carly', 'johnson', 'ernie', 'shaq']
roster['email'] = ['amy@berkeley.edu', 'billy@berkeley.edu', 'notbilly@stanford.edu', 'sam2@berkeley.edu', 'icarly@berkeley.edu', 'magic@berkeley.edu', 'ern@berkeley.edu', 'big@berkeley.edu']

grades = pd.DataFrame()
grades['sid'] = [123, 456, 994, 519, 808, 456, 994, 430, 223, 881, 405, 519, 123, 456, 994, 430, 881, 519]
grades['assignment'] = ['hw1', 'hw1', 'hw1', 'hw1', 'hw1', 'hw2', 'hw2', 'hw2', 'hw2', 'hw2', 'hw2', 'hw2', 'hw3', 'hw3', 'hw3', 'hw3', 'hw3', 'hw3']
grades['scores'] = [94, 93, 91, 88, 75, 53, 100, 100, 89, 95, 96, 100, 82, 90, 91, 100, 25, 49]

roster

grades

You should note that there are some students with no grades (e.g. johnson has no grades, since sid = 824 doesn't appear in grades), and there are some submissions in grades that belong to students who are not in the roster (e.g. sid = 808 doesn't appear in roster).

Let's first perform an inner join:

roster.merge(grades, left_on = 'sid', right_on = 'sid', how = 'inner')

Here, we see that we have all grades of every student in the course, if they have any assignment submissions. You'll note that johnson doesn't appear in this table, and neither does the billy whose email is notbilly@stanford.edu. We also see that the student with sid = 808 is not in this table.

What if we want at least one row for every single student in the roster, whether or not they submitted assignments? This comes up in practice from the instructor's side when computing grades at the end of the semester – even if you didn't submit something (or anything!), you still need to be assigned a grade. To do this, I can try a left join:

roster.merge(grades, how = 'left')

Now, we see that rows exist for johnson and the other billy, with NaN for their assignment and scores. (Also, note that we can drop the left_on and right_on parameters if the column that we're merging on between the two tables has the same name. In practice, this won't be the case, so you should get used to specifying these values.)

What if I want to see information for all students that submitted things? I can try a right join:

roster.merge(grades, how = 'right')

Now, note that johnson and other billy don't appear in the output, but there are rows for sid = 808 and the other two students who submitted work but aren't in the class (with NaN as their name and email).

It should be noted that LEFTJOIN(A, B) is the same as RIGHTJOIN(B, A). In other words, to generate the above dataframe, I could've used roster.merge(grades, how = 'right') OR grades.merge(grades, how = 'left'). The resulting dataframes would've been the same, just with the columns in a different order.

Lastly, what if I want a combination of the above two dataframes? That is, rows for every student in the course, their assignment grades if they exist, and grades for students not in the class if they happened to submit. This can be achieved with an outer join:

df = roster.merge(grades, how = 'outer')
df

Demoing Pandas Methods¶

Now, let's try some queries. Moving forward we'll use the below merged dataframe.

df = roster.merge(grades, how = 'left')
df

What if I want to know how many submissions there were for each assignment?¶

To do this, I can use value_counts:

df['assignment'].value_counts()

hw3    6
hw2    5
hw1    4
Name: assignment, dtype: int64

What if we want only scores on Homework 2?¶

This requires a use of boolean filtering, which likely have already seen:

df[df['assignment'] == 'hw2']
# grades.loc[grades['assignment'] == 'hw2', :]

What if we want the average homework score for each assignment?¶

To do this, I can group by assignment:

df.groupby('assignment').mean()

What if we only want to see students with an average homework score of over 80%?¶

I can create a custom filter function in order to help me do this.

def filter_fn(r):
    # takes in all rows for a given student
    # and returns true iff their mean homework score is > 80
    return r['scores'].mean() > 80

df.groupby(['sid']).filter(filter_fn)

What if I want a new column with scores converted to letter grades?¶

I can create a mapping between scores and letter grades, and then apply this to the scores column.

def to_letter(x):
    if x >= 90: return 'A'
    elif x >= 80: return 'B'
    elif x >= 70: return 'C'
    elif x >= 60: return 'D'
    else: return 'F'

df['letter'] = df['scores'].apply(to_letter)
df

What if I want a more traditional gradebook?¶

This might be something like the gradebook your teachers in high school or elementary school used, with the rows containing students' names, and the columns containing the various assignments in their course. To do this, we can use pivot (which you should've also seen in Data 8):

df.pivot(index = 'sid', columns = 'assignment', values = ['scores'])

In practice, this looks a little weird: let's get rid of the NaN column and replace all NaN values with a 0.

To get rid of the NaN column, we can use .iloc[:, 1:] to take all columns starting with hw1.
To get rid of the NaN values, we can use .fillna(0).

df.pivot(index = 'sid', columns = 'assignment', values = ['scores']).iloc[:, 1:].fillna(0)

	scores
assignment	hw1	hw2	hw3
sid
123	94.0	0.0	82.0
194	0.0	0.0	0.0
430	0.0	100.0	100.0
456	93.0	53.0	90.0
519	88.0	100.0	49.0
824	0.0	0.0	0.0
881	0.0	95.0	25.0
994	91.0	100.0	91.0

	sid	name	email
0	123	amy	amy@berkeley.edu
1	456	billy	billy@berkeley.edu
2	194	billy	notbilly@stanford.edu
3	994	sam	sam2@berkeley.edu
4	430	carly	icarly@berkeley.edu
5	824	johnson	magic@berkeley.edu
6	881	ernie	ern@berkeley.edu
7	519	shaq	big@berkeley.edu

	sid	assignment	scores
0	123	hw1	94
1	456	hw1	93
2	994	hw1	91
3	519	hw1	88
4	808	hw1	75
5	456	hw2	53
6	994	hw2	100
7	430	hw2	100
8	223	hw2	89
9	881	hw2	95
10	405	hw2	96
11	519	hw2	100
12	123	hw3	82
13	456	hw3	90
14	994	hw3	91
15	430	hw3	100
16	881	hw3	25
17	519	hw3	49

	sid	scores
assignment
hw1	523.000000	91.500000
hw2	656.000000	89.600000
hw3	567.166667	72.833333

	sid	assignment	scores
0	123	hw1	94
1	456	hw1	93
2	994	hw1	91
3	519	hw1	88
4	808	hw1	75
5	456	hw2	53
6	994	hw2	100
7	430	hw2	100
8	223	hw2	89
9	881	hw2	95
10	405	hw2	96
11	519	hw2	100
12	123	hw3	82
13	456	hw3	90
14	994	hw3	91
15	430	hw3	100
16	881	hw3	25
17	519	hw3	49

	sid	assignment	scores
0	123	hw1	94
1	456	hw1	93
2	994	hw1	91
3	519	hw1	88
4	808	hw1	75
5	456	hw2	53
6	994	hw2	100
7	430	hw2	100
8	223	hw2	89
9	881	hw2	95
10	405	hw2	96
11	519	hw2	100
12	123	hw3	82
13	456	hw3	90
14	994	hw3	91
15	430	hw3	100
16	881	hw3	25
17	519	hw3	49