A bit of Pandas and Seaborn
Data in Python: Intro to Pandas & a bit of Seaborn :panda_face: :ocean:
(Wrote these notes for McGill’s course COMP 598.)HackMD version
Motivation :zap:
- Tabular/matrix data (i.e. data entered in rows and columns of a table) is pervasive in many fields, including the life sciences :microscope:
- The most widely used software for dealing with this kind of data is Microsoft Excel.
Why use Python? :snake:
- I’ll give three (four) reasons why you might want to try working with this kind of data in Python instead
- It’s free. :money_with_wings:
- Automation and complex manipulations is much easier :pie:
- Seamlessly integrate with the rest of your Python tools (e.g. database management, machine learning training, etc.) :chart_with_upwards_trend:
- Clicking on buttons hurts :hospital: :mouse: :no_entry_sign:
Objectives :goal_net:
- Loading a dataset in Pandas
- Inspecting it
- Basic manipulations
- Visualization
:::danger
- These libraries are super extensive.
- This is not an exhaustive coverage of pandas, this is just to get a small taste and become accustomed to learning from documentation to best suit your needs. :::
Getting a Dataset
- I browsed Kaggle for some interesting COVID-related datasets (of course)
- We’ll be using the COVID by US county dataset.
::: danger Disclaimer: This is a real dataset, but the following visualizations and statistics are purely for illustration purposes, not as a real analysis of COVID. ::: —
:::info
- Download the CSV file (Comma Separated Values). A CSV is basically a text version of an Excel Table.
- Place the file in a folder
my_project/data/
(we’ll be working in themy_project
folder) :::
Loading the file in Python
:::info
Create a file in my_project/
called covid.py
:::
- If the source file is a CSV. doc
import pandas as pd
df = pd.read_csv('../data/covid.csv')
- If the source file is an Excel File. doc
import pandas as pd
df = pd.read_excel('../data/covid.xls')
DataFrames
- The core object that Pandas uses is called a DataFrame.
- We loaded our data into a DataFrame called
df
- This is object stores the table data and supports most of the needed functionality.
- Let’s take a look, using the
head
method which prints the first few rows of the table.
print(df.head())
::: warning
date county state fips cases deaths
0 2020-01-21 Snohomish Washington 53061.0 1 0
1 2020-01-22 Snohomish Washington 53061.0 1 0
2 2020-01-23 Snohomish Washington 53061.0 1 0
3 2020-01-24 Cook Illinois 17031.0 1 0
4 2020-01-24 Snohomish Washington 53061.0 1 0
:::
-
Each row is a day with number of deaths and cases for each US county.
Selections: Columns
- To get a list of columns
print(df.columns)
::: warning
Index(['date', 'county', 'state', 'fips', 'cases', 'deaths'], dtype='object')
:::
- To get a single column
dates = df['date']
print(dates)
Output:
0 2020-01-21
1 2020-01-22
2 2020-01-23
3 2020-01-24
4 2020-01-24
5 2020-01-25
...
- To get multiple columns
print(df[['date', 'county']])
Selections: Row
- To select a specific row, we use the
iloc
attribute. docs
print(df.iloc[2])
::: warning
date 2020-01-23
county Snohomish
state Washington
fips 53061
cases 1
deaths 0
Name: 2, dtype: object
::: —
- You can then select columns within a row
print(df.iloc[2]['date'])
::: warning
2020-01-23
:::
Advanced Selection
- Row and column selection methods are very extensive and powerful. You should really read the docs for a full understanding.
- Here is an example of getting the rows where deaths exceed cases by at least a factor of 2
.loc
is a powerful attribute which can take a condition on the column values to filter the rows.
print(df.loc[df['deaths'] > df['cases'] * 2 ])
::: warning
date county state fips cases deaths
26292 2020-04-02 Unknown California NaN 0 1
28473 2020-04-03 Unknown Arizona NaN 0 4
29475 2020-04-03 Unknown Minnesota NaN 0 3
29803 2020-04-03 Unknown New York NaN 37 608
30135 2020-04-03 Unknown Pennsylvania NaN 0 1
30475 2020-04-03 Unknown Utah NaN 0 3
:::
Adding columns
- Maybe we want to flag certain counties with lots of cases.
df['hot'] = df['cases']> 100
print(df.head())
::: warning
date county state fips cases deaths hot
0 2020-01-21 Snohomish Washington 53061.0 1 0 False
1 2020-01-22 Snohomish Washington 53061.0 1 0 False
2 2020-01-23 Snohomish Washington 53061.0 1 0 False
3 2020-01-24 Cook Illinois 17031.0 1 0 False
4 2020-01-24 Snohomish Washington 53061.0 1 0 False
:::
Writing DataFrames
- Let’s say we want to publish our DataFrame with the
hot
column. - We simply write it to a new CSV file which can be loaded later.
df.to_csv('new_data.csv')
#or to excel format
df.to_excel('new_data.xls')
Grouping
- Maybe we want to know averages for the columns per state
- the
groupby
method comes in handy here.
#groupby object
state_grouped = df.groupby(['state'])
#new dataframe
means_df = state_grouped.mean()
::: warning
state fips cases deaths hot
0 Alabama 1067.804565 95.768957 3.544569 0.222011
1 Alaska 2148.606250 23.179455 0.465347 0.059406
2 Arizona 4014.058216 411.876394 18.223048 0.372677
3 Arkansas 5075.446098 37.349700 0.763608 0.077983
4 California 6058.734863 671.394359 25.821298 0.382446
5 Colorado 8062.313048 206.642279 10.232297 0.233342
:::
:::info The logic behind grouping is a bit involved so I suggest reading the docs. :::
Sorting
- Which states have the most daily deaths on average?
print(means_df.sort_values(by=['deaths'], ascending=False)[['state', 'deaths']])
::: warning
state deaths
33 New York 274.444252
31 New Jersey 212.262184
6 Connecticut 171.599364
22 Massachusetts 152.209790
8 District of Columbia 140.775000
41 Puerto Rico 58.486486
:::
- I used the
sort_values
method withascending=False
for a decreasing sort bydeaths
- Then I print the
state
anddeaths
columns.
Plotting with Seaborn :ocean:
- Seaborn is a wrapper on top of matplotlib which works nicely with DataFrames.
Deaths per Day
- Let’s compare deaths over time between three states.
- Notice the use of
isin()
to filter rows whose column values are within a list of accepted values.
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.lineplot(x="date", y="deaths", hue="state" data=df.loc[df['state'].isin(['New York', 'New Jersey', 'California'])])
plt.show()
::: spoiler
#I used this to get rid of some x-ticks for easier reading.
for ind, label in enumerate(g.get_xticklabels()):
if ind % 30 == 0: # every 30th label is kept
label.set_visible(True)
else:
label.set_visible(False)
:::
:hatching_chick:
Mortality by State
- Let’s do one more.
- Let’s group by state again and instead of averaging take totals (sum)
- We also add a new column called
mortality
tot = df.groupby(['state'], as_index=False).sum()
tot['mortality'] = tot['deaths'] / tot['cases']
df = df.sort_values(by='mortality', ascending=False)
sns.barplot(x='state', y='mortality')