Data in Python: Intro to Pandas & a bit of Seaborn :panda_face: :ocean:

(Wrote these notes for McGill’s course COMP 598.)HackMD version

Motivation :zap:

Tabular/matrix data (i.e. data entered in rows and columns of a table) is pervasive in many fields, including the life sciences :microscope:
The most widely used software for dealing with this kind of data is Microsoft Excel.

Why use Python? :snake:

I’ll give three (four) reasons why you might want to try working with this kind of data in Python instead
1. It’s free. :money_with_wings:
2. Automation and complex manipulations is much easier :pie:
3. Seamlessly integrate with the rest of your Python tools (e.g. database management, machine learning training, etc.) :chart_with_upwards_trend:
4. Clicking on buttons hurts :hospital: :mouse: :no_entry_sign:

Objectives :goal_net:

Loading a dataset in Pandas
Inspecting it
Basic manipulations
Visualization

:::danger

These libraries are super extensive.
This is not an exhaustive coverage of pandas, this is just to get a small taste and become accustomed to learning from documentation to best suit your needs. :::

Getting a Dataset

I browsed Kaggle for some interesting COVID-related datasets (of course)
We’ll be using the COVID by US county dataset.

::: danger Disclaimer: This is a real dataset, but the following visualizations and statistics are purely for illustration purposes, not as a real analysis of COVID. ::: —

:::info

Download the CSV file (Comma Separated Values). A CSV is basically a text version of an Excel Table.
Place the file in a folder my_project/data/ (we’ll be working in the my_project folder) :::

Loading the file in Python

:::info Create a file in my_project/ called covid.py :::

If the source file is a CSV. doc

import pandas as pd
df = pd.read_csv('../data/covid.csv')

If the source file is an Excel File. doc

import pandas as pd
df = pd.read_excel('../data/covid.xls')

DataFrames

The core object that Pandas uses is called a DataFrame.
We loaded our data into a DataFrame called df
This is object stores the table data and supports most of the needed functionality.
Let’s take a look, using the head method which prints the first few rows of the table.

print(df.head())

::: warning

         date     county       state     fips  cases  deaths
2020-01-21  Snohomish  Washington  53061.0      1       0
2020-01-22  Snohomish  Washington  53061.0      1       0
2020-01-23  Snohomish  Washington  53061.0      1       0
2020-01-24       Cook    Illinois  17031.0      1       0
2020-01-24  Snohomish  Washington  53061.0      1       0

:::

Each row is a day with number of deaths and cases for each US county.

Selections: Columns

To get a list of columns

print(df.columns)

::: warning

Index(['date', 'county', 'state', 'fips', 'cases', 'deaths'], dtype='object')

:::

To get a single column

dates = df['date']
print(dates)

Output:

To get multiple columns

print(df[['date', 'county']])

Selections: Row

To select a specific row, we use the iloc attribute. docs

print(df.iloc[2])

::: warning

date      2020-01-23
county     Snohomish
state     Washington
fips           53061
cases              1
deaths             0
Name: 2, dtype: object

::: —

You can then select columns within a row

print(df.iloc[2]['date'])

::: warning

2020-01-23

:::

Advanced Selection

Row and column selection methods are very extensive and powerful. You should really read the docs for a full understanding.
Here is an example of getting the rows where deaths exceed cases by at least a factor of 2
.loc is a powerful attribute which can take a condition on the column values to filter the rows.

print(df.loc[df['deaths'] > df['cases'] * 2 ])

::: warning

              date   county         state  fips  cases  deaths
 2020-04-02  Unknown    California   NaN      0       1
 2020-04-03  Unknown       Arizona   NaN      0       4
 2020-04-03  Unknown     Minnesota   NaN      0       3
 2020-04-03  Unknown      New York   NaN     37     608
 2020-04-03  Unknown  Pennsylvania   NaN      0       1
 2020-04-03  Unknown          Utah   NaN      0       3

:::

Adding columns

Maybe we want to flag certain counties with lots of cases.

df['hot'] = df['cases']> 100
print(df.head())

::: warning

         date     county       state     fips  cases  deaths    hot
2020-01-21  Snohomish  Washington  53061.0      1       0  False
2020-01-22  Snohomish  Washington  53061.0      1       0  False
2020-01-23  Snohomish  Washington  53061.0      1       0  False
2020-01-24       Cook    Illinois  17031.0      1       0  False
2020-01-24  Snohomish  Washington  53061.0      1       0  False

:::

Writing DataFrames

Let’s say we want to publish our DataFrame with the hot column.
We simply write it to a new CSV file which can be loaded later.

df.to_csv('new_data.csv')
#or to excel format
df.to_excel('new_data.xls')

Grouping

Maybe we want to know averages for the columns per state
the groupby method comes in handy here.

#groupby object
state_grouped = df.groupby(['state'])
#new dataframe
means_df = state_grouped.mean()

::: warning

                       state          fips        cases      deaths       hot
                  Alabama   1067.804565    95.768957    3.544569  0.222011
                   Alaska   2148.606250    23.179455    0.465347  0.059406
                  Arizona   4014.058216   411.876394   18.223048  0.372677
                 Arkansas   5075.446098    37.349700    0.763608  0.077983
               California   6058.734863   671.394359   25.821298  0.382446
                 Colorado   8062.313048   206.642279   10.232297  0.233342

:::

:::info The logic behind grouping is a bit involved so I suggest reading the docs. :::

Sorting

Which states have the most daily deaths on average?

print(means_df.sort_values(by=['deaths'], ascending=False)[['state', 'deaths']])

::: warning

                       state      deaths
                New York  274.444252
              New Jersey  212.262184
              Connecticut  171.599364
           Massachusetts  152.209790
     District of Columbia  140.775000
             Puerto Rico   58.486486

:::

I used the sort_values method with ascending=False for a decreasing sort by deaths
Then I print the state and deaths columns.

Plotting with Seaborn :ocean:

Seaborn is a wrapper on top of matplotlib which works nicely with DataFrames.

Deaths per Day

Let’s compare deaths over time between three states.
Notice the use of isin() to filter rows whose column values are within a list of accepted values.

import seaborn as sns
import matplotlib.pyplot as plt

g = sns.lineplot(x="date", y="deaths", hue="state" data=df.loc[df['state'].isin(['New York', 'New Jersey', 'California'])])
plt.show()

::: spoiler

#I used this to get rid of some x-ticks for easier reading.
for ind, label in enumerate(g.get_xticklabels()):
    if ind % 30 == 0:  # every 30th label is kept
        label.set_visible(True)
    else:
        label.set_visible(False)

:::

:hatching_chick:

Mortality by State

Let’s do one more.
Let’s group by state again and instead of averaging take totals (sum)
We also add a new column called mortality

tot = df.groupby(['state'], as_index=False).sum()
tot['mortality'] = tot['deaths'] / tot['cases']
df = df.sort_values(by='mortality', ascending=False)
sns.barplot(x='state', y='mortality')

Data in Python: Intro to Pandas & a bit of Seaborn :panda_face: :ocean:

Motivation :zap:

Why use Python? :snake:

Objectives :goal_net:

Getting a Dataset

Loading the file in Python

DataFrames

Each row is a day with number of deaths and cases for each US county.

Selections: Columns

Selections: Row

Advanced Selection

Adding columns

Writing DataFrames

Grouping

Sorting

Plotting with Seaborn :ocean:

Deaths per Day

:hatching_chick:

Mortality by State

:hatching_chick: