Data in Python: Intro to Pandas & a bit of Seaborn :panda_face: :ocean:

(Wrote these notes for McGill’s course COMP 598.)HackMD version


Motivation :zap:

  • Tabular/matrix data (i.e. data entered in rows and columns of a table) is pervasive in many fields, including the life sciences :microscope:
  • The most widely used software for dealing with this kind of data is Microsoft Excel.

Why use Python? :snake:

  • I’ll give three (four) reasons why you might want to try working with this kind of data in Python instead
    1. It’s free. :money_with_wings:
    2. Automation and complex manipulations is much easier :pie:
    3. Seamlessly integrate with the rest of your Python tools (e.g. database management, machine learning training, etc.) :chart_with_upwards_trend:
    4. Clicking on buttons hurts :hospital: :mouse: :no_entry_sign:

Objectives :goal_net:

  1. Loading a dataset in Pandas
  2. Inspecting it
  3. Basic manipulations
  4. Visualization

:::danger

  • These libraries are super extensive.
  • This is not an exhaustive coverage of pandas, this is just to get a small taste and become accustomed to learning from documentation to best suit your needs. :::

Getting a Dataset

  • I browsed Kaggle for some interesting COVID-related datasets (of course)
  • We’ll be using the COVID by US county dataset.

::: danger Disclaimer: This is a real dataset, but the following visualizations and statistics are purely for illustration purposes, not as a real analysis of COVID. ::: —

:::info

  1. Download the CSV file (Comma Separated Values). A CSV is basically a text version of an Excel Table.
  2. Place the file in a folder my_project/data/ (we’ll be working in the my_project folder) :::

Loading the file in Python

:::info Create a file in my_project/ called covid.py :::

  • If the source file is a CSV. doc
import pandas as pd
df = pd.read_csv('../data/covid.csv')

  • If the source file is an Excel File. doc
import pandas as pd
df = pd.read_excel('../data/covid.xls')

DataFrames

  • The core object that Pandas uses is called a DataFrame.
  • We loaded our data into a DataFrame called df
  • This is object stores the table data and supports most of the needed functionality.
  • Let’s take a look, using the head method which prints the first few rows of the table.

print(df.head())

::: warning

         date     county       state     fips  cases  deaths
0  2020-01-21  Snohomish  Washington  53061.0      1       0
1  2020-01-22  Snohomish  Washington  53061.0      1       0
2  2020-01-23  Snohomish  Washington  53061.0      1       0
3  2020-01-24       Cook    Illinois  17031.0      1       0
4  2020-01-24  Snohomish  Washington  53061.0      1       0

:::

  • Each row is a day with number of deaths and cases for each US county.

Selections: Columns

  • To get a list of columns
print(df.columns)

::: warning

Index(['date', 'county', 'state', 'fips', 'cases', 'deaths'], dtype='object')

:::

  • To get a single column
dates = df['date']
print(dates)

Output:

0         2020-01-21
1         2020-01-22
2         2020-01-23
3         2020-01-24
4         2020-01-24
5         2020-01-25
...
  • To get multiple columns
print(df[['date', 'county']])

Selections: Row

  • To select a specific row, we use the iloc attribute. docs
print(df.iloc[2])

::: warning

date      2020-01-23
county     Snohomish
state     Washington
fips           53061
cases              1
deaths             0
Name: 2, dtype: object

::: —

  • You can then select columns within a row
print(df.iloc[2]['date'])

::: warning

2020-01-23

:::


Advanced Selection

  • Row and column selection methods are very extensive and powerful. You should really read the docs for a full understanding.
  • Here is an example of getting the rows where deaths exceed cases by at least a factor of 2
  • .loc is a powerful attribute which can take a condition on the column values to filter the rows.
print(df.loc[df['deaths'] > df['cases'] * 2 ])

::: warning

              date   county         state  fips  cases  deaths
26292   2020-04-02  Unknown    California   NaN      0       1
28473   2020-04-03  Unknown       Arizona   NaN      0       4
29475   2020-04-03  Unknown     Minnesota   NaN      0       3
29803   2020-04-03  Unknown      New York   NaN     37     608
30135   2020-04-03  Unknown  Pennsylvania   NaN      0       1
30475   2020-04-03  Unknown          Utah   NaN      0       3

:::


Adding columns

  • Maybe we want to flag certain counties with lots of cases.
df['hot'] = df['cases']> 100
print(df.head())

::: warning

         date     county       state     fips  cases  deaths    hot
0  2020-01-21  Snohomish  Washington  53061.0      1       0  False
1  2020-01-22  Snohomish  Washington  53061.0      1       0  False
2  2020-01-23  Snohomish  Washington  53061.0      1       0  False
3  2020-01-24       Cook    Illinois  17031.0      1       0  False
4  2020-01-24  Snohomish  Washington  53061.0      1       0  False

:::


Writing DataFrames

  • Let’s say we want to publish our DataFrame with the hot column.
  • We simply write it to a new CSV file which can be loaded later.
df.to_csv('new_data.csv')
#or to excel format
df.to_excel('new_data.xls')


Grouping

  • Maybe we want to know averages for the columns per state
  • the groupby method comes in handy here.
#groupby object
state_grouped = df.groupby(['state'])
#new dataframe
means_df = state_grouped.mean()

::: warning

                       state          fips        cases      deaths       hot
0                    Alabama   1067.804565    95.768957    3.544569  0.222011
1                     Alaska   2148.606250    23.179455    0.465347  0.059406
2                    Arizona   4014.058216   411.876394   18.223048  0.372677
3                   Arkansas   5075.446098    37.349700    0.763608  0.077983
4                 California   6058.734863   671.394359   25.821298  0.382446
5                   Colorado   8062.313048   206.642279   10.232297  0.233342

:::

:::info The logic behind grouping is a bit involved so I suggest reading the docs. :::


Sorting

  • Which states have the most daily deaths on average?

print(means_df.sort_values(by=['deaths'], ascending=False)[['state', 'deaths']])

::: warning

                       state      deaths
33                  New York  274.444252
31                New Jersey  212.262184
6                Connecticut  171.599364
22             Massachusetts  152.209790
8       District of Columbia  140.775000
41               Puerto Rico   58.486486

:::

  • I used the sort_values method with ascending=False for a decreasing sort by deaths
  • Then I print the state and deaths columns.

Plotting with Seaborn :ocean:

  • Seaborn is a wrapper on top of matplotlib which works nicely with DataFrames.

Deaths per Day

  • Let’s compare deaths over time between three states.
  • Notice the use of isin() to filter rows whose column values are within a list of accepted values.
import seaborn as sns
import matplotlib.pyplot as plt

g = sns.lineplot(x="date", y="deaths", hue="state" data=df.loc[df['state'].isin(['New York', 'New Jersey', 'California'])])
plt.show()

::: spoiler

#I used this to get rid of some x-ticks for easier reading.
for ind, label in enumerate(g.get_xticklabels()):
    if ind % 30 == 0:  # every 30th label is kept
        label.set_visible(True)
    else:
        label.set_visible(False)

:::


:hatching_chick:


Mortality by State

  • Let’s do one more.
  • Let’s group by state again and instead of averaging take totals (sum)
  • We also add a new column called mortality
tot = df.groupby(['state'], as_index=False).sum()
tot['mortality'] = tot['deaths'] / tot['cases']
df = df.sort_values(by='mortality', ascending=False)
sns.barplot(x='state', y='mortality')

:hatching_chick: