Dynamic Programming: Nussinov RNA Folding
Happy new year!
In this post, I will introduce a basic Dynamic Programming algorithm which was one of the first attempts at predicting the structure of RNA molecules computationally. Dynamic Programming is a very powerful class of algorithms that can be used to greatly speed up tasks that would normally take astronomical amounts of time though the magic of recursion and divide and conquer. The core principle is that we break up the problem into smaller more manageable sub-problems for which we can easily compute the answer. We then store that answer in some data structure and when computing a larger problem, instead of re-computing we just look up the solution of the smaller sub-problem we previously computed. You will get the hang of this if you keep reading. Dymanic Programming algorithms can be found in a wide range of applications, from mathematics, to graph theroy, to bioinformatics. Hope you enjoy it! You can find the full script here.
RNA
RNA can be defined as a sequence $\omega$ consisting of letters/nucleotides from the alphabet ${U,A,C,G}$. The corresponding phenotype $s$ is the secondary structure of $\omega$ which can be thought of as a pairing between nucleotides in the primary sequence that give rise to a 2D architecture. Because it has been shown that the function of many biomolecules, including RNA, is driven by structure, knowing the structure of a sequence gives us valuable insight on function.
What is a secondary structure?
For our purposes, we think of a secondary structure as a pairing betwen all pairs of indices $i$ and $j$ in $s$ where $i$ and $j$ can be paired or not. By doing this we can see that such a pairing will give us a structure in 2 dimensions.
### 1
from IPython.display import Image
#This will load an image of an RNA secondary structure
Image(url='http://www.tbi.univie.ac.at/~pkerp/forgi/_images/1y26_ss.png')
As you can see, unparied positions are forming loop-like structures, and paired positions are forming stem-like structures. Another thing to notice is that, although in reality this is often not the case, in general we only allow pairs between ${C,G}$ and ${A, U}$ nucleotides, most modern approaches allow for non-canonical pairings and you will find some examples of this in the above structure.
Nussinov RNA Folding Algorithm
There are several approaches for solving this problem, we will look at the simplest one here which is known as the Nussinov Algorithm. This algorithm is a popular example of a class of algorithms know as dynamic programming algorithms. The main idea behind these algorithms is that we can break down the problem into many subproblems which are easier to compute than the full problem. Once we have obtained the solution for the subproblems, we can retrieve the solution to the full problem by doing something called a backtrace (more on the backtrace later).
Here, the problem is obtaining the optimal pairing on a string of nucleotides. In order to know how good our structure is, we assign a score to it. One possible scoring scheme could be adding 1 to the score per paired set of nucleotides, and 0 otherwise. So in other words, we want a pairing that will give us the highest possible score. We can write this quantity as $OPT(i, j)$ where $i$ and $j$ are the indices of the sequence between which we obtain the pairing score. Our algorithm is therefore going to compute a folding score for all substrings bound by $i$ and $j$ and store the value in what is known as a dynamic programming table. Our dynamic programming table will be a $N$ x $N$ array where $N$ is the length of our sequence. So now that we have a way of measuring how good a structure is, we need a way to evaluate scores given a subsequence. To do this, we set some rules on the structure of an RNA sequence:
If $i$ and $j$ form a pair:
- The pair $i$ and $j$ must form a valid watson-crick pair.
- $i < j-4$. This ensures that bonding is not happening between positions that are too close to each other, which would produce steric clashes.
- If pair $(i,j)$ and $(k, l)$ are in the structure, then $i < k < j < l$. This ensures that there is no crossing over of pairs which would result in pseudoknots.
- No base appears in more than one pair.
Using these rules we can begin to build our algorithm. The first part of our algorithm needs to take as input indices $i$ and $j$ and return the value $OPT(i,j)$ which is the optimal score of a structure between $i$ and $j$. We start by thinking about values of $i$ and $j$ for which we can immediately know the solution, this is known as a ‘base case’. This is a case where the solution is known and no further recursion is required. Once the algorithm reaches the base case, it can return a solution and propagate it upward to the first recursive call. So once we have reached $i$ and $j$ that are too close to form a structure (rule number 2), we know that the score is 0.
Otherwise, we must weigh the possibility of forming a pair or not forming a pair. If $i$ and $j$ are unpaired, then $OPT(i,j)$ is just $OPT(i, j-1)$ since the score will not increase for unpaired indices.
The other case is that $i$ is paired to some index $t$ on the interval $[i,j]$. We then add 1 to the score and consider the structure formed before and after the pairing between $i$ and $t$. We can write these two cases as $OPT(i, t-1)$ and $OPT(t+1, j)$. But how do we know which $t$ to pair $i$ with? Well we simply try all possible values of $t$ within the allowed range and choose the best one.
All of this can be summed up as follows:
\[OPT(i,j) = max\begin{cases} OPT(i, j-1) \quad \text{If $i$ and $j$ are not paired with each other.}\\ max(1 + OPT(i, t-1) + OPT(t+1, j)) \quad \text{Where we try all values of $t$ < j - 4} \end{cases}\]We can now use this recursion to fill our dynamic programming table. Once we have filled the table with scores, we can retrieve the optimal folding by a process called backtracking. We won’t go into detail on how this works, but the main idea is that we can start by looking at the entry containing the score for the full sequence $OPT[0][N]$. We can then look at adjacent entries and deduce which case (pairing or not pairing) resulted in the current value. We can continue like this for the full table until we have retrieved the full structure.
### 2
import numpy as np
min_loop_length = 4
def pair_check(tup):
if tup in [('A', 'U'), ('U', 'A'), ('C', 'G'), ('G', 'C')]:
return True
return False
def OPT(i,j, sequence):
""" returns the score of the optimal pairing between indices i and j"""
#base case: no pairs allowed when i and j are less than 4 bases apart
if i >= j-min_loop_length:
return 0
else:
#i and j can either be paired or not be paired, if not paired then the optimal score is OPT(i,j-1)
unpaired = OPT(i, j-1, sequence)
#check if j can be involved in a pairing with a position t
pairing = [1 + OPT(i, t-1, sequence) + OPT(t+1, j-1, sequence) for t in range(i, j-4)\
if pair_check((sequence[t], sequence[j]))]
if not pairing:
pairing = [0]
paired = max(pairing)
return max(unpaired, paired)
def traceback(i, j, structure, DP, sequence):
#in this case we've gone through the whole sequence. Nothing to do.
if j <= i:
return
#if j is unpaired, there will be no change in score when we take it out, so we just recurse to the next index
elif DP[i][j] == DP[i][j-1]:
traceback(i, j-1, structure, DP, sequence)
else:
#try pairing j with a matching index k to its left.
for k in [b for b in range(i, j-min_loop_length) if pair_check((sequence[b], sequence[j]))]:
#if the score at i,j is the result of adding 1 from pairing (j,k) and whatever score
#comes from the substructure to its left (i, k-1) and to its right (k+1, j-1)
if k-1 < 0:
if DP[i][j] == DP[k+1][j-1] + 1:
structure.append((k,j))
traceback(k+1, j-1, structure, DP, sequence)
break
elif DP[i][j] == DP[i][k-1] + DP[k+1][j-1] + 1:
#add the pair (j,k) to our list of pairs
structure.append((k,j))
#move the recursion to the two substructures formed by this pairing
traceback(i, k-1, structure, DP, sequence)
traceback(k+1, j-1, structure, DP, sequence)
break
def write_structure(sequence, structure):
dot_bracket = ["." for _ in range(len(sequence))]
for s in structure:
dot_bracket[min(s)] = "("
dot_bracket[max(s)] = ")"
return "".join(dot_bracket)
#initialize matrix with zeros where can't have pairings
def initialize(N):
#NxN matrix that stores the scores of the optimal pairings.
DP = np.empty((N,N))
DP[:] = np.NAN
for k in range(0, min_loop_length):
for i in range(N-k):
j = i + k
DP[i][j] = 0
return DP
def nussinov(sequence):
N = len(sequence)
DP = initialize(N)
structure = []
#fill the DP matrix
for k in range(min_loop_length, N):
for i in range(N-k):
j = i + k
DP[i][j] = OPT(i,j, sequence)
#copy values to lower triangle to avoid null references
for i in range(N):
for j in range(0, i):
DP[i][j] = DP[j][i]
traceback(0,N-1, structure, DP, sequence)
return (sequence, write_structure(sequence, structure))
print(nussinov("ACCCGAUGUUAUAUAUACCU"))
('ACCCGAUGUUAUAUAUACCU', '(...(..(((....).))))')