!pip install Bio
Psuedoalignment Implementation
Goal: Given some RNA-seq data in FASTA format, find the vector of equivalence class counts.
The pseudoalignment procedure should take in: - (1) an index from gene notation, or, a way of corresponding isoform name to the actual sequence (ref genome: chr11) - (2) RNA-seq data (reads: reads.tar.gz) - (3) a k-mer length (ideally between 21 ~ 36)
The output would be a table like:
- isoforms in eq. class - number of items in that eq. class: how many isoforms are in the eq. class - counts: number of reads that map to the equivalence class An equivalence class is a set of isoform targes that the read could potentially be from/map to.
To see the final implementation, go to the pseudoalign - Obatining table… section of this notebook.
Read data
The RNA-seq data and ref genome are both in FASTA forma, which we would parse with the biopython library.
import gzip
from Bio import SeqIO
import pandas as pd
from matplotlib import pyplot as plt
Obtaining RNA-seq reads:
The RNA-seq reads are all of length 100. In the read description, it gave an alignment result that we could potentially use to check our work.
= []
reads with gzip.open("reads.fasta.gz", "rt") as handle:
for seq_record in SeqIO.parse(handle, "fasta"):
= seq_record.description.split(';')[0]
des = seq_record.description.split(';')[1]
mate1 = seq_record.description.split(';')[2]
mate2 id = des.split('/')[0]
= des.split('/')[1]
gene = str(seq_record.seq)
seq = len(seq_record.seq)
length = {
d 'mate1':mate1,
'mate2':mate2,
'id':id,
'seq':seq,
'length':length,
'gene':gene
} reads.append(d)
1] reads[
{'mate1': 'mate1:221-320',
'mate2': 'mate2:302-400',
'id': 'read2',
'seq': 'CCCCCGTGCTGCCCTGAGGAAGTTGAAGCTCCTCGCTGCTTCTGGCACTTCAGCGGGAAGTTGGTTGGGGCGGGATCGCGCGCCCTCTGGTGGCGCCATG',
'length': 100,
'gene': 'ENST00000410108'}
#check for missing
for d in reads:
if len(d['seq']) < 10:
print(d)
Indexing: Optaining the reference genome: (chr11)
The target sequences have varied lengths, as shown in the histogram.
= []
chr11 with open("chr11_transcriptome.fasta", "rt") as handle11:
for seq_record in SeqIO.parse(handle11, "fasta"):
= seq_record.description
gene = str(seq_record.seq)
seq = len(seq_record)
length = {
d 'seq':seq,
'length':length,
'gene':gene
} chr11.append(d)
1] chr11[
{'seq': 'CCACGTCTGAGGCGGCTGTGGCCGCGTCGGTGTCCGCGTCGAGGAGCCGGGGCAGGGCACGATGGCGGACTGGGCTCGGGCTCAGAGCCCGGGCGCTGTGGAAGAGATTCTAGACCGGGAGAACAAGCGAATGGCTGACAGCCTGGCCTCCAAAGTCACCAGGCTCAAATCGCTCGCCCTGGACATCGATAGGGATGCAGAGGATCAGAACCGGTACCTGGATGGCATGGTAAGGGCCCACGGTGTGCGTGTATCAGTGCCCTGCCCTAGCACTACCTGCTGCAGAGCCTGCAGTGTCTCCTTCAGCACTGGTGCGGGTGGGTGGTCAAATCACCACTTCTGTGTGATCTTGCTGGGATTCCTCCCTTAGGACTCGGATTTCACAAGCATGACCAGCCTGCTTACAGGGAGCGTGAAGCGCTTTTCCACAATGGCAAGGTCCGGACAAGACAACCGGAAGCTTCTATGTGGCATGGCCGTGGGTCTAATTGTGGCCTTCTTCATCCTCTCCTACTTCTTGTCCAGGGCAAGGACGTGAGCCAGTGGGAGCTGGTGTCTGTGGGTGCCAAGGGCAGCCAGGGTCTTCCCTGCCTGGTGTTTTGGGCTCCAGAGGACTTACCTACAAAATACTCCTTTGCAATGATAATTGTGGGTCAGGAATCTTCTTCCTGTGTGGCAGGAGGCTGCGGCTGCCTGTGACCTGATGAGCTCATGTTGGCTGGTCCCATGTGTGAAAGGGACTCTCTCGGGGAAACCAAGGCCCAGCCCCTCCCCCCTCCCCCAGGTGCTAGAGAGCCAGGCAGAGCTGGGGCCACAGCGTGTCTGGAGAAGCCAGGTCTGAGCAGAGCAGCTGGTGAATCCCAGACGTGGCCACCTGTGGCTTGCAGCCCTTCTCCCCGCGCTTGGGACTCTGACATCTTAAGGCTGCACGGTCGTGTCCTTGTCTGGGTGAGGCCATGTCTGTGATCCAAGGTTCCTGGAACTGACACAGGAAGGGGCTGTGAACCCTAAGTGGGTGTCATCTCCTCCGACCGAGGCTTCTCACCCTGGAGATGGCAGTTACTCCTGGCCATGGTTGCTGAGCATGGGCAGACCAGTGGAGGCCACCCTACTGTGTTATCTGCGCCTTCGATGAAGTGAGACCCTTGGGGAGAACGGGCTGTGGATGAAGGAGTGGACTGCAGCCTTGGCCTAGCCACTGGGCTGGGATCTTCTGGGTCATGTGACTGTGTATCCAGGAGCAGAAACTTGTATTCTCAGGATTCAGGATCTACCCAGCACCAAAGATGTATTTTCAGGAGAACAGACCTAGAAATGGGCCTGTCTGGCATTTCAGAGTCAGGCAAAGCAGGCAGGGCCAGGGAGCTTCTGTGGGTCTACACAAGAAGGTTCCTGTGAGGGCTATCAGTTGTTGCCTTCTAGCTTGCTGGTAACTTTGGCGCCTCCGCCAAGCCCTGCCAGACTCCCCTGGCTGTGATGGCATTCTGTGCCATCCTCCTTGTCCCCAGCCTCTGCAGGATGCCCTCCCTACCCACCTCTCCCTGGGCCTTCCCTGTCCACTGGGCTGGATTCATGTTCAAACCACTGGACTGGCAGGGCAACGACTTCTTCCCACCTCAAGATGAGGTCCTCGCCCCCTTGTCTTGGCATAAAAACACCTTTAAAGCATGAGCCATGTGCTTCTTTGCCCTTCTCTGTCCTGTTCCAATCTTCTGCCTCCCAGTCACTCCCTGGGGACTATGGGATCACTGTCCCCCCACCTGTGTGGCCACACCATATGTCCTGTCAATCCAGAACTGCCTCTGAGCTCCAGGCTGACCACAGATCAGCCACAGCCTGATGCCTGCAGCCCCACTTTGCTCACCCTTCCCCTCCCCTCCTCCTTCCTTCCACACAGCAAGCCTACCTTTCTCCATCCATGCTCACCATAGCCCCCTTCCTTGTGACCTGGACCCTCCATTGTACCTGGCTGAGACTGTCAGCCTCCTGGAGGAGTGGGGTCCACCTTCTTCTTGCCCTATGCAGTGCAAGCTCACTTCTCACCCAGCAAGGTTGACTCATCTGCCTCCATGTCTCTGGGGCTTTGCTGTTGCCCTGAAACCTAGCTGGGCTGGTCTTGCTCCCAGCTTGCTTCCCCCTCCTCGGATGTCCCTTTGCAGGCCCCTGTCGTTCCTCCGGCACCAGTGTCCTTGGCTGCCATGGCAAGCTCATCAGGGGCTTGTACCCTGGTCACCAAGCATGGTAGCAGCTGCCTGCATTGTATCTCCATCTGGTCACTGCAGGTGCCAACCCTTCATCCCCCATGTTTTCCTGGGCCATGGAGGGCTGACCTCCGTTTCTGGGGAATGTGGCTGAGCTGTGGTAACCAGCTACACCCCAGGTGCTCTTTCCATGGTGGTGCCTGCTCATCTTGCTGATGCAAACTAGGAAGTTAGGCTGCATCTCGGAGTGGCTTTCGCTGGAGAGGTGCTTTGCTGTCTCTCAGACTCAGTCACTGTGTTCCCTCCCCGCCTCTCTTATCTCCATGGCTGTTTGCAGCTCTCCCAGGTACTTTGGGGTCTGAGCTGGAATTCCTTTGTGGTTTGCTCTTCTGCTTCTCACTCTTGTATTAAGAAGGATTCCACAAAGGGAGAGTGGCATCCCTGCTGCTGCTGTGCCAGACCAGAGTTTCCTGAGGGGCCCTGACCCTAACCCTCCAGCTCAGCCCTGTACACCTGACCCTGTAAATGAGTGGGGTTTGCTGACTGTAATCCCTGACACCAGTAAAACCAAAAGGACTCTTGGGGGCTCAGTGTGAGAGCCAGGGTTACCTACTCTGCCAAGTGAGGACAAACTGCTAGGCTGTATCCCATAATTTCAGGATGAGAAACATTAACAATAAAAATTTGTAGTAAACATAACCTCATGAAGAACTAA',
'length': 2916,
'gene': 'ENST00000325147'}
'length'] for g in chr11])
plt.hist([g['Isoform target length distribution') plt.title(
Text(0.5, 1.0, 'Isoform target length distribution')
Equivalence classes
Define helper functions for equivalence counts:
- Generate a list of k-mers from a sequence based on k-mer length
- Get the reverse complement of a strand
def kmers(k,seq):
if k < 1:
raise "k must be positive"
return [seq[i:i+k] for i in range(len(seq)-k+1)]
= 'ACTGCATATA'
seq = 3
k print(kmers(k,seq))
['ACT', 'CTG', 'TGC', 'GCA', 'CAT', 'ATA', 'TAT', 'ATA']
def rev(seq):
= "".join(reversed(seq))
r = {'A':'T','C':'G','G':'C','T':'A'}
complement return ''.join([complement[base] for base in r])
print(seq)
print(rev(seq))
ACTGCATATA
TATATGCAGT
Obtain all equivalence classes and isoforms in each eq. class.
In this part we would use a hash table (or python dict, where the key is not hashed to some number/string but just the kmer themselves) to store the equivalence classes.
Note that hash table is used to implement both dict and set in python.
def equivalence(k,sequence):
if k > sequence[0]['length']:
raise 'k should be smaller than isoform length'
= dict()
eq for s in sequence:
= kmers(k,s['seq'])
ks for mer in ks:
#if the kmer is not already in the dict, return an empty list/set
#convert that to list, and concat the new isoform with that kmer
#conver to set again to remove duplicate
set(list(eq.get(mer,list())) + [s['gene']])})
eq.update({mer:return eq
= equivalence(22,chr11) eq22
print(list(eq22.keys())[0],' ',list(eq22.values())[0])
GCCACGTCTGAGGCGGCTGTGG {'ENST00000410108', 'ENST00000382762'}
Psuedoalign
To account for reverse strand, we do the alignment twice, the first time with the original read, the second time with the reverse complementary of the original read. For each direction, we take the intersection of the lists of equivalence classes each kmer of the read maps to. When returning the set of isoforms the read could map to, we take the union of forward’s and reverse’s mapping.
Example of pseudo-align
= reads[1]
read1 def psuedoalign(read,k,eqdict):
#original
= [km for km in kmers(k,read)]
kmso = [eqdict.get(kmr,set()) for kmr in kmso]
eqclasso = set.intersection(*eqclasso)
os #reverse
= [km for km in kmers(k,rev(read))]
kmsr = [eqdict.get(kmr,set()) for kmr in kmsr]
eqclassr = set.intersection(*eqclassr)
rs #return the non-empty set
return list(rs.union(os))
Answer checking
= psuedoalign(read1['seq'],22,eq22)
iso print('p-aligned: ',iso[0])
print('labeled: ',read1['gene'])
p-aligned: ENST00000410108
labeled: ENST00000410108
Final Implementation
Obtaining the table from reads, reference, and k-length
wrapper please see visualization part for the final table for varying k-mer lengths = 22,28,36
Just a wrapper of the above that takes care of missing bases in reads. Mostly just 1, a very, very few with 2 (that we just pass).
= []
n_ct for r in reads:
if 'N' in r['seq']:
'seq'].count('N'))
n_ct.append(r[print("number of N's missing: "+str(set(n_ct)))
print("number of fragments: "+str([n_ct.count(i) for i in set(n_ct)]))
number of N's missing: {1, 2}
number of fragments: [5598, 12]
def palign(reads,k,eqdict):
= {} #alignment results
align
for r in reads:
= r['seq']
seq # We only handle the missing one base in read case
# By substituting in A/T/C/G
if seq.count('N') == 1:
for fix in ['A','T','C','G']:
= seq.replace('N',fix)
seq = psuedoalign(seq,k,eqdict)
iso if iso: #append the first sub that gives non empty alignment
break
#if not, empty list would be appended
elif seq.count('N') == 0:
= psuedoalign(seq,k,eqdict)
iso # We ignore the missing two bases cases because only 12
# We sort the list of aligned isoforms and then convert that to string
= list(iso)
ali = len(iso) #nums of isoforms in the eq class
num
ali.sort()= "".join(i+', ' for i in ali)
eqstr 'counts':align.get(eqstr,{'counts':0})['counts']+1,'num':num}})
align.update({eqstr:{
return pd.DataFrame({
'isoforms in the eq. class':list(align.keys()),
'counts':[d['counts'] for d in align.values()],
'num. of isoforms in the eq. class':[d['num'] for d in align.values()]}).sort_values('num. of isoforms in the eq. class')
# The actual alignment of each read to the isoform targets took approximately 2min
= palign(reads,22,eq22) result22
= result22.sort_values('num. of isoforms in the eq. class')
result result.head()
isoforms in the eq. class | counts | num. of isoforms in the eq. class | |
---|---|---|---|
2 | 230484 | 0 | |
8269 | ENST00000540830, | 13 | 1 |
8828 | ENST00000681339, | 1 | 1 |
8827 | ENST00000280346, | 1 | 1 |
2425 | ENST00000323959, | 2 | 1 |
print(sum(result22.counts))
print(len(reads))
1282526
1282526
= equivalence(28,chr11)
eq28 = equivalence(36,chr11)
eq36 = palign(reads,28,eq28)
result28 = palign(reads,36,eq36) result36
Visualization
Plotting the counts of reads mapping to equivalence classes of certain sizes v.s. certain sizes. For kmer length = 22,28,36, the visualizations are very similar, only very slight differences in counts.
def vis(res,k,ax):
= 'num. of isoforms in the eq. class'
num = res.groupby(num)['counts'].agg(sum).reset_index(0)
plot
plt.sca(ax)
plt.scatter(plot.counts,plot[num])'counts of reads')
plt.xlabel('number of isoforms')
plt.ylabel('kmer length = '+str(k))
plt.title(return plot
= plt.subplots(1,3,figsize=(15,5))
fig,ax = vis(result22,22,ax[0])
tbl22 = vis(result28,28,ax[1])
tbl28 = vis(result36,36,ax[2])
tbl36 #tbl42 = vis(result42,42,ax[3])
plt.tight_layout()
Discussion
- What I did:
Generate kmers based on a lists of given isoform targets (from chromosome 11)
Generate kmer-based equivalence classes, for a certain kmer, what isoforms has it
Write a psuedoalign function to align a read to equivalence classes which also handles the missing base (‘N’ in sequence) and the reverse complementary sitution.
- The implementaiton is hash table (python dict) based.
- The psuedoalignment is based on intersection, no backtracking or skipping if at one point the intersection gives an empty set.
The implementation time of the whole process is roughly:
- Create equivalence classes/indexing?: 0.5 ~ 1min
- Psuedoalign: 2.5 ~ 3min
- Potential problem:
- From the looks of it, I have about 22% of reads that ended up not being aligned to anything. This is proabably because my implementation has no ‘skipping’ or ‘backtracking’, so many reads are simply discarded when one of their kmers fail to match anything.
- Interesting part:
- When the kmer length is somewhere between 21-36, the equivalence class total counts are the same, just slight differences in the counts for each. Noticeably, the shape of the visualization of #isoform in eq.class of a certain size v.s. counts of reads mapping to classes of that size shows us that there are a few kmers/segment that could be mapped to a lot of taret isoforms. It’s possible that they are common segments in genes.