Skip to content

reidalderson/bioinformatics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

86 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python code for the bioinformatics analyses from Alderson, Adriaenssens, et al.

Searches a proteome for specific motifs and performs statistical tests.

Given a supplied proteome and queried peptide motif(s), the script will search for the number of instances of the motif(s) and calculate the expected number of instances based on amino acid frequencies in the proteome. If a list of disordered regions is supplied, then a comparison between structured and disordered regions of the proteome can be perofrmed. Flexible input within the amino-acid motif is permitted; for example, "X" allows any residue at that position and multiple residues indicate "or". Residue positions are separated by the _ character, and so the IxI/V motif would be inputted as, "IV_X_IV".

To execute this script, the user must have biopython, itertools, numpy, pandas, and scipy installed in their Python distribution. The user must provide a proteome file downloaded from UniProt or in FASTA format. For example outputs, see the images below:

  • Statistical tests to check for deviations from expectations based on amino-acid frequencies in the proteome


  • Comparison of IxI/V motif frequency in structured and disordered regions of the proteome



A UniProt proteome file can be loaded like this:

p = Proteome()
human = p.load_uniprot('human_proteome.tab')

# Note that the UniProt file should contain the following columns
# 'Entry', 'Entry name', 'Status', 'Protein names', 'Gene names', 'Organism', 'Length', 'Sequence'

We can calculate the length and amino-acid composition of the proteome as such:

length = p.proteome_length(human)                 # the number of amino acid in the proteome
composition = p.proteome_AA_fractions(human)      # the amino acid fractions in the proteome

print(length)
11373813

print(composition)
AA  Fraction
A  0.070190
C  0.022974
D  0.047391
E  0.071040
F  0.036479
G  0.065723
H  0.026239
I  0.043344
K  0.057259
L  0.099601
M  0.021318
N  0.035851
P  0.063116
Q  0.047705
R  0.056423
S  0.083255
T  0.053569
V  0.059672
W  0.012188
Y  0.026661

We can count the number of times that a specific motif occurs in the proteome:

query = 'IPV'                               # the motif of interest
count_motif = p.find_motifs(human, query)   # motif count and fraction WRT same-sized motifs

print(count_motif)
(2076, 0.0001825)

and for flexible input:

flex_query = 'IV_X_IV'                              # "_" separates residues, "X" is wild card, >1 AA means "or"
iterator = p.iterate_motif(flex_query)              # generates list of motifs consistent with flex input
count_motifs = p.find_flex_motifs(human, iterator)  # motif, count, and fraction WRT same-sized motifs

print(count_motifs)
    Motif  Count  Fraction
0    IAI   1322  0.000116
1    IAV   2035  0.000179
2    ICI    529  0.000047
3    ICV    617  0.000054
4    IDI   1228  0.000108
..   ...    ...       ...
75   VVV   2988  0.000263
76   VWI    379  0.000033
77   VWV    512  0.000045
78   VYI    900  0.000079
79   VYV   1137  0.000100

We can compare and plot the observed frequencies of particular motifs vs. those expected from amino acid composition alone:

expected_motifs = p.expected_motifs(human, iterator)   # Pandas dataframe : motif, probability, observed, expected

print(expected_motifs)
   Motif  Probability  Observed  Expected
0    IAI     0.000132    1322.0    1499.0
1    IAV     0.000182    2035.0    2064.0
2    ICI     0.000043     529.0     490.0
3    ICV     0.000059     617.0     675.0
4    IDI     0.000089    1228.0    1012.0
..   ...          ...       ...       ...
75   VVV     0.000212    2988.0    2416.0
76   VWI     0.000032     379.0     358.0
77   VWV     0.000043     512.0     493.0
78   VYI     0.000069     900.0     784.0
79   VYV     0.000095    1137.0    1079.0

# Create the label for the axes on the plot
xlab = []
for i, res in enumerate(query.split("_")):
    if len(res)>1:
        xlab.append('['+'/'.join(list(res))+']')
    else:
        if res == 'X':
            xlab.append(res.lower())
        else:
            xlab.append(res)
xlab = ''.join(xlab)

# Now set up the plot
xy_max = max(1.05*np.max(expected['Expected'][1:]), 1.05*np.max(expected['Observed'][1:]))
plt.plot([0, xy_max], [0, xy_max], 'k-', lw=0.5, label='y = x')
plt.plot(expected['Expected'][1:], expected['Observed'][1:], 'ro')
plt.xlabel('Expected number of %s motifs' % xlab)
plt.ylabel('Observed number of %s motifs' % xlab)
plt.legend(loc='upper left')
plt.xlim(0, xy_max)
plt.ylim(0, xy_max)
plt.tight_layout()
plt.show()


We can subtract out the contribution of a smaller region of a proteome from the entire proteome and perform a chi-squared analysis:
query = 'IV_X_IV'
IDRs = p.load_fasta("human_idrs.fasta")   # Load a FASTA file with disordered regions
p.calc_stats(human, query, difference='y', sub_proteome=IDRs)   # calculate x2 values (observed vs. expected)
p.plot_chisq(query, pval=0.01, diff='y', main=human, sub=IDRs, title='Structured', save='n') # plot x2 values

We can write out files containing the motif counts in structured vs. disordered regions

query = 'IV_X_IV'                                       # load the query motif
iterator = start.iterate_motif(query)                   # get all motifs 
data_array = np.zeros(shape=(2, len(iterator), 2))      # set up 3D data array 

# store the data in a numpy array and write to disk
fout1 = open(('structured_count_%s_motifs.txt' % query), 'w')
fout2 = open(('disordered_count_%s_motifs.txt' % query), 'w')
fout1.write('%s\t%s\n' % ('Motif', 'Count'))
fout2.write('%s\t%s\n' % ('Motif', 'Count'))

# first store the data, calculate the total count, and write out the total
for i, motif in enumerate(iterator):
    data_array[0, i, 0] = p.find_motifs(human, motif)[0]
    data_array[1, i, 0] = p.find_motifs(IDRs, motif)[0]
structured_count = np.sum(data_array[0,:,0]) - np.sum(data_array[1,:,0])
fout1.write('%s\t%.0f\n' % ('Total', structured_count))
fout2.write('%s\t%.0f\n' % ('Total', np.sum(data_array[1,:,0])))

# then write out the count for each motif 
for j, val in enumerate(iterator):
    fout1.write('%s\t%.0f\n' % (val, data_array[0,j,0]-data_array[1,j,0]))
    fout2.write('%s\t%.0f\n' % (val, data_array[1,j,0]))
fout1.close()
fout2.close()

Finally, we can identify all proteins that contain a particular motif

query = 'IPV'   # the motif we are searching for

# prepare headers for the text files
idr_header = ('ENSEMBL IDs of proteins in the IDR-ome that contain at least one %s motif' % query)
prot_header = ('UniProt IDs of proteins in the proteome that contain at least one %s motif' % query)

# write out text files with ENSEMBL or UniProt IDs for proteins that contain the queried motif
np.savetxt(('disordered_protein_%s_motifs.txt' % query), p.find_proteins(IDRs, query, 'fasta'), fmt='%s', header=idr_header)
np.savetxt(('proteome_proteins_%s_motifs.txt' % query), p.find_proteins(human, query, 'uniprot'), fmt='%s', header=prot_header)

See these pages for more information or installation of the various Python packages:
biopython
itertools
numpy
numpy via a pre-built package
matplotlib
pandas
scipy

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages