|
31 | 31 | "## Working with sequences"
|
32 | 32 | ]
|
33 | 33 | },
|
| 34 | + { |
| 35 | + "cell_type": "markdown", |
| 36 | + "metadata": {}, |
| 37 | + "source": [ |
| 38 | + "We can create a sequence by defining a `Seq` object with strings. `Bio.Seq()` takes as input a string and converts in into a Seq object. We can print the sequences, individual residues, lengths and use other functions to get summary statistics. " |
| 39 | + ] |
| 40 | + }, |
34 | 41 | {
|
35 | 42 | "cell_type": "code",
|
36 | 43 | "execution_count": null,
|
|
49 | 56 | "print(my_seq.count( \"A\" ))"
|
50 | 57 | ]
|
51 | 58 | },
|
| 59 | + { |
| 60 | + "cell_type": "markdown", |
| 61 | + "metadata": {}, |
| 62 | + "source": [ |
| 63 | + "We can use functions from `Bio.SeqUtils` to get idea about a sequence " |
| 64 | + ] |
| 65 | + }, |
52 | 66 | {
|
53 | 67 | "cell_type": "code",
|
54 | 68 | "execution_count": null,
|
|
63 | 77 | "print(molecular_weight( my_seq ))"
|
64 | 78 | ]
|
65 | 79 | },
|
| 80 | + { |
| 81 | + "cell_type": "markdown", |
| 82 | + "metadata": {}, |
| 83 | + "source": [ |
| 84 | + "One letter code protein sequences can be converted into three letter codes using `seq3` utility " |
| 85 | + ] |
| 86 | + }, |
66 | 87 | {
|
67 | 88 | "cell_type": "code",
|
68 | 89 | "execution_count": null,
|
|
75 | 96 | "print(seq3( my_seq ))"
|
76 | 97 | ]
|
77 | 98 | },
|
| 99 | + { |
| 100 | + "cell_type": "markdown", |
| 101 | + "metadata": {}, |
| 102 | + "source": [ |
| 103 | + "Alphabets defines how the strings are going to be treated as sequence object. `Bio.Alphabet` module defines the available alphabets for Biopython. `Bio.Alphabet.IUPAC` provides basic definition for DNA, RNA and proteins. " |
| 104 | + ] |
| 105 | + }, |
78 | 106 | {
|
79 | 107 | "cell_type": "code",
|
80 | 108 | "execution_count": null,
|
|
129 | 157 | "### Parsing sequence file format: FASTA files"
|
130 | 158 | ]
|
131 | 159 | },
|
| 160 | + { |
| 161 | + "cell_type": "markdown", |
| 162 | + "metadata": {}, |
| 163 | + "source": [ |
| 164 | + "Sequence files can be parsed and read the same way we read other files. " |
| 165 | + ] |
| 166 | + }, |
132 | 167 | {
|
133 | 168 | "cell_type": "code",
|
134 | 169 | "execution_count": null,
|
|
141 | 176 | " print(fileObj.read())"
|
142 | 177 | ]
|
143 | 178 | },
|
| 179 | + { |
| 180 | + "cell_type": "markdown", |
| 181 | + "metadata": {}, |
| 182 | + "source": [ |
| 183 | + "Biopython provides specific functions to allow parsing/reading sequence files. " |
| 184 | + ] |
| 185 | + }, |
144 | 186 | {
|
145 | 187 | "cell_type": "code",
|
146 | 188 | "execution_count": null,
|
|
155 | 197 | "fileObj = open(\"data/glpa.fa\")\n",
|
156 | 198 | "\n",
|
157 | 199 | "for protein in SeqIO.parse(fileObj, 'fasta'):\n",
|
158 |
| - " print(protein.id)\n", |
159 |
| - " print(protein.seq)" |
| 200 | + " print(protein.id)\n", |
| 201 | + " print(protein.seq)" |
| 202 | + ] |
| 203 | + }, |
| 204 | + { |
| 205 | + "cell_type": "markdown", |
| 206 | + "metadata": {}, |
| 207 | + "source": [ |
| 208 | + "Sequence objects can be written into files using file handles with the function `SeqIO.write()`. We need to provide the name of the output sequence file and the sequence file format. " |
160 | 209 | ]
|
161 | 210 | },
|
162 | 211 | {
|
|
174 | 223 | "\n",
|
175 | 224 | "sequence = 'MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFG'\n",
|
176 | 225 | "\n",
|
177 |
| - "fileObj = open( \"biopython.fa\", \"w\")\n", |
| 226 | + "fileObj = open( \"mySeqFile.fa\", \"w\")\n", |
178 | 227 | " \n",
|
179 | 228 | "seqObj = Seq(sequence, IUPAC.protein)\n",
|
180 | 229 | "proteinObjs = [SeqRecord(seqObj, id=\"MYID\", description='my description'),]\n",
|
|
194 | 243 | "## Connecting with biological databases"
|
195 | 244 | ]
|
196 | 245 | },
|
| 246 | + { |
| 247 | + "cell_type": "markdown", |
| 248 | + "metadata": {}, |
| 249 | + "source": [ |
| 250 | + "Sequences can be searched and downloaded from public databases. " |
| 251 | + ] |
| 252 | + }, |
197 | 253 | {
|
198 | 254 | "cell_type": "code",
|
199 | 255 | "execution_count": null,
|
|
246 | 302 | "collapsed": true
|
247 | 303 | },
|
248 | 304 | "source": [
|
249 |
| - "- Write the reverse complement function using BioPython\n", |
250 |
| - "- Retrieve a FASTA file, calculate sequence length, remove short sequences, calculate GC content, write sequence file into another file format" |
| 305 | + "- Retrieve a FASTA file named <`data/sample.fa`> and find the answers of the following questions:\n", |
| 306 | + " - How many sequences are there in the file?\n", |
| 307 | + " - What are the IDs and the lengths of the longest and the shortest seqeunces?\n", |
| 308 | + " - Create a new object that contains only sequences with length > 500bp. What is the average length of these sequences?\n", |
| 309 | + " - Calculate and print the percentage of GC contents in each of the sequences.\n", |
| 310 | + " - Write the newly created sequence object into a FASTA file named <sample.long.fa> " |
251 | 311 | ]
|
252 | 312 | },
|
253 | 313 | {
|
|
274 | 334 | "name": "python",
|
275 | 335 | "nbconvert_exporter": "python",
|
276 | 336 | "pygments_lexer": "ipython3",
|
277 |
| - "version": "3.5.2" |
| 337 | + "version": "3.4.3" |
278 | 338 | }
|
279 | 339 | },
|
280 | 340 | "nbformat": 4,
|
|
0 commit comments