Skip to content
This repository was archived by the owner on Jun 25, 2018. It is now read-only.

Commit 4fcba7f

Browse files
committed
Added exercise 2.4.1 and its solution
1 parent ae20ad6 commit 4fcba7f

File tree

3 files changed

+1999
-6
lines changed

3 files changed

+1999
-6
lines changed

Diff for: Introduction_to_python_day_2_session_4.ipynb

+66-6
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,13 @@
3131
"## Working with sequences"
3232
]
3333
},
34+
{
35+
"cell_type": "markdown",
36+
"metadata": {},
37+
"source": [
38+
"We can create a sequence by defining a `Seq` object with strings. `Bio.Seq()` takes as input a string and converts in into a Seq object. We can print the sequences, individual residues, lengths and use other functions to get summary statistics. "
39+
]
40+
},
3441
{
3542
"cell_type": "code",
3643
"execution_count": null,
@@ -49,6 +56,13 @@
4956
"print(my_seq.count( \"A\" ))"
5057
]
5158
},
59+
{
60+
"cell_type": "markdown",
61+
"metadata": {},
62+
"source": [
63+
"We can use functions from `Bio.SeqUtils` to get idea about a sequence "
64+
]
65+
},
5266
{
5367
"cell_type": "code",
5468
"execution_count": null,
@@ -63,6 +77,13 @@
6377
"print(molecular_weight( my_seq ))"
6478
]
6579
},
80+
{
81+
"cell_type": "markdown",
82+
"metadata": {},
83+
"source": [
84+
"One letter code protein sequences can be converted into three letter codes using `seq3` utility "
85+
]
86+
},
6687
{
6788
"cell_type": "code",
6889
"execution_count": null,
@@ -75,6 +96,13 @@
7596
"print(seq3( my_seq ))"
7697
]
7798
},
99+
{
100+
"cell_type": "markdown",
101+
"metadata": {},
102+
"source": [
103+
"Alphabets defines how the strings are going to be treated as sequence object. `Bio.Alphabet` module defines the available alphabets for Biopython. `Bio.Alphabet.IUPAC` provides basic definition for DNA, RNA and proteins. "
104+
]
105+
},
78106
{
79107
"cell_type": "code",
80108
"execution_count": null,
@@ -129,6 +157,13 @@
129157
"### Parsing sequence file format: FASTA files"
130158
]
131159
},
160+
{
161+
"cell_type": "markdown",
162+
"metadata": {},
163+
"source": [
164+
"Sequence files can be parsed and read the same way we read other files. "
165+
]
166+
},
132167
{
133168
"cell_type": "code",
134169
"execution_count": null,
@@ -141,6 +176,13 @@
141176
" print(fileObj.read())"
142177
]
143178
},
179+
{
180+
"cell_type": "markdown",
181+
"metadata": {},
182+
"source": [
183+
"Biopython provides specific functions to allow parsing/reading sequence files. "
184+
]
185+
},
144186
{
145187
"cell_type": "code",
146188
"execution_count": null,
@@ -155,8 +197,15 @@
155197
"fileObj = open(\"data/glpa.fa\")\n",
156198
"\n",
157199
"for protein in SeqIO.parse(fileObj, 'fasta'):\n",
158-
" print(protein.id)\n",
159-
" print(protein.seq)"
200+
" print(protein.id)\n",
201+
" print(protein.seq)"
202+
]
203+
},
204+
{
205+
"cell_type": "markdown",
206+
"metadata": {},
207+
"source": [
208+
"Sequence objects can be written into files using file handles with the function `SeqIO.write()`. We need to provide the name of the output sequence file and the sequence file format. "
160209
]
161210
},
162211
{
@@ -174,7 +223,7 @@
174223
"\n",
175224
"sequence = 'MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFG'\n",
176225
"\n",
177-
"fileObj = open( \"biopython.fa\", \"w\")\n",
226+
"fileObj = open( \"mySeqFile.fa\", \"w\")\n",
178227
" \n",
179228
"seqObj = Seq(sequence, IUPAC.protein)\n",
180229
"proteinObjs = [SeqRecord(seqObj, id=\"MYID\", description='my description'),]\n",
@@ -194,6 +243,13 @@
194243
"## Connecting with biological databases"
195244
]
196245
},
246+
{
247+
"cell_type": "markdown",
248+
"metadata": {},
249+
"source": [
250+
"Sequences can be searched and downloaded from public databases. "
251+
]
252+
},
197253
{
198254
"cell_type": "code",
199255
"execution_count": null,
@@ -246,8 +302,12 @@
246302
"collapsed": true
247303
},
248304
"source": [
249-
"- Write the reverse complement function using BioPython\n",
250-
"- Retrieve a FASTA file, calculate sequence length, remove short sequences, calculate GC content, write sequence file into another file format"
305+
"- Retrieve a FASTA file named <`data/sample.fa`> and find the answers of the following questions:\n",
306+
" - How many sequences are there in the file?\n",
307+
" - What are the IDs and the lengths of the longest and the shortest seqeunces?\n",
308+
" - Create a new object that contains only sequences with length > 500bp. What is the average length of these sequences?\n",
309+
" - Calculate and print the percentage of GC contents in each of the sequences.\n",
310+
" - Write the newly created sequence object into a FASTA file named <sample.long.fa> "
251311
]
252312
},
253313
{
@@ -274,7 +334,7 @@
274334
"name": "python",
275335
"nbconvert_exporter": "python",
276336
"pygments_lexer": "ipython3",
277-
"version": "3.5.2"
337+
"version": "3.4.3"
278338
}
279339
},
280340
"nbformat": 4,

0 commit comments

Comments
 (0)