pycam
diff --git a/Diff for: ‎Introduction_to_python_day_2_session_4.ipynb
+66-6 b/Diff for: ‎Introduction_to_python_day_2_session_4.ipynb
+66-6
@@ -31,6 +31,13 @@
     "## Working with sequences"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can create a sequence by defining a `Seq` object with strings. `Bio.Seq()` takes as input a string and converts in into a Seq object. We can print the sequences, individual residues, lengths and use other functions to get summary statistics.  "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -49,6 +56,13 @@
     "print(my_seq.count( \"A\" ))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can use functions from `Bio.SeqUtils` to get idea about a sequence "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -63,6 +77,13 @@
     "print(molecular_weight( my_seq ))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "One letter code protein sequences can be converted into three letter codes using `seq3` utility "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -75,6 +96,13 @@
     "print(seq3( my_seq ))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Alphabets defines how the strings are going to be treated as sequence object. `Bio.Alphabet` module defines the available alphabets for Biopython. `Bio.Alphabet.IUPAC` provides basic definition for DNA, RNA and proteins. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -129,6 +157,13 @@
     "### Parsing sequence file format: FASTA files"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Sequence files can be parsed and read the same way we read other files. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -141,6 +176,13 @@
     "    print(fileObj.read())"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Biopython provides specific functions to allow parsing/reading sequence files. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -155,8 +197,15 @@
     "fileObj = open(\"data/glpa.fa\")\n",
     "\n",
     "for protein in SeqIO.parse(fileObj, 'fasta'):\n",
-    "  print(protein.id)\n",
-    "  print(protein.seq)"
+    "    print(protein.id)\n",
+    "    print(protein.seq)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Sequence objects can be written into files using file handles with the function `SeqIO.write()`. We need to provide the name of the output sequence file and the sequence file format. "
    ]
   },
   {
@@ -174,7 +223,7 @@
     "\n",
     "sequence = 'MYGKIIFVLLLSEIVSISASSTTGVAMHTSTSSSVTKSYISSQTNDTHKRDTYAATPRAHEVSEISVRTVYPPEEETGERVQLAHHFSEPEITLIIFG'\n",
     "\n",
-    "fileObj = open( \"biopython.fa\", \"w\")\n",
+    "fileObj = open( \"mySeqFile.fa\", \"w\")\n",
     "  \n",
     "seqObj = Seq(sequence, IUPAC.protein)\n",
     "proteinObjs = [SeqRecord(seqObj, id=\"MYID\", description='my description'),]\n",
@@ -194,6 +243,13 @@
     "## Connecting with biological databases"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Sequences can be searched and downloaded from public databases. "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -246,8 +302,12 @@
     "collapsed": true
    },
    "source": [
-    "- Write the reverse complement function using BioPython\n",
-    "- Retrieve a FASTA file, calculate sequence length, remove short sequences, calculate GC content, write sequence file into another file format"
+    "- Retrieve a FASTA file named <`data/sample.fa`> and find the answers of the following questions:\n",
+    "  - How many sequences are there in the file?\n",
+    "  - What are the IDs and the lengths of the longest and the shortest seqeunces?\n",
+    "  - Create a new object that contains only sequences with length > 500bp. What is the average length of these sequences?\n",
+    "  - Calculate and print the percentage of GC contents in each of the sequences.\n",
+    "  - Write the newly created sequence object into a FASTA file named <sample.long.fa> "
    ]
   },
   {
@@ -274,7 +334,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.5.2"
+   "version": "3.4.3"
   }
  },
  "nbformat": 4,