CS 321 Bioinformatics Project

Table of contents:

Starter Code
Scrum Process
Project Requirements

Starter Code

This repository contains:

the expected project package structure, in the src/ folder
some partial implementation of classes, in the src/main/java/cs321/ folder
sample JUnit tests, in the src/test/java/cs321/ folder
sample input data and expected results, in the data/ folder
a wrapper for the gradle build tool, which simplifies installing and running gradle. In turn, gradle facilitates and handles:
- Java library (e.g., JUnit) dependency management
- Compiling the code
- Generating self-containing jars
- Running classes
- Running unit tests

❗ NOTE: Do NOT fork this repository, because the forked repository cannot have its own GitHub issues, which will be used as Scrum tasks.

❗ NOTE: Do NOT modify the package structure in the src/ folder, otherwise the project may not be built correctly using gradle.

Ensure you have the correct JDK version

Use the following command to check your JDK version installed:

$ javac -version

This project does not work with JDK 19.

Instead it is recommended to use either JDK 8, JDK 11 or JDK 17.

📖 See this wiki page for additional details regarding the supported Java versions and links to download the correct JDK versions.

One-Time Team Setup

One team member should:

Create a new private GitHub repository and
- Make sure to name the private repository as specified by your instructor
- Give repository write permissions to all the other team members
- Give repository read permissions to the instructor and the tutors or teaching assistants (they will provide you with their GitHub ids)
Clone (not fork) this repository and verify that gradle can be run.

$ git clone https://github.com/BoiseState/CS321_Bioinformatics.git
$ cd CS321_Bioinformatics
$ ./gradlew tasks

The last command should perform a one-time gradle setup, followed by listing all the available gradle tasks and their descriptions.

NOTE: On Windows, the ./gradlew command should be replaced with gradlew (which will call the gradlew.bat file)

The same team member should push the cloned repository to the new private repository. This can be done by changing the remote URL of the cloned repository to the new private repository's URL.

$ git remote set-url origin NEW_URL_OF_YOUR_NEW_PRIVATE_REPOSITORY
$ git remote -v
$ git push

The other team members should then clone the newly created student repository containing the starter code.

Compile and Run the Project from the Command Line

Gradle allows running unit tests and code from IDEs, or the command line, as described below.

Run all the JUnit tests and print a summary of the results:

$ ./gradlew test

Run the main method from GeneBankCreateBTree.java and pass the appropriate <arguments>:

$ ./gradlew createJarGeneBankCreateBTree
$ java -jar build/libs/GeneBankCreateBTree.jar <arguments>

Run the main method from GeneBankSearchBTree.java and pass the appropriate <arguments>:

$ ./gradlew createJarGeneBankSearchBTree
$ java -jar build/libs/GeneBankSearchBTree.jar <arguments>

Run the project from an IDE: IntelliJ IDEA, VSCode or Eclipse

Eclipse

This repository is an Eclipse project, and can be directly opened in Eclipse.

📖 See this wiki page for additional instructions to run this project in Eclipse.

IntelliJ IDEA

This project can be opened with IntelliJ IDEA.

💡 HINT: As a student, you can get IntelliJ IDEA for free (using an academic license) by signing up with your BSU email.

📖 See this wiki page for additional instructions to run this project in IntelliJ IDEA.

VSCode

Alternatively, this project can be opened with VSCode.

📖 See this wiki page for detailed instructions to run this project in VSCode.

Notes for creating additional files and tests, while keeping the Gradle project structure

You can add as many classes as you want in src/main/java, and gradle should build them automatically. In other words, you should not have to make any changes to the build.gradle.

Also, you can add new test files with new tests cases in src/test/java and those will be run automatically by gradle or your IDE.

Scrum Process

The focus of this project is to learn about data structures, while working effectively in a group. In addition, given the small project scope, and the fixed set of requirements that are already defined (and will not need to be elicited with the use of a Product Owner), the team can customize the Scrum process learned in CS-HU 271 and focus exclusively on:

creating tasks
linking commits to task IDs (e.g., Implements task #123)
Test-Driven Development and unit testing. The starter code already contains a few sample unit tests that can be run from the command line.

Scrum Board

Creating the tasks upfront will allow dividing and assigning the work in order to provide transparency and accountability within the team.

Use the Projects tab (i.e., a simplified version of ZenHub) to configure your own team Scrum board, based on this project example (feel free to copy the contents of these tasks to your Scrum board).

Your Scrum board should contain the following columns (pipelines):

Column Name	Description
Product Backlog	All (unassigned) tasks that are going to be completed by the team throughout the duration of the project
Sprint Backlog	Tasks proposed to be implemented in the current week (sprint), assigned to developers
In Progress	Tasks currently being worked on
Review/QA	Tasks ready to be reviewed by another team member
Closed	Completed tasks, whose corresponding code is integrated in the `master` branch

Tasks should be assigned to the developer working on them. Each team member should add to the project log file, Project-Log.md, the tasks (e.g., https://github.com/StudentUserNameHostingRepo/CS321_Bioinformatics/issues/123) completed that week, as described in the progress reports.

Here is an example of a valid task written in engineering language that is assigned to a developer. This task should be referenced by a commit containing a message similar to Implements task #3.

As a warm up exercise, each team member should create a task similar to task #2 and then edit the README-submission.md file with their information.

Here is a sample project log from a team from a previous semester: Project-Log-sample.md

Project Requirements

Table of contents:

Introduction
Background
Specifications
Design Issues
Implementation
Using a Cache
Using a Database
Useful Examples
Test Scripts
Progress Reports
Submission

1. Introduction

In this assignment, we will solve a problem from the field of Bioinformatics using BTrees. The amount of data that we have to handle is large and any data structure is not likely to fit in memory. Hence a BTree is a good choice for the data structure.

2. Background

Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. One of the primary aims of Bioinformatics is to attempt to determine the structure and meaning of the human genome. The human genome is a complete set of human DNA. The Human Genome project was started in 1990 by the United States Department of Energy and the U.S. National Institutes of Health. By April 14, 2003 99% of the Human Genome had been sequenced with 99.9% accuracy. The Human Genome is a big strand of 4 different organic chemicals, known as bases, which are:

Adenine
Cytosine
Thiamine
Guanine

Biologists often call them A, C, T, G for short. The bases A and T are always paired together. Similarly the bases C and G are always paired together. So when we look at the DNA representation, only one side is listed. For example: the DNA sequence: AATGC actually represents two sequences: AATGC and its complement TTACG (replace A by T, T by A, C by G and G by C). Even with only half the bases represented in a DNA sequence, the human genome is about 2.87 billion characters long!

See below an image of the DNA as well as the chemical structure of the bases.

The primary source for getting the human genome (as well as all other mapped organisms) is in the National Center for Biotechnology Information (NCBI) website (http://www.ncbi.nlm.nih.gov/). See this page for downloading human genome data

We will be using the GeneBank files from NCBI. The format is described with a sample file at http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

Most of the information in a GeneBank file is of interest to biologists. We will only be interested in the actual DNA sequences that are embedded in these files rather than in the intervening annotations.

3. Specifications

3.1. Input Files

The GeneBank files have a bunch of annotations followed by the keyword ORIGIN. The DNA sequences start from the next line. Each line has 60 characters (one of A, T, C, G, could be lower/upper case) until the end of sequence, which is denoted by // on a line by itself. Sometimes you will see the character N, which denotes that the sequence is not known at that character. You would skip these characters. One GeneBank file may have several DNA sequences in it, each marked by ORIGIN and // tags.

When we reach a character N, we assume that the sequence has ended. Similarly, when we reach //, we also assume that the sequence has ended. So at those points, we reset the sequence that we were building and start over when we find the next ORIGIN or the next valid character after seeing a N.

Some sample genome files (having the *.gbk extension) can be found in the data/files_gbk/ folder.

3.2. Problem

For a given GeneBank file, we want to convert it into a BTree with each object being a DNA subsequence of specified length k (where 1 ≤ k ≤ 31). We will take the DNA sequence from the GeneBank file and break it into subsequences of length k each. We are interested in all subsequences with length k. For example, in the sequence AATTCG, the subsequences of length three are: AAT, ATT, TTC and TCG. Once we have a BTree for a length k, we want to be able to search for query subsequences of length k. The search returns the frequency of occurrence of the query string (which can be zero if it is not found).

The biological motivation behind is to study the frequency of different length subsequences to see if they are random or that some subsequences are more likely to be found in the DNA.

4. Design Issues

4.1. Saving memory

Since we only have four possible bases (A, C, G and T), we can optimize on space by converting our DNA strings to base 4 numbers. Then we can represent each base by a 2-bit binary number, as shown below:

DNA Base	2-bit binary number
A	00
T	11
C	01
G	10

Note that we have made the binary representations for complementary bases be binary complements as well. With this compact representation, we can store a 31 length subsequence in a 64-bit long primitive type in Java.

4.2. Key Values

Note that the binary compact representation of the subsequences will result in a unique 64-bit integer value. Hence we can directly use that as our key value.

4.3. Class Design

We will need a BTree class as well as a BTreeNode class. The objects that we store in the BTree will be similar to the objects we stored in the previous Hashtable assignment. You may call the relevant class TreeObject to represent the objects.

5. Implementation

You will have two programs:

one that creates a BTree from a given GeneBank file and
another for searching in the specified BTree for subsequences of given length. The search program assumes that the user specified the proper BTree to use depending upon the query length.

The main Java classes should be named GeneBankCreateBTree and GeneBankSearchBTree.

5.1. Program Arguments

The required arguments for the two programs are shown below:

java -jar build/libs/GeneBankCreateBTree.jar --cache=<0|1>  --degree=<btree degree> 
	--gbkfile=<gbk file> --length=<sequence length> [--cachesize=<n>] [--debug=0|1|2|3]


java -jar build/libs/GeneBankSearchBTree.jar --cache=<0/1> --degree=<btree degree> 
	--btreefile=<BTree file> --length=<sequence length> --queryfile=<query file> 
	[--cachesize=<n>] [--debug=0|1|2]

Note that the arguments can be provided in any order.

<0 (no cache) | 1 (cache)> specifies whether the program should use cache (value 1) or no cache (value 0); if the value is 1, the <cache_size> has to be specified
<degree> is the degree to be used for the B-Tree. If the user specifies 0, then your program should choose the optimum degree based on a disk block size of 4096 bytes and the size of your B-Tree node on disk
<gbk_file> is the input *.gbk file containing the input DNA sequences
<subsequence_length> is an integer that must be between 1 and 31 (inclusive)
<b-tree_file> is the B-Tree file generated by the GeneBankCreateBTree program
<query file> contains all the DNA strings of a specific subsequence length that we want to search for in the specified B-Tree file. The strings are one per line and they all must have the same length as the DNA subsequences in the B-Tree file. The DNA strings use A, C, T, and G (either lower or upper case)
[<cache size>] is an integer between 100 and 500 (inclusive) that represents the maximum number of BTreeNode objects that can be stored in memory
[<debug level>] is an optional argument with a default value of zero.
- It must support at least the following values for GeneBankSearchBTree:
  - 0: The output of the queries should be printed on the standard output stream. Any diagnostic messages, help and status messages must be be printed on standard error stream.
- It must support at least the following values for GeneBankCreateBTree:
  - 0: Any diagnostic messages, help and status messages must be be printed on standard error stream.
  - 1: The program writes a text file named dump, containing the frequency and the DNA string (corresponding to the key stored) in an inorder traversal, and has the following line format:

<frequency> <DNA string>

5.2. Additional Implementation Remarks

5.2.1. Your programs should always keep the root node in the memory

Write the root node to disk file only at the end of the program and read it in when the program starts up. In addition, your program can only hold a few nodes in memory. In other words, you can only use a few BTreeNode variables (including the root) in your program (e.g., root, parent, current, child, temporary node for split). However, using the <1 (cache)> option, you can store <cache_size> BTreeNode objects in the cache.

5.2.2. Metadata storage

We need to store some metadata about the B-Tree on disk. For example, we can store the degree of the tree, the byte offset of the root node (so we can find it), the number of nodes, and so on. This information could be stored in separate metadata file or it can be stored at the beginning of the B-Tree file. That is a design choice that is left to the student team.

5.2.3. The B-Tree should be stored as a binary data file on the disk (and not as a text file)

If the name of the GeneBank file is xyz.gbk, the subsequence length is <k> and the B-Tree degree is <t>, then the name of the B-Tree file should be xyz.gbk.btree.data.<k>.<t>.

📖 Describe the layout of the B-Tree file on disk as well as any other relevant observations in the README-submission.md file.

5.2.4 To query for a subsequence, we also query for its complement

Note that our BTree stores only one side of the DNA strand but there is a complementary strand as well. This implies that when we search for a subsequence, we also need to search for its complement to get the correct answer. For example: the DNA sequence: AATGC actually represents two sequences: AATGC and its complement TTACG (replace A by T, T by A, C by G and G by C). To search for AATGC, we will search for both AATGC and its complement TTACG to get the result.

5.2.5. Running Tests

The data/files_gbk/ folder contains:

sample gbk data files

The data/queries/ folder contains:

sample query files
a sample program named QueryGenerator.java that generates random queries for testing

Start off by running your tests at home. If you do need to run them on onyx please only run the smallest test (test0.gbk).

We will setup Amazon AWS accounts for each student so that we can run larger tests in the cloud. Running your tests on AWS is required so we can all get experience using the cloud.

See the section on Testers for automated test scripts.

6. Using a Cache

You should incorporate the Cache object from Project 1 to improve the performance of your B-Tree implementation. The size of the Cache should be a command line argument. An entry in the Cache is a BTreeNode. With the <1 (cache)> command line option, the <cache_size> needs to be specified as well.

📖 Report the time improvement using a cache of size 100 and 500 in your README-submission.md file.

7. Using a Database

Design a simple database to store the results (sequences and frequencies) from the B-Tree. We will perform an inorder tree traversal to get the information to store in the database. This would be done at the end of creating the GeneBank BTree. Then we will create a separate search program named GeneBankSearchDatabase that uses the database instead of the BTree.

$ ./gradlew createJarGeneBankSearchDatabase
$ java -jar build/libs/GeneBankSearchDatabase.jar <path_to_SQLite_database> 
												  <query_file> [<debug_level>]

We will use the embedded SQLite database for this project. The SQLite database is fully contained in a jar file that gradle will automatically pull down for you. See the database example in the section below on how to use SQLite.

8. Useful Examples

The following examples from the class examples repository will be useful for this project.

Disk IO example: In particular, look at DiskReadWriteExample.java. It shows the implementation of an external binary search tree on disk.
SQLite example: A quick starter example on how to set up and use SQLite.
Bitwise operators example: In particular, look at BitwiseShiftDemo.java for helpful sequence utility code.

9. Test Scripts

Several test gbk files are provided in the folder: data/files_gbk.

These include:

test0.gbk
test1.gbk
test2.gbk
test3.gbk
test4.gbk
test5.gbk

An optional large test file is the Y chromosome for human beings. That is a large file (but the smallest of all chromosomes!). To run your program on this file is optional as it may take a long time!! For this test, we recommend a cache size of 5,000 nodes. Note this file isan't available on GitHub due to their limitation on file size. You can download it from here: hs_ref_chrY.gbk

Several pre-generated query files are provided in the folder: data/files_gbk.

In particular, results are provided for the following sample query files.

query1
query2
query3
query4
query5
query6
query7
query8
query9
query10
query20
query31

To generate additional queries, we can use the data/queries/QueryGenerator.java program.

Three test scripts are provided at the top-level of the project:

./create-btrees.sh 
Usage:  create-btrees.sh  <datafile> 

./check-dumpfiles.sh 
Usage:  check-dumpfiles.sh  <datafile> 

./check-queries.sh 
Usage:  check-queries.sh  <datafile>

Please note that the scripts require the name of the test file and then it will add the appropriate path to the name to fnd it. The check-btrees.sh script creates B-Trees for the given data file for strings of length 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 31. The check-dumpfiles.sh compares the dump files from our code to the reference dump files. The check-queries.sh script compares the results of queries to our program with the reference results.

The results are provided for test0.gbk and test5.gbk data files. You can use the test scripts to run and compare results using the three test scripts as follows.

./gradlew createJarGeneBankCreateBTree
./gradlew createJarGeneBankSearchBTree

./create-btrees.sh test0.gbk
./check-dumpfiles.sh test0.gbk
./check-queries.sh test0.gbk

Then repeat for test5.gbk.

10. Progress Reports

Each team member will fill out a progress report (via a survey) each week. The link to the survey will be provided by your instuctor.

In addition, each team member should log their project-related activities for the week, including the URL to the tasks (e.g., https://github.com/StudentUserNameHostingRepo/CS321_Bioinformatics/issues/123) completed that week, in a separate file named Project-Log.md.

Here is a sample log file: Project-log-sample.md

As a reminder, each commit should link (reference) in the commit message the completed task (e.g., Implements task #123), in order to automatically link the task to the commit, and make the code changes directly available from the task itself.

It is expected that each team should have at least one meeting every week.

Progress reports are confidential.

11. Submission

Before submission, make sure that you:

can clone the team repository on the onyx server
can compile and run the program from the command line on onyx and obtain the expected results (just try test0.gbk)
submit on onyx if your instructor requires a submission

One team member should email the URL of the team GitHub repository to the instructor and the teaching assistant(s).

Make sure the instructor and the teaching assistant(s) have access to the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
.settings		.settings
data		data
docs		docs
gradle/wrapper		gradle/wrapper
results		results
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
Project-log-sample.md		Project-log-sample.md
README-submission.md		README-submission.md
README.md		README.md
build.gradle		build.gradle
check-dumpfiles.sh		check-dumpfiles.sh
check-queries.sh		check-queries.sh
create-btrees.sh		create-btrees.sh
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

Eli-Woodard/CS321_Bioinformatics

Folders and files

Latest commit

History

Repository files navigation