Skip to content

Writing your own scripts

Jesse Lingeman edited this page Aug 29, 2013 · 3 revisions

How scripts work

Scripts generally work in a fairly straight-forward way. There are many helper functions in Helpers.py that can make setup a little easier, but the basic idea is that first all of the necessary dependencies are loaded. Then, the settings hash is initialized. This hash contains all of the current settings for each algorithm. Upon setup of an algorithm, a copy of the settings hash is made, so all settings for a particular algorithm must be set up before an algorithm's setup function is run. This hash gets passed around through many of the functions and files. Please see the files in the demo_scripts folder for examples for each of the algorithms.

Once the settings file is initialized, we load the data. To do that, we call the ReadData function

we can create a new instance of an algorithm.

The basic setup

  1. Initialize the settings hash
# Create a new blank hash
settings = {}
# This will create a new settings hash from the config.cfg file in the root
# directory of the Network Inference
settings = ReadConfig(settings)
  1. Load and run the setup function This function will set up the settings hash that all algorithms use to initialize and set their runtime settings.
from __init__ import *

You can take a look at init.py to see what exactly is being loaded if you ever need to modify it. This will also load all of the various libraries and algorithm objects. Next we want to set up some global settings. For example we probably want to set a name for our experiment, which is what get used to create the output folder. To do that we run:

settings["global"]["experiment_name"] = "experiment_name_here"

Now we want to set up the output directory. This bit of code can just be copy+pasted into the script to do so. It will also create the appropriate time stamp from when the script is launched and append that to the directory.

# Set up output directory
t = datetime.now().strftime("%Y-%m-%d_%H.%M.%S")
settings["global"]["output_dir"] = settings["global"]["output_dir"] + "/" + \
    settings["global"]["experiment_name"] + "-" + t + "/"
    os.mkdir(settings["global"]["output_dir"])

And, finally, we are ready to start working with data and using some inference algorithms.

Reading in data

Reading data that is in the same format as the output of the software GeneNetWeaver, which is used to create the simulation data in the package, is easy. To do so we first need a string that is either a full or relative (from the root directory of this software) path to the data file, and we need to know the type of data it is (multifactorial, knockout, knockdown, wildtype, timeseries). From there we can simply call the ReadData function:

data_storage = ReadData(filepath, "multifactorial")

which will return a DataStorage object that contains our data.

This object has a few different data structures inside of it. First is the gene_list, accessed by calling:

data_storage.gene_list

This is simply the list of genes that the dataset contains. Next are the experiments that the data_storage contains. These can be accessed with:

data_storage.experiments

Each column (i.e., microarray) in the data file is represented as an Experiment object. These are the data structures that contain the actual gene expression data. Within the experiment data structure there is a hash called ratios. The gene expression ratios are indexed by the gene name, so to access them would be like this:

expression_value = data_storage.experiments.ratios[gene_name]

Data can also be normalized using functions from the DataStore class. These can be called like this:

data_storage.normalize()

Another important thing is to know how to combine data objects together. This is useful is you have data in multiple files but want to feed all of your data into an algorithm like Genie3. This can be done like this:

data_storage = ReadData(filepath, "multifactorial")
data_storage2 = ReadData(wildtype_path, "wildtype")
data_storage.combine(data_storage2)

The combine method will take care of mixing the two gene lists and expression data.

Running experiments

Each algorithm has slightly different data requirements and parameters. This has been streamlined as much as possible, but there is still a little bit of studying the class file that needs to go on. These files can be found in the wrappers folder. The main method should be all you need, and it will tell you what the different parameters are. Generally these are just the datasets.

Additionally, the configuration settings also may need to be tweaked. This is done from the settings data structure, and a list of available settings for each algorithm can be found in config/default_values.

The general setup is this:

  • Set up the settings for the given algorithm in the settings hash
  • Create an algorithm object using your data
  • Add the algorithm object to the job manager's queue
  • Run the job manager
  • Analyze results

Below is an example of these steps:

from __init__ import *


# Name the experiment
settings["global"]["experiment_name"] = "GENIE3" + sys.argv[1]

# Get the filenames for the data files
ko_file = "datasets/ko_file.csv"
kd_file = "datasets/kd_file.csv"
wt_file = "datasets/wt_file.csv"
mf_file = "datasets/multifactorial_data.csv"
gold_file = "datasets/gold_standard_network.csv"

# Read in gold standard network
goldnet = Network()
goldnet.read_goldstd(gold_file)

# Create the necessary directories to store this run's code and results in
t = datetime.now().strftime("%Y-%m-%d_%H.%M.%S")
settings["global"]["output_dir"] = settings["global"]["output_dir"] + "/" + \
    settings["global"]["experiment_name"] + "-" + t + "/"
os.mkdir(settings["global"]["output_dir"])


# Read data into program
# Where the format is "FILENAME" "DATATYPE"
mf_storage = ReadData(mf_file[0], "multifactorial")
ko_storage = ReadData(ko_file[0], "knockout")
kd_storage = ReadData(kd_file[0], "knockdown")
wt_storage = ReadData(wt_file[0], "wildtype")

# Setup job manager
jobman = JobManager(settings)

# Modify settings for GENIE3
settings["genie3"]["num_trees"] = 1000

# Make GENIE3 jobs

# Run GENIE3 ONLY on multifactorial data
# The first argument to setup is the data storage, second is the settings hash, and the third is the
# unique name for this run.
genie3job = GENIE3()
genie3job.setup(mf_storage, settings, "MF")
jobman.queueJob(genie3job)

# Now combine knockout data with the multifactorial data and
# create a new genie3 instance
# When the setup method is run, everything is written to disk so we can modify data objects
mf_storage.combine(ko_storage)
genie3job = GENIE3()
genie3job.setup(mf_storage, settings, "MF_KO")
jobman.queueJob(genie3job)

# Add the wildtype data
mf_storage.combine(wt_storage)
genie3job = GENIE3()
genie3job.setup(mf_storage, settings, "MF_KO_WT")
jobman.queueJob(genie3job)

# Add the knokdown data
mf_storage.combine(kd_storage)
genie3job = GENIE3()
genie3job.setup(mf_storage, settings, "MF_KO_WT_KD")
jobman.queueJob(genie3job)

# Now run all of these algorithms
jobman.runQueue()
jobman.waitToClear()

# And finally save all of the results and compare to the gold standard network
SaveResults(jobman.finished, goldnet, settings)
Clone this wiki locally