Software
Analyzing Gene Expression Data
Having obtained the gene expression data (in log base 2 scale) for various knockout specimens, we used various computational and statistical methods to analyze them. Following is a summary of the various tools and associated code used:1. GenePattern: http://www.broad.mit.edu/cancer/software/genepattern/
GenePattern (GP) provides a free, easy-to-use and powerful suite of tools to perform genomic analysis. The file formats for GP analysis are provided here, of which the .gct format was used. These analyses can be sped up by pre-filtering genes based on any criteria, such as, coefficient of variation, fold change, standard deviation etc. The shortened list of genes can be subjected to various forms of analysis.
- a. Heat Maps – Provide a visual way to identify patterns in gene expression across various knockouts.
- b. Hierarchical Clustering - Method to organize genes or phenotypes into clusters arranges in a tree structure (dendogram) using pair-wise distances.

2. Significance Analysis of Microarrays (SAM):
Ranks genes based on multi-class comparison of gene expression and provides estimate of false discovery rates for selecting a set of significant genes. The R code used to implement SAM is provided. (Data)
3. Correlation Network:
Given a set of genes and their expression profiles, a network based on the correlation between the genes can be constructed. The MATLAB code (Data) for calculating the correlation coefficients for sets of genes and the Python code (Data) visualizing the network (via Graphviz or Cytoscape) is available.

4. GO Enrichment:
Gene Ontology allows the genes to be mapped to processes via gene products. For a given list of genes (selected via any class comparison like SAM etc.), the GO processes that are enriched in it can be identified. There are standard tools like DAVID that perform these functions. However, due to a lack of support for TAIR annotation, we use custom code (using Python Code and Data, and MATLAB code and Data) to implement a hypergeomertic test for GO term enrichment.
5. Bayesian Networks (Pebl)
The novel method of this project is the Bayesian Network analysis. The Systems Biology group at the University of Michigan [, has developed a free and open-source project called Python Environment for Bayesian Learning (Pebl), which learns the structure of a Bayesian Network from gene expression data and prior information. The installation and documentation page for Pebl is available here and the code for learning & a Bayesian network is here. The Pebl learner searches through millions of possible network structure and scores them on their ability to explain the data. The best network and consensus network (preserves strongly conserved connections) give a picture of the underlying causality of gene interactions.


