Introduction

This section provides a brief summary of GO-PCA, a list of key features of the Python implementation, and a link to Demo notebooks.

What is GO-PCA?

GO-PCA is an unsupervised method to explore gene expression data using prior knowledge. Briefly, GO-PCA combines principal component analysis (PCA) with nonparametric GO enrichment analysis in order to define signatures, i.e., small sets of genes that are both strongly correlated and closely functionally related.

The expression profiles of all signatures generated can be conveniently visualized as a heat map. This visualization, referred to as the signature matrix, aims to provide a systematic and easily interpretable view of biologically relevant expression patterns in the data. Together with other GO-PCA visualizations, it can serve as a powerful starting point for exploratory data analysis and hypothesis generation. The method is described in detail in an open-access research article.

Key features

GO-PCA is implemented in Python, a high-level programming language that is widely used in both scientific and non-scientific settings. The key features of GO-PCA are:

  • Accessibility and transparency: GO-PCA is free and open-source software.
  • Cross-platform compatibility: GO-PCA can be easily installed on Windows, OS X, and Linux, and runs under both Python 2.7.x and 3.5.x.
  • Simple command-line interface: GO-PCA can be run directly from the command-line (go-pca.py), and command-line scripts can be used to generate output files containing the signatures created in tab-separated text (*.tsv) or Excel spreadsheet (*.xlsx) format.
  • A powerful Python API (more documentation forthcoming): The GO-PCA Python API can be used to create high-quality figures displaying the signature matrix or individual matrices in detail. This API in turn relies on the powerful and open-source plotly plotting library.
  • Speed: GO-PCA takes about 60 seconds to run on the DMAP dataset, consisting of ~8,000 genes and ~200 samples. The most computationally intensive part of GO-PCA (GO enrichment analysis using the XL-mHG test) is implemented in Cython, a Python extension which produces efficient C code.
  • Reproducibility: GO-PCA is a deterministic algorithm, and supports the calculation of MD5 hash values for all input, configuration, and output data. These values make it easy to establish e.g. whether two GO-PCA runs used identical parameter settings.
  • Extensibility: GO-PCA’s code is modular and well-documented, making it straightforward to implement modifications, new features and extensions.

Demos

Demos of GO-PCA in action can be found in a separate GitHub repository. Note: These demos were created using an older version of this package that relied on the matplotlib library for plotting.