Running GO-PCA

This section describes the expression file format used by GO-PCA, and documents the go-pca.py command that is used to run GO-PCA. For information on how to provide the gene sets to GO-PCA, please see the previous section, Obtaining Gene Sets.

The expression file format

The main input to GO-PCA is an expression matrix, with rows representing genes, and columns representing samples. GO-PCA expects the expression matrix to be described in a tab-delimited text file that contains the gene expression values in a matrix layout. The first row contains the sample names, and the first column represents gene names (the content of the top left cell is ignored). A mini-example of a valid expression file with only five genes and three samples is shown below:

ignored Sample1 Sample2 Sample3
IGBP1   8.64947 8.01958 7.95444
MYC     7.61296 7.38281 7.58559
SMAD1   8.84338 8.41662 8.94365
MDM1    6.17908 6.07470 5.59411
CD44    7.64093 7.56293 7.58277

Running GO-PCA from the command-line

For users who are unfamiliar with Python, the most convenient way of running GO-PCA is through the command line. In general, this can be done in a two-step process:

  1. GO-PCA is run using the command go-pca.py All results are written to a binary result file.
  2. A variety of additional command-line scripts can be used to extract detailed information about the signatures generated, and to plot the GO-PCA signature matrix. These scripts use the result file as input data. In this way, GO-PCA does not need to be re-run for every operation.