Command-line interface

The GO-PCA command-line interface (CLI) consists of individual scripts that can be used to process and visualize the results of a GO-PCA run.

Generate custom GO-derived gene sets: gopca_extract_go_gene_sets.py

Generating custom GO-derived gene sets for use with GO-PCA is a two-step process: First, the script ensembl_extract_protein_coding_genes.py from the genometools package has to be used to create a tab-delimited text file with a list of protein-coding genes. The input for this script is an Ensembl GTF file (see the “Gene sets” column on Ensembl’s FTP Download page):

ensembl_extract_protein_coding_genes.py -a [gtf_file] -o [output_file]

The output file can then be used as the “gene file” (-g) for the script gopca_extract_go_gene_sets.py.

usage: gopca_extract_go_gene_sets.py [-h] [--version] -g <file> -t <file> -a
                                     <file> -o <file>
                                     [-e [<evidence code, ...> [<evidence code, ...> ...]]]
                                     [--min-genes-per-term <int>]
                                     [--max-genes-per-term <int>]
                                     [--part-of-cc-only] [-l <file>] [-q] [-v]

Help

–version Output the GO-PCA version and exit.

Input and output files

-g, –gene-file
 File containing list of protein-coding genes (generated using the script ensembl_extract_protein_coding_genes.py).
-t, –gene-ontology-file
 Path of ontology file (in OBO format).
-a, –goa-association-file
 Path of UniProt-GOA Gene Association file (in GAF format).
-o, –output-file
 Path of output file.

Other options

-e, –evidence-codes
 

List of three-letter evidence codes to include. If empty, include all evidence types. [IDA, IGI, IMP, ISO, ISS, IC, NAS, TAS]

Default: [u’IDA’, u’IGI’, u’IMP’, u’ISO’, u’ISS’, u’IC’, u’NAS’, u’TAS’]

–min-genes-per-term
 

Exclude GO terms that have fewer than the specified number of genes annotated with them. Set to 0 to disable. [5]

Default: 5

–max-genes-per-term
 

Exclude GO terms that have more than the specified number of genes annotated with them. Set to 0 to disable. [200]

Default: 200

–part-of-cc-only
 

If enabled, ignore part_of relations outside the cellular_component (CC) domain.

Default: False

Reporting options

-l, –log-file Path of log file (if specified, report to stdout AND file).
-q, –quiet

Only output errors and warnings.

Default: False

-v, –verbose

Enable verbose output. Ignored if –quiet is specified.

Default: False

Running GO-PCA: go-pca.py

go-pca.py is the command to run GO-PCA. All parameters can either be spcefied directly on the command line, or in a separate configuration file, using the -c option.

Note

The configuration file is expected to follow the Windows “INI-style” format, with a single “[GO-PCA]” section, followed by “parameter=value” entries. If a configuration file is given, and a parameter is set both in the configuration file and on the command line, the command line setting takes precedence.

The only required parameters are::

-e  (The expression file.)
-s  (The gene set file.)
-o  (The output file.)

However, if the expression matrix is not pre-filtered to only contain expressed genes, it is also highly advisable to specify the -G option.

usage: go-pca.py [-h] [--version] [-c <file>] -e <file> -s <file> [-t <file>]
                 -o <file> [-l <file>] [-q] [-v] [-D <int>] [-G <int>]
                 [-P <float>] [-E <float>] [-R <float>] [-Xf <float>]
                 [-Xm <int>] [-L <int>] [--escore-pval-thresh <float>]
                 [--no-local-filter] [--no-global-filter]
                 [--go-part-of-cc-only] [-ps <int>] [-pp <int>] [-pz <float>]
                 [-pm <int>]

Help

–version Output the GO-PCA version and exit.

Separate configuration file

-c, –config-file
 GO-PCA configuration file. Note: The parameter values specified as command line arguments (see below) overwrite the corresponding values in the configuration file.

Input and output files

-e, –expression-file
 Tab-separated text file containing the gene expression matrix.
-s, –gene-set-file
 Tab-separated text file containing the gene sets.
-t, –gene-ontology-file
 OBO file containing the Gene Ontology.
-o, –output-file
 Output pickle file (extension ”.pickle” is recommended).

Reporting options

-l, –log-file Path of log file (if specified, report to stdout AND file).
-q, –quiet

Only output errors and warnings.

Default: False

-v, –verbose

Enable verbose output. Ignored if –quiet is specified.

Default: False

GO-PCA parameters ([] = default value)

-D, –n-components
 

Number of principal components to test (-1 = determine automatically using a permutation test). [-1]

Default: -1

-G, –sel-var-genes
 

Variance filter: Keep G most variable genes (0 = off). [0]

Default: 0

-P, –pval-thresh
 

P-value threshold for GO enrichment test. [1.0e-06]

Default: 1e-06

-E, –escore-thresh
 

E-score threshold for GO enrichment test. [2.0]

Default: 2.0

-R, –sig-corr-thresh
 

Correlation threshold used in generating signatures. [0.50]

Default: 0.5

-Xf, –mHG-X-frac
 

X_frac parameter for GO enrichment test. [0.25]

Default: 0.25

-Xm, –mHG-X-min
 

X_min parameter for GO enrichment test. [5]

Default: 5

-L, –mHG-L

L parameter for GO enrichment test (0 = “off”; -1 = # genes / 8). [-1]

Default: -1

–escore-pval-thresh
 

P-value threshold for XL-mHG E-score calculation (“psi”). [1.0e-04]

Default: 0.0001

Manually disable the GO-PCA filters

–no-local-filter
 

Disable the “local” filter.

Default: False

–no-global-filter
 

Disable the “global” filter (if -t is specified).

Default: False

Legacy options

–go-part-of-cc-only
 

Only propagate “part of” GO relations for the CC domain.

Default: False

Parameters for automatically determining the number of PCs to test ([] = default value)

-ps, –pc-seed

Random number generator seed (-1 = arbitrary value). [0]

Default: 0

-pp, –pc-num-permutations
 

Number of permutations. [15]

Default: 15

-pz, –pc-zscore-thresh
 

Z-score threshold. [2.00]

Default: 2.0

-pm, –pc-max-components
 

Maximum number of PCs to test (0 = no maximum). [0]

Default: 0

Inspecting the results: gopca_print_info.py

In order to simply get a summary of the results contained in a particular GO-PCA result file, the gopca_print_info.py command can be used. It prints things like the number of principal components analyzed, the number of signatures generated etc.

usage: gopca_print_info.py [-h] [--version] -g <file> [-u] [-s] [-l <file>]
                           [-q] [-v]

Help

–version Output the GO-PCA version and exit.

Input file (required)

-g, –gopca-file
 A GO-PCA run or result pickle.
-u, –print-user-config
 

Print user-provided GO-PCA config data of the run.

Default: False

-s, –print-signatures
 

Print signatures of the GO-PCA result.

Default: False

Reporting options

-l, –log-file Path of log file (if specified, report to stdout AND file).
-q, –quiet

Only output errors and warnings.

Default: False

-v, –verbose

Enable verbose output. Ignored if –quiet is specified.

Default: False

Extracting the signature matrix (as tab-delimited text file): gopca_extract_signature_matrix.py

This command generates a tab-delimited text file which contains a matrix with the signature expression values for each signature and each sample. (This is the data visualized by the gopca_plot_signature_matrix.py command).

usage: gopca_extract_signature_matrix.py [-h] [--version] -g <file> -o <file>
                                         [-l <file>] [-q] [-v]

Help

–version Output the GO-PCA version and exit.

Input and output files (required)

-g, –gopca-file
 The GO-PCA result file.
-o, –output-file
 The output file.

Reporting options

-l, –log-file Path of log file (if specified, report to stdout AND file).
-q, –quiet

Only output errors and warnings.

Default: False

-v, –verbose

Enable verbose output. Ignored if –quiet is specified.

Default: False

Plotting the signature matrix as a heatmap: gopca_plot_signature_matrix.py

This command generates an interactive plot (embedded into an HTML file) of the GO-PCA signature matrix, visualized as a heatmap.

The HTML file also allows exporting the figure to the PNG format.

Extracting the signatures (as tab-delimited text file): gopca_extract_signatures.py

This command generates a tab-delimited text file in which each row corresponds to a signature. The columns contain detailed information for each signature, e.g., the gene set enrichment it was based on, and the list of genes contained in it.

usage: gopca_extract_signatures.py [-h] [--version] -g <file> -o <file>
                                   [-l <file>] [-q] [-v]

Help

–version Output the GO-PCA version and exit.

Input and output files (required)

-g, –gopca-file
 The GO-PCA result file.
-o, –output-file
 The output file.

Reporting options

-l, –log-file Path of log file (if specified, report to stdout AND file).
-q, –quiet

Only output errors and warnings.

Default: False

-v, –verbose

Enable verbose output. Ignored if –quiet is specified.

Default: False

Extracting the signatures (as Excel spreadsheet): gopca_extract_signatures_excel.py

This command generates a file with the same information as gopca_extract_signatures.py, but in the form of an Excel spreadsheet.

usage: gopca_extract_signatures_excel.py [-h] [--version] -g <file> -o <file>
                                         [-l <file>] [-q] [-v]

Help

–version Output the GO-PCA version and exit.

Input and output files (required)

-g, –gopca-file
 The GO-PCA result file.
-o, –output-file
 The output file.

Reporting options

-l, –log-file Path of log file (if specified, report to stdout AND file).
-q, –quiet

Only output errors and warnings.

Default: False

-v, –verbose

Enable verbose output. Ignored if –quiet is specified.

Default: False