Network-based SEED API

Network-based SEED API

A. Introduction

The Network-Based SEED API offers a framework that supports programmatic access to current SEED data. The system is distributed as a small set of Perl packages that the user downloads and installs locally. These packages define an API that the programmer uses to communicate over the network with the SEED environment maintained by the Fellowship for Interpretation of Genomes (FIG), Argonne National Lab (ANL) and the University of Chicago (UC). The distribution can be easily installed on a Mac or Unix-based system, and with a little extra effort, on a Windows machine. In addition to the packages that are used in constructing Perl programs, we offer a library of utility programs that offer predefined commands that can be used to extract data from the SEED.

The bulk of the functionality is offered via SAPserver.pm, a module that supports access to a database of genomic data that includes data on over a thousand genomes. The database is described abstractly via an Entity-Relationship model and is managed via metadata which makes it straightforward to extend the model, and is implemented via a standard relational database. We also offer:

MODELserver.pm to support the construction and use of metabolic models and flux-based analysis,
ANNOserver.pm to support annotation of DNA and protein sequences
RASTserver.pm to support the submission of genomes to be annotated and the retrieval of the annotations.

B. Summary of the Individual Servers

B.1 The Sapling Server (API)

The SAPserver.pm package offers programmatic access to the data maintained in the Sapling DB within the SEED. The Sapling DB is described by an entity-relationship model that depicts the basic entities maintained within the database and the relationships that we have encoded between them. This offers the basic foundation upon which most of the SEED toolkit resides. The methods offered by Sapling Objects support a rich set of operations against genomic data. Using the methods described in the API, the user has access to genomes, annotations, functional coupling data, protein families, subsystems, and a rapidly growing number of more specialized forms of data.

To see the overall ER diagram and the relations that implement it see the Sapling webpage.

A complete tutorial is offered in SAP tutorial.

B.2 The Annotation Support Server (API)

The ANNOserver.pm package supports capabilities relating to annotation of genomes. It supports invocation of standard gene callers (Glimmer3 for protein-encoding genes), and newly-developed high-performace methods to assign function to protein sequences or regions of DNA fragments (based on FIGfams and a unique use of K-mers that act as signatures of FIGfams). We include an example application based on these methods that can be used to produce relatively acurate annotation of most microbial genomes within a few minutes.

B.3 The RAST server (API)

RAST is a publicly-available server for the annotation of microbial genomes. It is maintained by a team at Argonne National Lab and FIG. Currently, it has over 2600 registered users, and several thousand genomes have been run through the service in the last couple of years (often several times!). The RASTserver.pm package was created to support programmatic submision of genomes to RAST, the retrieval of status, and the retrieval of the final set of annotations.

B.4 The Model Server (API)

This server provides access to all data associated with the biochemistry database and the genome-scale metabolic models stored within the SEED. This server also provides the user with the ability to run a set of simple flux balance analysis studies with the SEED models. A detailed description of the interface is here.

C. Command Line Services (List)

If you don't want to write Perl programs but would like to use the SEED servers to process your data, we supply a number of predefined shell scripts that provide basic bioinformatics functions using the servers. These scripts are all prefaced with "svr_" and are found in the bin directory of the distribution. These are designed to use stdin and stdout and to be piped together to form more complex processing. Here is a complete description of the available command line services. See section F. for examples using these services.

D. Distribution of the SEED server packages

Macintosh Distribution

The Macintosh distribution is a .dmg file that you can download here. When you open the SeedServers.dmg, you will see a number of packages and one called SeedServers.mpkg. Click on SeedServers.mpkg and follow the directions. It will create a directory called sas in your applications directory. From there you can run the MacRast.app to submit genomes to the RAST annotation system. You can also write your own server programs as documented below. You can do this from anywhere on your computer, as long as you run your PERL scripts with the command svr instead of perl. The command svr knows about the server evironment and runs perl in that context.

If you are more inclined to install software yourself and manipulate your own environment, see the Linux instructions below.

Linux Distribution

D.1 Download

Download the Network-based SEED API distribution here. It is called sas.tgz. Depending on your browser and preferences, you may wind up with the sas.tgz file, or you may get the uncompressed sas.tar. Use the examples below that correspond to your file.

D.2 Installation

Place the tarball in a directory of your choice (we use sas) and untar it. It will create several subdirectories.
cd to the sas/modules directory and run BUILD_MODULES.
Then, put the bin directory in your path and the lib and modules/lib directories in your perl path and you should be good to go.

Bash Example for uncompressed files (sas.tar):

mkdir sas
cp ~/Downloads/sas.tar sas
cd sas
TOP=`pwd`
tar -xvf sas.tar
cd modules
./BUILD_MODULES
export PERL5LIB=$PERL5LIB:$TOP/lib:$TOP/modules/lib
export PATH=$PATH:$TOP/bin

Bash example for the compressed file (sas.tgz)

mkdir sas
cp ~/Downloads/sas.tgz sas
cd sas
TOP=`pwd`
tar -zxvf sas.tgz
cd modules
./BUILD_MODULES
export PERL5LIB=$PERL5LIB:$TOP/lib:$TOP/modules/lib
export PATH=$PATH:$TOP/bin

To verify correct installation, try this:

perl -e 'use SeedEnv'

It should produce no errors.

D.3 Server Packages

There are 4 server packages included in the distribution:

The Genomics ER model server - SAPserver.pm
The MODEL server - MODELserver.pm
The Annotation Support Server - ANNOserver.pm
The RAST server - RASTserver.pm

D.4. Utilities

Also included is a package of utilities called SeedUtils.pm that contain functions useful for bioinformatics, but that do not require access to the databases. Click here to see the current list of functions and api descriptions.

E. Programming using the Servers

The SEED servers work by providing all necessary network operations in a server package that you use to access the server functions. You use these as any other PERL package. For instance, to find all genomes in the SEED, you would do this:

#!/usr/bin/perl -w

use strict;
use SAPserver; 
use Data::Dumper;

my $sapObject = SAPserver->new();
my $genomes = $sapObject->all_genomes();

foreach my $g (sort { $genomes->{$a} cmp $genomes->{$b} }  keys(%$genomes)) {
    print "$g\t$genomes->{$g}\n";
}

Following are some examples of programming with the servers:

E.1 Example 1: Conversion of Gene and Protein IDs

In example1 we illustrate some basic capabilities that relate to determining the set of IDs attached to specific protein sequences. The program accepts a protein ID as input. The ID may be one of several that are maintained by the SEED, UniProt, RefSeq, KEGG and other groups. The program first accesses all IDs attached to identical protein sequences. This can be a fairly large set in cases in which many very similar genomes have been sequenced.

The program determines the set of IDs associated with proteins that have identical sequence to the input protein ID. We call these sequence-equivalent proteins. The program displays the associated protein sequence, and then it computes a table describing these proteins. There will be at least one row for each of the computed IDs. There will be several rows if multiple groups have used the same ID and attached functional assignments to the ID. The columns in the table are as follows:

The first column is the ID of the protein.
The second is the genus, species and strain of the associated genome (which we sometimes call "the scientific name of the genome")
If we believe that the ID corresponds precisely to the gene associated with the input ID, this field will contain a 1. In this case, we speak of the two IDs as "precisely equivalent".
Otherwise, if we only know that this ID is for a protein sequence identical to the sequence designated by the input ID, it will contain 0. In this case, we say "the IDs are sequence-equivalent, but not necessarily precisely equivalent."
The fourth column gives the functional assignment associated with the protein ID.
The fifth column gives the source of the assignment.
If we believe that this assertion was made by an expert, this field will contain 1. Otherwise, it will contain 0.

The need to map IDs between groups and compare asserted functions of proteins is quite basic. It would be straightforward for any group to write a short CGI script using the capabilities illustrated in example1 that supported connecting protein IDs to literature (via PubMed), to structure data (when present), and so forth. Every annotation team needs this class of functionality.

Example 1 Discussion

With the server packages installed as described in section C, the code in example1 can be run as follows

> perl server_paper_example1.pl "fig|83333.1.peg.145"

The work of this routine is in two parts. Here, we use the SAP server to build a hash of identifiers for precisely equivalent genes used in determining the contents of column 3 later on.

	    my %preciseHash;
	    my $precise_assertions_list = $sapObject->equiv_precise(-ids => $id);
	    $precise_assertions_list = $precise_assertions_list->{$id};
	    if (@$precise_assertions_list > 0) {
		my $inputID = $id;
	        for my $precise_assertion (@$precise_assertions_list) {
	            my ($newID, $function, $source, $expert) = @$precise_assertion;
	            $preciseHash{$newID} = 1;
	        }

Here, we again use the SAP server to get all sequence equivalent IDs and produce the output table shown below (truncated for this example).

	    my $assertions = $sapObject->equiv_sequence(-ids => $id);
	    my $assertions = $assertions->{$id};
	    if (@$assertions < 1) {
	        print STDERR "No results found.\n";
	    } else {
	        for my $assertion (@$assertions) {
	            my ($newID, $function, $source, $expert, $genomeName) = @$assertion;
	            $genomeName = '' if ! defined $genomeName;
	            my $column3 = ($preciseHash{$newID} ? 1 : 0);
	            print join("\t", $newID, $genomeName, $column3, $function, $source,
	                             $expert) . "\n";
	        }
	    }

Output Table

cmr|NT01SF0150		0	dnaK suppressor protein VC0596 [imported]	CMR	0
cmr|NT02EC0154		0	dnaK suppressor protein VC0596 [imported]	CMR	0
cmr|NT02SB0136		0	RNA polymerase-binding protein DksA	CMR	0
cmr|NT02SD0166		0	RNA polymerase-binding protein DksA	CMR	0
cmr|NT02SF0144		0	dnaK suppressor protein VC0596 [imported]	CMR	0
cmr|NT03EC0181		0	DksA homolog	CMR	0
cmr|NT04EC0178		0	dnaK suppressor protein VC0596 [imported]	CMR	0
cmr|NT04SF0143		0	RNA polymerase-binding protein DksA	CMR	0
cmr|NT04SS0162		0	RNA polymerase-binding protein DksA	CMR	0
cmr|NT10EC0149		0	RNA polymerase-binding protein DksA	CMR	0
.
.
.

E.2 Example 2: Metabolic Reconstructions Provided for Complete Prokaryotic Genomes

Given a set of functional roles one often wishes to understand what subsystems can be inferred from the set. The second example reads as input a set of functional roles and constructs a table of subsystems, along with their variation codes, that can be identified. The data displayed in this simple example could form the start of a research project to gather the functional roles not connected to subsystems, to determine whether they were not connected because a small set of functional roles were not present in the input, and to seek candidates for such "missing functional roles". The ability to easily map functional roles into subsystems will improve over time, as the SEED annotation effort improves its collection of encoded subsystems [PMID: 16214803].

Example 2 Discussion

With the server packages installed as described in section C, the code in example 2 can be run as follows

> perl server_paper_example2.pl < Buchnera_roles.tbl

The work of this routine is in two parts. First, we use the Subsystem Server to construct the list of role-id pairs for use later on.

	    my $sapObject = SAPserver->new();
	    my @pairs;
	    while () {
	        chomp;
	        my ($id, $role) = split(/\t/, $_);
	        push @pairs, [$role, $id];
	    }

Then, we call the Subsystem Server method metabolic_reconstruction and process the results to produce the output table.

	    my $reconstruction = 
	        $ssObject->metabolic_reconstruction(-roles => \@pairs);
	    for my $record (@$reconstruction) {
	        my ($variantID, $role, $id) = @$record;
	        my ($subsysID, $code) = $variantID =~ /^(.+):(.+)$/;
	        print join("\t", $subsysID, $code, $role, $id) . "\n";
	    }

Example 2 Input File (Truncated)

fig|107806.1.peg.1      Anthranilate synthase, aminase component (EC 4.1.3.27) # TrpAa          82
fig|107806.1.peg.2      Anthranilate synthase, amidotransferase component (EC 4.1.3.27) # TrpAb Buchnera aphidicola str. APS (Acyrthosiphon pisum)      167
fig|107806.1.peg.3      Anthranilate synthase, amidotransferase component (EC 4.1.3.27) # TrpAb Buchnera aphidicola str. APS (Acyrthosiphon pisum)      167
fig|107806.1.peg.7      2-isopropylmalate synthase (EC 2.3.3.13)        Buchnera aphidicola str. APS (Acyrthosiphon pisum)      508
fig|107806.1.peg.8      3-isopropylmalate dehydrogenase (EC 1.1.1.85)   Buchnera aphidicola str. APS (Acyrthosiphon pisum)      353
fig|107806.1.peg.9      3-isopropylmalate dehydratase large subunit (EC 4.2.1.33)       Buchnera aphidicola str. APS (Acyrthosiphon pisum)      462
fig|107806.1.peg.10     3-isopropylmalate dehydratase small subunit (EC 4.2.1.33)               28
fig|107806.1.peg.11     tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA        Buchnera aphidicola str. APS (Acyrthosiphon pisum)      150
fig|107806.1.peg.12     ATP synthase A chain (EC 3.6.3.14)      Buchnera aphidicola str. APS (Acyrthosiphon pisum)      264
.
.
.

Example 2 Output Table (Truncated)

tRNA processing	1	Ribonuclease P protein component (EC 3.1.26.5)	fig|107806.1.peg.24
tRNA processing	1	tRNA(Ile)-lysidine synthetase	fig|107806.1.peg.111
tRNA processing	1	tRNA-specific adenosine-34 deaminase (EC 3.5.4.-)	fig|107806.1.peg.247
tRNA processing	1	tRNA delta(2)-isopentenylpyrophosphate transferase (EC 2.5.1.8)	fig|107806.1.peg.541
tRNA processing	1	tRNA-i(6)A37 methylthiotransferase	fig|107806.1.peg.421
tRNA processing	1	Ribonuclease T (EC 3.1.13.-)	fig|107806.1.peg.187
tRNA processing	1	tRNA pseudouridine synthase B (EC 4.2.1.70)	fig|107806.1.peg.361
tRNA processing	1	tRNA pseudouridine synthase A (EC 4.2.1.70)	fig|107806.1.peg.198
tRNA aminoacylation, Arg	1.00	Arginyl-tRNA synthetase (EC 6.1.1.19)	fig|107806.1.peg.239
Experimental-EFP	1	Translation elongation factor P	fig|107806.1.peg.29
mnm5U34 biosynthesis bacteria	1	tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA	fig|107806.1.peg.11
mnm5U34 biosynthesis bacteria	1	tRNA 5-methylaminomethyl-2-thiouridine synthase TusD	fig|107806.1.peg.507
mnm5U34 biosynthesis bacteria	1	tRNA 5-methylaminomethyl-2-thiouridine synthase TusA	fig|107806.1.peg.427
mnm5U34 biosynthesis bacteria	1	tRNA 5-methylaminomethyl-2-thiouridine synthase TusC	fig|107806.1.peg.506
mnm5U34 biosynthesis bacteria	1	GTPase and tRNA-U34 5-formylation enzyme TrmE	fig|107806.1.peg.26
mnm5U34 biosynthesis bacteria	1	tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase (EC 2.1.1.61)	fig|107806.1.peg.253
mnm5U34 biosynthesis bacteria	1	Cysteine desulfurase (EC 2.8.1.7)	fig|107806.1.peg.568
mnm5U34 biosynthesis bacteria	1	tRNA 5-methylaminomethyl-2-thiouridine synthase TusB	fig|107806.1.peg.505
Ribosome LSU bacterial	1	LSU ribosomal protein L18p (L5e)	fig|107806.1.peg.483
Ribosome LSU bacterial	1	LSU ribosomal protein L11p (L12e)	fig|107806.1.peg.47
Ribosome LSU bacterial	1	LSU ribosomal protein L23p (L23Ae)	fig|107806.1.peg.497
.
.
.

E.3 Example 3: Creating Custom Interfaces

Suppose that you had substantial expertise in graphical interfaces, understood the power of comparative analysis, and wished to support the ability to graphically display the chromosomal regions around a set of genes (normally from distinct genomes). The SEED offers one alternative for doing this (see the region displayed here for an example), but suppose that you did not like forcing users to find appropriate SEED IDs and you thought that you could develop a superior display.

Example 3 illustrates the functions required to determine the location of a SEED gene encoding a specific protein and to acquire the genes from a given region centered on that location. If you were to create a program to accept arbitrary protein IDs, use the conversion capabilities demonstrated in example1, and display the regions graphically around these genes, you would have the core of a useful tool. If you shaded genes from the same subsystem (determined using the capabilities described in example2), you could enhance the supported functionality. Of course, you could also compute which genes could be connected to literature or structures and encode that data as well.

Example 3 Discussion

With the server packages installed as described in section C, the code in example 3 can be run as follows

> perl server_paper_example3.pl "fig|100226.1.peg.3361" 2000

First, we get the location for the subject ID using the Sapling Server

	my $sapObject = SAPserver->new();
	my $geneLocH  = $sapObject->fid_locations(-ids => [$geneID]);
	my $geneLoc   = $geneLocH->{$geneID}->[0];

Next, we normalize the direction and compute the desired region.

	if ($geneLoc =~ /^(\S+)_(\d+)([+-])(\d+)/)  # retrieve an encoded location
	{
	    my($contig,$beg,$strand,$length) = ($1,$2,$3,$4);
	    my ($left,$right);
         if ($strand eq "+")
         {
             ($left,$right) = ($beg, $beg + ($length-1));
         }
         else
         {
             ($left,$right) = ($beg - ($length-1), $beg);
	    }
	    my $paddedLeft      = ($left > $max_distance) ? $left  - $max_distance : 1;
	    my $paddedRight     = $right + $max_distance;
	    my $sz              = ($paddedRight + 1) - $paddedLeft;
	    my $region          = $contig . "_" . $paddedLeft . "+" . $sz;

Then we use the Sapling Server to retrieve all genes in the region and display the results.

	    my $genesInRegionH  = $sapObject->genes_in_region(-locations => [$region],
	                                                      -includeLocation => 1);
	    my $genesInRegion   = $genesInRegionH->{$region};
	    foreach my $geneID2 (keys(%$genesInRegion)) {
	        my $location = $genesInRegion->{$geneID2};
	        $location =~ /^(\S+)_(\d+)([+-])(\d+)/;
		print "$geneID2\t$1\t$2\t$3\t$4\n";
	    }

Example 3 Output Table

fig|100226.1.peg.3358	100226.1:NC_003888	3764143	+	867
fig|100226.1.peg.3359	100226.1:NC_003888	3765006	+	504
fig|100226.1.peg.3360	100226.1:NC_003888	3765814	+	360
fig|100226.1.peg.3361	100226.1:NC_003888	3766170	+	612
fig|100226.1.peg.3362	100226.1:NC_003888	3766852	+	489
fig|100226.1.peg.3363	100226.1:NC_003888	3767961	-	606
fig|100226.1.peg.3364	100226.1:NC_003888	3770099	-	2007

E.4 Example 4: Access to Functional Coupling (Conserved Contiguity) Data

A great deal has been learned from studying genes that tend to occur close to one another in diverse genomes [PMID: 11471247 - change date to 1998, PMID: 9787636, PMID: 10077608, PMID: 11230160, PMID: 18712303].

Example 4 accesses the SEED server that offers access to the data we use to compute co-occurrence scores. The program illustrates the potential for constructing custom tools by going through all of the protein-encoding genes in all of the complete prokaryotic genomes maintained within the SEED looking for "hypothetical proteins" that tend to co-occur with genes encoding functions that can be connected to subsystems. The program constructs a table showing

the gene,
the function of the gene,
the genome id containing such a gene,
the description of the genome,
the non-hypothetical gene in a subsystem that appears to have the strongest co-occurrence score,
the co-occurrence score, and
the function assigned to the co-occurring gene contained in a subsystem.

We believe that there are many variations to this basic data mining capability that could be implemented on top of this basic co-occurrence data.

Example 4 Discussion

With the server packages installed as described in section C, the code in example 4 can be run as follows

> perl server_paper_example4.pl

First we retrieve all complete geneomes from the Sapling database

            my $genomeHash = $sapObject->all_genomes(-complete => 1);

Then we get all "Hypothetical" genes

	    my $geneHash = $sapObject->feature_assignments(-genome => $genome,
	                                                   -type => 'peg');
	    my @hypotheticalGenes = grep { &SeedUtils::hypo($geneHash->{$_}) } sort keys %$geneHash;

Then we use the Functional Coupling Server to get a hash of all genes in the neighborhood of the hypothetical gene

	    my $couplingHash = $sapObject->conserved_in_neighborhood(-ids => \@hypotheticalGenes,
	                                                            -hash => 1);

And finally, we process the coupling data for each hypothetical gene and format the output.

	    for my $gene (@hypotheticalGenes) {
	        my $couplingList = $couplingHash->{$gene};
	        if (defined $couplingList) {
	            my $subHash = $sapObject->ids_to_subsystems(-ids => [ map { $_->[1]} @$couplingList ]);
	            my ($bestCoupler, $bestScore, $bestFunction) = (undef, 0, '');
	            for my $coupling (@$couplingList) {
	                my ($score, $coupler, $function) = @$coupling;
	                if ($subHash->{$coupler} && $score > $bestScore) {
	                    $bestCoupler = $coupler;
	                    $bestScore = $score;
	                    $bestFunction = $function;
	                }
	            }
	            if (defined $bestCoupler) {
	                print join("\t", $gene, $geneHash->{$gene}, $genome, $genomeName,
	                                 $bestCoupler, $bestScore, $bestFunction) . "\n";
	            }
	        }
	    }

Example 4 output (truncated)

fig|273035.4.peg.1016	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.1018	31	DNA helicase
fig|273035.4.peg.1234	predicted metal-dependent hydrolase	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.1233	80	Diacylglycerol kinase (EC 2.7.1.107)
fig|273035.4.peg.1496	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.1491	9	Phage terminase large subunit
fig|273035.4.peg.1570	S1 RNA binding domain protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.1571	10	Glucose-6-phosphate isomerase (EC 5.3.1.9)
fig|273035.4.peg.328	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.329	60	Putative tRNA-m1A22 methylase
fig|273035.4.peg.403	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.402	54	LSU ribosomal protein L27p
fig|273035.4.peg.60	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.61	9	hypothetical protein
fig|273035.4.peg.600	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.598	18	Recombination protein U homolog
fig|273035.4.peg.61	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.60	9	hypothetical protein
fig|273035.4.peg.710	hypothetical protein	273035.4	Spiroplasma kunkelii CR2-3x	fig|273035.4.peg.708	12	LSU ribosomal protein L31p
.
.
.

E.5 Services to Support Annotation of Genes

E.5.a Identification of Genes

If one builds an annotation pipeline, one of the first steps involves identifying the putative genes. Example 5 illustrates some basic functions that can be invoked via the servers to identify protein-encoding genes and rRNA-encoding genes, and tRNA-encoding genes. These services utilize tools made available by JCVI, Niels Larsen, Gary Olsen, and Sean Eddy. They offer reasonably accurate, easily-invoked services to locate genes in prokaryotic genomes.

E.5.b Assigning Functions to Encoded Proteins

Once genes have been identified, the next step usually relates to making initial estimates of function for the products of the protein-encoding genes. Example 6 reads a fasta file of protein sequences and generates initial estimates of function. There are two levels of service offered: the first is a very fast technique that can assign functions to most proteins that have been placed in FIGfams. The second, much slower technique, involves invoking BLAT [PMID: 11932250] to seek an estimate of function against prokaryotic genes that have not been placed in FIGfams. Both techniques are far more rapid than the use of BLAST [PMID: 2231712], but they are also miss some similarities BLAST detects.

E.5.c Creation of a Metabolic Reconstruction

Example 7 illustrates how to create an initial metabolic reconstruction from the assignments generated via a technique like those illustrated in example6. It uses functionality already demonstrated in example2. The program takes the detected functions, composes a set of functional roles (splitting multi-functional function assignments into the atomic functional roles), and accesses the inferred set of subsystems.

F. Using the Command Line Services

If you don't want to write Perl programs but would like to use the SEED servers to process your data, we supply a number of predefined shell scripts that provide basic bioinformatics functions using the servers. These scripts are all prefaced with "svr_" and are found in the bin directory of the distribution. These are designed to use stdin and stdout and to be piped together to form more complex processing.

For instance, to get a two column table of all feature ids and associated functions for a given genome, you could do this:

svr_all_features 3702.1 peg | svr_function_of

This would produce a 2-column table. The first column would contain PEG IDs for genes occurring in genome 3702.1, and the second would contain the functions of those genes.

Here are examples of a number of basic functions using the servers that can be run from the command line and piped together to create small systems. These should serve as models for others who wish to create their own custom bioinformatics systems using the servers.

F.1 Find all features for a genome.

Simply retrieving all the features for a given genome is often the first step in an analysis sequence. This command is designed to be issued at the command line and takes as arguments a genome id and a feature type. The output is a single column table containing feature ids, suitable for piping into subsequent commands.

svr_all_features genome_id feature_type

This script returns all features of the specified type for the specified genome.

The code for this server is here. The man page is here.

Here is an example of running this command:

> svr_all_features 3702.1 peg
fig|3702.1.peg.1
fig|3702.1.peg.2
fig|3702.1.peg.3
fig|3702.1.peg.4
fig|3702.1.peg.5
fig|3702.1.peg.6
fig|3702.1.peg.7
fig|3702.1.peg.8
fig|3702.1.peg.9
fig|3702.1.peg.10
fig|3702.1.peg.11
fig|3702.1.peg.12
fig|3702.1.peg.13
fig|3702.1.peg.14
fig|3702.1.peg.15
.
.
.

F.2 Find Gene Function

Given a set of protein-encoding genes, a next step might be to retrieve the assigned function for each gene. This command takes as input a single column table of gene ids and returns a tab-separated two column table of gene id and function.

svr_function_of < table_of_gene_ids

Note that this script uses stdin and stdout and is designed to be part of a processing pipeline.

The code for this server is here. The man page is here.

Here is an example of running this command:

> svr_all_features 3702.1 peg | svr_function_of
fig|3702.1.peg.1	photosystem II protein D1 (PsbA)
fig|3702.1.peg.2	maturase
fig|3702.1.peg.3	SSU ribosomal protein S16p, chloroplast
fig|3702.1.peg.4	Photosystem II protein PsbK
fig|3702.1.peg.5	Photosystem II protein PsbI
fig|3702.1.peg.6	ATP synthase alpha chain (EC 3.6.3.14)
fig|3702.1.peg.7	ATP synthase CF0 B chain
fig|3702.1.peg.8	ATP synthase C chain (EC 3.6.3.14)
fig|3702.1.peg.9	ATP synthase CF0 A chain
fig|3702.1.peg.10	SSU ribosomal protein S2p (SAe), chloroplast
fig|3702.1.peg.11	DNA-directed RNA polymerase delta (= beta'') subunit (EC 2.7.7.6), chloroplast
fig|3702.1.peg.12	DNA-directed RNA polymerase gamma subunit (EC 2.7.7.6), chloroplast
fig|3702.1.peg.13	DNA-directed RNA polymerase beta subunit (EC 2.7.7.6), chloroplast
fig|3702.1.peg.14	Cytochrome b6-f complex subunit VIII (PetN)
fig|3702.1.peg.15	Photosystem II protein PsbM
.
.
.

F.3 Find Gene Aliases

Instead of function, perhaps you wish to see all the aliases by which a given feature or set of features is known in the SEED. You would use this command that behaves just like svr_function_of except it returns aliases:

svr_aliases_of < table_of_gene_ids

Note that this script uses stdin and stdout and is designed to be part of a processing pipeline.

The code for this server is here. The man page is here.

Here is an example of running this command:

> svr_all_features 3702.1 peg | svr_aliases_of
fig|3702.1.peg.1	gi|112382048,gi|113200888,gi|114054364,gi|114107113,gi|114329726,gi|115531894,gi|134286292,gi|134286378,
gi|134286553,gi|134286643,gi|134286733,gi|134286999,gi|139387232,gi|139389076,gi|139389398,gi|139389623,gi|139389781,gi|13938993
1,gi|156597939,gi|156598592,gi|157695865,gi|159792928,gi|159793098,gi|159895452,gi|159895537,gi|166344112,gi|167391785,gi|169142
690,gi|169142840,gi|169142925,gi|169143011,gi|169794053,gi|6723714,sp|A4QJR4,sp|A4QJZ9,sp|A4QKH2,sp|A4QKR1,sp|A4QL00,sp|A4QLR3,s
p|B0Z4K6,sp|B0Z4U0,sp|B0Z524,sp|B0Z5A8,sp|B1A915,sp|B1NWD0,sp|P83755,sp|P83755,sp|P83756,sp|P83756,sp|Q06FY1,sp|Q09G66,sp|Q0G9Y2
,tr|A4QJZ9,tr|A4QKH2,tr|A4QL00,tr|A4QLR3,tr|A9QAZ4,tr|A9QAZ4,tr|A9QBW0,tr|A9QBW0,tr|Q06FY1,tr|Q09G66,tr|Q0G9Y2,fig|3702.1.peg.1,
gi|515374,gi|5881674,gi|7525013,fig|85636.1.peg.1,gi|13518299
fig|3702.1.peg.2	gi|12002371,gi|12002415,gi|12002417,gi|12002419,gi|12002421,gi|12002423,gi|12002425,gi|12002427,gi|12002
429,gi|12002431,gi|126022795,gi|5881675
fig|3702.1.peg.3	sp|P56806,sp|P56806,gi|5881676,gi|7525015
fig|3702.1.peg.4	sp|P56782,sp|P56782,fig|3702.1.peg.4,gi|5881677,gi|7525016
.
.
.

F.4 Find Neighbors

Beyond the basics of finding aliases and function, a more advanced analysis might require finding the PEGs that are in the neighborhood of a given PEG. This command takes as input a tab-separated table where the last field in each line contains the PEG for which a list of neighbors is being requested. It takes an argument telling how many neighbors to find to the left and right. The output file is the input file with an extra column appended at the end (containing a list of neighbors).

svr_neighbors_of n < table_of_gene_ids

Note that this script uses stdin and stdout and is designed to be part of a processing pipeline.

The code for this server is here. The man page is here.

Here is an example of running this command:

> svr_all_features 3702.1 peg | svr_neighbors_of 5
fig|3702.1.peg.1	fig|3702.1.peg.2,fig|3702.1.peg.3,fig|3702.1.peg.4,fig|3702.1.peg.5,fig|3702.1.peg.6
fig|3702.1.peg.2	fig|3702.1.peg.1,fig|3702.1.peg.3,fig|3702.1.peg.4,fig|3702.1.peg.5,fig|3702.1.peg.6,fig|3702.1.peg.7
fig|3702.1.peg.3	fig|3702.1.peg.1,fig|3702.1.peg.2,fig|3702.1.peg.4,fig|3702.1.peg.5,fig|3702.1.peg.6,fig|3702.1.peg.7,fi
g|3702.1.peg.8
fig|3702.1.peg.4	fig|3702.1.peg.1,fig|3702.1.peg.2,fig|3702.1.peg.3,fig|3702.1.peg.5,fig|3702.1.peg.6,fig|3702.1.peg.7,fi
g|3702.1.peg.8,fig|3702.1.peg.9
fig|3702.1.peg.5	fig|3702.1.peg.1,fig|3702.1.peg.2,fig|3702.1.peg.3,fig|3702.1.peg.4,fig|3702.1.peg.6,fig|3702.1.peg.7,fi
g|3702.1.peg.8,fig|3702.1.peg.9,fig|3702.1.peg.10
.
.
.

F.5 Corresponding Genes

Download the file of corresponding genes here.

It is created by running svr_fig_id_corr -external > id.correspondence.2

F.6 The RAST batch interface.

We have supplied a set of RAST batch scripts. Each of these RAST scripts (except for the submission script, which has much more complex arguments) takes as the first two arguments the username and password of a valid account on the RAST server.

A single job may be submitted using svr__submit_RAST_job. This script takes a number of arguments which define the parameters for the submission:

--user username	RAST login for the submitting user
--passwd password	RAST password for the submitting user
--genbank filename	If submitting a genbank file, the file of input data.
--fasta filename	If submitting a FASTA file of contigs, the file of input data.
--domain Bacteria or
--domain Archaea	Domain of the submitted genome.
--taxon_id taxonomy-id	The NCBI taxonomy id of the submitted genome
--bioname “genus species str.”	Biological name of the submitted genome
--genetic_code ( 11 \| 4 )	Genetic code for the submitted genome, either 11 or 4.
--gene_caller	Gene caller to use (FigFam-base RAST gene caller or straight Glimmer-3)
--reannotate_only	Preserve the original gene calls and use RAST

A set of jobs may be submitted using svr_run_RAST_jobs. This script takes a file of Genbank contig ids on the standard input, and will determine all Genbank contigs that are part of the same Genbank project, and submit them as a single RAST job.

svr_run_RAST_jobs username password < contig-ids

The script will return information and statistics about the job submission, as well as a job number.

The status of one or more jobs may be checked using svr_status_of_RAST_job:

svr_status_of_RAST_job username password jobnumber jobnumber …

When it returns a status of “completed”, the job is ready for retrieval.

The results of a job may be retrieved using svr_retrieve_RAST_job. The supported formats are as follows.

• genbank	(Genbank format)
• genbank_stripped	(Genbank with EC numbers removed)
• embl	(EMBL format)
• embl_stripped	(EMBL with EC numbers stripped)
• gff3	(GFF3 format)
• gff3_stripped	(GFF3 with EC numbers stripped)
• rast_tarball	(gzipped tar file of the entire job)

svr_retrieve_RAST_job username password jobnumber format > output-file

If necessary, an executing job may be killed:

svr_kill_RAST_job username password jobnumber

And deleted from the system entirely (careful, this may not be undone):

svr_delete_RAST_job username password jobnumber