The Sapling Server is a web service that allows you to access data stored in the Sapling database, a large-scale MySQL database of genetic data. The Sapling database contains data on public genomes imported from RAST (http://rast.nmpdr.org) and curated by the annotation team of the Fellowship for Interpretation of Genomes.
The SAPserver package wraps calls to the Sapling Server so that you can use them in your PERL program.
All Sapling Server services are documented in the SAP module. Each method has a signature like
my $document = $sapObject->taxonomy_of($args);
where $document
is usually a hash reference and $args
is always a hash
reference. The method description includes a section called Parameter Hash
Fields that describes the fields in $args
. For example, SAP/taxonomy_of
has a field called -ids
that is to be a list of genome IDs and an optional
field called -format
that indicates whether you want taxonomy groups
represented by numbers, names, or both. To call the taxonomy_of service,
you create a SAPserver object and call a method with the same name as the
service.
use strict; use SAPserver;
my $sapServer = SAPserver->new(); my $resultHash = $sapServer->taxonomy_of({ -ids => ['360108.3', '100226.1'], -format => 'names' }); for my $id (keys %$resultHash) { my $taxonomy = $resultHash->{$id}; print "$id: " . join(" ", @$taxonomy) . "\n"; }
The output from this program (reformatted slightly for readability) is shown below.
360108.3: Bacteria, Proteobacteria, delta/epsilon subdivisions, Epsilonproteobacteria, Campylobacterales, Campylobacteraceae, Campylobacter, Campylobacter jejuni, Campylobacter jejuni subsp. jejuni, Campylobacter jejuni subsp. jejuni 260.94 100226.1: Bacteria, Actinobacteria, Actinobacteria (class), Actinobacteridae, Actinomycetales, Streptomycineae, Streptomycetaceae, Streptomyces, Streptomyces coelicolor, Streptomyces coelicolor A3(2)
For convenience, you can specify the parameters as a simple hash rather than a hash reference. So, for example, the above taxonomy_of call could also be written like this.
my $resultHash = $sapServer->taxonomy_of(-ids => ['360108.3', '100226.1'], -format => 'names');
Let's look at a simple program that pulls all the complete genomes from the database and displays their representative genomes plus their texonomies in name format.
Three Sapling Server methods are needed to perform this function.
The program starts by connecting to the Sapling Server itself.
use strict; use SAPserver; my $sapServer = SAPserver->new();
Now we use all_genomes to get a list of the complete genomes.
all_genomes will normally return all genomes, but we use the
-complete
option to restrict the output to those that are complete.
my $genomeIDs = $sapServer->all_genomes(-complete => 1);
All we want are the genome IDs, so we use a PERL trick to convert the
hash reference in $genomeIDs
to a list reference.
$genomeIDs = [ keys %$genomeIDs ];
Now we ask for the representatives and the taxonomies.
my $representativeHash = $sapServer->representative(-ids => $genomeIDs); my $taxonomyHash = $sapServer->taxonomy_of(-ids => $genomeIDs, -format => 'names');
Our data is now contained in a pair of hash tables. The following loop stiches them together to produce the output.
for my $genomeID (@$genomeIDs) { my $repID = $representativeHash->{$genomeID}; my $taxonomy = $taxonomyHash->{$genomeID}; # Format the taxonomy string. my $taxonomyString = join(" ", @$taxonomy); # Print the result. print join("\t", $genomeID, $repID, $taxonomyString) . "\n"; }
An excerpt from the output of this script is shown below. The first column contains a genome ID, the second contains the representative genome's ID, and the third is the full taxonomy. Note that the two genomes with very close taxonomies have the same representative genome: this is the expected behavior.
221109.1 221109.1 Bacteria Firmicutes Bacilli Bacillales Bacillaceae Oceanobacillus Oceanobacillus iheyensis Oceanobacillus iheyensis HTE831 204722.1 204722.1 Bacteria Proteobacteria Alphaproteobacteria Rhizobiales Brucellaceae Brucella Brucella suis Brucella suis 1330 391037.3 369723.3 Bacteria Actinobacteria Actinobacteria (class) Actinobacteridae Actinomycetales Micromonosporineae Micromonosporaceae Salinispora Salinispora arenicola Salinispora arenicola CNS205 339670.3 216591.1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia Burkholderia cepacia complex Burkholderia cepacia Burkholderia cepacia AMMD 272560.3 216591.1 Bacteria Proteobacteria Betaproteobacteria Burkholderiales Burkholderiaceae Burkholderia pseudomallei group Burkholderia pseudomallei Burkholderia pseudomallei K96243 262768.1 262768.1 Bacteria Firmicutes Mollicutes Acholeplasmatales Acholeplasmataceae Candidatus Phytoplasma Candidatus Phytoplasma asteris Onion yellows phytoplasma Onion yellows phytoplasma OY-M 272624.3 272624.3 Bacteria Proteobacteria Gammaproteobacteria Legionellales Legionellaceae Legionella Legionella pneumophila Legionella pneumophila subsp. pneumophila Legionella pneumophila subsp. pneumophila str. Philadelphia 1 150340.3 150340.3 Bacteria Proteobacteria Gammaproteobacteria Vibrionales Vibrionaceae Vibrio Vibrio sp. Ex25 205914.1 205914.1 Bacteria Proteobacteria Gammaproteobacteria Pasteurellales Pasteurellaceae Histophilus Histophilus somni Haemophilus somnus 129PT 393117.3 169963.1 Bacteria Firmicutes Bacilli Bacillales Listeriaceae Listeria Listeria monocytogenes Listeria monocytogenes FSL J1-194 203119.1 203119.1 Bacteria Firmicutes Clostridia Clostridiales Clostridiaceae Clostridium Clostridium thermocellum Clostridium thermocellum ATCC 27405 380703.5 380703.5 Bacteria Proteobacteria Gammaproteobacteria Aeromonadales Aeromonadaceae Aeromonas Aeromonas hydrophila Aeromonas hydrophila subsp. hydrophila Aeromonas hydrophila subsp. hydrophila ATCC 7966 259536.4 259536.4 Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Moraxellaceae Psychrobacter Psychrobacter arcticus Psychrobacter arcticus 273-4
The Sapling Server processes lists of data (in this case a list of genome IDs) so that you can minimize the overhead of sending requests across the web. You can, however, specify a single ID instead of a list to a method call, and this would allow you to structure your program with a more traditional loop, as follows. To make this process simpler, you construct the Sapling Server object in singleton mode. In singleton mode, when you pass in only a single ID, you get back an actual result instead of a hash reference.
use strict; use SAPserver; my $sapServer = SAPserver->new(singleton => 1); my $genomeIDs = $sapServer->all_genomes(-complete => 1); $genomeIDs = [ keys %$genomeIDs ]; for my $genomeID (@$genomeIDs) { my $repID = $sapServer->representative(-ids => $genomeID); my $taxonomy = $sapServer->taxonomy_of(-ids => $genomeID, -format => 'names'); # Format the taxonomy string. my $taxonomyString = join(" ", @$taxonomy); # Print the result. print join("\t", $genomeID, $repID, $taxonomyString) . "\n"; }
At this point there is a risk that you are bewildered by all the options we've presented-- hashes, hash references, singleton mode-- however, the goal here is to provide a facility that will fit comfortably with different programming styles. The server software tries to figure out how you want to use it and adjusts accordingly.
Many of the Sapling Server services return data on genes (a term we use rather
loosely to include any kind of genetic locus or feature). The standard
method for identifying a gene is the FIG ID, an identifying string that
begins with the characters fig|
and includes the genome ID, the gene type,
and an additional number for uniqueness. For example, the FIG ID
fig|272558.1.peg.203
describes a protein encoding gene (peg) in
Bacillus halodurans C-125 (272558.1).
Frequently, however, you will have a list of gene IDs from some other
database (e.g. NCBI, UniProt) or in a community format such as Locus Tags
or gene names. Most services that take gene IDs as input allow you to specify a
-source
option that indicates the type of IDs being used. The acceptable
formats are as follows.
fig|272558.1.peg.203
are 10172815
and NTL01BH0204
.
accD
, which identifies the Acetyl-coenzyme A carboxyl
transferase beta chain, corresponds to 58 specific genes in the database.
fig|272558.1.peg.203
is
891745
.
fig|158878.1.peg.2821
is sav:SAV2628
.
fig|380703.5.peg.250
is
ABK38410.1
.
fig|272558.1.peg.203
are 10172815
, 15612766
, and 81788207
.
fig|272558.1.peg.203
is NP_241069.1
.
fig|243277.1.peg.3952
is O31153
.
fig|272558.1.peg.203
are Q9KGA9
and Q9KGA9_BACHD
.
You can also mix identifiers of different types by specifying mixed
for the source type. In this case, however, care must be taken, because the same
identifier can have different meanings in different databases.
The following examples use the Sapling Server in normal mode: that is, data is sent to the server in batches and the results are stitched together afterward. In this mode there is significantly reduced overhead, but there is also a risk that the server request might time out. If this happens, you may want to consider breaking the input into smaller batches. At some point, the server system will perform sophisticated flow control to reduce the risk of timeout errors, but we are not yet there.
There are two primary methods for retrieving functional roles.
In both cases, the list of incoming gene IDs is given as a list via the ids
parameter. It is assumed the IDs are FIG identifiers, but the source
parameter
can be used to specify a different ID type (see Specifying Gene IDs).
ids_to_functions provides the basic functional role. Almost every gene in the system will return a result with this method. The following example program reads a file of UniProt IDs and produces their functional roles. Note that we're assuming the file is a manageable size: since we're submitting the entire file at once, we risk a timeout error if it's too big.
use strict; use SAPserver; my $sapServer = SAPserver->new(); # Read all the gene IDs. my @genes = map { chomp; $_ } <STDIN>; # Compute the functional roles. my $results = $sapServer->ids_to_functions(-ids => \@genes, -source => 'UniProt'); # Loop through the genes. for my $gene (@genes) { # Did we find a result? my $role = $results->{$gene}; if (defined $role) { # Yes, print it. print "$gene\t$role\n"; } else { # No, emit a warning. print STDERR "$gene was not found.\n"; } }
Sample output from this script is shown below. Note that one of the input IDs was not found.
HYPA_ECO57 [NiFe] hydrogenase nickel incorporation protein HypA 17KD_RICBR rickettsial 17 kDa surface antigen precursor 18KD_MYCLE 18 KDA antigen (HSP 16.7) P72620_SYNY3 [NiFe] hydrogenase metallocenter assembly protein HypD 1A14_ARATH 1-aminocyclopropane-1-carboxylate synthase 4 / ACC synthase 4 (ACS4) / identical to gi:940370 [GB:U23481]; go_function: 1-aminocyclopropane-1-carboxylate synthase activity [goid 0016847]; go_process: ethylene biosynthesis [goid 0009693]; go_process: response to auxin stimulus [goid 0009733] Q2RXN5_RHORT [NiFe] hydrogenase metallocenter assembly protein HypE O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) / N-acetylglutamate synthase (EC 2.3.1.1) Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25)
Q8YY27_ANASP was not found.
ids_to_subsystems returns roles in subsystems. Roles in subsystems have several differences from general functional roles. Only half of the genes in the database are currently associated with subsystems.A single gene may be in In addition, multiple subsystems and may have multiple roles in a subsystem.
As a result, instead of a single string per incoming gene, ids_to_subsystems returns a list. Each element of the list consists of the role name followed by the subsystem name. This makes the processing of the results a little more complex, because we have to iterate through the list. An empty list indicates the gene is not in a subsystem (although it could also indicate the gene was not found).
use SAPserver; my $sapServer = SAPserver->new(); # Read all the gene IDs. my @genes = map { chomp; $_ } <STDIN>; # Compute the functional roles. my $results = $sapServer->ids_to_subsystems(-ids => \@genes, -source => 'UniProt'); # Loop through the genes. for my $gene (@genes) { # Did we find a result? my $roleData = $results->{$gene}; if (! @$roleData) { # Not in a subsystem: emit a warning. print STDERR "$gene is not in a subsystem.\n"; } else { # Yes, print the entries. for my $ssData (@$roleData) { print "$gene\t$ssData->[0]\t($ssData->[1])\n" } } }
Sample output from this script is shown below. In this case, four of the input IDs
failed to find a result; however, two of them (O29118
and Q8PZM3
) produced multiple
results.
HYPA_ECO57 [NiFe] hydrogenase nickel incorporation protein HypA (NiFe hydrogenase maturation) P72620_SYNY3 [NiFe] hydrogenase metallocenter assembly protein HypD (NiFe hydrogenase maturation) Q2RXN5_RHORT [NiFe] hydrogenase metallocenter assembly protein HypE (NiFe hydrogenase maturation) O29118 N-acetylglutamate synthase (EC 2.3.1.1) (Arginine Biosynthesis extended) O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) (Arginine Biosynthesis extended) O29118 N-acetylglutamate synthase (EC 2.3.1.1) (Arginine Biosynthesis) O29118 Glutamate N-acetyltransferase (EC 2.3.1.35) (Arginine Biosynthesis) Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25) (CBSS-316057.3.peg.1294) Q8PZM3 tRNA nucleotidyltransferase (EC 2.7.7.25) (tRNA nucleotidyltransferase)
17KD_RICBR is not in a subsystem. Q8YY27_ANASP is not in a subsystem. 18KD_MYCLE is not in a subsystem. 1A14_ARATH is not in a subsystem.
This next example finds all genes in subsystems for a specific genome. We will perform this operation in two phases. First, we will find the subsystems for each genome, and then the genes for those subsystems. This requires two Sapling Server functions.
Our example program will loop through the genome IDs in an input file. For each genome, it will call genomes_to_subsystems to get the subsystem list, and then feed the list to ids_in_subsystems to get the result.
SAP/ids_in_subsystems returns gene IDs rather than taking them as input.
Like SAP/ids_to_subsystems and SAP/ids_to_functions, it takes a source
parameter that indicates the type of ID desired (e.g. NCBI
, CMR
, LocusTag
).
In this case, however, the type describes how the gene IDs will be formatted in
the output rather than the type expected upon input. If a gene does not have an
ID for a particular source type (e.g. fig|100226.1.peg.3361
does not have a
locus tag), then the FIG identifier is used. The default source type (SEED
)
means that FIG identifiers will be used for everything.
The program is given below.
use strict; use SAPserver; my $sapServer = SAPserver->new(); # Loop through the input file. Note that this loop will stop on the first # blank line. while (my $genomeID = <STDIN>) { chomp $genomeID; # Get the subsystems for this genome. my $subHash = $sapServer->genomes_to_subsystems(-ids => $genomeID); # The data returned for each genome (and in our case there's only one) # includes the subsystem name and the variant code. The following # statement strips away the variant codes, leaving only the subsystem # names. my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ]; # Ask for the genes in each subsystem, using NCBI identifiers. my $roleHashes = $sapServer->ids_in_subsystems(-subsystems => $subList, -genome => $genomeID, -source => 'NCBI', -roleForm => 'full'); # The hash maps each subsystem ID to a hash of roles to lists of feature # IDs. We therefore use three levels of nested loops to produce our # output lines. At the top level we have a hash mapping subsystem IDs # to role hashes. for my $subsystem (sort keys %$roleHashes) { my $geneHash = $roleHashes->{$subsystem}; # The gene hash maps each role to a list of gene IDs. for my $role (sort keys %$geneHash) { my $geneList = $geneHash->{$role}; # Finally, we loop through the gene IDs. for my $gene (@$geneList) { print "$genomeID\t$gene\t$subsystem\t$role\n"; } } } }
An excerpt of the output is shown here.
360108.3 85840651 Queuosine-Archaeosine Biosynthesis Queuosine Biosynthesis QueC ATPase 360108.3 85841812 Queuosine-Archaeosine Biosynthesis Queuosine Biosynthesis QueE Radical SAM 360108.3 85841520 Queuosine-Archaeosine Biosynthesis Queuosine biosynthesis QueD, PTPS-I 360108.3 85841162 Queuosine-Archaeosine Biosynthesis S-adenosylmethionine:tRNA ribosyltransferase-isomerase (EC 5.-.-.-) 360108.3 85842244 Queuosine-Archaeosine Biosynthesis tRNA-guanine transglycosylase (EC 2.4.2.29) 360108.3 85841653 Quinate degradation 3-dehydroquinate dehydratase II (EC 4.2.1.10) 360108.3 85840760 RNA polymerase bacterial DNA-directed RNA polymerase alpha subunit (EC 2.7.7.6) 360108.3 85841269 RNA polymerase bacterial DNA-directed RNA polymerase beta subunit (EC 2.7.7.6) 360108.3 85841348 RNA polymerase bacterial DNA-directed RNA polymerase beta' subunit (EC 2.7.7.6) 360108.3 85841887 RNA polymerase bacterial DNA-directed RNA polymerase omega subunit (EC 2.7.7.6) 360108.3 85841283 RNA processing and degradation, bacterial 3'-to-5' exoribonuclease RNase R 360108.3 85840820 RNA processing and degradation, bacterial Ribonuclease III (EC 3.1.26.3) 360108.3 85842272 Recycling of Peptidoglycan Amino Acids Aminoacyl-histidine dipeptidase (Peptidase D) (EC 3.4.13.3)
The ids_in_subsystems service has several useful options for changing the nature
of the output. For example, in the above program each role is represented by a
full description (roleForm
set to full
). If you don't need the roles, you
can specify none
for the role form. You can also request that the gene IDs
be returned in a comma-separated list instead of a list data structure. These
two changes can drastically simplify the above program.
use strict; use SAPserver; my $sapServer = SAPserver->new(); # Loop through the input file. Note that this loop will stop on the first # blank line. while (my $genomeID = <STDIN>) { chomp $genomeID; # Get the subsystems for this genome. my $subHash = $sapServer->genomes_to_subsystems(-ids => $genomeID); # The data returned for each genome (and in our case there's only one) # includes the subsystem name and the variant code. The following # statement strips away the variant codes, leaving only the subsystem # names. my $subList = [ map { $_->[0] } @{$subHash->{$genomeID}} ]; # Ask for the genes in each subsystem, using NCBI identifiers. my $genesHash = $sapServer->ids_in_subsystems(-subsystems => $subList, -genome => $genomeID, -source => 'NCBI', -roleForm => 'none', -grouped => 1); # The hash maps each subsystem ID to a comma-delimited list of gene IDs. for my $subsystem (sort keys %$genesHash) { my $genes = $genesHash->{$subsystem}; print "$genomeID\t$subsystem\t$genes\n"; } }
The sample output in this case looks quite different. The role information is missing, and all the data for a subsystem is in a single line.
360108.3 Queuosine-Archaeosine Biosynthesis 85841622, 85841791, 85840661, 85841162, 85841520, 85842244, 85840651, 85841812 360108.3 Quinate degradation 85841653 360108.3 RNA polymerase bacterial 85840760, 85841269, 85841348, 85841887 360108.3 RNA processing and degradation, bacterial 85840820, 85841283 360108.3 Recycling of Peptidoglycan Amino Acids 85842019, 85842272
In this example we'll string several services together to perform a more complex task: locating the upstream regions of operons involved in a particular metabolic pathway. The theory is that we can look for a common pattern in these regions.
A metabolic pathway is a subsystem, so we'll enter our database via the subsystems. To keep the data manageable, we'll limit our results to specific genomes. The program we write will take as input a subsystem name and a file of genome IDs.
The worst part of the task is finding the operon for a gene. This involves finding all the genes in a neighborhood and isolating the ones that point in the correct direction. Fortunately, there is a Sapling Server function-- SAP/make_runs-- that specifcally performs this task.
To start our program, we create a SAPserver object and pull the subsystem name from the command-line parameters. This program is going to be doing a lot of complicated, long-running stuff, so we'll usually want to deal with one result at a time. To facilitate that, we construct the server helper in singleton mode.
use strict; use SAPserver; my $sapServer = SAPserver->new(singleton => 1); # Get the subsystem name. my $ssName = $ARGV[0];
Our main loop will read through the list of genomes from the standard input and call a method PrintUpstream to process each one. We're going to be a bit cheesy here and allow our genome loop to stop on the first blank line.
while (my $genomeID = <STDIN>) { chomp $genomeID; PrintUpstream($sapServer, $ssName, $genomeID); }
Now we need to write PrintUpstream. Its first task is to find all the genes in the genome that belong to the subsystem. A single call to SAP/ids_in_subsystems will get this information. We then feed the results into SAP/make_runs to get operons and call SAP/upstream for each operon. The program is given below.
sub PrintUpstream { my ($sapServer, $ssName, $genomeID) = @_; # Because we specify "roleForm => none", we get back one long # gene list. my $geneList = $sapServer->ids_in_subsystems(-subsystems => $ssName, -genome => $genomeID, -roleForm => 'none'); # Convert the gene list to a comma-delimited string. my $geneString = join(", ", @$geneList); # Get a list of operons. my $opList = $sapServer->make_runs(-groups => $geneString); # Loop through the operons. for my $opString (@$opList) { # Get the first feature's ID. my ($front) = split /\s*,/, $opString, 2; # Grab its upstream region. We'll include the operon string as the # comment text. my $fasta = $sapServer->upstream(-ids => $front, -comments => { $front => $opString }, -skipGene => 1); # Print the result. print "$fasta\n"; } }