A complete tutorial is offered in SAP tutorial.
Bash Example for uncompressed files (sas.tar):
mkdir sasBash example for the compressed file (sas.tgz)
cp ~/Downloads/sas.tar sas
cd sas
TOP=`pwd`
tar -xvf sas.tar
cd modules
./BUILD_MODULES
export PERL5LIB=$PERL5LIB:$TOP/lib:$TOP/modules/lib
export PATH=$PATH:$TOP/bin
mkdir sas
cp ~/Downloads/sas.tgz sas
cd sas
TOP=`pwd`
tar -zxvf sas.tgz
cd modules
./BUILD_MODULES
export PERL5LIB=$PERL5LIB:$TOP/lib:$TOP/modules/lib
export PATH=$PATH:$TOP/bin
perl -e 'use SeedEnv'It should produce no errors.
#!/usr/bin/perl -w
use strict;
use SAPserver;
use Data::Dumper;
my $sapObject = SAPserver->new();
my $genomes = $sapObject->all_genomes();
foreach my $g (sort { $genomes->{$a} cmp $genomes->{$b} } keys(%$genomes)) {
print "$g\t$genomes->{$g}\n";
}
The program determines the set of IDs associated with proteins that have identical sequence to the input protein ID. We call these sequence-equivalent proteins. The program displays the associated protein sequence, and then it computes a table describing these proteins. There will be at least one row for each of the computed IDs. There will be several rows if multiple groups have used the same ID and attached functional assignments to the ID. The columns in the table are as follows:
The need to map IDs between groups and compare asserted functions of proteins is quite basic. It would be straightforward for any group to write a short CGI script using the capabilities illustrated in example1 that supported connecting protein IDs to literature (via PubMed), to structure data (when present), and so forth. Every annotation team needs this class of functionality.
> perl server_paper_example1.pl "fig|83333.1.peg.145"The work of this routine is in two parts. Here, we use the SAP server to build a hash of identifiers for precisely equivalent genes used in determining the contents of column 3 later on.
my %preciseHash;Here, we again use the SAP server to get all sequence equivalent IDs and produce the output table shown below (truncated for this example).
my $precise_assertions_list = $sapObject->equiv_precise(-ids => $id);
$precise_assertions_list = $precise_assertions_list->{$id};
if (@$precise_assertions_list > 0) {
my $inputID = $id;
for my $precise_assertion (@$precise_assertions_list) {
my ($newID, $function, $source, $expert) = @$precise_assertion;
$preciseHash{$newID} = 1;
}
my $assertions = $sapObject->equiv_sequence(-ids => $id);
my $assertions = $assertions->{$id};
if (@$assertions < 1) {
print STDERR "No results found.\n";
} else {
for my $assertion (@$assertions) {
my ($newID, $function, $source, $expert, $genomeName) = @$assertion;
$genomeName = '' if ! defined $genomeName;
my $column3 = ($preciseHash{$newID} ? 1 : 0);
print join("\t", $newID, $genomeName, $column3, $function, $source,
$expert) . "\n";
}
}
cmr|NT01SF0150 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT02EC0154 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT02SB0136 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT02SD0166 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT02SF0144 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT03EC0181 0 DksA homolog CMR 0
cmr|NT04EC0178 0 dnaK suppressor protein VC0596 [imported] CMR 0
cmr|NT04SF0143 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT04SS0162 0 RNA polymerase-binding protein DksA CMR 0
cmr|NT10EC0149 0 RNA polymerase-binding protein DksA CMR 0
.
.
.
> perl server_paper_example2.pl < Buchnera_roles.tblThe work of this routine is in two parts. First, we use the Subsystem Server to construct the list of role-id pairs for use later on.
my $sapObject = SAPserver->new();Then, we call the Subsystem Server method metabolic_reconstruction and process the results to produce the output table.
my @pairs;
while () {
chomp;
my ($id, $role) = split(/\t/, $_);
push @pairs, [$role, $id];
}
my $reconstruction =
$ssObject->metabolic_reconstruction(-roles => \@pairs);
for my $record (@$reconstruction) {
my ($variantID, $role, $id) = @$record;
my ($subsysID, $code) = $variantID =~ /^(.+):(.+)$/;
print join("\t", $subsysID, $code, $role, $id) . "\n";
}
fig|107806.1.peg.1 Anthranilate synthase, aminase component (EC 4.1.3.27) # TrpAa 82
fig|107806.1.peg.2 Anthranilate synthase, amidotransferase component (EC 4.1.3.27) # TrpAb Buchnera aphidicola str. APS (Acyrthosiphon pisum) 167
fig|107806.1.peg.3 Anthranilate synthase, amidotransferase component (EC 4.1.3.27) # TrpAb Buchnera aphidicola str. APS (Acyrthosiphon pisum) 167
fig|107806.1.peg.7 2-isopropylmalate synthase (EC 2.3.3.13) Buchnera aphidicola str. APS (Acyrthosiphon pisum) 508
fig|107806.1.peg.8 3-isopropylmalate dehydrogenase (EC 1.1.1.85) Buchnera aphidicola str. APS (Acyrthosiphon pisum) 353
fig|107806.1.peg.9 3-isopropylmalate dehydratase large subunit (EC 4.2.1.33) Buchnera aphidicola str. APS (Acyrthosiphon pisum) 462
fig|107806.1.peg.10 3-isopropylmalate dehydratase small subunit (EC 4.2.1.33) 28
fig|107806.1.peg.11 tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA Buchnera aphidicola str. APS (Acyrthosiphon pisum) 150
fig|107806.1.peg.12 ATP synthase A chain (EC 3.6.3.14) Buchnera aphidicola str. APS (Acyrthosiphon pisum) 264
.
.
.
tRNA processing 1 Ribonuclease P protein component (EC 3.1.26.5) fig|107806.1.peg.24
tRNA processing 1 tRNA(Ile)-lysidine synthetase fig|107806.1.peg.111
tRNA processing 1 tRNA-specific adenosine-34 deaminase (EC 3.5.4.-) fig|107806.1.peg.247
tRNA processing 1 tRNA delta(2)-isopentenylpyrophosphate transferase (EC 2.5.1.8) fig|107806.1.peg.541
tRNA processing 1 tRNA-i(6)A37 methylthiotransferase fig|107806.1.peg.421
tRNA processing 1 Ribonuclease T (EC 3.1.13.-) fig|107806.1.peg.187
tRNA processing 1 tRNA pseudouridine synthase B (EC 4.2.1.70) fig|107806.1.peg.361
tRNA processing 1 tRNA pseudouridine synthase A (EC 4.2.1.70) fig|107806.1.peg.198
tRNA aminoacylation, Arg 1.00 Arginyl-tRNA synthetase (EC 6.1.1.19) fig|107806.1.peg.239
Experimental-EFP 1 Translation elongation factor P fig|107806.1.peg.29
mnm5U34 biosynthesis bacteria 1 tRNA uridine 5-carboxymethylaminomethyl modification enzyme GidA fig|107806.1.peg.11
mnm5U34 biosynthesis bacteria 1 tRNA 5-methylaminomethyl-2-thiouridine synthase TusD fig|107806.1.peg.507
mnm5U34 biosynthesis bacteria 1 tRNA 5-methylaminomethyl-2-thiouridine synthase TusA fig|107806.1.peg.427
mnm5U34 biosynthesis bacteria 1 tRNA 5-methylaminomethyl-2-thiouridine synthase TusC fig|107806.1.peg.506
mnm5U34 biosynthesis bacteria 1 GTPase and tRNA-U34 5-formylation enzyme TrmE fig|107806.1.peg.26
mnm5U34 biosynthesis bacteria 1 tRNA (5-methylaminomethyl-2-thiouridylate)-methyltransferase (EC 2.1.1.61) fig|107806.1.peg.253
mnm5U34 biosynthesis bacteria 1 Cysteine desulfurase (EC 2.8.1.7) fig|107806.1.peg.568
mnm5U34 biosynthesis bacteria 1 tRNA 5-methylaminomethyl-2-thiouridine synthase TusB fig|107806.1.peg.505
Ribosome LSU bacterial 1 LSU ribosomal protein L18p (L5e) fig|107806.1.peg.483
Ribosome LSU bacterial 1 LSU ribosomal protein L11p (L12e) fig|107806.1.peg.47
Ribosome LSU bacterial 1 LSU ribosomal protein L23p (L23Ae) fig|107806.1.peg.497
.
.
.
Example 3 illustrates the functions required to determine the location of a SEED gene encoding a specific protein and to acquire the genes from a given region centered on that location. If you were to create a program to accept arbitrary protein IDs, use the conversion capabilities demonstrated in example1, and display the regions graphically around these genes, you would have the core of a useful tool. If you shaded genes from the same subsystem (determined using the capabilities described in example2), you could enhance the supported functionality. Of course, you could also compute which genes could be connected to literature or structures and encode that data as well.
> perl server_paper_example3.pl "fig|100226.1.peg.3361" 2000First, we get the location for the subject ID using the Sapling Server
my $sapObject = SAPserver->new();Next, we normalize the direction and compute the desired region.
my $geneLocH = $sapObject->fid_locations(-ids => [$geneID]);
my $geneLoc = $geneLocH->{$geneID}->[0];
if ($geneLoc =~ /^(\S+)_(\d+)([+-])(\d+)/) # retrieve an encoded locationThen we use the Sapling Server to retrieve all genes in the region and display the results.
{
my($contig,$beg,$strand,$length) = ($1,$2,$3,$4);
my ($left,$right);
if ($strand eq "+")
{
($left,$right) = ($beg, $beg + ($length-1));
}
else
{
($left,$right) = ($beg - ($length-1), $beg);
}
my $paddedLeft = ($left > $max_distance) ? $left - $max_distance : 1;
my $paddedRight = $right + $max_distance;
my $sz = ($paddedRight + 1) - $paddedLeft;
my $region = $contig . "_" . $paddedLeft . "+" . $sz;
my $genesInRegionH = $sapObject->genes_in_region(-locations => [$region],
-includeLocation => 1);
my $genesInRegion = $genesInRegionH->{$region};
foreach my $geneID2 (keys(%$genesInRegion)) {
my $location = $genesInRegion->{$geneID2};
$location =~ /^(\S+)_(\d+)([+-])(\d+)/;
print "$geneID2\t$1\t$2\t$3\t$4\n";
}
fig|100226.1.peg.3358 100226.1:NC_003888 3764143 + 867
fig|100226.1.peg.3359 100226.1:NC_003888 3765006 + 504
fig|100226.1.peg.3360 100226.1:NC_003888 3765814 + 360
fig|100226.1.peg.3361 100226.1:NC_003888 3766170 + 612
fig|100226.1.peg.3362 100226.1:NC_003888 3766852 + 489
fig|100226.1.peg.3363 100226.1:NC_003888 3767961 - 606
fig|100226.1.peg.3364 100226.1:NC_003888 3770099 - 2007
Example 4 accesses the SEED server that offers access to the data we use to compute co-occurrence scores. The program illustrates the potential for constructing custom tools by going through all of the protein-encoding genes in all of the complete prokaryotic genomes maintained within the SEED looking for "hypothetical proteins" that tend to co-occur with genes encoding functions that can be connected to subsystems. The program constructs a table showing
> perl server_paper_example4.plFirst we retrieve all complete geneomes from the Sapling database
my $genomeHash = $sapObject->all_genomes(-complete => 1);
Then we get all "Hypothetical" genes
my $geneHash = $sapObject->feature_assignments(-genome => $genome,
-type => 'peg');
my @hypotheticalGenes = grep { &SeedUtils::hypo($geneHash->{$_}) } sort keys %$geneHash;
Then we use the Functional Coupling Server to get a hash of all genes in the neighborhood of the hypothetical gene
my $couplingHash = $sapObject->conserved_in_neighborhood(-ids => \@hypotheticalGenes,
-hash => 1);
And finally, we process the coupling data for each hypothetical gene and format the output.
for my $gene (@hypotheticalGenes) {
my $couplingList = $couplingHash->{$gene};
if (defined $couplingList) {
my $subHash = $sapObject->ids_to_subsystems(-ids => [ map { $_->[1]} @$couplingList ]);
my ($bestCoupler, $bestScore, $bestFunction) = (undef, 0, '');
for my $coupling (@$couplingList) {
my ($score, $coupler, $function) = @$coupling;
if ($subHash->{$coupler} && $score > $bestScore) {
$bestCoupler = $coupler;
$bestScore = $score;
$bestFunction = $function;
}
}
if (defined $bestCoupler) {
print join("\t", $gene, $geneHash->{$gene}, $genome, $genomeName,
$bestCoupler, $bestScore, $bestFunction) . "\n";
}
}
}
fig|273035.4.peg.1016 hypothetical protein 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.1018 31 DNA helicase
fig|273035.4.peg.1234 predicted metal-dependent hydrolase 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.1233 80 Diacylglycerol kinase (EC 2.7.1.107)
fig|273035.4.peg.1496 hypothetical protein 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.1491 9 Phage terminase large subunit
fig|273035.4.peg.1570 S1 RNA binding domain protein 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.1571 10 Glucose-6-phosphate isomerase (EC 5.3.1.9)
fig|273035.4.peg.328 hypothetical protein 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.329 60 Putative tRNA-m1A22 methylase
fig|273035.4.peg.403 hypothetical protein 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.402 54 LSU ribosomal protein L27p
fig|273035.4.peg.60 hypothetical protein 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.61 9 hypothetical protein
fig|273035.4.peg.600 hypothetical protein 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.598 18 Recombination protein U homolog
fig|273035.4.peg.61 hypothetical protein 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.60 9 hypothetical protein
fig|273035.4.peg.710 hypothetical protein 273035.4 Spiroplasma kunkelii CR2-3x fig|273035.4.peg.708 12 LSU ribosomal protein L31p
.
.
.
svr_all_features 3702.1 peg | svr_function_ofThis would produce a 2-column table. The first column would contain PEG IDs for genes occurring in genome 3702.1, and the second would contain the functions of those genes.
svr_all_features genome_id feature_type
This script returns all features of the specified type for the specified genome.
The code for this server is here. The man page is here.
Here is an example of running this command:
> svr_all_features 3702.1 peg
fig|3702.1.peg.1
fig|3702.1.peg.2
fig|3702.1.peg.3
fig|3702.1.peg.4
fig|3702.1.peg.5
fig|3702.1.peg.6
fig|3702.1.peg.7
fig|3702.1.peg.8
fig|3702.1.peg.9
fig|3702.1.peg.10
fig|3702.1.peg.11
fig|3702.1.peg.12
fig|3702.1.peg.13
fig|3702.1.peg.14
fig|3702.1.peg.15
.
.
.
svr_function_of < table_of_gene_ids
Note that this script uses stdin and stdout and is designed to be part of a processing pipeline.
The code for this server is here. The man page is here.
Here is an example of running this command:
> svr_all_features 3702.1 peg | svr_function_of
fig|3702.1.peg.1 photosystem II protein D1 (PsbA)
fig|3702.1.peg.2 maturase
fig|3702.1.peg.3 SSU ribosomal protein S16p, chloroplast
fig|3702.1.peg.4 Photosystem II protein PsbK
fig|3702.1.peg.5 Photosystem II protein PsbI
fig|3702.1.peg.6 ATP synthase alpha chain (EC 3.6.3.14)
fig|3702.1.peg.7 ATP synthase CF0 B chain
fig|3702.1.peg.8 ATP synthase C chain (EC 3.6.3.14)
fig|3702.1.peg.9 ATP synthase CF0 A chain
fig|3702.1.peg.10 SSU ribosomal protein S2p (SAe), chloroplast
fig|3702.1.peg.11 DNA-directed RNA polymerase delta (= beta'') subunit (EC 2.7.7.6), chloroplast
fig|3702.1.peg.12 DNA-directed RNA polymerase gamma subunit (EC 2.7.7.6), chloroplast
fig|3702.1.peg.13 DNA-directed RNA polymerase beta subunit (EC 2.7.7.6), chloroplast
fig|3702.1.peg.14 Cytochrome b6-f complex subunit VIII (PetN)
fig|3702.1.peg.15 Photosystem II protein PsbM
.
.
.
svr_aliases_of < table_of_gene_ids
Note that this script uses stdin and stdout and is designed to be part of a processing pipeline.
The code for this server is here. The man page is here.
Here is an example of running this command:
> svr_all_features 3702.1 peg | svr_aliases_of
fig|3702.1.peg.1 gi|112382048,gi|113200888,gi|114054364,gi|114107113,gi|114329726,gi|115531894,gi|134286292,gi|134286378,
gi|134286553,gi|134286643,gi|134286733,gi|134286999,gi|139387232,gi|139389076,gi|139389398,gi|139389623,gi|139389781,gi|13938993
1,gi|156597939,gi|156598592,gi|157695865,gi|159792928,gi|159793098,gi|159895452,gi|159895537,gi|166344112,gi|167391785,gi|169142
690,gi|169142840,gi|169142925,gi|169143011,gi|169794053,gi|6723714,sp|A4QJR4,sp|A4QJZ9,sp|A4QKH2,sp|A4QKR1,sp|A4QL00,sp|A4QLR3,s
p|B0Z4K6,sp|B0Z4U0,sp|B0Z524,sp|B0Z5A8,sp|B1A915,sp|B1NWD0,sp|P83755,sp|P83755,sp|P83756,sp|P83756,sp|Q06FY1,sp|Q09G66,sp|Q0G9Y2
,tr|A4QJZ9,tr|A4QKH2,tr|A4QL00,tr|A4QLR3,tr|A9QAZ4,tr|A9QAZ4,tr|A9QBW0,tr|A9QBW0,tr|Q06FY1,tr|Q09G66,tr|Q0G9Y2,fig|3702.1.peg.1,
gi|515374,gi|5881674,gi|7525013,fig|85636.1.peg.1,gi|13518299
fig|3702.1.peg.2 gi|12002371,gi|12002415,gi|12002417,gi|12002419,gi|12002421,gi|12002423,gi|12002425,gi|12002427,gi|12002
429,gi|12002431,gi|126022795,gi|5881675
fig|3702.1.peg.3 sp|P56806,sp|P56806,gi|5881676,gi|7525015
fig|3702.1.peg.4 sp|P56782,sp|P56782,fig|3702.1.peg.4,gi|5881677,gi|7525016
.
.
.
svr_neighbors_of n < table_of_gene_ids
Note that this script uses stdin and stdout and is designed to be part of a processing pipeline.
The code for this server is here. The man page is here.
Here is an example of running this command:
> svr_all_features 3702.1 peg | svr_neighbors_of 5
fig|3702.1.peg.1 fig|3702.1.peg.2,fig|3702.1.peg.3,fig|3702.1.peg.4,fig|3702.1.peg.5,fig|3702.1.peg.6
fig|3702.1.peg.2 fig|3702.1.peg.1,fig|3702.1.peg.3,fig|3702.1.peg.4,fig|3702.1.peg.5,fig|3702.1.peg.6,fig|3702.1.peg.7
fig|3702.1.peg.3 fig|3702.1.peg.1,fig|3702.1.peg.2,fig|3702.1.peg.4,fig|3702.1.peg.5,fig|3702.1.peg.6,fig|3702.1.peg.7,fi
g|3702.1.peg.8
fig|3702.1.peg.4 fig|3702.1.peg.1,fig|3702.1.peg.2,fig|3702.1.peg.3,fig|3702.1.peg.5,fig|3702.1.peg.6,fig|3702.1.peg.7,fi
g|3702.1.peg.8,fig|3702.1.peg.9
fig|3702.1.peg.5 fig|3702.1.peg.1,fig|3702.1.peg.2,fig|3702.1.peg.3,fig|3702.1.peg.4,fig|3702.1.peg.6,fig|3702.1.peg.7,fi
g|3702.1.peg.8,fig|3702.1.peg.9,fig|3702.1.peg.10
.
.
.
--user username | RAST login for the submitting user |
--passwd password | RAST password for the submitting user |
--genbank filename | If submitting a genbank file, the file of input data. |
--fasta filename | If submitting a FASTA file of contigs, the file of input data. |
--domain Bacteria or | |
--domain Archaea | Domain of the submitted genome. |
--taxon_id taxonomy-id | The NCBI taxonomy id of the submitted genome |
--bioname “genus species str.” | Biological name of the submitted genome |
--genetic_code ( 11 | 4 ) | Genetic code for the submitted genome, either 11 or 4. |
--gene_caller | Gene caller to use (FigFam-base RAST gene caller or straight Glimmer-3) |
--reannotate_only | Preserve the original gene calls and use RAST |
• genbank | (Genbank format) |
• genbank_stripped | (Genbank with EC numbers removed) |
• embl | (EMBL format) |
• embl_stripped | (EMBL with EC numbers stripped) |
• gff3 | (GFF3 format) |
• gff3_stripped | (GFF3 with EC numbers stripped) |
• rast_tarball | (gzipped tar file of the entire job) |