Using NCBI eUtilities via CGI

Using the NCBI eUtilities via CGI

DRAFT

The Entrez query system at NCBI allows users to query the various (over 30) primary NCBI databases through a single interface, the Entrez Core. The Core provides support for both the NCBI web interface and various program interfaces, especially the eUtilities that are availble through the Web's Common Gateway Interface (CGI). Here the focus is on using the eUtilities from programs written in Perl, Java, etc., but Perl will be used for all examples.

The basic operation of the Core is fairly straight forward. For example, records in a particular database can be easily retrieve via searches based on the contents of specified fields.

However, searching for records in one database as a function of the contents of records in a second database is not as simple. Such a search requires the use of "link" information that connects records in the same or different primary data resources. (This is the same information that underlies the familiar "Links" pull-down options on NCBI Web pages.)

This presentation will describe basic use of the eUtilities through practical examples, covering the following topics:

Entrez manipulates sets of UIDs
The Entrez "query result database"
NCBI program interfaces to the Entrez Core
Using ePost in a Perl program
Using eSearch to create new query result database entries
Links, linksets, and using eLink to retrieve linksets
Using eLink to put UIDs into the query result database
Using eLink to get information about SNPs related to a specific gene once
Using LWP to post large UID lists and retrieve results in batches
Retrieving eLinked data in batches using "index lists"
Additional information

Entrez manipulates sets of UIDs

Every database in the Entrez domain assigns unique IDs (UIDs) to major record-types in each database. These IDs are integer values unique within the database, but the same integer may be used to identify records in multiple databases. (Thus, to identify a particular record, one must specify both the database and the record's UID.)

The primary function of the Entrez programmatic interface is to help users manipulate sets of UIDs, and fetch data records identified by those UIDs. Entrez, itself, must also format data for Web-based users, but the programmatic interface leaves data display to the client program.

The interface allows programs to:

define a set of UIDs,
display the contents of records identified by a set of UIDs,
create a new UID set from an existing set by choosing members of the existing set whose data records satisfy specified criteria, and
create a new set of UIDs representing records that are in some way related to members of the records identified by an existing set of UIDs.

These primitive capabilities can be combined into powerful sequences that can integrate data from among most of the Entrez data resources. In fact, they make it possible to (partially) mimic relational database operations such as selects and joins on data in separate data resources.

Note, however, that record content may be retrieved in a limited number of report formats, where a report type contains a fixed subset of elements taken from the raw data record. As a result, additional processing may be required to prune report data for subsequent display or use, and/or multiple requests may be required to retrieve data in multiple report formats to obtain all desired data fields.

The Entrez "query result database"

The Entrez Core can keep a record of each query it processes, including the UID set resulting from each query. The database holding these records will be referred to as "the query result database" within this presentation, although it is described as "the History" or "the History server" in some NCBI documentation.

The query result database can be used by (most of the) programs that implement the Entrez set manipulation functions listed above, and is so important for efficient use of Entrez that this presentation is almost entirely oriented around it. "Efficient" use of the query result database allows users to download large numbers of records without violating the access rate limits that NCBI imposes upon remote queries.

Each UID set in the Core database is identified by 3 pieces of information:

a query identifier, known as the "query key",
the name of the database used to generate the associated UID set, and
an identifier for the state of the database at the time of the query, known as the "web environment".

Query keys are integers, but are often displayed as a pound sign (#) followed by an integer. The Entrez databases names are strings like "snp", "nuc", "nucest", "gene", etc. Web environment identifiers are long (around 60 character) strings.

Here is a schematic query result database entry:

Database	Query Key	Web Env (edited)	UID set
snp	2	A3zq156CDS_p1DdWz...AU6u3yb5D3B634BAF50	242, 28853987

NCBI program interfaces to the Entrez Core

There exist several "technologies" for accessing remote data and computing resources programmatically.

The two most popular approaches are:

the Web Common Gateway Interface (CGI), and
Remote Procedure Calls (RPC) over SOAP, sometimes known as JAX-RPC or "Web Services".

NCBI supports both of these interfaces to the Entrez Core. In addition, NCBI provides an educational Perl module (NCBI_PowerScripting.pm) that defines a set of objects that call the CGI services behind the scenes.

The CGI and Web Services routines are known as the "eUtilities" or "eUtils", and may be categorized with respect to the UID manipulation functions listed above as:

Function	Generic name	CGI routine
define a set of UIDs	ePost (and sometimes eSearch)	epost.fcgi, esearch.fcgi
display the contents of records identified by UIDs	eSummary and eFetch	esummary.fcgi, efetch.fcgi
create a UID set from a previously defined set	eSearch	esearch.fcgi
create a UID set by finding links from an existing set	eLink	elink.fcgi

This presentation will deal only with the CGI functions, but the Web Services provide identical functionality within the JAX-RPC framework. (Note that the Web Services are not currently, circa 2007, available via Perl.)

Here is an URL that uses the epost.fcgi script to insert (or "post") 2 UIDs (242 and 2885398) into the query result database:

   http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=snp&id=242,28853987

If you enter this URL into a Web browser you will get a response like: <?xml version="1.0"?> <!DOCTYPE ePostResult PUBLIC "-//NLM//DTD ePostResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd"> <ePostResult> <QueryKey>1</QueryKey> <WebEnv> 01yWrS_p1DdWzAUPU6eOwxX2...s@1FBE5D3B634BAF50_0012SID </WebEnv> </ePostResult>

and the query result database will then include a new record containing the 2 UIDs specified by using the "id" parameter:

Database	Query Key	Web Env (edited)	UID set
snp	1	01yWrS_p1DdWzAUPU6e...E5D3B634BAF50_0012SID	242,28853987

If you then specify the "db", "query_key", and "WebEnv" parameters in a URL like:

  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=snp&query_key=1&\
    WebEnv=01yWrS_p1DdWzAUPU6eOwxX2...s@1FBE5D3B634BAF50_0012SID

The full result is shown in first-query-xml.html.

Note that summary records were retrieved for both of the SNP UIDs placed on the Entrez query database PRIOR to this request for a summary. esummary.fcgi used the database name, the query key, and the web environment parameters to find the UID list, and then retrieved a record from the specified database for each UID on the list.

The following URL shows how to use efetch.fcgi to get a full XML record for these two SNP UIDs:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&query_key=2&\
  WebEnv=01yWrS_p1DdWzAUPU6eOwxX2...s@1FBE5D3B634BAF50_0012SID&\
  report=sgml&mode=xml

The result may be examined in fetch-example-xml.html. Note that the "report" and "mode" options were used to specify the report contents and format. Selection of values for these options seems rather unusual.

Using ePost in a Perl program

The hand-entered queries shown above can all be sent to Entrez via programs. A Perl program to post 2 UIDs (242 and 28853987) to the query result database is shown below:

#!/usr/bin/perl -w

use LWP::Simple;

$url = 
 "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=snp&id=242,28853987";

@result_array = get ( "$url" );  # note that epost is returning an array lines of XML.

print @result_array;

Note that the query is identical to the one issued in the first example above, and the results will be identical, except for changes in the WebEnv identifier.

Now here is a slight modification that will filter a result_array to print only the query key and Web environment values:

#!/usr/bin/perl -w use LWP::Simple; $url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?" . "db=snp&id=242,28853987&email=account\@host.domain"; @result_array = get ( "$url" ); foreach $line ( @result_array ) # Search each line of the returned document for... { if( $line =~ m/<QueryKey>(.*)<\/QueryKey>/ ) # ...the query key, and... { $query_key = $1; } if( $line =~ m/<WebEnv>(.*)<\/WebEnv>/ ) # ...the web environment. { $web_env = $1; } } print "Query Key: $query_key\nWeb Environment: $web_env\n"; Running this program from a Unix shell will get you output showing the query key and Web environment values that were returned by ePost:

bash-2.05$ perl test-ncbi-2.pl
Query Key: 1
Web Environment: 03bGQckqzaWiGXQqZHvYpXvTh...oNb-J@1FBE58C06361F480_0005SID

Now the eSummary query can be added to the program to get the contents of records identified by their UIDs. Note, however, that eSummary returns a long string, rather than an array of strings (as did ePost). The long string can be split up into separate lines, each of which is placed into a separate array entry by using the Perl function "split" and the Perl representation for a line separator: the "newline" or '\n' character.

The resulting program looks like this:

#!/usr/bin/perl -w use LWP::Simple; $post_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=snp&id=242,28853987"; @epost_result_array = get( "$post_url" ); foreach $line ( @epost_result_array ) # Search each returned line for... { if( $line =~ m/<QueryKey>(.*)<\/QueryKey>/ ) # ... the query key, and ... { $query_key = $1; } if( $line =~ m/<WebEnv>(.*)<\/WebEnv>/ ) # ... the Web environment. { $web_environment = $1; } } $esummary_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=snp" . "&query_key=$query_key&WebEnv=$web_environment"; # eSummary returns a long string that can be split up into separate # lines by using the Perl function "split" and the Perl representation for # a line separator: the "newline" or '\n' character. $esummary_result_string = get( "$esummary_url" ); # Get the summary info. @esummary_result_array = split( '\n', $esummary_result_string ); foreach $line ( @esummary_result_array ) # Search each line for desired data. { print "TAX_ID = $1\n" if ( $line =~ m/^.*TAX_ID.*>(.*)<.*/ ); print "\nSNP_ID = $1\n" if ( $line =~ m/^.*SNP_ID.*>(.*)<.*/ ); print "Gene = $1\n" if ( $line =~ m/^.*GENE.*>(.*)<.*/ ); print "Chromosome : Position = $1\n" if $line =~ m/^.*CHRPOS.*>(.*)<.*/; print "Function class = $1\n" if ( $line =~ m/^.*FXN_CLASS.*>(.*)<.*/ ); print "SNP class = $1\n" if ( $line =~ m/^.*SNP_CLASS.*>(.*)<.*/ ); } Running this program from a Unix shell will produce output like:

bash-2.05$ perl test-ncbi-4.pl

SNP_ID = 242
GENE = 
FXN_CLASS = 
TAX_ID = 9606
SNP class = in-del
Chromosome:Position = 1:20742047

SNP_ID = 28853987
GENE = LOC653635
FXN_CLASS = locus-region
TAX_ID = 9606
SNP class = snp
Chromosome:Position = 1:800

Note that this program extracts significant information by using a set of print...if statements that constitute a "naive" XML parsing process that may be too inefficient for processing larger or more complex files. Such cases may require the use of a formal XML parser. See the Perl CPAN site for some options.

(Note also that this parsing step can occur "automatically" as part of the SOAP message processing performed by some of the Web Services versions of the eUtilties, but not by the CGI versions.)

Using eSearch to create new query result database entries

eSearch can be used to modify the query result database in two different ways. It can:

search an NCBI database and put the list of UIDs whose contents match the search conditions into a new query result database entry, and
search an existing query result entry UID list and put UIDs whose content matches search conditions into a new query result database entry.

eSearch can be used to insert a new entry in the query result database by using a query like:

  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&\
     usehistory=y&term=1[CHR]+AND+9606[TAX_ID]

which requests the UIDs for all SNPs located on human chromosome 1.

Note the presence of the "usehistory" parameter. Without this parameter eSearch will return a UID list to the user, but will NOT insert the UID list into the query result database.

eSearch can also be used to create a new UID list from an existing list by using a query like:

  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&query_key=1&\
    WebEnv=01yWrS_p1DdWzAUPU6eOwxX2...s@1FBE5D3B634BAF50_0012SID&\
    term=snp[FXN_CLASS]&usehistory=y

In this query the existing list is identified by the db, query_key, and WebEnv parameters, and eSearch is directed to select only those SNPs whose records include a SNP function class tag, and store the resulting UID list in a new query database entry (by the "usehistory" parameter).

Links, linksets, and using eLink to retrieve linksets

A "link" is a connection between two UIDs, not necessarily in the same database. For example, a link may identify a protein coded for by a particular gene.

Each type of link is assigned a specific name, so, for example, a link from a SNP UID to a UNISTS UID will be known as a "snp_unists" link. (There is a list of linknames usable in Entrez querie on the NCBI Web site.)

One can imagine that links are stored in a very large table with entries like this for links from SNP 242:

Link database
from UID	Linkname	to UID
242	snp_pubmed	8808290
242	snp_snp_genegenotype	242
242	snp_taxonmy	9606
242	snp_unists	71299

which shows links as a connection between a UID in one database and a UID in another (or possibly the same) database, and a "linkset" is simply a set of such links. For example, the set of links linking from SNP UID 242 shown above constitutes a linkset.

The eUtility eLink can be used to query Entrez for linkset information. It can take a UID list and return the linkset encoded in XML. Links returned within XML will be encoded within LinkSetDb and LinkSetDbHistory elements, as shown in examples below.

Here is a Perl program that puts a single SNP UID (242) into a query result database entry and then requests a linkset of all the NCBI databases to which it is connected:

#!/usr/bin/perl -w use LWP::Simple; $post_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=snp&id=242"; @epost_result_array = get( "$post_url" ); # Post the gene UID. foreach $line ( @epost_result_array ) # Search the result document for { # query key and web environment. if( $line =~ m/<QueryKey>(.*)<\/QueryKey>/ ) { $query_key = $1; } if( $line =~ m/<WebEnv>(.*)<\/WebEnv>/ ) { $web_environment = $1; } } $elink_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=all" . "&query_key=$query_key&WebEnv=$web_environment" . "&dbfrom=snp"; # Now link from gene to SNP. $elink_result_string = get( "$elink_url" ); @elink_result_array = split( '\n', $elink_result_string ); foreach $line ( @elink_result_array ) # Print the returned document. { print "$line\n"; }

The results of running this program from the Unix shell look like:

bash-2.05$ perl test-ncbi-6.pl <?xml version="1.0"?> <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> <eLinkResult> <LinkSet> <DbFrom>snp</DbFrom> <IdList> <Id>242</Id> </IdList> <LinkSetDb> <DbTo>pubmed</DbTo> <LinkName>snp_pubmed</LinkName> <Link> <Id>8808290</Id> </Link> </LinkSetDb> <LinkSetDb> <DbTo>snp</DbTo> <LinkName>snp_snp_genegenotype</LinkName> <Link> <Id>242</Id> </Link> </LinkSetDb> <LinkSetDb> <DbTo>taxonomy</DbTo> <LinkName>snp_taxonomy</LinkName> <Link> <Id>9606</Id> </Link> </LinkSetDb> <LinkSetDb> <DbTo>unists</DbTo> <LinkName>snp_unists</LinkName> <Link> <Id>71299</Id> </Link> </LinkSetDb> </LinkSet> </eLinkResult> This output first identifies the database from which the discovered links originate, and then provides a list of each UID that was linked FROM. (In this case there is only one entry in the IdList.)

Next appear separate LinkSetDb elements for each database to which the originating UID links. Note that each link in a LinkSetDb entry is described by a particular "link name," that can be used in queries. For example links from dbSNP to the TAXONOMY database are "snp_taxonomy" links.

If an input UID set contains more than one element (e.g., "&id=242,28853987"), each resulting linkset will be a union of the linksets that would be produced by each individual UID. To avoid this aggregation, multiple input UID sets must be specified during a query, as with "&id=242&id=28853987".

In the example above, it is important to note that no query key or Web environment information was returned, since the linked UIDs were NOT placed on the query result database.

It is also important to note that the "db" parameter is used in a new way; it no longer specifies which database is being searched, but which database is being linked to.

Sometimes, it is desireable to perform an eSearch for records in a database that have links to at least one UID in another database. For example, you may wish to write an eSearch term clause that loads a gene UID into the (new) query result database only if it links to a SNP UID. To do that you would include a search condition something like:

     gene_SNP[filter]

in the term clause, as in:

    term=11[CHR] AND gene_SNP[filter] AND mouse[orgn]

which will load all gene UIDs on mouse chromosome 11 that have links to the SNP database.

In most cases, a linkset maps UIDs in one database to UIDs in another database, but it can be useful to map into the same database. For example, one may wish to build a linkset that maps protein enzyme UIDs to protein substrates and products.

Using eLink to put UIDs into the query result database

It is possible to insert UIDs into the (new) result database by using the parameter "cmd=neighbor_history" with eLink. If this is done when linking to a single target database, the usual query key and Web environment information is returned. If this is done when linking to multiple databases (as with "db=all"), however, multiple query keys are returned along with a single Web environment. Here is example output from an eLink request that includes a UID list containing only the SNP UID 242:

 http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=all\
     &id=242&dbfrom=snp&cmd=neighbor_history

<?xml version="1.0"?> <!DOCTYPE eLinkResult PUBLIC "-//NLM//DTD eLinkResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_020511.dtd"> <eLinkResult> <LinkSet> <DbFrom>snp</DbFrom> <IdList> <Id>242</Id> </IdList> <LinkSetDbHistory> <DbTo>pubmed</DbTo> <LinkName>snp_pubmed</LinkName> <QueryKey>2</QueryKey> </LinkSetDbHistory> <LinkSetDbHistory> <DbTo>taxonomy</DbTo> <LinkName>snp_taxonomy</LinkName> <QueryKey>3</QueryKey> </LinkSetDbHistory> <LinkSetDbHistory> <DbTo>unists</DbTo> <LinkName>snp_unists</LinkName> <QueryKey>4</QueryKey> </LinkSetDbHistory> <WebEnv>00O9qrUHud2D8LXbpvILvTB...1FBE6E626363B8F0_0105SID</WebEnv> </LinkSet> </eLinkResult> This request left three UID lists (but not a linkset) in the query result database. Such lists will include the UIDs for the database entries that are "linked to". For example, the snp_unists link in the previous example showed UNISTS UID 71299, whereas the example just completed showed only a QueryKey XML element. UID 71299 was left in the query result database as:

Database	Query Key	Web Env (edited)	UID set
pubmed	2	00O9qrUHud2D8LXbpvILvTB...1FBE6E626363B8F0_0105SID	8808290
taxonomy	3	00O9qrUHud2D8LXbpvILvTB...1FBE6E626363B8F0_0105SID	9606
unists	4	00O9qrUHud2D8LXbpvILvTB...1FBE6E626363B8F0_0105SID	71299

The information "linked to" can be retrieved by specifying the query key and Web environment in eSummary or eFetch requests. For example the the information in the unists database that is linked to by the request above can be retrieved by using a request like:

   http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=unists&\
   query_key=4\&WebEnv=00O9qrUHud2D8LXbpvILvTB...1FBE6E626363B8F0_0105SID

and will contain only the 71299 entry and look something like: <?xml version="1.0"?> <!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD eSummaryResult, 29 October 2004//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSummary_041029.dtd"> <eSummaryResult> <DocSum> <Id>71299</Id> <Item Name="Marker_Name" Type="String">D10S1196</Item> <Item Name="Map_Gene_Summary_List" Type="List"> <Item Name="Map_Gene_Summary" Type="Structure"> <Item Name="Org" Type="String">Homo sapiens</Item> <Item Name="Chr" Type="String"> chromosome 1</Item> <Item Name="Locus" Type="String"></Item> </Item> <Item Name="Map_Gene_Summary" Type="Structure"> <Item Name="Org" Type="String">Pan troglodytes</Item> <Item Name="Chr" Type="String"> chromosome 1</Item> <Item Name="Locus" Type="String"></Item> </Item> </Item> <Item Name="EPCR_Summary" Type="String">Found by e-PCR in sequences from Homo sapiens and Pan troglodytes. </Item> <Item Name="LocusId" Type="String"></Item> </DocSum> </eSummaryResult>

Using eLink to get information about SNPs related to a specific gene

Here is a program that uses ePost, eLink, and eSummary in a sequence to get SNP information for every SNP in a specific gene. This program is similar to the earlier example that posted a UID and then printed summary information for that UID. The only difference is an eLink call between the ePost and eSummary calls that uses the query result entry for the first call to build a new query result entry containing a list of SNP UIDs, against which eSummary can be run. #!/usr/bin/perl -w use LWP::Simple; $post_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=gene&id=3718"; @epost_result_array = get( "$post_url" ); foreach $line ( @epost_result_array ) # Search the returned document for the { # ePost query key and Web environment. if( $line =~ m/<QueryKey>(.*)<\/QueryKey>/ ) { $query_key = $1 }; if( $line =~ m/<WebEnv>(.*)<\/WebEnv>/ ) { $web_environment = $1 }; } $elink_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=snp" . "&query_key=$query_key&WebEnv=$web_environment" . "&dbfrom=gene&cmd=neighbor_history"; $elink_result_string = get( "$elink_url" ); @elink_result_array = split( '\n', $elink_result_string ); foreach $line ( @elink_result_array ) # Search the returned document for the { # eLink query key and Web environment. if( $line =~ m/<QueryKey>(.*)<\/QueryKey>/ ) { $query_key_2 = $1 }; if( $line =~ m/<WebEnv>(.*)<\/WebEnv>/ ) { $web_environment_2 = $1 }; } $esummary_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=snp" . "&query_key=$query_key_2&WebEnv=$web_environment_2"; $esummary_result_string = get( "$esummary_url" ); @esummary_result_array = split( '\n', $esummary_result_string ); foreach $line ( @esummary_result_array ) # Search each line for info to print. { print "TAX_ID = $1\n" if $line =~ m/^.*TAX_ID.*>(.*)<.*/; print "\nSNP_ID = $1\n" if $line =~ m/^.*SNP_ID.*>(.*)<.*/; print "Gene = $1\n" if $line =~ m/^.*GENE.*>(.*)<.*/; print "Chromosome : Position = $1\n" if $line =~ m/^.*CHRPOS.*>(.*)<.*/; print "Function class = $1\n" if $line =~ m/^.*FXN_CLASS.*>(.*)<.*/; print "SNP class = $1\n" if $line =~ m/^.*SNP_CLASS.*>(.*)<.*/; }

The output from this program for the JAK3 UID (3718) is a list of selected information for every SNP linked to from UID 3718. The output is an extended version of the following:

bash-2.05$ perl test-ncbi-7.pl|more SNP_ID = 3008 Gene = JAK3 Function class = mrna-utr TAX_ID = 9606 SNP class = snp Chromosome : Position = 19:17798428 SNP_ID = 11888 Gene = JAK3 Function class = locus-region TAX_ID = 9606 SNP class = snp Chromosome : Position = 19:17796625 SNP_ID = 867174 Gene = JAK3 Function class = intron TAX_ID = 9606 SNP class = snp Chromosome : Position = 19:17813929

Note that the program above essentially implements an SQL select command similar to:

select snp.SNP_ID, snp.GENE, ... ,snp.CHRPOS from gene, snp where gene.GENE_ID = '3718' and gene.GENE_ID = snp.GENE which selects specified fields from a table that represents the join of the Gene and SNP databases. Although the similarities are only approximate, they demonstrate the power of the system to satisfy arbitrary user queries.

Using LWP to post large UID lists and retrieve results in batches

The examples in this web page have relied upon the Perl LWP::Simple package. Since LWP::Simple supports only HTTP GET requests, the number of characters allowed in a request URL may be limited (to around 1 or 2KB, depending on the HTTP server being accessed).

This limitation may cause eUtility requests containing long lists of UIDs to fail, and to avoid this problem you must use the Perl module LWP in place of LWP::Simple. If you use LWP you must rewrite all your eUtil calls, since both HTTP GETs and POSTs are handled differently with LWP.

Here is an example ePost invocation using LWP. Note that the "email" and "tool" parameters are used to identify the requester and requesting program, and they can be used with LWP::Simple get requests as well. NCBI uses this information to help manage server access problems such as request overloads:

#!/usr/bin/perl -w use LWP; # ...in place of LWP::Simple. $db_list = "242,28853987"; # this list could be VERY long... $virtual_browser = LWP::UserAgent->new; $post_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi"; $epost_response = $virtual_browser->post( $post_url, [ 'db' => 'snp', 'id' => $db_list, 'email' => 'user_address\@user_mail_host.edu', 'tool' => 'the_name_of_this_program.pl' ] ); $epost_result_string = $epost_response->content; @epost_result_array = split( "\n", $epost_result_string );

Having posted the UID list, the record content for each UID can be retrieved using either the LWP::Simple get method or the LWP post method. However, the number of UIDs in $db_list should be determined first, and eFetch content should be downloaded in "batches" of not more than 500 UIDs to conform to NCBI guidelines. In addition, request initiations should be spaced at least 15 seconds apart during peak-usage hours and 3 seconds apart during off-peak hours.

To facilitate this limitation, the eFetch and eSummary routines allow users to specify the first UID entry to be processed and the maximum number to be processed, beginning with the first, by using the "retstart" and "retmax" parameters, respectively.

Here is a continuation of the previous example that will print the eFetch content for a list of UIDs in the query result database entry. This code uses LWP rather than LWP::Simple to send eFetch requests, and waits at least 3 seconds between request initiations.

$number_of_UIDs_in_list = 2; # ...from the example above. # Make multiple requests (if necessary) to fetch sets of up to the # maximum size of a batch. $batch_size = 500; # Define the maximum number of entries to get at one time. $minimum_request_interval = 3; # Define the minimum request interval (seconds). foreach $line ( @esearch_result_array ) { if( $line =~ m/<QueryKey>(.*)<\/QueryKey>/ ) { $query_key = $1 }; if( $line =~ m/<WebEnv>(.*)<\/WebEnv>/ ) { $web_environment = $1 }; } $efetch_url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"; for( $start = 0; $start < $number_of_UIDs_in_list; $start+= $batch_size ) { $start_time = time; $efetch_response = $virtual_browser->post( $efetch_url, [ 'db' => 'snp', 'query_key' => $query_key, 'WebEnv' => $web_environment, 'retmode' => 'xml', 'rettype' => 'xml', 'retstart' => $start, # first UID to process... 'retmax' => $batch_size, # number of UIDs to process 'email' => 'dgrobe@iupui.edu', 'tool' => 'this_program.pl' ] ); $efetch_result_string = $efetch_response->content; print "$efetch_result_string"; $current_time = time; while( ( ( $current_time - $start_time ) < $minimum_request_interval ) && ( ( $start + $batch_size ) <= $number_of_SNPs_in_list ) ) # this second clause prevents unnecessary waiting. { sleep 1; # sleep for one second. $current_time = time; } } This script assumes that the number of items in the list is contained in a variable named $number_of_UIDs_in_list. That number is known within this series of code fragments, but if the result list is loaded by an eSearch call, we can get the number by searching for the first <Count>...</Count> element returned by esearch.fcgi. (This can be done while searching for the WebEnv and query_key information.)

Also, if this code fragment is to operate on an extant query result database entry, you can find out how many items are in the list by using a search like:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=$db\ &term=%23$query_key&WebEnv=$web_environment&usehistory=y which uses the earlier query key, preceded by a pound sign (#) URL encoded as a "%23", as the search term.

Note that most eSummary or eFetch requests asking for batches of 500 records will take more than 3 seconds, so that the NCBI restrictions will not signficantly degrade performance during off-peak work. They will, however, assure that multiple simultaneous requests will get their fair shares of the system resources.

Retrieving eLinked records in batches using "index lists"

The method described earlier that uses eLink to get SNP lists for a single gene, may not generalize efficiently to getting separate SNP lists for a large number of genes. The problem is that there may be too many genes to process serially in a reasonable amount of time, and separate SNP lists for each gene might not result in large enough batch requests.

However, NCBI PowerScripting Lecture 4 describes a way to process such requests using batched eFetch requests. The general idea is to:

send an appropriate eLink request containing many "&id" parameters, each one specifying a UID, to be known below as a "linkFromUID". The following example query: will produce 2 "LinkSet" elements, each with an "IdList" element identifying a linkFromUID, as well as a number of "Link" elements each with multiple "Id" elements identifying a linkToUID.
capture each returned LinkSet element, and append its UID entries to an "index list," that records each linkToUID linked to from a specified linkFromUID. The result can be represented by something like this index table: but you can use whatever representation or storage approach you want. In Perl it would be convenient to store the index list in a hash, using the linkFrom UID as a key, and put each linkToUID into a Perl list to be stored as a single hash value. This approach would simplify subsequent use as well.
send a second eLink request via LWP with a single "&id" parameter specifying every linkFromUID, and including the "cmd=neighbor_history" parameter to force all the linkToUIDs into a single query result database entry,
send an eFetch request (or some number of batched requests) for the contents of every record for every UID within the database entry generated by the previous step (which will include all linkToUID values),
parse the returned records into a structure that can be searched to retrieve individual records identified by linkToUID, such as a hash using linkToUIDs as keys, and, finally,
use the index list constructed in Step 2, along with the records returned in Step 5, to construct an output list aggregated by linkFromUID.

This approach is rather complicated, but is fairly straight forward using Perl hashes, and will dramatically improve overall data retrieval rates. LWP appears to allow only one instance of a parameter name in a single request, so that Steps 1 and 2 must be repeated multiple times, using LWP::Simple get requests including multiple $id parameters to build the index list. (Alternatively, a socket could be opened for sending a LWP post request specifying mulitple $id parameters at the same time.)

Additional information

The NCBI Web site includes a great deal of information describing the use of the eUtilities and the structure of queries for searches.

The Entrez Programming Utilities page describes each eUtility and provides examples.

The "PowerTools Technical Workshop Series" link on the "Education" page linked from the main NCBI page links to course slidesets including additional and thorough descriptions of the NCBI programmatic interfaces.

Users who plan to use the NCBI eUtils extensively will want to examine the NCBI_PowerScripting.pm package, which encapsulates many access details within straight-forward object calls. Using that package, the programs shown above can be replaced by just a few lines of Perl code, and can even be generated "automatically" through the Web-based eBot service. Such scripts can simply be downloaded and run from the user's desktop to demonstrate how the eUtilities can be efficiently used.

The NCBI_PowerScripting package also includes routines for performing and managing "batched" Entrez queries. This helps programs access large amounts of NCBI information without violating request rate guidelines.

Michael Grobe
Principal Systems Analyst
Research Technologies
University Information Technical Services
Indiana University
Office: IT 330A Indianapolis, IN
Office Phone: 317-278-6891
Office e-mail: dgrobe@iupui.edu
July 4, 2007

Acknowledgements: This Web Page could not have been put together without the NCBI PowerScripting course, and in particular, the presentations and exercise sessions by Eric Sayers and Andrei Gabrielian.