The basic operation of the Core is fairly straight forward. For example, records in a particular database can be easily retrieve via searches based on the contents of specified fields.
However, searching for records in one database as a function of the contents of records in a second database is not as simple. Such a search requires the use of "link" information that connects records in the same or different primary data resources. (This is the same information that underlies the familiar "Links" pull-down options on NCBI Web pages.)
This presentation will describe basic use of the eUtilities through practical examples, covering the following topics:
The primary function of the Entrez programmatic interface is to help users
manipulate sets of UIDs, and fetch data records identified by those
UIDs.
Entrez, itself, must also format data for Web-based users, but the
programmatic interface leaves data display to the client program.
The interface allows programs to:
These primitive capabilities can be combined into powerful sequences
that can integrate data from among most of the Entrez data resources.
In fact, they make it possible to (partially) mimic relational database
operations such as selects and joins on data in separate data
resources.
Note, however, that record content may be retrieved in a limited number
of report formats, where a report type contains a fixed subset
of elements taken from the raw data record.
As a result, additional processing may be required to prune report
data for subsequent display or use, and/or multiple requests may
be required to retrieve data in multiple report formats to obtain all
desired data fields.
The query result database can be used by (most of the) programs that
implement the Entrez set manipulation functions listed above, and is
so important for efficient use of Entrez that this presentation
is almost entirely oriented around it.
"Efficient" use of the query result database allows users to download
large numbers of records without violating the access rate limits
that NCBI imposes upon remote queries.
Each UID set in the Core database is identified by 3 pieces of information:
Query keys are integers, but are often displayed as a pound sign (#) followed by
an integer.
The Entrez databases names are strings like "snp", "nuc", "nucest",
"gene", etc.
Web environment identifiers are long (around 60 character) strings.
Here is a schematic query result database entry:
The two most popular approaches are:
NCBI supports both of these interfaces to the Entrez Core.
In addition, NCBI provides an educational Perl module (NCBI_PowerScripting.pm) that
defines a set of objects that call the CGI services behind the scenes.
The CGI and Web Services routines are known as the "eUtilities" or "eUtils",
and may be categorized with respect to
the UID manipulation functions listed above as:
This presentation will deal only with the CGI functions, but the
Web Services provide identical functionality within the JAX-RPC
framework. (Note that the Web Services are not currently, circa 2007,
available via Perl.)
Here is an URL that uses the epost.fcgi script to insert (or "post")
2 UIDs (242 and 2885398) into the query result database:
and the query result database will then include a new record containing
the 2 UIDs specified by using the "id" parameter:
If you then specify the "db", "query_key", and "WebEnv" parameters in a
URL like:
The full result is shown in first-query-xml.html.
Note that summary records were retrieved for both of the SNP UIDs placed on the
Entrez query database PRIOR to this request for a summary. esummary.fcgi used
the database name, the query key, and the web environment parameters to find
the UID list, and then retrieved a record from the specified database
for each UID on the list.
The following URL shows how to use efetch.fcgi to get a full XML record for these two
SNP UIDs:
Now here is a slight modification that will filter a result_array to print only
the query key and Web environment values:
Now the eSummary query can be added to the program to get the contents of
records identified by their UIDs. Note, however, that eSummary returns a long
string, rather than an array of strings (as did ePost).
The long string can be split up into separate lines, each of which is placed
into a separate array entry by using the Perl function "split" and the Perl
representation for a line separator: the "newline" or '\n' character.
The resulting program looks like this:
Note that this program extracts significant information by using a set of
print...if statements that constitute a "naive" XML parsing process
that may be too inefficient for processing larger or more complex
files.
Such cases may require the use of a formal XML parser.
See the Perl CPAN site for some options.
(Note also that this parsing step can occur "automatically" as part of
the SOAP message processing performed by some of the Web Services versions of
the eUtilties, but not by the CGI versions.)
eSearch can be used to insert a new entry in the query result database by using
a query like:
which requests the UIDs for all SNPs located on human chromosome 1.
Note the presence of the "usehistory" parameter. Without this parameter eSearch
will return a UID list to the user, but will NOT insert the UID list into the
query result database.
eSearch can also be used to create a new UID list from an existing list
by using a query like:
In this query the existing list is identified by the db, query_key, and WebEnv
parameters, and eSearch is directed to select only those SNPs whose
records include a SNP function class tag, and store the resulting UID
list in a new query database entry (by the "usehistory" parameter).
Each type of link is assigned a specific name, so, for example, a link from
a SNP UID to a UNISTS UID will be known as a "snp_unists" link.
(There is a list of
linknames usable in Entrez querie on the NCBI Web site.)
One can imagine that links are stored in a very large table with entries
like this for links from SNP 242:
which shows links as a connection between a UID in one database and a UID in another
(or possibly the same) database, and a "linkset" is simply a set of such links.
For example, the set of links linking from SNP UID 242 shown above constitutes
a linkset.
The eUtility eLink can be used to query Entrez for linkset information.
It can take a UID list and return the linkset encoded in XML.
Links returned within XML will be encoded within LinkSetDb and
LinkSetDbHistory elements, as shown in examples below.
Here is a Perl program that puts a single SNP UID (242) into a query result
database entry and then requests a linkset of all the NCBI databases to
which it is connected:
The results of running this program from the Unix shell look like:
Next appear separate LinkSetDb elements for each database to which the originating
UID links.
Note that each link in a LinkSetDb entry is described by a particular "link name,"
that can be used in queries.
For example links from dbSNP to the TAXONOMY database are "snp_taxonomy" links.
If an input UID set contains more than one element (e.g., "&id=242,28853987"), each
resulting linkset will be a union of the linksets that would be produced by each individual UID.
To avoid this aggregation, multiple input UID sets must be specified during a
query, as with "&id=242&id=28853987".
In the example above, it is important to note that no query key or Web
environment information was returned, since the linked UIDs were NOT placed on
the query result database.
It is also important to note that the "db" parameter is used in a new way; it no longer
specifies which database is being searched, but which database is being linked to.
Sometimes, it is desireable to perform an eSearch for records in a database that have links to
at least one UID in another database.
For example, you may wish to write an eSearch term clause that loads a
gene UID into the (new) query result database only if it links to a SNP UID.
To do that you would include a search condition something like:
In most cases, a linkset maps UIDs in one database to UIDs in another
database, but it can be useful to map into the same database.
For example, one may wish to build a linkset that maps protein enzyme
UIDs to protein substrates and products.
The information "linked to" can be retrieved by specifying the query key and Web
environment in eSummary or eFetch requests.
For example the the information in the unists database that is linked to by the
request above can be retrieved by using a request like:
The output from this program for the JAK3 UID (3718) is a list of selected
information for every SNP linked to from UID 3718. The output is an extended version
of the following:
Note that the program above essentially implements an SQL select
command similar to:
This limitation may cause eUtility requests containing long lists of UIDs to fail, and
to avoid this problem you must use the Perl module LWP in place of LWP::Simple.
If you use LWP you must rewrite all your eUtil calls, since both HTTP
GETs and POSTs are handled differently with LWP.
Here is an example ePost invocation using LWP. Note that the "email" and "tool"
parameters are used to identify the requester and requesting program, and they can
be used with LWP::Simple get requests as well.
NCBI uses this information to help manage server access problems
such as request overloads:
Having posted the UID list, the record content for each UID can be
retrieved using either the LWP::Simple get method or the LWP post method.
However, the number of UIDs in $db_list should be determined first,
and eFetch content should be downloaded in "batches" of not more than 500 UIDs
to conform to NCBI guidelines. In addition, request initiations should be
spaced at least 15 seconds apart during peak-usage hours and 3 seconds
apart during off-peak hours.
To facilitate this limitation, the eFetch and eSummary routines allow
users to specify the first UID entry to be processed and the maximum
number to be processed, beginning with the first, by using the "retstart"
and "retmax" parameters, respectively.
Here is a continuation of the previous example that will print
the eFetch content for a list of UIDs in the query result database entry.
This code uses LWP rather than LWP::Simple to send eFetch requests, and
waits at least 3 seconds between request initiations.
Also, if this code fragment is to operate on an extant query result database entry,
you can find out how many items are in the list by using a search like:
Note that most eSummary or eFetch requests asking for batches of 500
records will take more than 3 seconds, so that the NCBI restrictions
will not signficantly degrade performance during off-peak work.
They will, however, assure that multiple
simultaneous requests will get their fair shares of the system resources.
The method described earlier that uses eLink to get SNP lists for a single gene, may not
generalize efficiently to getting separate SNP lists for a large number of genes.
The problem is that there may be too many genes to process serially in a reasonable amount of time,
and separate SNP lists for each gene might not result in large enough batch requests.
However,
NCBI
PowerScripting Lecture 4 describes a way to process such requests
using batched eFetch requests.
The general idea is to:
The
Entrez Programming Utilities page describes each eUtility and provides
examples.
The "PowerTools Technical Workshop Series" link on the "Education" page linked from
the main NCBI page links to course slidesets including additional and thorough
descriptions of the NCBI programmatic interfaces.
Users who plan to use the NCBI eUtils extensively will want to examine the
NCBI_PowerScripting.pm package, which encapsulates many access details within
straight-forward object calls. Using that package, the programs shown above can
be replaced by just a few lines of Perl code, and can even be generated "automatically"
through the Web-based eBot service.
Such scripts can simply be downloaded and run from the user's desktop to demonstrate
how the eUtilities can be efficiently used.
The NCBI_PowerScripting package also includes routines for performing and managing
"batched" Entrez queries. This helps programs access large amounts of
NCBI information without violating request rate guidelines.
Michael Grobe
Acknowledgements: This Web Page could not have been put together
without the NCBI PowerScripting course, and in particular, the
presentations and exercise sessions by Eric Sayers and Andrei Gabrielian.
Entrez manipulates sets of UIDs
Every database in the Entrez domain assigns unique IDs (UIDs) to major
record-types in each database. These IDs are integer values unique
within the database, but the same integer may be used to identify
records in multiple databases.
(Thus, to identify a particular record, one must specify both the
database and the record's UID.)
The Entrez "query result database"
The Entrez Core can keep a record of each query it processes, including
the UID set resulting from each query.
The database holding these records will be referred to as "the query result
database" within this presentation, although it is described as
"the History" or "the History server" in some NCBI documentation.
Database Query
KeyWeb Env
(edited)UID set
snp 2 A3zq156CDS_p1DdWz...AU6u3yb5D3B634BAF50 242, 28853987
NCBI program interfaces to the Entrez Core
There exist several "technologies" for accessing remote data and
computing resources programmatically.
Function Generic name CGI routine
define a set of UIDs
ePost (and sometimes eSearch)
epost.fcgi, esearch.fcgi
display the contents of records identified by UIDs
eSummary and eFetch
esummary.fcgi, efetch.fcgi
create a UID set from a previously defined set
eSearch
esearch.fcgi
create a UID set by finding links from an existing set
eLink
elink.fcgi
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=snp&id=242,28853987
If you enter this URL into a Web browser you will get a response like:
Database Query
KeyWeb Env
(edited)UID set
snp 1
01yWrS_p1DdWzAUPU6e...E5D3B634BAF50_0012SID
242,28853987
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=snp&query_key=1&\
WebEnv=01yWrS_p1DdWzAUPU6eOwxX2...s@1FBE5D3B634BAF50_0012SID
where the "\" at the end of the line signifies that the line actually
continues onto the next line (but does NOT get typed in),
eSummary.fcgi will return a document like this (with many lines removed):
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&query_key=2&\
WebEnv=01yWrS_p1DdWzAUPU6eOwxX2...s@1FBE5D3B634BAF50_0012SID&\
report=sgml&mode=xml
The result may be examined in fetch-example-xml.html.
Note that the "report" and "mode" options were used to specify the
report contents and format.
Selection of values for these options seems rather unusual.
Using ePost in a Perl program
The hand-entered queries shown above can all be sent to Entrez via programs.
A Perl program to post 2 UIDs (242 and 28853987) to the query result database
is shown below:
#!/usr/bin/perl -w
use LWP::Simple;
$url =
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=snp&id=242,28853987";
@result_array = get ( "$url" ); # note that epost is returning an array lines of XML.
print @result_array;
Note that the query is identical to the one issued in the first example above, and
the results will be identical, except for changes in the WebEnv
identifier.
bash-2.05$ perl test-ncbi-2.pl
Query Key: 1
Web Environment: 03bGQckqzaWiGXQqZHvYpXvTh...oNb-J@1FBE58C06361F480_0005SID
bash-2.05$ perl test-ncbi-4.pl
SNP_ID = 242
GENE =
FXN_CLASS =
TAX_ID = 9606
SNP class = in-del
Chromosome:Position = 1:20742047
SNP_ID = 28853987
GENE = LOC653635
FXN_CLASS = locus-region
TAX_ID = 9606
SNP class = snp
Chromosome:Position = 1:800
Using eSearch to create new query result database entries
eSearch can be used to modify the query result database in two different ways. It can:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&\
usehistory=y&term=1[CHR]+AND+9606[TAX_ID]
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&query_key=1&\
WebEnv=01yWrS_p1DdWzAUPU6eOwxX2...s@1FBE5D3B634BAF50_0012SID&\
term=snp[FXN_CLASS]&usehistory=y
Links, linksets, and using eLink to retrieve linksets
A "link" is a connection between two UIDs, not necessarily in the same database.
For example, a link may identify a protein coded for by a particular gene.
Link database
from UID Linkname to UID
242 snp_pubmed 8808290
242 snp_snp_genegenotype 242
242 snp_taxonmy 9606
242 snp_unists 71299
gene_SNP[filter]
in the term clause, as in:
term=11[CHR] AND gene_SNP[filter] AND mouse[orgn]
which will load all gene UIDs on mouse chromosome 11 that have links to the SNP
database.
Using eLink to put UIDs into the query result database
It is possible to insert UIDs into the (new) result database by using
the parameter "cmd=neighbor_history" with eLink.
If this is done when linking to a single target database, the usual query key and
Web environment information is returned.
If this is done when linking to multiple databases (as with "db=all"), however,
multiple query keys are returned along with a single Web
environment.
Here is example output from an eLink request
that includes a UID list containing only the SNP UID 242:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?db=all\
&id=242&dbfrom=snp&cmd=neighbor_history
Database Query
KeyWeb Env
(edited)UID set
pubmed 2 00O9qrUHud2D8LXbpvILvTB...1FBE6E626363B8F0_0105SID 8808290
taxonomy 3 00O9qrUHud2D8LXbpvILvTB...1FBE6E626363B8F0_0105SID 9606
unists 4 00O9qrUHud2D8LXbpvILvTB...1FBE6E626363B8F0_0105SID 71299
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=unists&\
query_key=4\&WebEnv=00O9qrUHud2D8LXbpvILvTB...1FBE6E626363B8F0_0105SID
and will contain only the 71299 entry and look something like:
Using eLink to get information about SNPs related to a specific gene
Here is a program that uses ePost, eLink, and eSummary in a sequence to get
SNP information for every SNP in a specific gene. This program is similar to
the earlier example that posted a UID and then printed summary information for
that UID. The only difference is an eLink call between the ePost and eSummary
calls that uses the query result entry for the first call to build a new
query result entry containing a list of SNP UIDs, against which eSummary
can be run.
Using LWP to post large UID lists and retrieve results in batches
The examples in this web page have relied upon the Perl LWP::Simple package.
Since LWP::Simple supports only HTTP GET requests, the number of characters
allowed in a request URL may be limited (to around 1 or 2KB, depending on the
HTTP server being accessed).
Retrieving eLinked records in batches using "index lists"
This approach is rather complicated, but is fairly straight forward
using Perl hashes, and will dramatically improve overall data
retrieval rates. LWP appears to allow only one instance of a
parameter name in a single request, so that Steps 1 and 2 must be
repeated multiple times, using LWP::Simple get requests including
multiple $id parameters to build the index list.
(Alternatively, a socket could be opened for sending a LWP post
request specifying mulitple $id parameters at the same time.)
Additional information
The NCBI Web site includes a great deal of information describing the use of the
eUtilities and the structure of queries for searches.
Principal Systems Analyst
Research Technologies
University Information Technical Services
Indiana University
Office: IT 330A Indianapolis, IN
Office Phone: 317-278-6891
Office e-mail: dgrobe@iupui.edu
July 4, 2007