Searching PubMed with Python

Update 2021-01: minor update to reflect some changes in the Pubmed API

PubMed is a search engine accessing millions of biomedical citations. Users can freely search for biomedical references. For some articles, the access to the full text paper is also open.

This post describes how you can programmatically search the PubMed database with Python, in order to integrate searching or browsing capabilities into your Python application.

There are two main options to consider:

Accessing the database via their public API
Using a package that does the above for you, e.g. Biopython

The Entrez Database a.k.a. the PubMed API

The PubMed API is called the Entrez Database. It’s a web service freely accessible, although there are some guidelines to follow (at the moment of this writing, they recommend not to post more than three requests per second).

There are in total 8 different functions, or e-utilities, which access the database in different ways. Most of the utilities will return XML data, although some of them have the option to return a more convenient JSON format.

In particular, the search API is available at the following URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi

If we want to search for the term fever, the URL we need is for example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=20&sort=relevance&term=fever

The query string parameters used in this example:

db=pubmed, to narrow the search down to the pubmed DB only
retmode=json, to have a JSON string in response and not an XML
retmax=20, to obtain 20 results
sort=relevance, the results are sorted by relevance and not by added date which is the default ranking option on pubmed
term=[your query], the URL-encoded query

This search session will provide a number of PubMed IDs (probably 20) corresponding to the top citations which match our query.

In order to get some more details about these citations, we can use the efetch utility, which takes one or more citation ID as input. At the moment, the efetch utility does not return JSON, so XML is the only option to consider.

Given a list of citation IDs, the fetch operation can be built as follows

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=ID1,ID2,...

At this point, the response will be an XML to handle with e.g. minidom or other XML library. Please notice that we can query the efetch utility for multiple documents, simply by separating them with a comma.

Overall, it’s relatively easy to create the appropriate request using libraries like urllib.request or, better, requests. The response can be parsed with the json module, or minidom in case of XML.

An even more convenient way to do the job is to use an existing library that does what we need for us. A good example is Biopython, a comprehensive package for biological computation in Python.

Searching PubMed with Biopython

You can install the Biopython package with pip:

pip install biopython

The only component we need for searching PubMed is Entrez, which we can import with:

from Bio import Entrez

We can define a function for performing the search, e.g.

def search(query):
    Entrez.email = 'your.email@example.com'
    handle = Entrez.esearch(db='pubmed',
                            sort='relevance',
                            retmax='20',
                            retmode='xml',
                            term=query)
    results = Entrez.read(handle)
    return results

The list of citation IDs will be available as results[‘IdList’].

The next step is to fetch the details for all the retrieved articles via the efetch utility:

def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'your.email@example.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

A full example of search over the term fever:

if __name__ == '__main__':
    results = search('fever')
    id_list = results['IdList']
    papers = fetch_details(id_list)
    for i, paper in enumerate(papers['PubmedArticle']):
         print("{}) {}".format(i+1, paper['MedlineCitation']['Article']['ArticleTitle']))

Notice that the structure of the MedlineCitation dictionaries can get
really convoluted, so you can get familiar with it by doing some pretty-printing. For example after fetching the papers with the code above, you can print out the data for the first paper using the following snippet, so you can understand the structure of its record.

# Pretty print the first paper in full to observe its structure
import json
print(json.dumps(papers['PubmedArticle'][0], indent=2))

The reason for declaring your email address is to allow the NCBI to
contact you before blocking your IP, in case you’re violating the guidelines.

The Gist of the full example:

https://gist.github.com/bonzanini/5a4c39e4c02502a8451d

3 thoughts on “Searching PubMed with Python”

C says:

January 30, 2017 at 7:59 pm

Running the code as is from Git. Getting the following error:
TypeError: string indices must be integers.
Error is occurring on line 31.

LikeLike

1. Marco says:
  
  January 31, 2017 at 8:48 am
  It looks like the format returned by the efetch method is slightly different now
  
  If you replace papers with papers[‘PubmedArticle’] you should get the list or papers, e.g.
```
for i, paper in enumerate(papers['PubmedArticle']):
    print("{}) {}".format(i+1, paper['MedlineCitation']['Article']['ArticleTitle']))
```
  LikeLiked by 1 person
Justin Hart says:

April 4, 2017 at 11:01 pm

Love this.

LikeLike

Searching PubMed with Python

Published by

Marco

3 thoughts on “Searching PubMed with Python”

Leave a comment Cancel reply

Share this:

Related

Published by

Marco

3 thoughts on “Searching PubMed with Python”

Leave a comment Cancel reply