Metapub Cookbook

How to do some stuff you might want to do with Metapub.

Retrieve Article Metadata (including Abstract) by PubMed ID

MetaPub's PubMedFetcher retrieves ALL of the metadata included with the PubMedArticle object as per the Entrez XML definition.

That means you have instant access to all of the following information (and more) with a single request per article:

Title
First author, last author, list of authors
Abstract
MESH tags
Keywords
Year of publication
Journal
...and everything else PubMed contains for the given pubmed ID

This recipe demonstrates how you could rebuild the PubMed page for yourself using all of the features of the PubMedArticle object. (Page formatting is left as an exercise to the reader.)


import argparse
from metapub import PubMedFetcher

def fetch_article_metadata(pmid):
    # Initialize the PubMedFetcher
    fetch = PubMedFetcher()

    # Retrieve the article using the PubMed ID
    article = fetch.article_by_pmid(pmid)

    # Extract and format metadata
    first_author = str(article.author_list[0]) if article.author_list else "N/A"
    last_author = str(article.author_list[-1]) if article.author_list else "N/A"
    authors = ', '.join([str(author) for author in article.author_list]) if article.author_list else "N/A"
    mesh_tags = ', '.join(article.mesh) if article.mesh else 'N/A'
    keywords = ', '.join(article.keywords) if article.keywords else 'N/A'

    # Print metadata
    print(f"Title: {article.title}")
    print(f"First Author: {first_author}")
    print(f"Last Author: {last_author}")
    print(f"Authors: {authors}")
    print(f"Abstract: {article.abstract}")
    print(f"MESH Tags: {mesh_tags}")
    print(f"Keywords: {keywords}")
    print(f"Year of Publication: {article.year}")
    print(f"Journal: {article.journal}")

def main():
    parser = argparse.ArgumentParser(description='Fetch and display metadata for a given PubMed ID.')
    parser.add_argument('pmid', type=str, nargs='?', help='PubMed ID of the article to fetch')
    args = parser.parse_args()

    if not args.pmid:
        args.pmid = input("Please enter a PubMed ID: ")

    fetch_article_metadata(args.pmid)

if __name__ == "__main__":
    main()

Advanced Query Construction with PubMedFetcher and Python Dictionaries

Most researchers are well aware of how to construct an advanced query in PubMed using the PubMed Advanced Search Builder, and most people who use Metapub know you can simply copy that string into fetch.pmids_for_query(query) and get the expected results.

But did you know PubMedFetcher.pmids_for_query also accepts unlimited English-like parameters like "journal" and "author1" for the same results?

Moreover, you can feed PubMedFetcher.pmids_for_query a dictionary of parameters. This is particularly useful for web apps or text-mining projects where parameters are automatically generated.

This recipe demonstrates how to use the versatility of PubMedFetcher's advanced query constructor for a simple command-line app which you can try in any working Python environment and proves it generates the same results as using traditional PubMed query codes.


import argparse
import logging
from metapub import PubMedFetcher

# Configure logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger()

def search_with_dictionary(params):
    fetch = PubMedFetcher()
    logger.debug(f"Searching with dictionary parameters: {params}")
    # Perform search using dictionary parameters
    pmids_dict = fetch.pmids_for_query(**params)
    return pmids_dict

def search_with_query_string(query):
    fetch = PubMedFetcher()
    logger.debug(f"Searching with query string: {query}")
    # Perform search using query string
    pmids_query = fetch.pmids_for_query(query)
    return pmids_query

def main():
    parser = argparse.ArgumentParser(description='Search PubMed using dictionary parameters.')
    parser.add_argument('--journal', type=str, help='Journal name')
    parser.add_argument('--author1', type=str, help='First author')
    parser.add_argument('--year', type=str, help='Year of publication')
    parser.add_argument('--keyword', type=str, help='Keyword')
    parser.add_argument('--query', type=str, help='Traditional PubMed query string')
    args = parser.parse_args()

    params = {}
    if args.journal:
        params['journal'] = args.journal
    if args.author1:
        params['first author'] = args.author1
    if args.year:
        params['year'] = args.year
    if args.keyword:
        params['keyword'] = args.keyword

    if args.query:
        pmids_query = search_with_query_string(args.query)
        logger.info(f"PMIDs from query string: {pmids_query}")

    if params:
        pmids_dict = search_with_dictionary(params)
        logger.info(f"PMIDs from dictionary: {pmids_dict}")

if __name__ == "__main__":
    main()

Build a Citation Graph

This recipe demonstrates how to create a citation network using the Metapub library. The script below starts with a given PubMed ID and recursively builds a directed acyclic graph (DAG) of citations up to a specified depth. Each node in the graph represents a paper, and each directed edge indicates a citation from one paper to another.

Input: Accepts a PubMed ID and optional depth parameter.
Output: Prints a dictionary containing nodes and edges of the citation network.
Logging: Provides detailed logs of the process, including node and edge additions, fetching actions, and error handling.

Usage:

python build_citation_graph.py --depth 3 30848465

Code Explanation:

1. Initialize Logging: Configures logging to output to stdout for detailed tracking.

2. Fetch and Process Articles: Uses PubMedFetcher to retrieve articles and their citations.

3. Build the Network: Adds nodes and edges recursively, building the citation network up to the specified depth.

4. Error Handling: Logs errors encountered during fetching to ensure smooth execution.

This script is a practical example for researchers and developers looking to leverage Metapub for bibliometric analysis and visualization of citation networks.


import argparse
import logging
from metapub import PubMedFetcher

# Configure logging to output to stdout
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger()

def fetch_citation_network(pmid, depth=1, max_depth=3):
    fetch = PubMedFetcher()
    citation_graph = {'nodes': [], 'edges': []}

    def add_node(pmid):
        if pmid not in citation_graph['nodes']:
            citation_graph['nodes'].append(pmid)
            logger.debug(f'Added node: {pmid}')

    def add_edge(citing_pmid, cited_pmid):
        if (citing_pmid, cited_pmid) not in citation_graph['edges']:
            citation_graph['edges'].append((citing_pmid, cited_pmid))
            logger.debug(f'Added edge from {citing_pmid} to {cited_pmid}')

    def build_network(pmid, current_depth):
        if current_depth > max_depth:
            logger.debug(f'Max depth {max_depth} reached at PMID: {pmid}')
            return
        logger.info(f'Fetching article for PMID: {pmid}')
        try:
            article = fetch.article_by_pmid(pmid)
            add_node(pmid)
        except Exception as e:
            logger.error(f'Error fetching article for PMID {pmid}: {e}')
            return

        logger.info(f'Fetching related PMIDs (citedin) for: {pmid}')
        try:
            related_pmids = fetch.related_pmids(pmid).get('citedin', [])
            if not related_pmids:
                logger.debug(f'No related PMIDs (citedin) found for PMID: {pmid}')
            for related_pmid in related_pmids:
                add_node(related_pmid)
                add_edge(pmid, related_pmid)
                build_network(related_pmid, current_depth + 1)
        except Exception as e:
            logger.error(f'Error fetching related PMIDs for PMID {pmid}: {e}')

    build_network(pmid, depth)
    return citation_graph

def main():
    parser = argparse.ArgumentParser(description='Create a citation network for a given PubMed ID.')
    parser.add_argument('pmid', type=str, help='PubMed ID of the article to start with')
    parser.add_argument('--depth', type=int, default=3, help='Depth of the citation network')
    args = parser.parse_args()

    logger.info(f'Starting citation network creation for PMID: {args.pmid} with depth: {args.depth}')
    citation_network = fetch_citation_network(args.pmid, max_depth=args.depth)
    logger.info('Citation network creation completed')
    print(citation_network)

if __name__ == "__main__":
    main()

Scrape Citations and Fetch DOIs

This recipe demonstrates how to scrape a webpage that has citations in the form of a title and an author, but you don't have all of their DOIs. We'll scrape metapub.org/citations and use beautifulsoup to break apart the HTML, just for the purposes of demonstrating here, but you'll have to replace with your own scraping function.

This recipes also demonstrates the complementary functions of PubMedFetcher and CrossRefFetcher and how to navigate their usage discrepancies -- for example, the fact that you want "year" from PubMedArticle but "pubdate" from a CrossRefWork. (This is a potential avenue of API improvement for metapub...)


import requests
from bs4 import BeautifulSoup
from metapub import PubMedFetcher, CrossRefFetcher

# Function to scrape citations from the webpage
def scrape_citations(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    citations = soup.find_all('article', class_='citation')
    citation_data = []

    for citation in citations:
        title = citation.find('a').get_text()
        authors = citation.find_all('p')[1].get_text().replace('Authors: ', '')
        doi_element = citation.find('a', href=lambda href: href and 'doi.org' in href)
        doi = doi_element.get_text() if doi_element else None
        citation_data.append({'title': title, 'authors': authors, 'doi': doi, 'year': None, 'journal': None})

    return citation_data

# Function to check DOI using PubMedFetcher
def check_Pubmed(citation):
    fetch = PubMedFetcher()
    query = f"{citation['title']} {citation['authors']}"
    pmids = fetch.pmids_for_query(query)

    for pmid in pmids:
        article = fetch.article_by_pmid(pmid)
        if article and citation['title'].lower() in article.title.lower():
            year = article.year
            journal = article.journal
            doi = article.doi if article.doi else None
            return doi, year, journal
    return None, None, None

# Function to check DOI using CrossRefFetcher
def check_CrossRef(citation):
    cr = CrossRefFetcher()
    results = cr.article_by_title(citation['title'])

    if not results:
        return None, None, None

    # If results is not a list, assume it's the correct result
    if not isinstance(results, list):
        return results.doi if results.doi else None, results.year, results.journal

    # If results is a list, check each result for a match
    for result in results:
        if result.title and citation['title'].lower() == result.title.lower() and citation['authors'].split(",")[0].lower() in [author['family'].lower() for author in result.authors]:
            return result.doi if result.doi else None, result.year, result.journal
    return None, None, None

# Function to get DOI using both PubMedFetcher and CrossRefFetcher
def get_doi(citation):
    doi, year, journal = check_Pubmed(citation)

    if not doi:
        doi, crossref_year, crossref_journal = check_CrossRef(citation)
        if doi:
            year = year if year else crossref_year
            journal = journal if journal else crossref_journal

    return doi, year, journal

# Main script
def main():
    url = 'https://metapub.org/citations/'
    citations = scrape_citations(url)

    for citation in citations:
        if not citation['doi']:
            doi, year, journal = get_doi(citation)
            citation['doi'] = doi if doi else 'DOI not found'
            citation['year'] = year if year else 'Year not found'
            citation['journal'] = journal if journal else 'Journal not found'

        print(f"Citation: {citation['title']}")
        print(f"Authors: {citation['authors']}")
        print(f"DOI: {citation['doi']}")
        print(f"Year: {citation['year']}")
        print(f"Journal: {citation['journal']}\n")

if __name__ == "__main__":
    main()

Find Open Access Articles from a Specific Journal and Year

This recipe helps you learn how many articles from a given year and journal can be retrieved as open-access articles. It uses the FindIt engine to construct valid PDF links for each PMID to check if articles are available as open access.


import requests
from metapub import PubMedFetcher
from metapub import FindIt

def find_open_access_articles(journal, year):
    fetch = PubMedFetcher()
    pmids = fetch.pmids_for_query(f"{journal}[journal] AND {year}[pdat]")
    open_access_pmids = []

    for pmid in pmids:
        src = FindIt(pmid, verify=True)
        if src.url:
            print(pmid, "available at ", src.url)
            open_access_pmids.append(pmid)
        else:
            print(pmid, "not available: ", src.reason)

    return open_access_pmids

# Example usage
journal = "cell"
year = 2023
open_access_articles = find_open_access_articles(journal, year)

print("Found", len(open_access_articles), "open access in journal", journal, "for year", year)

Download Article PDFs

If you've generated a long list of research you need to read and don't want to take the time to click an average of 2.5 times per article to download its fulltext, FindIt is for you. This recipe demonstrates how simple it is to track down every Open Access PDF you might want from your wish list in a small script.


from metapub import PubMedFetcher

def download_article_pdfs(pmid_list, download_dir):
    fetch = PubMedFetcher()
    
    for pmid in pmid_list:
        article = fetch.article_by_pmid(pmid)
        pdf_url = article.pdf_url
        
        if pdf_url:
            response = requests.get(pdf_url)
            with open(f"{download_dir}/{pmid}.pdf", 'wb') as f:
                f.write(response.content)
            print(f"Downloaded PDF for PMID {pmid}")
        else:
            print(f"No PDF available for PMID {pmid}")

# Example usage
pmid_list = ["12345678", "23456789"]
download_dir = "/path/to/download"
download_article_pdfs(pmid_list, download_dir)