Metapub Cookbook
How to do some stuff you might want to do with Metapub.
Retrieve Article Metadata (including Abstract) by PubMed ID
MetaPub's PubMedFetcher retrieves ALL of the metadata included with the PubMedArticle object as per the Entrez XML definition.
That means you have instant access to all of the following information (and more) with a single request per article:
- Title
- First author, last author, list of authors
- Abstract
- MESH tags
- Keywords
- Year of publication
- Journal
- ...and everything else PubMed contains for the given pubmed ID
This recipe demonstrates how you could rebuild the PubMed page for yourself using all of the features of the PubMedArticle object. (Page formatting is left as an exercise to the reader.)
import argparse
from metapub import PubMedFetcher
def fetch_article_metadata(pmid):
# Initialize the PubMedFetcher
fetch = PubMedFetcher()
# Retrieve the article using the PubMed ID
article = fetch.article_by_pmid(pmid)
# Extract and format metadata
first_author = str(article.author_list[0]) if article.author_list else "N/A"
last_author = str(article.author_list[-1]) if article.author_list else "N/A"
authors = ', '.join([str(author) for author in article.author_list]) if article.author_list else "N/A"
mesh_tags = ', '.join(article.mesh) if article.mesh else 'N/A'
keywords = ', '.join(article.keywords) if article.keywords else 'N/A'
# Print metadata
print(f"Title: {article.title}")
print(f"First Author: {first_author}")
print(f"Last Author: {last_author}")
print(f"Authors: {authors}")
print(f"Abstract: {article.abstract}")
print(f"MESH Tags: {mesh_tags}")
print(f"Keywords: {keywords}")
print(f"Year of Publication: {article.year}")
print(f"Journal: {article.journal}")
def main():
parser = argparse.ArgumentParser(description='Fetch and display metadata for a given PubMed ID.')
parser.add_argument('pmid', type=str, nargs='?', help='PubMed ID of the article to fetch')
args = parser.parse_args()
if not args.pmid:
args.pmid = input("Please enter a PubMed ID: ")
fetch_article_metadata(args.pmid)
if __name__ == "__main__":
main()
Advanced Query Construction with PubMedFetcher and Python Dictionaries
Most researchers are well aware of how to construct an advanced query in PubMed using the PubMed Advanced Search Builder, and most people who use Metapub know you can simply copy that string into fetch.pmids_for_query(query)
and get the expected results.
But did you know PubMedFetcher.pmids_for_query
also accepts unlimited English-like parameters like "journal" and "author1" for the same results?
Moreover, you can feed PubMedFetcher.pmids_for_query
a dictionary of parameters. This is particularly useful for web apps or text-mining projects where parameters are automatically generated.
This recipe demonstrates how to use the versatility of PubMedFetcher's advanced query constructor for a simple command-line app which you can try in any working Python environment and proves it generates the same results as using traditional PubMed query codes.
import argparse
import logging
from metapub import PubMedFetcher
# Configure logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger()
def search_with_dictionary(params):
fetch = PubMedFetcher()
logger.debug(f"Searching with dictionary parameters: {params}")
# Perform search using dictionary parameters
pmids_dict = fetch.pmids_for_query(**params)
return pmids_dict
def search_with_query_string(query):
fetch = PubMedFetcher()
logger.debug(f"Searching with query string: {query}")
# Perform search using query string
pmids_query = fetch.pmids_for_query(query)
return pmids_query
def main():
parser = argparse.ArgumentParser(description='Search PubMed using dictionary parameters.')
parser.add_argument('--journal', type=str, help='Journal name')
parser.add_argument('--author1', type=str, help='First author')
parser.add_argument('--year', type=str, help='Year of publication')
parser.add_argument('--keyword', type=str, help='Keyword')
parser.add_argument('--query', type=str, help='Traditional PubMed query string')
args = parser.parse_args()
params = {}
if args.journal:
params['journal'] = args.journal
if args.author1:
params['first author'] = args.author1
if args.year:
params['year'] = args.year
if args.keyword:
params['keyword'] = args.keyword
if args.query:
pmids_query = search_with_query_string(args.query)
logger.info(f"PMIDs from query string: {pmids_query}")
if params:
pmids_dict = search_with_dictionary(params)
logger.info(f"PMIDs from dictionary: {pmids_dict}")
if __name__ == "__main__":
main()
Build a Citation Graph
This recipe demonstrates how to create a citation network using the Metapub library. The script below starts with a given PubMed ID and recursively builds a directed acyclic graph (DAG) of citations up to a specified depth. Each node in the graph represents a paper, and each directed edge indicates a citation from one paper to another.
- Input: Accepts a PubMed ID and optional depth parameter.
- Output: Prints a dictionary containing nodes and edges of the citation network.
- Logging: Provides detailed logs of the process, including node and edge additions, fetching actions, and error handling.
Usage:
python build_citation_graph.py --depth 3 30848465
Code Explanation:
1. Initialize Logging: Configures logging to output to stdout for detailed tracking.
2. Fetch and Process Articles: Uses PubMedFetcher
to retrieve articles and their citations.
3. Build the Network: Adds nodes and edges recursively, building the citation network up to the specified depth.
4. Error Handling: Logs errors encountered during fetching to ensure smooth execution.
This script is a practical example for researchers and developers looking to leverage Metapub for bibliometric analysis and visualization of citation networks.
import argparse
import logging
from metapub import PubMedFetcher
# Configure logging to output to stdout
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger()
def fetch_citation_network(pmid, depth=1, max_depth=3):
fetch = PubMedFetcher()
citation_graph = {'nodes': [], 'edges': []}
def add_node(pmid):
if pmid not in citation_graph['nodes']:
citation_graph['nodes'].append(pmid)
logger.debug(f'Added node: {pmid}')
def add_edge(citing_pmid, cited_pmid):
if (citing_pmid, cited_pmid) not in citation_graph['edges']:
citation_graph['edges'].append((citing_pmid, cited_pmid))
logger.debug(f'Added edge from {citing_pmid} to {cited_pmid}')
def build_network(pmid, current_depth):
if current_depth > max_depth:
logger.debug(f'Max depth {max_depth} reached at PMID: {pmid}')
return
logger.info(f'Fetching article for PMID: {pmid}')
try:
article = fetch.article_by_pmid(pmid)
add_node(pmid)
except Exception as e:
logger.error(f'Error fetching article for PMID {pmid}: {e}')
return
logger.info(f'Fetching related PMIDs (citedin) for: {pmid}')
try:
related_pmids = fetch.related_pmids(pmid).get('citedin', [])
if not related_pmids:
logger.debug(f'No related PMIDs (citedin) found for PMID: {pmid}')
for related_pmid in related_pmids:
add_node(related_pmid)
add_edge(pmid, related_pmid)
build_network(related_pmid, current_depth + 1)
except Exception as e:
logger.error(f'Error fetching related PMIDs for PMID {pmid}: {e}')
build_network(pmid, depth)
return citation_graph
def main():
parser = argparse.ArgumentParser(description='Create a citation network for a given PubMed ID.')
parser.add_argument('pmid', type=str, help='PubMed ID of the article to start with')
parser.add_argument('--depth', type=int, default=3, help='Depth of the citation network')
args = parser.parse_args()
logger.info(f'Starting citation network creation for PMID: {args.pmid} with depth: {args.depth}')
citation_network = fetch_citation_network(args.pmid, max_depth=args.depth)
logger.info('Citation network creation completed')
print(citation_network)
if __name__ == "__main__":
main()
Scrape Citations and Fetch DOIs
This recipe demonstrates how to scrape a webpage that has citations in the form of a title and an author, but you don't have all of their DOIs. We'll scrape metapub.org/citations and use beautifulsoup to break apart the HTML, just for the purposes of demonstrating here, but you'll have to replace with your own scraping function.
This recipes also demonstrates the complementary functions of PubMedFetcher and CrossRefFetcher and how to navigate their usage discrepancies -- for example, the fact that you want "year" from PubMedArticle but "pubdate" from a CrossRefWork. (This is a potential avenue of API improvement for metapub...)
import requests
from bs4 import BeautifulSoup
from metapub import PubMedFetcher, CrossRefFetcher
# Function to scrape citations from the webpage
def scrape_citations(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
citations = soup.find_all('article', class_='citation')
citation_data = []
for citation in citations:
title = citation.find('a').get_text()
authors = citation.find_all('p')[1].get_text().replace('Authors: ', '')
doi_element = citation.find('a', href=lambda href: href and 'doi.org' in href)
doi = doi_element.get_text() if doi_element else None
citation_data.append({'title': title, 'authors': authors, 'doi': doi, 'year': None, 'journal': None})
return citation_data
# Function to check DOI using PubMedFetcher
def check_Pubmed(citation):
fetch = PubMedFetcher()
query = f"{citation['title']} {citation['authors']}"
pmids = fetch.pmids_for_query(query)
for pmid in pmids:
article = fetch.article_by_pmid(pmid)
if article and citation['title'].lower() in article.title.lower():
year = article.year
journal = article.journal
doi = article.doi if article.doi else None
return doi, year, journal
return None, None, None
# Function to check DOI using CrossRefFetcher
def check_CrossRef(citation):
cr = CrossRefFetcher()
results = cr.article_by_title(citation['title'])
if not results:
return None, None, None
# If results is not a list, assume it's the correct result
if not isinstance(results, list):
return results.doi if results.doi else None, results.year, results.journal
# If results is a list, check each result for a match
for result in results:
if result.title and citation['title'].lower() == result.title.lower() and citation['authors'].split(",")[0].lower() in [author['family'].lower() for author in result.authors]:
return result.doi if result.doi else None, result.year, result.journal
return None, None, None
# Function to get DOI using both PubMedFetcher and CrossRefFetcher
def get_doi(citation):
doi, year, journal = check_Pubmed(citation)
if not doi:
doi, crossref_year, crossref_journal = check_CrossRef(citation)
if doi:
year = year if year else crossref_year
journal = journal if journal else crossref_journal
return doi, year, journal
# Main script
def main():
url = 'https://metapub.org/citations/'
citations = scrape_citations(url)
for citation in citations:
if not citation['doi']:
doi, year, journal = get_doi(citation)
citation['doi'] = doi if doi else 'DOI not found'
citation['year'] = year if year else 'Year not found'
citation['journal'] = journal if journal else 'Journal not found'
print(f"Citation: {citation['title']}")
print(f"Authors: {citation['authors']}")
print(f"DOI: {citation['doi']}")
print(f"Year: {citation['year']}")
print(f"Journal: {citation['journal']}\n")
if __name__ == "__main__":
main()
Find Open Access Articles from a Specific Journal and Year
This recipe helps you learn how many articles from a given year and journal can be retrieved as open-access articles. It uses the FindIt engine to construct valid PDF links for each PMID to check if articles are available as open access.
import requests
from metapub import PubMedFetcher
from metapub import FindIt
def find_open_access_articles(journal, year):
fetch = PubMedFetcher()
pmids = fetch.pmids_for_query(f"{journal}[journal] AND {year}[pdat]")
open_access_pmids = []
for pmid in pmids:
src = FindIt(pmid, verify=True)
if src.url:
print(pmid, "available at ", src.url)
open_access_pmids.append(pmid)
else:
print(pmid, "not available: ", src.reason)
return open_access_pmids
# Example usage
journal = "cell"
year = 2023
open_access_articles = find_open_access_articles(journal, year)
print("Found", len(open_access_articles), "open access in journal", journal, "for year", year)
Download Article PDFs
If you've generated a long list of research you need to read and don't want to take the time to click an average of 2.5 times per article to download its fulltext, FindIt is for you. This recipe demonstrates how simple it is to track down every Open Access PDF you might want from your wish list in a small script.
from metapub import PubMedFetcher
def download_article_pdfs(pmid_list, download_dir):
fetch = PubMedFetcher()
for pmid in pmid_list:
article = fetch.article_by_pmid(pmid)
pdf_url = article.pdf_url
if pdf_url:
response = requests.get(pdf_url)
with open(f"{download_dir}/{pmid}.pdf", 'wb') as f:
f.write(response.content)
print(f"Downloaded PDF for PMID {pmid}")
else:
print(f"No PDF available for PMID {pmid}")
# Example usage
pmid_list = ["12345678", "23456789"]
download_dir = "/path/to/download"
download_article_pdfs(pmid_list, download_dir)