Find papers authored by a specific number of authors
PubMed has an interface which you can call from a script. The intend to develop it is exactly your class of problem, which cannot be solved from the provided user interface.
This is the main page of NCBI Entrez API: https://www.ncbi.nlm.nih.gov/books/NBK25501/
What you need to do is to query PubMed by keyword(s), for example this is a search by "concrete":
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=100&sort=relevance&term=concrete
Do multiple queries, to cover your field, for example you can also consider "brick" or "cement".
It returns a list of publications. For each publication, you would have to check the number of authors and keep ones with a single author. For each publication, you shall call:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=29510510&retmode=json
Determine the size of the vector "authors", and only keep the ones of size one.
Based on @Razvan P's hint, I wrote a little python3-script which solves your problem:
'''
Created on 01.04.2018
@author: OBu
'''
import requests
import json
from collections import Counter # for histogram
eutils_basepath = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
DB = 'pubmed' # please modify for other databases
RETMAX = '100' # max 100 results - modify if needed, maximum = 100.000
SEARCHTERM = "concrete" # replace with your search term
# Now build the search URL:
search_url = eutils_basepath + 'esearch.fcgi?db=' + DB + \
'&retmode=json&retmax=' + RETMAX + \
'&sort=relevance&term=' + SEARCHTERM
# for additional search parameters or mor complex search terms see examples in
# https://www.ncbi.nlm.nih.gov/books/NBK25500/#chapter1.Searching_a_Database
# or the full doc under
# https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
s = requests.Session()
r = s.get(search_url)
if r.status_code != 200:
raise ConnectionError("Search failed with error code " + str(r.status_code))
search_results = json.loads(r.text)
#show some statistics
print(f"{search_results['esearchresult']['count']} publications found.")
if RETMAX < search_results['esearchresult']['count']:
print(f"Warning: Only the first {RETMAX} publications are processed.")
# walk through all rerieved ids and fetch detailed publication information
# An alternative soloution could use one single query based on the previous search results as shown in
# https://www.ncbi.nlm.nih.gov/books/NBK25500/#_chapter1_Downloading_Document_Summaries_
# This would reduce the server load
histogram = Counter()
for pub_id in search_results['esearchresult']['idlist']:
#print(f"Fetching {pub_id}", end=" ") # uncomment for a more verbose versione
# Now build the fetch URL:
fetch_url = eutils_basepath + 'esummary.fcgi?db=' + DB + '&retmode=json&id=' + pub_id
r = s.get(fetch_url)
if r.status_code != 200:
raise ConnectionError(f"Fetching of publication {pub_id} failed with error code {r.status_code}")
# else: # uncomment for a more verbose versione
# print("...success!") # uncomment for a more verbose versione
fetch_result = json.loads(r.text)
authors = fetch_result['result'][pub_id]['authors']
if len(authors) == 1:
print(f"UID: {pub_id}, author: {authors[0]['name']}, title: {fetch_result['result'][pub_id]['title']}")
histogram[len(authors)] += 1
print("Histogram: (number of authors, number of papers with that many authors)")
print(sorted(histogram.items(), key=lambda x: x[0]))
You will need python 3.6 or above to run this script (please remove f-strings for earlier versions, and you'll hvae to install "requests" via pip install requests
.
The script searches for the SEARCHTERM in pubmed and for the search term concrete
(I like this running gag ;-) ) produces an output like
14125 publications found.
Warning: Only the first 100 publications are processed.
UID: 28844248, author: Baroody AJ, title: The Use of Concrete Experiences in Early Childhood Mathematics Instruction.
UID: 28772472, author: Wang XY, title: Modeling of Hydration, Compressive Strength, and Carbonation of Portland-Limestone Cement (PLC) Concrete.
UID: 29159238, author: Paul SC, title: Data on optimum recycle aggregate content in production of new structural concrete.
UID: 27012788, author: Kovler K, title: The national survey of natural radioactivity in concrete produced in Israel.
Histogram: (number of authors, number of papers with that many authors)
[(1, 4), (2, 15), (3, 21), (4, 24), (5, 19), (6, 10), (7, 3), (8, 2), (10, 2)]
It should not be too difficult to modify the script for other search tasks...
If there are questions on how to use the script, please ask!