Tuesday, May 18, 2010

How to Be A PubMed Historian

Quite a lot of people seem to like those graphs I sometimes make showing the number of papers published about a certain topic in any given year, based on the number of PubMed hits.

But how do I do it? Surely I don't sit there manually searching PubMed for each term, for each year, right? That would mean dozens, maybe hundreds, of manual searches. Well, unfortunately, that is exactly how I've done it in the past. I really am that cool, see.


Actually it doesn't take very long once you get into the swing of it, but I've now worked out a better way. See below for a bash script which repeatedly searches PubMed for a given sequence of years, downloads the first page of the results, picks out the bit where it tells you how many hits you got, and puts it all into a single output text file ready to be pasted into Excel or whatever. This comes with no guarantees whatsoever, but it seems to work. Enjoy...

Edit 29/06/2010: Vastly improved version that searches for multiple different terms sequentially, accepts terms that include spaces, and outputs the data into a sensible format
. The search term text file should be a plain text file containing one search term per line. e.g:
serotonin depression
dopamine depression
GABA depression
Would search for each of those terms and output the data for each year into a single text file - with three data columns in this case - good for comparing the relative popularity of many different terms across time.

---
#! /bin/bash
# 29 . 06 . 2010
#PubMedHistory script by Neuroskeptic http://neuroskeptic.blogspot.com
# script to find out how many PubMed hits for a certain string in a given year range.

# usage: script (search term text file) (start year) (end year) (output file)
# e.g script list_of_terms.txt 2000 2005 dope.txt
#first, print the HEADER line of the output file.

printf "YEAR\t" > $4
cat $1 | while read subject
do
#pre-format the subject to remove spaces
ffa=${subject/' '/%20}
echo -n "$ffa" >> $4
printf "\t" >> $4
done
#and a newline
printf "\n" >> $4

#Now the real thing. The main loop is a YEAR loop:

for (( yearz=$2; yearz<=$3; yearz++ )) do #For each year, create a temporary file t.txt containing the output for this line.
#First, the year, then a tab.

printf "$yearz\t" > t.txt

#now, a second loop to go through the list of searches
cat $1 | while read subject
do
one=${subject/' '/%20}
wget -O $yearz.txt http://www.ncbi.nlm.nih.gov/sites/entrez?term="$one"+"$
yearz"'[Publication Date]'
#find the line in the output with what we're interested in
output=`cat $yearz.txt | grep ncbi_resultcount | awk '{print}'`
#now, change it to get rid of the bit containing the search term
#as this will screw up the next step if it contains spaces!
output=${output/content*
publication/LOL}
#print to a temp file
echo $output > temp$one$2$3$4.txt
#find the bit we want using awk
output=`awk '{ print $22 }' temp$one$2$3$4.txt`
rm temp$one$2$3$4.txt
rm $yearz.txt
#trim output
trimmedout=${output#content\=\
"}
trimmedoutB=${trimmedout%\"}
#replace "false" with 0 because that's what "false" means
trimmedoutC=${trimmedoutB/'
false'/0}
echo in year $yearz , I got $trimmedoutC. Saving to temp file t.txt
#write the result, and a tab, to the TEMPORARY output file
printf "$trimmedoutC\t" >> t.txt
done
#Now we've done all the search terms for this YEAR, so send the temporary data to the final file
cat t.txt >> $4
#and give it a newline
printf "\n" >> $4
done
rm t.txt

No comments:

Post a Comment