Automatically fetch PDFs of UCSC browser views for locations within a given session
June 21, 2013The script on the below (current version on bitbucket) will take a browser URL, a session URL (obtained by saving a session to a file and copying the file to a site that is accessible by the chosen browser), and a file listing positions (as browser strings, i.e. chr:start-end to be obtained and will generate a PDF file for each position.
The interesting part is essentially lines 60-68, where curl is used to first fetch a URL that corresponds to the PDF/PS creation page. From the html obtained from that URL, the path of the PDF for the browser view is parsed and used to actually obtain the PDF.
#! /bin/bash
set -o pipefail
SCRIPT=$(basename $0)
USAGE="
NAME
${SCRIPT} - automatically fetch UCSC browser PDF files
SYNOPSIS
${SCRIPT} browser_url session_url position_file
DESCRIPTION
Uses UCSC browser urls to automatically fetch PDFs for
a list of regions provided in the position file.
Saves pdf files to current directory.
Respect usage limits imposed by the respective browsers.
ARGUMENTS
browser_url
base URL w/o protocol, e.g. niamssolexa.niams.nih.gov
session_url
complete URL pointing to a session file saved from
the browser
position_file
one position in format chr:start-end per line for
all the positions to be captured
DEPENDENCIES
curl
accessible UCSC browser or mirror
"
function usage {
echo "$@" >&2
echo "${USAGE}" >&2
exit 1
}
function fail {
echo "$@" >&2
exit 1
}
# parse commandline
base="${1%/}"
[[ -z "${base}" ]] && usage "missing base url"
session="${2}"
[[ -z "${session}" ]] && usage "missing session url"
position_file="${3}"
[[ -z "${position_file}" ]] && usage "missing position file"
[[ -f "${position_file}" ]] || fail "${position_file} not found"
url1="http://${base}/cgi-bin/hgTracks?hgS_doLoadUrl=submit"
url1="${url1}&hgS_loadUrlName=${session}&hgt.psOutput=on"
while read pos
do
echo "fetching ${pos}"
pdfpath=$(curl -s "${url1}&position=${pos}" \
| grep 'the current browser graphic in PDF' \
| perl -ple 's:.*HREF="../trash(.*).pdf.*:/trash\1.pdf:')
if [[ $? -ne 0 ]]
then
fail " could not retrieve pdf url from ${url1}&position=${pos}"
fi
curl -s -o $(echo "${pos}" | tr ':-' '__').pdf "http://${base}/${pdfpath}"
done < ${position_file}