geoslurp.datapull package

Submodules

geoslurp.datapull.cds module

class geoslurp.datapull.cds.Cds(resource, jobqueue={}, auth=None)

Bases: object

clearRequests(removestates=['downloaded', 'unavailable', 'failed'])

clears certain requests and updates the jobqueue

downloadQueue(sleep=30)
loadRequests()

Load previous requests from job queue

queueRequest(fout, requestDict)

geoslurp.datapull.crawler module

class geoslurp.datapull.crawler.CrawlerBase(url)

Bases: ABC

parallelDownload(outdir, check=False, maxconn=8, gzip=False, continueonError=False)

Download uris in parallel :param direc: directory to download to :param check: Only download when newer or non-existent (default to False) :param maxconn: amount of parallel downloads to execute :param continueOnError (bool): keep trying

rooturl = None
abstract uris()

Generator which returns uri’s to requested datasets

geoslurp.datapull.ftp module

class geoslurp.datapull.ftp.Crawler(url, pattern='.*', followpattern='.*', auth=None)

Bases: CrawlerBase

Crawler for ftp directories

ls(subdirs='')

List directories and files (generator)

uris(check=False, subdirs='')

Generate a list files in a directory and return a list of uri

class geoslurp.datapull.ftp.Uri(url, lastmod=None, subdirs='', auth=None)

Bases: UriBase

geoslurp.datapull.geodesyunr module

class geoslurp.datapull.geodesyunr.Crawler(catalogfile)

Bases: CrawlerBase

Crawl the gps tenv3 data on geodesy.unr.edu

uris(refresh=True)

List uris of available gps final data in tenv3 format

class geoslurp.datapull.geodesyunr.Uri(indict)

Bases: UriBase

derived class which additionally holds info from the inventory

geoslurp.datapull.github module

class geoslurp.datapull.github.Crawler(reponame, commitsha=None, filter=<geoslurp.datapull.github.GithubFilter object>, followfilt=<geoslurp.datapull.github.GithubFilter object>, oauthtoken=None)

Bases: CrawlerBase

Crawls a github repository fixed to a certain commit

getSubTree(url)
treeitems(rootelem=None, depth=10, dirpath=None)

generator which recursively list all elements in a git tree

uris(depth=10)

Construct Uris from tree nodes

class geoslurp.datapull.github.GithubFilter(regexdict={'type': 'blob'})

Bases: object

Filter used for testing a certain dict element

isValid(elem)

Returns True if all of the regex criteria match the elem

geoslurp.datapull.github.cachedGithubCatalogue(reponame, cachedir='.', commitsha=None, gfilter=<geoslurp.datapull.github.GithubFilter object>, gfollowfilter=<geoslurp.datapull.github.GithubFilter object>, depth=2, ghtoken=None)

Caches the result of a github result for later reuse

geoslurp.datapull.http module

class geoslurp.datapull.http.Uri(url, lastmod=None, auth=None, headers=None, cookiefile=None, checkssl=True)

Bases: UriBase

geoslurp.datapull.motu module

class geoslurp.datapull.motu.MotuOpts(moturoot, service, product, auth, btdbox, fout, cache, variables=None)

Bases: object

A class which mimics the options from argparse as used by the motuclient command line program

auth_mode = 'cas'
block_size = 12001
btdbox = <geoslurp.tools.Bounds.BtdBox object>
cache = '.'
console_mode = False
date_max = '9999-12-31 23:59:59'
date_min = '1-01-01 00:00:00'
depth_max = None
depth_min = None
describe = False
extraction_geographic = True
extraction_vertical = False
fullname()
latitude_max = None
latitude_min = None
longitude_max = None
longitude_min = None
motu = None
out_dir = '.'
out_name = 'dataset.nc'
outputWritten = 'netcdf'
product_id = None
proxy_server = None
pwd = None
service_id = None
size = False
socket_timeout = 515
sync = False
syncbtdbox(bbox=None)

Sets the internal btdbox and synchronize the corresponding motu variables

syncfilename(fout)
user = None
user_agent = 'motu-api-client'
variable = None
class geoslurp.datapull.motu.MotuRecursive(mopts, keepfiles=False)

Bases: object

Class which recursively downloads netcdf files within the 1GB limit using motu and patches them together

download()

Download file

keepfiles = False
class geoslurp.datapull.motu.Uri(Mopts)

Bases: UriBase

download(direc, check=False, gzip=False, outfile=None)

Download file into directory and possibly check the modification time :param check : check whether the file needs updating :param gzip: additionally gzips the file (adds .gz to file name) :param continueonError (bool): don’t raise an exception when a download error occurrs

info = False
kbsize = 0
maxbtdbox = <geoslurp.tools.Bounds.BtdBox object>
maxkbsize = 0
requestInfo()

Request info (modification time, size, datacoverage) on this specific query from the server

updateModTime()

Requests data description from the motu service

updateSize()

Request information about the size of the query

geoslurp.datapull.rsync module

class geoslurp.datapull.rsync.Crawler(url, auth)

Bases: CrawlerBase

Crawler wrapper around the rsync program calls the linux rsync utility

ls()

list remote content (using dry run)

parallelDownload(outdir, check=False, includes=None, dryrun=False)

Download uris in parallel :param direc: directory to download to :param check: Only download when newer or non-existent (default to False) :param maxconn: amount of parallel downloads to execute :param continueOnError (bool): keep trying

startrsync(cmd)

Start rsync and returns the list of files as a generator

uris()

Generator which returns uri’s to requested datasets

geoslurp.datapull.sftp module

geoslurp.datapull.thredds module

class geoslurp.datapull.thredds.Crawler(catalogurl, filter=<geoslurp.datapull.thredds.ThreddsFilter object>, followfilter=<geoslurp.datapull.thredds.ThreddsFilter object>, auth=None)

Bases: CrawlerBase

A class to work with an Opendap server

static getCatalog(url, auth=None)

Retrieve a catalogue

static getServices(catalog, rooturl, depth=2)

Retrieves the root for serving files over http url from a catalogue

setResumePoint(filter, followfilt=None)

Sets the filters after which the normal filters will be applied.

unsetResumePoint()

Unset resume point

uris(depth=10)

Generates a list of threddsURI’s (makes use of xmlitems())

xmlitems(xmlcatalog=None, url=None, depth=10)

Generator which returns xml nodes which obey a certain filter Nodes which obey the followFilter will be recursively searched

class geoslurp.datapull.thredds.ThreddsFilter(xmltyp='*', attr=None, regex=None)

Bases: object

Helper class to aid traversing to opendap xml elements

AND(xmltyp, attr=None, regex=None)

Provides a method for chaining OR filters

OR(xmltyp, attr=None, regex=None)

Provides a method for chaining OR filters

isCatalog()

Check if the filter type is a catalogRef

isValid(xmlelem)

Filter xmlelem on attributes

class geoslurp.datapull.thredds.Uri(dataxml, services, auth=None)

Bases: UriBase

Thredds URI class

opendap = None
suburl = None
geoslurp.datapull.thredds.getAttrib(xml, regex)

Search in xml attributes based on a regex

geoslurp.datapull.thredds.getDate(xml)

extracts the date from a dataset element

geoslurp.datapull.thredds.getTagEnding(xml)

Strip the leading junk ({…}) from a tag

geoslurp.datapull.thredds.gethref(input)

small function to extract a href link from a dictionary

geoslurp.datapull.uri module

class geoslurp.datapull.uri.UriBase(url, lastmod=None, auth=None, subdirs='', headers=None, cookiefile=None, checkssl=True)

Bases: object

Base class to store uri resource

auth = None
buffer()

Download file into a buffer (default uses curl)

download(direc, check=False, gzip=False, gunzip=False, outfile=None, continueonError=False, restdict=None)

Download file into directory and possibly check the modification time :param check : check whether the file needs updating :param gzip: additionally gzips the file (adds .gz to file name) :param continueonError (bool): don’t raise an exception when a download error occurrs

headers = None
lastmod = None
subdirs = ''
updateModTime()

Tries to retrieve the last modification time of a file Note: this is often not supported by the server

url = None
class geoslurp.datapull.uri.UriFile(url, lastmod=None)

Bases: UriBase

buffer()

Download file into a buffer (default uses curl)

updateModTime()

Tries to retrieve the last modification time of a file Note: this is often not supported by the server

geoslurp.datapull.uri.curlDownload(url, fileorfid, mtime=None, gzip=False, gunzip=False, auth=None, restdict=None, headers=None, customRequest=None, upfid=None, cookiefile=None, checkssl=True)

Download the content of an url to an open file or buffer using pycurl :param url: url to download from :param fileorfid: filename or open file or buffer :param mtime: explicitly set the modification time to this (usefull when modification times are not supported b the server) :param gzip: additionally gzip the file on disk (note this routine does not append .gz to the file name) :param gunzip: automatically gunzip the downloaded file :param auth: supply authentification data (user and passw) :param restdic: a set of (REST) API name-value pairs to be added to the url (provide as a dict) :param headers (array of header values): additionally set header elements :param customRequest: set a custoi request (e.g. for WEBDAV servers) :return: modification time of remote file

geoslurp.datapull.uri.findFiles(dir, pattern, since=None)

Generator to recursively search adirecctor (returns a generator)

geoslurp.datapull.uri.setFtime(file, modTime=None)

change modification and access time of a file

geoslurp.datapull.uri.timeFromStamp(stamp)

geoslurp.datapull.webdav module

class geoslurp.datapull.webdav.Crawler(rooturl, pattern, auth, depth=1)

Bases: CrawlerBase

Webdav Crawler (list content of a directory)

find(urlin, depth)

List files in a webdav directory and recursively do this for directories untill the depth is exhausted

pattern = None
uris()

Generator which returns uri’s to requested datasets