geoslurp.datapull package

Submodules

geoslurp.datapull.cds module

class geoslurp.datapull.cds.Cds(resource, jobqueue={}, auth=None)

Bases: object

clearRequests(removestates=['downloaded', 'unavailable', 'failed']): clears certain requests and updates the jobqueue

downloadQueue(sleep=30)

loadRequests(): Load previous requests from job queue

queueRequest(fout, requestDict)

geoslurp.datapull.crawler module

class geoslurp.datapull.crawler.CrawlerBase(url)

Bases: ABC

parallelDownload(outdir, check=False, maxconn=8, gzip=False, continueonError=False): Download uris in parallel :param direc: directory to download to :param check: Only download when newer or non-existent (default to False) :param maxconn: amount of parallel downloads to execute :param continueOnError (bool): keep trying

rooturl = None

abstract uris(): Generator which returns uri’s to requested datasets

geoslurp.datapull.ftp module

class geoslurp.datapull.ftp.Crawler(url, pattern='.*', followpattern='.*', auth=None)

Bases: CrawlerBase

Crawler for ftp directories

ls(subdirs=''): List directories and files (generator)

uris(check=False, subdirs=''): Generate a list files in a directory and return a list of uri

class geoslurp.datapull.ftp.Uri(url, lastmod=None, subdirs='', auth=None): Bases: UriBase

geoslurp.datapull.geodesyunr module

class geoslurp.datapull.geodesyunr.Crawler(catalogfile)

Bases: CrawlerBase

Crawl the gps tenv3 data on geodesy.unr.edu

uris(refresh=True): List uris of available gps final data in tenv3 format

class geoslurp.datapull.geodesyunr.Uri(indict)

Bases: UriBase

derived class which additionally holds info from the inventory

geoslurp.datapull.github module

class geoslurp.datapull.github.Crawler(reponame, commitsha=None, filter=<geoslurp.datapull.github.GithubFilter object>, followfilt=<geoslurp.datapull.github.GithubFilter object>, oauthtoken=None)

Bases: CrawlerBase

Crawls a github repository fixed to a certain commit

getSubTree(url)

treeitems(rootelem=None, depth=10, dirpath=None): generator which recursively list all elements in a git tree

uris(depth=10): Construct Uris from tree nodes

class geoslurp.datapull.github.GithubFilter(regexdict={'type': 'blob'})

Bases: object

Filter used for testing a certain dict element

isValid(elem): Returns True if all of the regex criteria match the elem

geoslurp.datapull.github.cachedGithubCatalogue(reponame, cachedir='.', commitsha=None, gfilter=<geoslurp.datapull.github.GithubFilter object>, gfollowfilter=<geoslurp.datapull.github.GithubFilter object>, depth=2, ghtoken=None): Caches the result of a github result for later reuse

geoslurp.datapull.http module

class geoslurp.datapull.http.Uri(url, lastmod=None, auth=None, headers=None, cookiefile=None, checkssl=True): Bases: UriBase

geoslurp.datapull.motu module

class geoslurp.datapull.motu.MotuOpts(moturoot, service, product, auth, btdbox, fout, cache, variables=None)

Bases: object

A class which mimics the options from argparse as used by the motuclient command line program

auth_mode = 'cas'

block_size = 12001

btdbox = <geoslurp.tools.Bounds.BtdBox object>

cache = '.'

console_mode = False

date_max = '9999-12-31 23:59:59'

date_min = '1-01-01 00:00:00'

depth_max = None

depth_min = None

describe = False

extraction_geographic = True

extraction_vertical = False

fullname()

latitude_max = None

latitude_min = None

longitude_max = None

longitude_min = None

motu = None

out_dir = '.'

out_name = 'dataset.nc'

outputWritten = 'netcdf'

product_id = None

proxy_server = None

pwd = None

service_id = None

size = False

socket_timeout = 515

sync = False

syncbtdbox(bbox=None): Sets the internal btdbox and synchronize the corresponding motu variables

syncfilename(fout)

user = None

user_agent = 'motu-api-client'

variable = None

class geoslurp.datapull.motu.MotuRecursive(mopts, keepfiles=False)

Bases: object

Class which recursively downloads netcdf files within the 1GB limit using motu and patches them together

download(): Download file

keepfiles = False

class geoslurp.datapull.motu.Uri(Mopts)

Bases: UriBase

download(direc, check=False, gzip=False, outfile=None): Download file into directory and possibly check the modification time :param check : check whether the file needs updating :param gzip: additionally gzips the file (adds .gz to file name) :param continueonError (bool): don’t raise an exception when a download error occurrs

info = False

kbsize = 0

maxbtdbox = <geoslurp.tools.Bounds.BtdBox object>

maxkbsize = 0

requestInfo(): Request info (modification time, size, datacoverage) on this specific query from the server

updateModTime(): Requests data description from the motu service

updateSize(): Request information about the size of the query

geoslurp.datapull.rsync module

class geoslurp.datapull.rsync.Crawler(url, auth)

Bases: CrawlerBase

Crawler wrapper around the rsync program calls the linux rsync utility

ls(): list remote content (using dry run)

parallelDownload(outdir, check=False, includes=None, dryrun=False): Download uris in parallel :param direc: directory to download to :param check: Only download when newer or non-existent (default to False) :param maxconn: amount of parallel downloads to execute :param continueOnError (bool): keep trying

startrsync(cmd): Start rsync and returns the list of files as a generator

uris(): Generator which returns uri’s to requested datasets

geoslurp.datapull.sftp module

geoslurp.datapull.thredds module

class geoslurp.datapull.thredds.Crawler(catalogurl, filter=<geoslurp.datapull.thredds.ThreddsFilter object>, followfilter=<geoslurp.datapull.thredds.ThreddsFilter object>, auth=None)

Bases: CrawlerBase

A class to work with an Opendap server

static getCatalog(url, auth=None): Retrieve a catalogue

static getServices(catalog, rooturl, depth=2): Retrieves the root for serving files over http url from a catalogue

setResumePoint(filter, followfilt=None): Sets the filters after which the normal filters will be applied.

unsetResumePoint(): Unset resume point

uris(depth=10): Generates a list of threddsURI’s (makes use of xmlitems())

xmlitems(xmlcatalog=None, url=None, depth=10): Generator which returns xml nodes which obey a certain filter Nodes which obey the followFilter will be recursively searched

class geoslurp.datapull.thredds.ThreddsFilter(xmltyp='*', attr=None, regex=None)

Bases: object

Helper class to aid traversing to opendap xml elements

AND(xmltyp, attr=None, regex=None): Provides a method for chaining OR filters

OR(xmltyp, attr=None, regex=None): Provides a method for chaining OR filters

isCatalog(): Check if the filter type is a catalogRef

isValid(xmlelem): Filter xmlelem on attributes

class geoslurp.datapull.thredds.Uri(dataxml, services, auth=None)

Bases: UriBase

Thredds URI class

opendap = None

suburl = None

geoslurp.datapull.thredds.getAttrib(xml, regex): Search in xml attributes based on a regex

geoslurp.datapull.thredds.getDate(xml): extracts the date from a dataset element

geoslurp.datapull.thredds.getTagEnding(xml): Strip the leading junk ({…}) from a tag

geoslurp.datapull.thredds.gethref(input): small function to extract a href link from a dictionary

geoslurp.datapull.uri module

class geoslurp.datapull.uri.UriBase(url, lastmod=None, auth=None, subdirs='', headers=None, cookiefile=None, checkssl=True)

Bases: object

Base class to store uri resource

auth = None

buffer(): Download file into a buffer (default uses curl)

download(direc, check=False, gzip=False, gunzip=False, outfile=None, continueonError=False, restdict=None): Download file into directory and possibly check the modification time :param check : check whether the file needs updating :param gzip: additionally gzips the file (adds .gz to file name) :param continueonError (bool): don’t raise an exception when a download error occurrs

headers = None

lastmod = None

subdirs = ''

updateModTime(): Tries to retrieve the last modification time of a file Note: this is often not supported by the server

url = None

class geoslurp.datapull.uri.UriFile(url, lastmod=None)

Bases: UriBase

buffer(): Download file into a buffer (default uses curl)

updateModTime(): Tries to retrieve the last modification time of a file Note: this is often not supported by the server

geoslurp.datapull.uri.curlDownload(url, fileorfid, mtime=None, gzip=False, gunzip=False, auth=None, restdict=None, headers=None, customRequest=None, upfid=None, cookiefile=None, checkssl=True): Download the content of an url to an open file or buffer using pycurl :param url: url to download from :param fileorfid: filename or open file or buffer :param mtime: explicitly set the modification time to this (usefull when modification times are not supported b the server) :param gzip: additionally gzip the file on disk (note this routine does not append .gz to the file name) :param gunzip: automatically gunzip the downloaded file :param auth: supply authentification data (user and passw) :param restdic: a set of (REST) API name-value pairs to be added to the url (provide as a dict) :param headers (array of header values): additionally set header elements :param customRequest: set a custoi request (e.g. for WEBDAV servers) :return: modification time of remote file

geoslurp.datapull.uri.findFiles(dir, pattern, since=None): Generator to recursively search adirecctor (returns a generator)

geoslurp.datapull.uri.setFtime(file, modTime=None): change modification and access time of a file

geoslurp.datapull.uri.timeFromStamp(stamp)

geoslurp.datapull.webdav module

class geoslurp.datapull.webdav.Crawler(rooturl, pattern, auth, depth=1)

Bases: CrawlerBase

Webdav Crawler (list content of a directory)

find(urlin, depth): List files in a webdav directory and recursively do this for directories untill the depth is exhausted

pattern = None

uris(): Generator which returns uri’s to requested datasets