Download this Jupyter notebook from github


Download and register a dynamic dataset of (crowd-sourced?) US Marijuana street prices

To download and register a dataset, one can create a new class which is inherited from DataSet. Two function pull and register can then be implemented so the data can be downloaded and registered in the database.

Example: US street prices of Marijuana

I found this curious dataset (csv format) on this blog. So let’s use it as an example. It’s unclear where it get’s the data from but I suspect it has been scraped from here The accuracy is therefore questionable but it serves it’s purpose here nevertheless.

Create the stucture of the database table using sqalchemy ORM methods

Since we know how are table is going to look like we can set it up according to SQAlchemy’s tutorial

[1]:
from sqlalchemy import MetaData,Column,Float,Integer,String
from sqlalchemy.dialects.postgresql import TIMESTAMP
from sqlalchemy.ext.declarative import declarative_base

scheme="public"

# Setup the postgres table using methods as specified with sqlalchemy
WeedTBase=declarative_base(metadata=MetaData(schema=scheme))

class Weedtable(WeedTBase):
    """Defines the Marijuana PostgreSQL table"""
    __tablename__='usweedprices'
    id=Column(Integer,primary_key=True)
    state=Column(String)
    highq=Column(Float)
    medq=Column(Float)
    lowq=Column(Float)
    time=Column(TIMESTAMP)

Create a helper function to extract relevant information from the file

This will return dictionaries with keys who match the corresponding column names above

[2]:
from datetime import datetime
def metaExtract(csvfile):
    """A little generator which extracts rows from """
    with open(csvfile,'rt') as fid:
        ln=fid.readline().split(",")
        for ln in fid:
            lnspl=ln.split(",")
            dt = datetime.strptime(lnspl[7], "%Y-%m-%d ")
            try:
                meta={"state":lnspl[0],"highq":float(lnspl[1]),"medq":float(lnspl[3]),"lowq":float(lnspl[5]),"time":dt}
            except ValueError:
                #skip entries with NA values
                continue
            yield meta

Inherit from DataSet

In this case, we implement the pull method (where to download the data). and the register method. Note that we explicitly insert the sqlalchemy table as a member in the class, and specify the scheme name as a member.

[3]:
from geoslurp.dataset import DataSet
from geoslurp.datapull.http import Uri as http
import os

class USWeedPrices(DataSet):
    scheme=scheme
    csvfile="marijuana-street-price-clean.csv"
    table=Weedtable
    def __init__(self,dbcon):
        super().__init__(dbcon)

    def pull(self):
        """Pulls the csv file from the interwebs"""
        weedurl=http("http://blog.yhat.com/static/misc/data/marijuana-street-price-clean.csv")
        uri,updated=weedurl.download(self.cacheDir())

    def register(self):
        self.truncateTable()

        #insert in bulk mode
#         metalist=[meta for meta in metaExtract(os.path.join(self.cacheDir(),self.csvfile))]
#         self.bulkInsert(metalist)

        #insert by entry
        for meta in metaExtract(os.path.join(self.cacheDir(),self.csvfile)):
            self.addEntry(meta)

        self.updateInvent()

Create an instance of the class and call the pull, and register methods to download and register the data

[4]:
from geoslurp.config import setInfoLevel
from geoslurp.db import geoslurpConnect
setInfoLevel()


gpcon=geoslurpConnect(readonlyuser=False)


usWeed=USWeedPrices(gpcon)

usWeed.pull()
usWeed.register()

Geoslurp-INFO: Downloading /tmp/geoslurp/public/usweedprices/marijuana-street-price-clean.csv

Avoid reinventing the wheel

One can imagine that having a csv file is pretty common. The above could therefore also be simplified by inheriting from PandasBase, which is demonstrated bellow. Note that in this case, the register function of PandasBase works as is, and there is no need to reimplement it.

[5]:
#get rid of the previous dataset
usWeed.purgeentry() #entry in the database
usWeed.purgecache() # the cached data
Geoslurp-INFO: Deleting usweedprices entry
Geoslurp-INFO: Pruning directory /tmp/geoslurp/public/usweedprices
[6]:
from geoslurp.dataset import PandasBase

class  USWeedPrices2(PandasBase):
    def __init__(self,dbconn):
        super().__init__(dbconn)
        self.pdfile=os.path.join(self.cacheDir(),"marijuana-street-price-clean.csv")

    def pull(self):
        """same as above but we now also only download when newer than specified"""
        weedurl=http("http://blog.yhat.com/static/misc/data/marijuana-street-price-clean.csv",lastmod=datetime(2018,1,1))
        uri,updated=weedurl.download(self.cacheDir(),check=True)




Once the class is defined, one can again pull the data and call the register function. Examples for querying this dataset are found in the example notebook on discovery.

[7]:
usWeed2=USWeedPrices2(gpcon)

usWeed2.pull()

usWeed2.register()
Geoslurp-INFO: Already Downloaded, skipping /tmp/geoslurp/public/usweedprices2/marijuana-street-price-clean.csv
Geoslurp-INFO: Filling pandas table public.usweedprices2 with data from /tmp/geoslurp/public/usweedprices2/marijuana-street-price-clean.csv