{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Download and register a dynamic dataset of (crowd-sourced?) US Marijuana street prices\n", "\n", "To download and register a dataset, one can create a new class which is inherited from [DataSet](../references/geoslurp.dataset.html#geoslurp.dataset.dataSetBase.DataSet). Two function *pull* and *register* can then be implemented so the data can be downloaded and registered in the database. \n", "\n", "\n", "\n", "## Example: US street prices of Marijuana\n", "I found this curious dataset (csv format) on this [blog](http://blog.yhat.com/posts/7-funny-datasets.html). So let's use it as an example. It's unclear where it get's the data from but I suspect it has been scraped from [here](http://www.priceofweed.com/directory) The accuracy is therefore questionable but it serves it's purpose here nevertheless.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create the stucture of the database table using sqalchemy ORM methods\n", "Since we know how are table is going to look like we can set it up according to [SQAlchemy's tutorial](https://docs.sqlalchemy.org/en/13/orm/tutorial.html#declare-a-mapping)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from sqlalchemy import MetaData,Column,Float,Integer,String\n", "from sqlalchemy.dialects.postgresql import TIMESTAMP\n", "from sqlalchemy.ext.declarative import declarative_base\n", "\n", "scheme=\"public\"\n", "\n", "# Setup the postgres table using methods as specified with sqlalchemy\n", "WeedTBase=declarative_base(metadata=MetaData(schema=scheme))\n", "\n", "class Weedtable(WeedTBase):\n", " \"\"\"Defines the Marijuana PostgreSQL table\"\"\"\n", " __tablename__='usweedprices'\n", " id=Column(Integer,primary_key=True)\n", " state=Column(String)\n", " highq=Column(Float)\n", " medq=Column(Float)\n", " lowq=Column(Float)\n", " time=Column(TIMESTAMP)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a helper function to extract relevant information from the file\n", "This will return dictionaries with keys who match the corresponding column names above" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from datetime import datetime\n", "def metaExtract(csvfile):\n", " \"\"\"A little generator which extracts rows from \"\"\"\n", " with open(csvfile,'rt') as fid:\n", " ln=fid.readline().split(\",\")\n", " for ln in fid:\n", " lnspl=ln.split(\",\")\n", " dt = datetime.strptime(lnspl[7], \"%Y-%m-%d \")\n", " try:\n", " meta={\"state\":lnspl[0],\"highq\":float(lnspl[1]),\"medq\":float(lnspl[3]),\"lowq\":float(lnspl[5]),\"time\":dt}\n", " except ValueError:\n", " #skip entries with NA values\n", " continue\n", " yield meta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inherit from [DataSet](../references/geoslurp.dataset.html#geoslurp.dataset.dataSetBase.DataSet)\n", "In this case, we implement the *pull* method (where to download the data). and the *register* method. Note that we explicitly insert the sqlalchemy table as a member in the class, and specify the scheme name as a member." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from geoslurp.dataset import DataSet\n", "from geoslurp.datapull.http import Uri as http\n", "import os\n", "\n", "class USWeedPrices(DataSet):\n", " scheme=scheme\n", " csvfile=\"marijuana-street-price-clean.csv\"\n", " table=Weedtable\n", " def __init__(self,dbcon):\n", " super().__init__(dbcon)\n", " \n", " def pull(self):\n", " \"\"\"Pulls the csv file from the interwebs\"\"\"\n", " weedurl=http(\"http://blog.yhat.com/static/misc/data/marijuana-street-price-clean.csv\")\n", " uri,updated=weedurl.download(self.cacheDir())\n", " \n", " def register(self):\n", " self.truncateTable()\n", " \n", " #insert in bulk mode\n", "# metalist=[meta for meta in metaExtract(os.path.join(self.cacheDir(),self.csvfile))]\n", "# self.bulkInsert(metalist)\n", " \n", " #insert by entry\n", " for meta in metaExtract(os.path.join(self.cacheDir(),self.csvfile)):\n", " self.addEntry(meta)\n", " \n", " self.updateInvent()\n", " \n", " \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create an instance of the class and call the pull, and register methods to download and register the data" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Geoslurp-INFO: Downloading /tmp/geoslurp/public/usweedprices/marijuana-street-price-clean.csv\n" ] } ], "source": [ "from geoslurp.config import setInfoLevel\n", "from geoslurp.db import geoslurpConnect\n", "setInfoLevel()\n", "\n", "\n", "gpcon=geoslurpConnect(readonlyuser=False)\n", "\n", "\n", "usWeed=USWeedPrices(gpcon)\n", "\n", "usWeed.pull()\n", "usWeed.register()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Avoid reinventing the wheel\n", "One can imagine that having a csv file is pretty common. The above could therefore also be simplified by inheriting from [PandasBase](../references/geoslurp.dataset.html#geoslurp.dataset.dataSetBase.DataSet), which is demonstrated bellow. Note that in this case, the register function of **PandasBase** works as is, and there is no need to reimplement it." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Geoslurp-INFO: Deleting usweedprices entry\n", "Geoslurp-INFO: Pruning directory /tmp/geoslurp/public/usweedprices\n" ] } ], "source": [ "#get rid of the previous dataset \n", "usWeed.purgeentry() #entry in the database\n", "usWeed.purgecache() # the cached data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from geoslurp.dataset import PandasBase\n", "\n", "class USWeedPrices2(PandasBase):\n", " def __init__(self,dbconn):\n", " super().__init__(dbconn)\n", " self.pdfile=os.path.join(self.cacheDir(),\"marijuana-street-price-clean.csv\")\n", " \n", " def pull(self):\n", " \"\"\"same as above but we now also only download when newer than specified\"\"\"\n", " weedurl=http(\"http://blog.yhat.com/static/misc/data/marijuana-street-price-clean.csv\",lastmod=datetime(2018,1,1))\n", " uri,updated=weedurl.download(self.cacheDir(),check=True)\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the class is defined, one can again pull the data and call the register function. Examples for querying this dataset are found in the example notebook on discovery." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Geoslurp-INFO: Already Downloaded, skipping /tmp/geoslurp/public/usweedprices2/marijuana-street-price-clean.csv\n", "Geoslurp-INFO: Filling pandas table public.usweedprices2 with data from /tmp/geoslurp/public/usweedprices2/marijuana-street-price-clean.csv\n" ] } ], "source": [ "usWeed2=USWeedPrices2(gpcon)\n", "\n", "usWeed2.pull()\n", "\n", "usWeed2.register()" ] } ], "metadata": { "kernelspec": { "display_name": "pyrr", "language": "python", "name": "pyrr" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.1" } }, "nbformat": 4, "nbformat_minor": 4 }