{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Download and register a dynamic dataset of (crowd-sourced?) US Marijuana street prices\n",
    "\n",
    "To download and register a dataset, one can create a new class which is inherited from [DataSet](../references/geoslurp.dataset.html#geoslurp.dataset.dataSetBase.DataSet). Two function *pull* and *register* can then be implemented so the data can be downloaded and registered in the database. \n",
    "\n",
    "\n",
    "\n",
    "## Example: US street prices of Marijuana\n",
    "I found this curious dataset (csv format) on this [blog](http://blog.yhat.com/posts/7-funny-datasets.html). So let's use it as an example. It's unclear where it get's the data from but I suspect it has been scraped from [here](http://www.priceofweed.com/directory) The accuracy is therefore questionable but it serves it's purpose here nevertheless.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create the stucture of the database table using sqalchemy ORM methods\n",
    "Since we know how are table is going to look like we can set it up according to [SQAlchemy's tutorial](https://docs.sqlalchemy.org/en/13/orm/tutorial.html#declare-a-mapping)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sqlalchemy import MetaData,Column,Float,Integer,String\n",
    "from sqlalchemy.dialects.postgresql import TIMESTAMP\n",
    "from sqlalchemy.ext.declarative import declarative_base\n",
    "\n",
    "scheme=\"public\"\n",
    "\n",
    "# Setup the postgres table using methods as specified with sqlalchemy\n",
    "WeedTBase=declarative_base(metadata=MetaData(schema=scheme))\n",
    "\n",
    "class Weedtable(WeedTBase):\n",
    "    \"\"\"Defines the Marijuana PostgreSQL table\"\"\"\n",
    "    __tablename__='usweedprices'\n",
    "    id=Column(Integer,primary_key=True)\n",
    "    state=Column(String)\n",
    "    highq=Column(Float)\n",
    "    medq=Column(Float)\n",
    "    lowq=Column(Float)\n",
    "    time=Column(TIMESTAMP)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a helper function to extract relevant information from the file\n",
    "This will return dictionaries with keys who match the corresponding column names above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime\n",
    "def metaExtract(csvfile):\n",
    "    \"\"\"A little generator which extracts rows from \"\"\"\n",
    "    with open(csvfile,'rt') as fid:\n",
    "        ln=fid.readline().split(\",\")\n",
    "        for ln in fid:\n",
    "            lnspl=ln.split(\",\")\n",
    "            dt = datetime.strptime(lnspl[7], \"%Y-%m-%d \")\n",
    "            try:\n",
    "                meta={\"state\":lnspl[0],\"highq\":float(lnspl[1]),\"medq\":float(lnspl[3]),\"lowq\":float(lnspl[5]),\"time\":dt}\n",
    "            except ValueError:\n",
    "                #skip entries with NA values\n",
    "                continue\n",
    "            yield meta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Inherit from [DataSet](../references/geoslurp.dataset.html#geoslurp.dataset.dataSetBase.DataSet)\n",
    "In this case, we implement the *pull* method (where to download the data). and the *register* method. Note that we explicitly insert the sqlalchemy table as a member in the class, and specify the scheme name as a member."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from geoslurp.dataset import DataSet\n",
    "from geoslurp.datapull.http import Uri as http\n",
    "import os\n",
    "\n",
    "class USWeedPrices(DataSet):\n",
    "    scheme=scheme\n",
    "    csvfile=\"marijuana-street-price-clean.csv\"\n",
    "    table=Weedtable\n",
    "    def __init__(self,dbcon):\n",
    "        super().__init__(dbcon)\n",
    "        \n",
    "    def pull(self):\n",
    "        \"\"\"Pulls the csv file from the interwebs\"\"\"\n",
    "        weedurl=http(\"http://blog.yhat.com/static/misc/data/marijuana-street-price-clean.csv\")\n",
    "        uri,updated=weedurl.download(self.cacheDir())\n",
    "        \n",
    "    def register(self):\n",
    "        self.truncateTable()\n",
    "        \n",
    "        #insert in bulk mode\n",
    "#         metalist=[meta for meta in metaExtract(os.path.join(self.cacheDir(),self.csvfile))]\n",
    "#         self.bulkInsert(metalist)\n",
    "        \n",
    "        #insert by entry\n",
    "        for meta in metaExtract(os.path.join(self.cacheDir(),self.csvfile)):\n",
    "            self.addEntry(meta)\n",
    "        \n",
    "        self.updateInvent()\n",
    "        \n",
    "        \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create an instance of the class and call the pull, and register methods to download and register the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Geoslurp-INFO: Downloading /tmp/geoslurp/public/usweedprices/marijuana-street-price-clean.csv\n"
     ]
    }
   ],
   "source": [
    "from geoslurp.config import setInfoLevel\n",
    "from geoslurp.db import geoslurpConnect\n",
    "setInfoLevel()\n",
    "\n",
    "\n",
    "gpcon=geoslurpConnect(readonlyuser=False)\n",
    "\n",
    "\n",
    "usWeed=USWeedPrices(gpcon)\n",
    "\n",
    "usWeed.pull()\n",
    "usWeed.register()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Avoid reinventing the wheel\n",
    "One can imagine that having a csv file is pretty common. The above could therefore also be simplified by inheriting from [PandasBase](../references/geoslurp.dataset.html#geoslurp.dataset.dataSetBase.DataSet), which is demonstrated bellow. Note that in this case, the register function of **PandasBase** works as is, and there is no need to reimplement it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Geoslurp-INFO: Deleting usweedprices entry\n",
      "Geoslurp-INFO: Pruning directory /tmp/geoslurp/public/usweedprices\n"
     ]
    }
   ],
   "source": [
    "#get rid of the previous dataset \n",
    "usWeed.purgeentry() #entry in the database\n",
    "usWeed.purgecache() # the cached data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from geoslurp.dataset import PandasBase\n",
    "\n",
    "class  USWeedPrices2(PandasBase):\n",
    "    def __init__(self,dbconn):\n",
    "        super().__init__(dbconn)\n",
    "        self.pdfile=os.path.join(self.cacheDir(),\"marijuana-street-price-clean.csv\")\n",
    "    \n",
    "    def pull(self):\n",
    "        \"\"\"same as above but we now also only download when newer than specified\"\"\"\n",
    "        weedurl=http(\"http://blog.yhat.com/static/misc/data/marijuana-street-price-clean.csv\",lastmod=datetime(2018,1,1))\n",
    "        uri,updated=weedurl.download(self.cacheDir(),check=True)\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once the class is defined, one can again pull the data and call the register function. Examples for querying this dataset are found in the example notebook on discovery."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Geoslurp-INFO: Already Downloaded, skipping /tmp/geoslurp/public/usweedprices2/marijuana-street-price-clean.csv\n",
      "Geoslurp-INFO: Filling pandas table public.usweedprices2 with data from /tmp/geoslurp/public/usweedprices2/marijuana-street-price-clean.csv\n"
     ]
    }
   ],
   "source": [
    "usWeed2=USWeedPrices2(gpcon)\n",
    "\n",
    "usWeed2.pull()\n",
    "\n",
    "usWeed2.register()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "pyrr",
   "language": "python",
   "name": "pyrr"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}