Recent Projects

For the Funzies

A free, ad-less site that can parse free-form recipe text with NLP, enabling unit and servings conversion that actually works and can stack... Unlike all the other ten thousand recipe websites out there :-). It also incorporates (my) art by matching drawings to each recipe's ingredients.

The site uses a django (python) framework (combined with materialize, a nice CSS framework) and a postgreSQL database (hosted on AWS). The website is deployed via AWS Elastic Beanstalk. I wrote all of the code. Yay. :)

- the website
- an example recipe
- some public recipes
- the code (github)
- issues waffle board
Notes on Bayesian Statistics - a booklet
The good: it does actually have about 40 pages. And it's all written in lovely LaTeX. The bad: It's yet unfinished, and has been stagnant for the past three years.
Nah, not a lovely long treatise about Javascript's 'this' - just this website.

- the code (github)

Edu-related Projects (old)

edu-related projects

Creating an R Package: Legistative Text Mapping and Directed Acyclic Graphs
From 2011 to 2012 I worked with the fantastic Mark Huberty, a graduate student in Political Science at the Travers Department at UC Berkeley (who is awesome and basically is the reason I got interested in doing what I'm now happily doing with my life. Props.) I helped develop an original R package to map the evolution of legislation from its introduction as a bill, through the amendment process, until finalization. There are five core pieces of functionality in the Leghist package: 1. Mapping of amendments to their locations in legislative bills.
2. Identification of discarded material.
3. Mapping of sections between editions of bills.
4. Modeling of the content of added and discarded material.
5. Visualization of the flow of bill content, by subject matter. Although I was somewhat involved in all parts of the above, I wrote the code for 5). There were two master functions that I created for the package. Both took raw output from 1:4), and created customizable, yet automated, directed ayclic graphs, implemented through the R package igraph. I also wrote automated scripts to test these functions' functionality. I also worked on figuring out how to document (by implementing Roxygen2), test, and finally create the R package in an automated fashion.
Automated Pulling, XML Parsing, and Visualization of World Bank Country-wise Economic Indicators.
This was actually for a school project. I worked with a good group of kids, and we all took our own chunks of the intended project and just sort of ran with it. My goal was to completely automate, and make easily adjustable, the automated pulling of World Bank data, and do the same for some really awesome looking, and informative (!) graphs. I would show you the pretty pictures, but I left my laptop out in the rain, so...
Using Tweets and Bayesian Statistical Analysis to Model the 2012 Presidential Election
Basically, I created a forecasting model which predicts state-level vote-share probabilities by using a hierarchical Bayesian model to incorporate the simple text analysis of state-specific tweets into predictions. The model used Markov chain Monte Carlo methods to develop the final posterior distribution. Model priors were based off of state-level 2004 and 2008 vote-share data. Data consisted of recent tweets mentioning 'Obama' or 'Romney'. Although the simple text analysis of tweets is a terrible substitute for polling data (problems will be discussed in the paper), it offered a potential way to bolster political forecasting models. (Note: tweets are hideously biased. In most real-life cases, -1 for using them in forecasting models)
Homegrown Random Forests
This was part of a fun machine learning project I worked on with two other students from Berkeley. Basically we tried to predict baseball players' success by the players' past stats. We implemented a plethora of different machine learning methods; I wrote a random forest learner from scratch. Why? Because. It was pretty fun. Code can be found here and here .
Voting with your Tweet: Forecasting elections with social media data; Broadcasting live predictions online
This project broadcasted live, out-of-sample congressional elections predictions based on Mark Huberty's SuperLearner-based algorithm which takes tweets as input. I helped Mark (who is awesome and taught me nearly everything) by cleaning up code, writing a bit myself, and gathering congressional candidate data.

Traded for Monies

(sorry, no links to code, just lovely vague descriptions)
SLA Schema + SLA Ingest and Evaluation Pipelines
The way our company made money was to sign contracts with SLAs (Service Language Agreements) basically saying that we would "gather X data, with Y requirements, and Z restrictions", and then fulfill those contracts by paying contributors money to capture that data. An example contract might require that we capture 30 price observations per month (with price SD < .2), and at least five per week for two brands each for twenty specific consumer products, and all of this for each of seven different regions, with at least 9000 observations per month.

This information, while human readable, was not machine readable. So things like progress dashboards had to be written specifically for each new contract, instead of spun up automatically. This wasn't scalable, so I designed a 'SLA' schema that could describe a broad range of SLAs requirements via a JSON blob.

After I translated all current contracts to this machine-readable format and augmented an existing system to push this data (in google-docs, editable by certain employees) to a redshift db. Then, I wrote a pipeline to track SLAs against their contract's data (ran daily). The pipeline tracked % SLA coverage, as well as budgeting overages, across all facets listed in the SLA (e.g. product X brand X month X region, product X brand X week, product X month, product, etc...)
Interactive Data-Drill-Down App (Shiny)
Created an interactive web application that allows users to visualize an aggregated time series' component series, or metrics regarding their component series, and drill down to discover more about a specific series. Consisted of about a dozen unique plot types, various search mechanisms, metrics, and options.

The applications was built entirely in R's Shiny framework.
Places Clustering
Given many observations, each with:

  • longitude, latitude coordinates with non-gaussian error distributions
  • coordinate 'accuracy' (expected SD in meters) metric
  • as associated user (observations within users were highly autocorrelated)
  • a hand-typed (read: messy) place name,

  • I developed a method of clustering these observations to back out 'true' places, and confidence metrics regarding each clustered place's location, name, and actual existence. This was used to help better direct our users when capturing further observations (place based tasks), and used to create better price indices.
    Optimal Volume Allocation in Surveys
    Developed formula (lagrange minimization) to define optimal volume allocation, given a budget, and a goal to minimize the standard error in the end product: an aggregated (over weights) price time series.
    Product Hierarchies & Weights Upkeep System
    Developed and maintained system to handle many (quite complex) weights hierarchies. Enforced a series of tests to ensure that end node weights always summed to 1, etc...
    Changepoint, Bimodality detection in price time series
    Just another pipeline :). What it sounds like. Data pull, estimation, plots, results push.
    Automated Visualizations of Medicare Data
    This is part of what I worked on for the my summer (2012) internship with Acumen LLC. I wrote various R functions which took as input excel workbooks, which the functions parsed, organized, and plotted in some way. Ex:
    - Fn: Plots normalized values of multiple variables over all districts or all states in the United States with a segmented scatter plot. A state s is highlighted and its values shown. All points above a n standard deviations from the mean become two-letter state abbreviations, or three-letter district abbreviations.
    - Fn: Allows viewers to spot the professional relationships among (often many thousands of) doctors. First, creates a base distance metric to represent professional closeness among two doctors Di and Dj. This involved variables like the number of beneficiaries Di and Dj share, weighted by billing, the % of beneficiaries Di and Dj share with regards to their own unique beneficiary service count, weighted by billing, etc.