Traded for Monies
(sorry, no links to code, just lovely vague descriptions)
SLA Schema + SLA Ingest and Evaluation Pipelines
The way our company made money was to sign contracts with SLAs (Service Language Agreements) basically saying that we would "gather X data, with Y requirements, and Z restrictions",
and then fulfill those contracts by paying contributors money to capture that data. An example contract might require that we capture 30 price observations per month (with price SD < .2), and at least five per week
for two brands each for twenty specific consumer products, and all of this for each of seven different regions, with at least 9000 observations per month.
This information, while human readable, was not machine readable. So things like progress dashboards had to be written specifically for each new contract, instead of spun up automatically.
This wasn't scalable, so I designed a 'SLA' schema that could describe a broad range of SLAs requirements via a JSON blob.
After I translated all current contracts to this machine-readable format and augmented an existing system to push this data (in google-docs, editable by certain employees) to a redshift db.
Then, I wrote a pipeline to track SLAs against their contract's data (ran daily). The pipeline tracked % SLA coverage, as well as budgeting overages, across all facets listed in the SLA
(e.g. product X brand X month X region, product X brand X week, product X month, product, etc...)
Interactive Data-Drill-Down App (Shiny)
Created an interactive web application that allows users to visualize an aggregated time series'
component series, or metrics regarding their component series, and drill down to discover more about
a specific series. Consisted of about a dozen unique plot types, various search mechanisms,
metrics, and options.
The applications was built entirely in R's Shiny
Given many observations, each with:
longitude, latitude coordinates with non-gaussian error distributions
coordinate 'accuracy' (expected SD in meters) metric
as associated user (observations within users were highly autocorrelated)
a hand-typed (read: messy) place name,
I developed a method of clustering these observations to back out 'true' places, and confidence metrics regarding each clustered place's location, name, and actual existence.
This was used to help better direct our users when capturing further observations (place based tasks), and used to create better price indices.
Optimal Volume Allocation in Surveys
Developed formula (lagrange minimization) to define optimal volume allocation, given a budget, and a goal to minimize
the standard error in the end product: an aggregated (over weights) price time series.
Product Hierarchies & Weights Upkeep System
Developed and maintained system to handle many (quite complex) weights hierarchies. Enforced a series of
tests to ensure that end node weights always summed to 1, etc...
Changepoint, Bimodality detection in price time series
Just another pipeline :). What it sounds like. Data pull, estimation, plots, results push.
Automated Visualizations of Medicare Data
This is part of what I worked on for the my summer (2012) internship with Acumen LLC. I wrote
various R functions which took as input excel workbooks, which the functions parsed, organized, and plotted in some way. Ex:
Fn: Plots normalized values of multiple variables over all districts or all states in the
United States with a segmented scatter plot. A state
s is highlighted and its values
shown. All points above a
n standard deviations from the mean become two-letter
state abbreviations, or three-letter district abbreviations.
Fn: Allows viewers to spot the professional relationships among (often many thousands of)
doctors. First, creates a base distance metric to represent professional closeness among two
doctors Di and Dj. This involved variables like the number of beneficiaries Di and Dj share,
weighted by billing, the % of beneficiaries Di and Dj share with regards to their own unique
beneficiary service count, weighted by billing, etc.