Projects

For the Fun of It

Noisedive.org
I recently spun up a small website where I could store notes about deep learning whitepapers. Mostly, I wanted a nice place where I could search for, store, and write LaTeX-friendly notes while maintaining an easy-to-use (e.g. Markdown) editing interface.
Malware Data Science
I teamed up with Joshua Saxe to write "Malware Data Science" (published by No Starch Press), a book that introduces data science techniques for malware detection and analysis. I wrote the two chapters that focus on neural networks, and edited other chapters.

100% of author proceeds are donated to the Environmental Defense Fund.

RescipeStasher
A free, ad-less site that can parse free-form recipe text with NLP, enabling unit and servings conversion that actually works and can stack... During a work break a few years ago, I wrote this for fun after getting annoyed at all the recipe sites out there that can't handle unit conversion well. Unfortunately, it got spam-attacked when I was working again and I haven't had the time / inclination to put it back up.

The site uses a django (python) framework (combined with materialize, a nice CSS framework) and a postgreSQL database (hosted on AWS). The website is deployed via AWS Elastic Beanstalk.

Notes on Bayesian Statistics - a booklet
Stagnant but fun personal project from back in the day.

***

Edu-related Projects (old)

edu-related projects

Creating an R Package: Legistative Text Mapping and Directed Acyclic Graphs
From 2011 to 2012 I worked with the fantastic Mark Huberty, a graduate student in Political Science at the Travers Department at UC Berkeley (who is awesome and basically is the reason I got interested in doing what I'm now happily doing with my life. Props.) I helped develop an original R package to map the evolution of legislation from its introduction as a bill, through the amendment process, until finalization. There are five core pieces of functionality in the Leghist package: 1. Mapping of amendments to their locations in legislative bills.
2. Identification of discarded material.
3. Mapping of sections between editions of bills.
4. Modeling of the content of added and discarded material.
5. Visualization of the flow of bill content, by subject matter. Although I was somewhat involved in all parts of the above, I wrote the code for 5). There were two master functions that I created for the package. Both took raw output from 1:4), and created customizable, yet automated, directed ayclic graphs, implemented through the R package igraph. I also wrote automated scripts to test these functions' functionality. I also worked on figuring out how to document (by implementing Roxygen2), test, and finally create the R package in an automated fashion.
***
Automated Pulling, XML Parsing, and Visualization of World Bank Country-wise Economic Indicators.
This was actually for a school project. I worked with a good group of kids, and we all took our own chunks of the intended project and just sort of ran with it. My goal was to completely automate, and make easily adjustable, the automated pulling of World Bank data, and do the same for some really awesome looking, and informative (!) graphs. I would show you the pretty pictures, but I left my laptop out in the rain, so...
***
Using Tweets and Bayesian Statistical Analysis to Model the 2012 Presidential Election
Basically, I created a forecasting model which predicts state-level vote-share probabilities by using a hierarchical Bayesian model to incorporate the simple text analysis of state-specific tweets into predictions. The model used Markov chain Monte Carlo methods to develop the final posterior distribution. Model priors were based off of state-level 2004 and 2008 vote-share data. Data consisted of recent tweets mentioning 'Obama' or 'Romney'. Although the simple text analysis of tweets is a terrible substitute for polling data (problems will be discussed in the paper), it offered a potential way to bolster political forecasting models. (Note: tweets are hideously biased. In most real-life cases, -1 for using them in forecasting models)
***
Homegrown Random Forests
This was part of a fun machine learning project I worked on with two other students from Berkeley. Basically we tried to predict baseball players' success by the players' past stats. We implemented a plethora of different machine learning methods; I wrote a random forest learner from scratch. Why? Because. It was pretty fun. Code can be found here and here .
***
Voting with your Tweet: Forecasting elections with social media data; Broadcasting live predictions online
This project broadcasted live, out-of-sample congressional elections predictions based on Mark Huberty's SuperLearner-based algorithm which takes tweets as input. I helped Mark (who is awesome and taught me nearly everything) by cleaning up code, writing a bit myself, and gathering congressional candidate data.

Opendoor - Core Projects

Lead team of engineers to re-design & re-engineer experimentation and production training pipeline for our most important neural network model (home price prediction). Included designing & implementing a better way to reproduce, track, and manage experiments via a much stronger mlflow based configuration system (among other things).
Developed a custom neural-network interpretability tool (similar to LIME, SHAP) to help home valuation experts better understand neural network model predictions (ported to production).
Researched ways to improve our most important neural network's (home price prediction) accuracy. Improvements included changes to our loss scheduler, loss function definition, architecture, epochs, features, and more. Led to an estimated $48m in increased profits due to the largest accuracy gains any researcher had yet found for this model.

Sophos - Core Projects

Model Compression
Making models smaller. arXiv paper, original OBD-SD method slides.
Avoiding Catastrophic Forgetting in Neural Networks
Minimizing the forgetting effects created when you fine-tune trained models on new data. arXiv paper
Garbage In, Garbage Out
I wrote this paper to accompany a talk I gave at Black Hat in 2017. The basic idea was to show how malware detection model accuracy can vary wildy based on the test dataset used, which extremely relevant when you're uncertain of your production "test" dataset. To showcase this, I trained our team's URL model on three different datasets, and tested each of these trained models on time-split validation sets from the same three datasets, yielding nine sets of results. I showed how analysing the results can help tell us how sensitive a given model is to changes in the data distribution, and indicate what datasets seem to be supersets of others. The paper for this talk is here, and the slide deck is here.
Vendor & Model Evaluation Dashboard
Our team develops many models to classify and detect malware. We want to be able to track how our competitors, our deployed models, and our test models are performing against different datasets and label definitions over time. We also want to be able to understand how our own current products will improve (or not) when combined with new detection models.

To do all of this, another engineer and I created an internal dashboard that runs evalation queries against billions of records in real-time against a redshift database. Users select options like what models and vendors to evaluate, what label definitions to use, and the time period to inspect, and the dashboard runs the resulting SQL queries to find the results, and then generates various interactive plots to showcase the results to the user.

The dashboard also provides an interface to upload files to be evaluated by our team's models.
Endpoint Detection And Response
Signatures and Malware detection models aren't always 100% sure if a file is benign, or actually something malicious. When this is the case, or even when a customer's analyst wants to be extra sure about a network event or suspicious file, we want to provide easy ways for an analyst to delve deeper into the threat.

On this project, I worked with one other engineer to develop docker containers that serve up various models, run queries against an elasticsearch cluster I designed, and output a large JSON object that is served to a customer's dashboard to help them analyze a given potential threat. We developed a single codebase to work for PE, RTF, DOC, and PDF files.
HTML Malware Detection Model
I worked on the team to develop our HTML malware detection model. This deep learning model examines the contents of an HTML file at hierarchical spatial scales. Specifically, simple tokens are extracted from an input HTML file from a simple regular expression search, and bin-counts of the numeric hashes of these tokens are analyzed at different scales (like zooming in and out on the file).

The paper for this model is here.
URL Malware Detection Model
I worked on the team to develop our URL malware detection model. The model is a convolutional neural network, which works by first applying a learned embedding onto characters, and then running 1-d convolutions over these character embeddings. In this way, the neural network is able to develop its own fuzzy-ngram patters, and ends up being able to detect suspicious short-strings with surprising accuracy.

The paper for the (older, original) model is here.
***

Premise - Core Projects

SLA Schema + SLA Ingest and Evaluation Pipelines
The way our company made money was to sign contracts with SLAs (Service Language Agreements) basically saying that we would "gather X data, with Y requirements, and Z restrictions", and then fulfill those contracts by paying contributors money to capture that data. An example contract might require that we capture 30 price observations per month (with price SD < .2), and at least five per week for two brands each for twenty specific consumer products, and all of this for each of seven different regions, with at least 9000 observations per month.

This information, while human readable, was not machine readable. So things like progress dashboards had to be written specifically for each new contract, instead of spun up automatically. This wasn't scalable, so I designed a 'SLA' schema that could describe a broad range of SLAs requirements via a JSON blob.

After I translated all current contracts to this machine-readable format and augmented an existing system to push this data (in google-docs, editable by certain employees) to a redshift db. Then, I wrote a pipeline to track SLAs against their contract's data (ran daily). The pipeline tracked % SLA coverage, as well as budgeting overages, across all facets listed in the SLA (e.g. product X brand X month X region, product X brand X week, product X month, product, etc...)
Interactive Data-Drill-Down App (Shiny)
Created an interactive web application that allows users to visualize an aggregated time series' component series, or metrics regarding their component series, and drill down to discover more about a specific series. Consisted of about a dozen unique plot types, various search mechanisms, metrics, and options.

The applications was built entirely in R's Shiny framework.
Places Clustering
Given many observations, each with:

  • longitude, latitude coordinates with non-gaussian error distributions
  • coordinate 'accuracy' (expected SD in meters) metric
  • as associated user (observations within users were highly autocorrelated)
  • a hand-typed (read: messy) place name,

  • I developed a method of clustering these observations to back out 'true' places, and confidence metrics regarding each clustered place's location, name, and actual existence. This was used to help better direct our users when capturing further observations (place based tasks), and used to create better price indices.
    Optimal Volume Allocation in Surveys
    Developed formula (lagrange minimization) to define optimal volume allocation, given a budget, and a goal to minimize the standard error in the end product: an aggregated (over weights) price time series.
    Product Hierarchies & Weights Upkeep System
    Developed and maintained system to handle many (quite complex) weights hierarchies. Enforced a series of tests to ensure that end node weights always summed to 1, etc...
    Changepoint, Bimodality detection in price time series
    Just another pipeline :). What it sounds like. Data pull, estimation, plots, results push.
    Automated Visualizations of Medicare Data
    This is part of what I worked on for the my summer (2012) internship with Acumen LLC. I wrote various R functions which took as input excel workbooks, which the functions parsed, organized, and plotted in some way. Ex:
    - Fn: Plots normalized values of multiple variables over all districts or all states in the United States with a segmented scatter plot. A state s is highlighted and its values shown. All points above a n standard deviations from the mean become two-letter state abbreviations, or three-letter district abbreviations.
    - Fn: Allows viewers to spot the professional relationships among (often many thousands of) doctors. First, creates a base distance metric to represent professional closeness among two doctors Di and Dj. This involved variables like the number of beneficiaries Di and Dj share, weighted by billing, the % of beneficiaries Di and Dj share with regards to their own unique beneficiary service count, weighted by billing, etc.