Sophos - Core Projects
Garbage In, Garbage Out
I wrote this paper to accompany a talk I gave at Black Hat in 2017. The basic idea was to show how malware detection model
accuracy can vary wildy based on the test dataset used, which extremely relevant when you're uncertain of your production "test"
dataset. To showcase this, I trained our team's URL model on three different datasets, and tested each of these trained models
on time-split validation sets from the same three datasets, yielding nine sets of results. I showed how analysing the results can
help tell us how sensitive a given model is to changes in the data distribution, and indicate what datasets seem to be supersets of others.
The paper for this talk is here
and the slide deck is here
Vendor & Model Evaluation Dashboard
Our team develops many models to classify and detect malware. We want to be able to track how our competitors, our deployed models,
and our test models are performing against different datasets and label definitions over time. We also want to be able to understand
how our own current products will improve (or not) when combined with new detection models.
To do all of this, another engineer and I created an internal dashboard that runs evalation queries against billions of records in
real-time against a redshift database. Users select options like what models and vendors to evaluate, what label definitions to use,
and the time period to inspect, and the dashboard runs the resulting SQL queries to find the results, and then generates various
plots to showcase the results to the user.
The dashboard also provides an interface to upload files to be evaluated by our team's models.
Endpoint Detection And Response
Signatures and Malware detection models aren't always 100% sure if a file is benign, or actually something malicious. When this is
the case, or even when a customer's analyst wants to be extra sure about a network event or suspicious file, we want to provide easy
ways for an analyst to delve deeper into the threat.
On this project, I worked with one other engineer to develop docker containers that serve up various models, run queries against an
elasticsearch cluster I designed, and output a large JSON object that is served to a customer's dashboard to help them analyze a
given potential threat. We developed a single codebase to work for PE, RTF, DOC, and PDF files.
HTML Malware Detection Model
I worked on the team to develop our HTML malware detection model. This deep learning model
examines the contents of an HTML file at hierarchical spatial scales. Specifically, simple tokens are extracted from an input
HTML file from a simple regular expression search, and bin-counts of the numeric hashes of these tokens are analyzed at different
scales (like zooming in and out on the file).
The paper for this model is here
URL Malware Detection Model
I worked on the team to develop our URL malware detection model. The model is a convolutional neural network, which works by
first applying a learned embedding onto characters, and then running 1-d convolutions over these character embeddings. In this way,
the neural network is able to develop its own fuzzy-ngram patters, and ends up being able to detect suspicious short-strings with
The paper for the (older, original) model is here
Premise - Core Projects
SLA Schema + SLA Ingest and Evaluation Pipelines
The way our company made money was to sign contracts with SLAs (Service Language Agreements) basically saying that we would "gather X
data, with Y requirements, and Z restrictions",
and then fulfill those contracts by paying contributors money to capture that data. An example contract might require that we capture 30
price observations per month (with price SD < .2), and at least five per week
for two brands each for twenty specific consumer products, and all of this for each of seven different regions, with at least 9000
observations per month.
This information, while human readable, was not machine readable. So things like progress dashboards had to be written specifically for
each new contract, instead of spun up automatically.
This wasn't scalable, so I designed a 'SLA' schema that could describe a broad range of SLAs requirements via a JSON blob.
After I translated all current contracts to this machine-readable format and augmented an existing system to push this data (in
google-docs, editable by certain employees) to a redshift db.
Then, I wrote a pipeline to track SLAs against their contract's data (ran daily). The pipeline tracked % SLA coverage, as well as
budgeting overages, across all facets listed in the SLA
(e.g. product X brand X month X region, product X brand X week, product X month, product, etc...)
Interactive Data-Drill-Down App (Shiny)
Created an interactive web application that allows users to visualize an aggregated time series'
component series, or metrics regarding their component series, and drill down to discover more about
a specific series. Consisted of about a dozen unique plot types, various search mechanisms,
metrics, and options.
The applications was built entirely in R's Shiny
Given many observations, each with:
longitude, latitude coordinates with non-gaussian error distributions
coordinate 'accuracy' (expected SD in meters) metric
as associated user (observations within users were highly autocorrelated)
a hand-typed (read: messy) place name,
I developed a method of clustering these observations to back out 'true' places, and confidence metrics regarding each clustered place's
location, name, and actual existence.
This was used to help better direct our users when capturing further observations (place based tasks), and used to create better price
Optimal Volume Allocation in Surveys
Developed formula (lagrange minimization) to define optimal volume allocation, given a budget, and a goal to minimize
the standard error in the end product: an aggregated (over weights) price time series.
Product Hierarchies & Weights Upkeep System
Developed and maintained system to handle many (quite complex) weights hierarchies. Enforced a series of
tests to ensure that end node weights always summed to 1, etc...
Changepoint, Bimodality detection in price time series
Just another pipeline :). What it sounds like. Data pull, estimation, plots, results push.
Automated Visualizations of Medicare Data
This is part of what I worked on for the my summer (2012) internship with Acumen LLC. I wrote
various R functions which took as input excel workbooks, which the functions parsed, organized, and plotted in some way. Ex:
Fn: Plots normalized values of multiple variables over all districts or all states in the
United States with a segmented scatter plot. A state
s is highlighted and its values
shown. All points above a
n standard deviations from the mean become two-letter
state abbreviations, or three-letter district abbreviations.
Fn: Allows viewers to spot the professional relationships among (often many thousands of)
doctors. First, creates a base distance metric to represent professional closeness among two
doctors Di and Dj. This involved variables like the number of beneficiaries Di and Dj share,
weighted by billing, the % of beneficiaries Di and Dj share with regards to their own unique
beneficiary service count, weighted by billing, etc.