Data Skeptic

For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removing a feature and measuring the decrease in accuracy or Gini values in the leaves. We broadly discuss these techniques in this episode.

Direct download: feature-importance.mp3
Category:general -- posted at: 9:24am PDT

As cities provide bike sharing services, they must also plan for how to redistribute bicycles as they inevitably build up at more popular destination stations. In this episode, Hui Xiong talks about the solution he and his colleagues developed to rebalance bike sharing systems.

Direct download: nyc-bike-share-rebalancing.mp3
Category:general -- posted at: 8:00am PDT

Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.

Direct download: random-forest.mp3
Category:general -- posted at: 8:00am PDT

Jo Hardin joins us this week to discuss the ASA's Election Prediction Contest. This is a competition aimed at forecasting the results of the upcoming US presidential election competition. More details are available in Jo's blog post found here.

You can find some useful R code for getting started automatically gathering data from 538 via Jo's github and official contest details are available here. During the interview we also mention Daily Kos and 538.

Direct download: asa-election-prediction-with-jo-hardin.mp3
Category:general -- posted at: 8:00am PDT

The F1 score is a model diagnostic that combines precision and recall to provide a singular evaluation for model comparison.  In this episode we discuss how it applies to selecting an interior designer.

Direct download: f1-score.mp3
Category:general -- posted at: 9:49am PDT

Urban congestion effects every person living in a city of any reasonable size. Lewis Lehe joins us in this episode to share his work on downtown congestion pricing. We explore topics of how different pricing mechanisms effect congestion as well as how data visualization can inform choices.

You can find examples of Lewis's work at His paper which we discussed during the interview isDistance-dependent congestion pricing for downtown zones.

On this episode, we discuss State of California data which can be found at

Direct download: urban-congestion.mp3
Category:general -- posted at: 8:00am PDT

Heteroskedasticity is a term used to describe a relationship between two variables which has unequal variance over the range.  For example, the variance in the length of a cat's tail almost certainly changes (grows) with age.  On the other hand, the average amount of chewing gum a person consume probably has a consistent variance over a wide range of human heights.

We also discuss some issues with the visualization shown in the tweet embedded below.

Image claiming relationship between income and tickets issued

Direct download: heteroskedasticity.mp3
Category:general -- posted at: 8:00am PDT

Our guest today is Michael Cuthbert, an associate professor of music at MIT and principal investigator of the Music21 project, which we focus our discussion on today.

Music21 is a python library making analysis of music accessible and fun. It supports integration with popular formats such as MIDI, MusicXML, Lilypond, and others. It's also well integrated with The Elvis Project, enabling users to import large volumes of music for easy analysis. Music21 is a great platform for musicologists and machine learning researchers alike to explore patterns and structure in music.

Direct download: music21.mp3
Category:general -- posted at: 8:00am PDT

Paxos is a protocol for arriving a consensus in a distributed computing system which accounts for unreliability of the nodes.  We discuss how this might be used in the real world in the event of a massive disaster.

Direct download: paxos.mp3
Category:general -- posted at: 8:00am PDT

Machine learning models are often criticized for being black boxes. If a human cannot determine why the model arrives at the decision it made, there's good cause for skepticism. Classic inspection approaches to model interpretability are only useful for simple models, which are likely to only cover simple problems.

The LIME project seeks to help us trust machine learning models. At a high level, it takes advantage of local fidelity. For a given example, a separate model trained on neighbors of the example are likely to reveal the relevant features in the local input space to reveal details about why the model arrives at it's conclusion.

In this episode, Marco Tulio Ribeiro joins us to discuss how LIME (Locally Interpretable Model-Agnostic Explanations) can help users trust machine learning models. The accompanying paper is titled "Why Should I Trust You?": Explaining the Predictions of Any Classifier.

Direct download: trust-in-ml.mp3
Category:general -- posted at: 8:00am PDT