Data Skeptic (general)

Today's episode is a reading of Isaac Asimov's Franchise.  As mentioned on the show, this is just a work of fiction to be enjoyed and not in any way some obfuscated political statement.  Enjoy, and happy holidays!

Direct download: 2016-holiday-special.mp3
Category:general -- posted at: 8:00am PDT

Classically, entropy is a measure of disorder in a system. From a statistical perspective, it is more useful to say it's a measure of the unpredictability of the system. In this episode we discuss how information reduces the entropy in deciding whether or not Yoshi the parrot will like a new chew toy. A few other everyday examples help us examine why entropy is a nice metric for constructing a decision tree.

Direct download: entropy.mp3
Category:general -- posted at: 12:23am PDT

Cloud services are now ubiquitous in data science and more broadly in technology as well. This week, I speak to Mark Souza, Tobias Ternström, and Corey Sanders about various aspects of data at scale. We discuss the embedding of R into SQLServer, SQLServer on linux, open source, and a few other cloud topics.

Direct download: ms-connect-conference.mp3
Category:general -- posted at: 8:24am PDT

Today's episode is all about Causal Impact, a technique for estimating the impact of a particular event on a time series. We talk to William Martin about his research into the impact releases have on app and we also chat with Karen Blakemore about a project she helped us build to explore the impact of a Saturday Night Live appearance on a musician's career.

Martin's work culminated in a paper Causal Impact for App Store Analysis. A shorter summary version can be found here. His company helping app developers do this sort of analysis can be found at

Direct download: causal-impact.mp3
Category:general -- posted at: 7:56am PDT

The Bootstrap is a method of resampling a dataset to possibly refine it's accuracy and produce useful metrics on the result. The bootstrap is a useful statistical technique and is leveraged in Bagging (bootstrap aggregation) algorithms such as Random Forest. We discuss this technique related to polling and surveys.

Direct download: the-bootstrap.mp3
Category:general -- posted at: 8:00am PDT

The Gini Coefficient (as it relates to decision trees) is one approach to determining the optimal decision to introduce which splits your dataset as part of a decision tree. To pick the right feature to split on, it considers the frequency of the values of that feature and how well the values correlate with specific outcomes that you are trying to predict.

Direct download: gini-index.mp3
Category:general -- posted at: 8:00am PDT

Financial analysis techniques for studying numeric, well structured data are very mature. While using unstructured data in finance is not necessarily a new idea, the area is still very greenfield. On this episode,Delia Rusu shares her thoughts on the potential of unstructured data and discusses her work analyzing Wikipedia to help inform financial decisions.

Delia's talk at PyData Berlin can be watched on Youtube (Estimating stock price correlations using Wikipedia). The slides can be found here and all related code is available on github.

Direct download: unstructured-data-for-finance.mp3
Category:general -- posted at: 8:00am PDT

AdaBoost is a canonical example of the class of AnyBoost algorithms that create ensembles of weak learners. We discuss how a complex problem like predicting restaurant failure (which is surely caused by different problems in different situations) might benefit from this technique.

Direct download: adaboost.mp3
Category:general -- posted at: 8:00am PDT

Platform as a service is a growing trend in data science where services like fraud analysis and face detection can be provided via APIs. Such services turn the actual model into a black box to the consumer. But can the model be reverse engineered?

Florian Tramèr shares his work in this episode showing that it can. The paper Stealing Machine Learning Models via Prediction APIs is definitely worth your time to read if you enjoy this episode. Related source code can be found in

Direct download: stealing-models-from-the-cloud.mp3
Category:general -- posted at: 7:54am PDT

For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removing a feature and measuring the decrease in accuracy or Gini values in the leaves. We broadly discuss these techniques in this episode.

Direct download: feature-importance.mp3
Category:general -- posted at: 9:24am PDT