Data Skeptic

This episode explores the basis of why we can trust encryption.  Suprisingly, a discussion of looking up a word in the dictionary (binary search) and efficiently going wine tasting (the travelling salesman problem) help introduce computational complexity as well as the P ?= NP question, which is paramount to the trustworthiness RSA encryption.

With a high level foundation of computational theory, we talk about NP problems, and why prime factorization is a difficult problem, thus making it a great basis for the RSA encryption algorithm, which most of the internet uses to encrypt data.  Unlike the encryption scheme Ray Romano used in "Everybody Loves Raymond", RSA has nice theoretical foundations.

It should be noted that although this episode gives good reason to trust that properly encrypted data, based on well choosen public/private keys where the private key is not compromised, is safe.  However, having safe encryption doesn't necessarily mean that the Internet is secure.  Topics like Man in the Middle attacks as well as the Snowden revelations are a topic for another day, not for this record length "mini" episode.

Direct download: MINI_Is_the_Internet_Secure.mp3
Category:miniepisode -- posted at: 12:41am PDT

Jeff Stanton joins me in this episode to discuss his book An Introduction to Data Science, and some of the unique challenges and issues faced by someone doing applied data science. A challenge to any data scientist is making sure they have a good input data set and apply any necessary data munging steps before their analysis. We cover some good advise for how to approach such problems.

Direct download: Practicing_and_Communicating_Data_Science.mp3
Category:data science -- posted at: 10:18pm PDT

The t-test is this week's mini-episode topic. The t-test is a statistical testing procedure used to determine if the mean of two datasets differs by a statistically significant amount. We discuss how a wine manufacturer might apply a t-test to determine if the sweetness, acidity, or some other property of two separate grape vines might differ in a statistically meaningful way.

Direct download: MINI_The_T-Test.mp3
Category:miniepisode -- posted at: 7:49pm PDT

This week I'm joined by Karl Mamer to discuss the data behind three well known urban legends. Did a large blackout in New York and surrounding areas result in a baby boom nine months later? Do subliminal messages affect our behavior? Is placing beer alongside diapers a recipe for generating more revenue than these products in separate locations? Listen as Karl and I explore these claims.

Direct download: Data_Myths_with_Karl_Mamer.mp3
Category:skepticism -- posted at: 7:35pm PDT

The Data Skeptic Podcast is launching a contest- not one of chance, but one of skill. Listeners are encouraged to put their data science skills to good use, or if all else fails, guess!

The contest works as follows. Below is some data about the cumulative number of downloads the podcast has achieved on a few given dates. Your job is to predict the date and time at which the podcast will recieve download number 27,182. Why this arbitrary number? It's as good as any other arbitrary number!

Use whatever means you want to formulate a prediction. Once you have it, wait until that time and then post a review of the Data Skeptic Podcast on iTunes. You don't even have to leave a good review! The review which is posted closest to the actual time at which this download occurs will win a free copy of Matthew Russell's "Mining the Social Web" courtesy of the Data Skeptic Podcast. "Price is Right" rules are in play - the winner is the person that posts their review closest to the actual time without going over.

More information at

Direct download: contest.mp3
Category:statistics -- posted at: 9:49pm PDT

A discussion about conducting US presidential election polls helps frame a converation about selection bias.

Direct download: MINI_Selection_Bias.mp3
Category:miniepisode -- posted at: 1:00am PDT