Data Skeptic

Our episode this week begins with a correction. Back in episode 28 (Monkeys on Typewriters), Kyle made some bold claims about the probability that monkeys banging on typewriters might produce the entire works of Shakespeare by chance. The proof shown in the show notes turned out to be a bit dubious and Dave Spiegel joins us in this episode to set the record straight.

In addition to that, our discussion explores a number of interesting topics in astronomy and astrophysics. This includes a paper Dave wrote with Ed Turner titled "Bayesian analysis of the astrobiological implications of life's early emergence on Earth" as well as exoplanet discovery.

Direct download: Shakespeare-abiogenesis-exoplanets.mp3
Category:general -- posted at: 12:30am PDT

There are several factors that are important to selecting an appropriate sample size and dealing with small samples. The most important questions are around representativeness - how well does your sample represent the total population and capture all it's variance?

Linhda and Kyle talk through a few examples including elections, picking an Airbnb, produce selection, and home shopping as examples of cases in which the amount of observations one has are more or less important depending on how complex the underlying system one is observing is.

Direct download: sample_sizes.mp3
Category:general -- posted at: 11:47pm PDT

There's an old adage which says you cannot fit a model which has more parameters than you have data. While this is often the case, it's not a universal truth. Today's guest Jake VanderPlas explains this topic in detail and provides some excellent examples of when it holds and doesn't. Some excellent visuals articulating the points can be found on Jake's blog Pythonic Perambulations, specifically on his post The Model Complexity Myth.

We also touch on Jake's work as an astronomer, his noteworthy open source contributions, and forthcoming book (currently available in an Early Edition) Python Data Science Handbook.

Direct download: model_complexity_myth.mp3
Category:general -- posted at: 12:00am PDT

There are many occasions in which one might want to know the distance or similarity between two things, for which the means of calculating that distance is not necessarily clear. The distance between two points in Euclidean space is generally straightforward, but what about the distance between the top of Mount Everest to the bottom of the ocean? What about the distance between two sentences?

This mini-episode summarizes some of the considerations and a few of the means of calculating distance. We touch on Jaccard Similarity, Manhattan Distance, and a few others.

Direct download: distance_measures.mp3
Category:general -- posted at: 12:00am PDT

1