Data Skeptic (general)

Co-host Linh Da was in a biking accident after hitting a pothole. She sustained an injury that required stitches. This is the story of our quest to file a 311 complaint and track it through the City of Los Angeles's open data portal.

My guests this episode are Chelsea Ursaner (LA City Open Data Team), Ben Berkowitz (CEO and founder of SeeClickFix), and Russ Klettke (Editor of

Direct download: potholes.mp3
Category:general -- posted at: 8:24am PDT

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one should select to solve their problem?

This mini-episode explores the appropriate value of k to use when trying to estimate the cost of a house in Los Angeles based on the closests sales in it's area.

Direct download: the-elbow-method.mp3
Category:general -- posted at: 8:00am PDT

Today on Data Skeptic, Lachlan Gunn joins us to discuss his recent paper Too Good to be True. This paper highlights a somewhat paradoxical / counterintuitive fact about how unanimity is unexpected in cases where perfect measurements cannot be taken. With large enough data, some amount of error is expected.

The "Too Good to be True" paper highlights three interesting examples which we discuss in the podcast. You can also watch a lecture from Lachlan on this topic via youtube here.

Direct download: too_good_to_be_true.mp3
Category:general -- posted at: 8:00am PDT

How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why different houses have different prices. There's some amount of variance that can be explained by a model, and some amount that cannot be directly measured. R-squared is the ratio of the explained variance to the total variance. It's not a measure of accuracy, it's a measure of the power of one's model.

Direct download: r-squared.mp3
Category:general -- posted at: 8:00am PDT

Direct download: think_again.mp3
Category:general -- posted at: 7:30am PDT

[MINI] Multiple Regression

This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price.

Unlike a typical episode of Data Skeptic, these show notes are not just supporting material, but are actually featured in the episode.

The site Redfin gratiously allows users to download a CSV of results they are viewing. Unfortunately, they limit this extract to 500 listings, but you can still use it to try the same approach on your own using the download link shown in the figure below.

Direct download: multiple_regressions.mp3
Category:general -- posted at: 7:30am PDT

Samuel Mehr joins us this week to share his perspective on why people are musical, where music comes from, and why it works the way it does. We discuss a number of empirical studies related to music and musical cognition, and dispense a few myths about music along the way.

Some of Sam's work discussed in this episode include Music in the Home: New Evidence for an Intergenerational Link,Two randomized trials provide no consistent evidence for nonmusical cognitive benefits of brief preschool music enrichment, and Miscommunication of science: music cognition research in the popular press. Additional topics we discussed are also covered in a Harvard Gazette article featuring Sam titled Muting the Mozart effect.

You can follow Sam on twitter via @samuelmehr.

Direct download: samuel.mp3
Category:general -- posted at: 7:30am PDT

This episode reviews the concept of k-d trees: an efficient data structure for holding multidimensional objects. Kyle gives Linhda a dictionary and asks her to look up words as a way of introducing the concept of binary search. We actually spend most of the episode talking about binary search before getting into k-d trees, but this is a necessary prerequisite.

Direct download: kd_trees.mp3
Category:general -- posted at: 7:13am PDT

Algorithms are pervasive in our society and make thousands of automated decisions on our behalf every day. The possibility of digital discrimination is a very real threat, and it is very plausible for discrimination to occur accidentally (i.e. outside the intent of the system designers and programmers). Christian Sandvig joins us in this episode to talk about his work and the concept of auditing algorithms.

Christian Sandvig (@niftyc) has a PhD in communications from Stanford and is currently an Associate Professor of Communication Studies and Information at the University of Michigan. His research studies the predictable and unpredictable effects that algorithms have on culture. His work exploring the topic of auditing algorithms has framed the conversation of how and why we might want to have oversight on the way algorithms effect our lives. His writing appears in numerous publications including The Social Media Collective, The Huffington Post, and Wired.

One of his papers we discussed in depth on this episode was Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms, which is well worth a read.

Direct download: auditing_algorithms.mp3
Category:general -- posted at: 7:00am PDT

Today's episode begins by asking how many left handed employees we should expect to be at a company before anyone should claim left handedness discrimination. If not lefties, let's consider eye color, hair color, favorite ska band, most recent grocery store used, and any number of characteristics could be studied to look for deviations from the norm in a company.

When multiple comparisons are to be made simultaneous, one must account for this, and a common method for doing so is with the Bonferroni Correction. It is not, however, a sure fire procedure, and this episode wraps up with a bit of skepticism about it.

Direct download: bonferroni-correction2.mp3
Category:general -- posted at: 7:00am PDT