Data Skeptic (miniepisode)

The multi-armed bandit problem is named with reference to slot machines (one armed bandits). Given the chance to play from a pool of slot machines, all with unknown payout frequencies, how can you maximize your reward? If you knew in advance which machine was best, you would play exclusively that machine. Any strategy less than this will, on average, earn less payout, and the difference can be called the "regret".

You can try each slot machine to learn about it, which we refer to as exploration. When you've spent enough time to be convinced you've identified the best machine, you can then double down and exploit that knowledge. But how do you best balance exploration and exploitation to minimize the regret of your play?

This mini-episode explores a few examples including restaurant selection and A/B testing to discuss the nature of this problem. In the end we touch briefly on Thompson sampling as a solution.

Direct download: multi-armed-bandit.mp3
Category:miniepisode -- posted at: 12:00am PDT

[MINI] k-Nearest Neighbors

This episode explores the k-nearest neighbors algorithm which is an unsupervised, non-parametric method that can be used for both classification and regression. The basica concept is that it leverages some distance function on your dataset to find the $k$ closests other observations of the dataset and averaging them to impute an unknown value or unlabelled datapoint.

Direct download: MINI_knn.mp3
Category:miniepisode -- posted at: 10:51pm PDT

This mini-episode is a high level explanation of the basic idea behind MapReduce, which is a fundamental concept in big data. The origin of the idea comes from a Google paper titled MapReduce: Simplified Data Processing on Large Clusters. This episode makes an analogy to tabulating paper voting ballets as a means of helping to explain how and why MapReduce is an important concept.

Direct download: MINI_Map_Reduce.mp3
Category:miniepisode -- posted at: 10:17pm PDT

More features are not always better! With an increasing number of features to consider, machine learning algorithms suffer from the curse of dimensionality, as they have a wider set and often sparser coverage of examples to consider. This episode explores a real life example of this as Kyle and Linhda discuss their thoughts on purchasing a home.

The curse of dimensionality was defined by Richard Bellman, and applies in several slightly nuanced cases. This mini-episode discusses how it applies on machine learning.

This episode does not, however, discuss a slightly different version of the curse of dimensionality which appears in decision theoretic situations. Consider the game of chess. One must think ahead several moves in order to execute a successful strategy. However, thinking ahead another move requires a consideration of every possible move of every piece controlled, and every possible response one's opponent may take. The space of possible future states of the board grows exponentially with the horizon one wants to look ahead to. This is present in the notably useful Bellman equation.

Direct download: MINI_The_Curse_of_Dimensionality.mp3
Category:miniepisode -- posted at: 12:01am PDT

[MINI] Anscombe's Quartet

This mini-episode discusses Anscombe's Quartet, a series of four datasets which are clearly very different but share some similar statistical properties with one another. For example, each of the four plots has the same mean and variance on both axis, as well as the same correlation coefficient, and same linear regression.

 

The episode tries to add some context by imagining each of these datasets as data about a sports team, and why it can be important to look beyond basic summary statistics when exploring your dataset.

Direct download: MINI_Anscombes_Quartet.mp3
Category:miniepisode -- posted at: 1:00am PDT

Linhda and Kyle review a New York Times article titled How Your Hometown Affects Your Chances of Marriage. This article explores research about what correlates with the likelihood of being married by age 26 by county. Kyle and LinhDa discuss some of the fine points of this research and the process of identifying factors for consideration.

Direct download: marriage-analysis.mp3
Category:miniepisode -- posted at: 12:07am PDT

This week's episode dicusses z-scores, also known as standard score. This score describes the distance (in standard deviations) that an observation is away from the mean of the population. A closely related top is the 68-95-99.7 rule which tells us that (approximately) 68% of a normally distributed population lies within one standard deviation of the mean, 95 within 2, and 99.7 within 3.

Kyle and Linh Da discuss z-scores in the context of human height. If you'd like to calculate your own z-score for height, you can do so below. They further discuss how a z-score can also describe the likelihood that some statistical result is due to chance. Thus, if the significance of a finding can be said to be 3σ, that means that it's 99.7% likely not due to chance, or only 0.3% likely to be due to chance.

Direct download: z-scores.mp3
Category:miniepisode -- posted at: 10:08pm PDT

For our 50th episode we enduldge a bit by cooking Linhda's previously mentioned "healthy" cornbread.  This leads to a discussion of the statistical topic of overdispersion in which the variance of some distribution is larger than what one's underlying model will account for.

Direct download: MINI_Cornbread_and_Overdispersion.mp3
Category:miniepisode -- posted at: 12:19am PDT

This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and th bag of words approach.

Direct download: nlp.mp3
Category:miniepisode -- posted at: 11:44pm PDT

This episode explores how going wine testing could teach us about using markov chain monte carlo (mcmc).

Direct download: MINI_mcmc.mp3
Category:miniepisode -- posted at: 11:24pm PDT

This episode introduces the idea of a Markov Chain. A Markov Chain has a set of states describing a particular system, and a probability of moving from one state to another along every valid connected state. Markov Chains are memoryless, meaning they don't rely on a long history of previous observations. The current state of a system depends only on the previous state and the results of a random outcome.

Markov Chains are a useful way method for describing non-deterministic systems. They are useful for destribing the state and transition model of a stochastic system.

As examples of Markov Chains, we discuss stop light signals, bowling, and text prediction systems in light of whether or not they can be described with Markov Chains.

Direct download: MINI_Markov_Chains.mp3
Category:miniepisode -- posted at: 12:00am PDT

This episode explores Ordinary Least Squares or OLS - a method for finding a good fit which describes a given dataset.

Direct download: MINI_Ordinary_Least_Squares_Regression.mp3
Category:miniepisode -- posted at: 12:43am PDT

The k-means clustering algorithm is an algorithm that computes a deterministic label for a given "k" number of clusters from an n-dimensional datset.  This mini-episode explores how Yoshi, our lilac crowned amazon's biological processes might be a useful way of measuring where she sits when there are no humans around.  Listen to find out how!

Direct download: MINI_k-means_clustering.mp3
Category:miniepisode -- posted at: 11:38pm PDT

The χ2 (Chi-Squared) test is a methodology for hypothesis testing. When one has categorical data, in the form of frequency counts or observations (e.g. Vegetarian, Pescetarian, and Omnivore), split into two or more categories (e.g. Male, Female), a question may arrise such as "Are women more likely than men to be vegetarian?" or put more accurately, "Is any observed difference in the frequency with which women report being vegetarian differ in a statistically significant way from the frequency men report that?"

Direct download: MINI_Chi-Squared_Test.mp3
Category:miniepisode -- posted at: 9:58pm PDT

When dealing with dynamic systems that are potentially undergoing constant change, its helpful to describe what "state" they are in.  In many applications the manner in which the state changes from one to another is not completely predictable, thus, there is uncertainty over how it transitions from state to state.  Further, in many applications, one cannot directly observe the true state, and thus we describe such situations as partially observable state spaces.  This episode explores what this means and why it is important in the context of chess, poker, and the mood of Yoshi the lilac crowned amazon parrot.

Direct download: MINI_Partially_Observable_State_Spaces.mp3
Category:miniepisode -- posted at: 11:41pm PDT

This episode introduces a high level discussion on the topic of Data Provenance, with more MINI episodes to follow to get into specific topics. Thanks to listener Sara L who wrote in to point out the Data Skeptic Podcast has focused alot about using data to be skeptical, but not necessarily being skeptical of data.

Data Provenance is the concept of knowing the full origin of your dataset. Where did it come from? Who collected it? How as it collected? Does it combine independent sources or one singular source? What are the error bounds on the way it was measured? These are just some of the questions one should ask to understand their data. After all, if the antecedent of an argument is built on dubious grounds, the consequent of the argument is equally dubious.

For a more technical discussion than what we get into in this mini epiosode, I recommend A Survey of Data Provenance Techniques by authors Simmhan, Plale, and Gannon.

Direct download: MINI_Data_Provenance.mp3
Category:miniepisode -- posted at: 6:14pm PDT

In this quick holiday episode, we touch on how one would approach modeling the statistical distribution over the probability of belief in Santa Claus given age.

Direct download: MINI_Belief_in_Santa_Claus.mp3
Category:miniepisode -- posted at: 11:36pm PDT

Love and Data is the continued theme in this mini-episode as we discuss the game theory example of The Battle of the Sexes. In this textbook example, a couple must strategize about how to spend their Friday night. One partner prefers football games while the other partner prefers to attend the opera. Yet, each person would rather be at their non-preferred location so long as they are still with their spouse. So where should they decide to go?

Direct download: MINIBattle_of_the_sexes.mp3
Category:miniepisode -- posted at: 6:34pm PDT

Economist Peter Backus put forward "The Girlfriend Equation" while working on his PhD - a probabilistic model attempting to estimate the likelihood of him finding a girlfriend. In this mini episode we explore the soundness of his model and also share some stories about how Linhda and Kyle met.

Direct download: MINI_The_Girlfriend_Equation.mp3
Category:miniepisode -- posted at: 12:03am PDT

What is randomness? How can we determine if some results are randomly generated or not? Why are random numbers important to us in our everyday life? These topics and more are discussed in this mini-episode on random numbers.

Many readers will be vaguely familar with the idea of "X number of monkeys banging on Y number of typewriters for Z number of years" - the idea being that such a setup would produce random sequences of letters. The origin of this idea was the mathemetician Borel who was interested in whether or not 1,000,000 monkeys working for 10 hours per day might eventually reproduce the works of shakespeare.

We explore this topic and provide some further details in the show notes which you can find over at dataskeptic.com

Direct download: MINI_Random_Numbers.mp3
Category:miniepisode -- posted at: 7:18pm PDT

This episode explores the basis of why we can trust encryption.  Suprisingly, a discussion of looking up a word in the dictionary (binary search) and efficiently going wine tasting (the travelling salesman problem) help introduce computational complexity as well as the P ?= NP question, which is paramount to the trustworthiness RSA encryption.

With a high level foundation of computational theory, we talk about NP problems, and why prime factorization is a difficult problem, thus making it a great basis for the RSA encryption algorithm, which most of the internet uses to encrypt data.  Unlike the encryption scheme Ray Romano used in "Everybody Loves Raymond", RSA has nice theoretical foundations.

It should be noted that although this episode gives good reason to trust that properly encrypted data, based on well choosen public/private keys where the private key is not compromised, is safe.  However, having safe encryption doesn't necessarily mean that the Internet is secure.  Topics like Man in the Middle attacks as well as the Snowden revelations are a topic for another day, not for this record length "mini" episode.

Direct download: MINI_Is_the_Internet_Secure.mp3
Category:miniepisode -- posted at: 12:41am PDT

The t-test is this week's mini-episode topic. The t-test is a statistical testing procedure used to determine if the mean of two datasets differs by a statistically significant amount. We discuss how a wine manufacturer might apply a t-test to determine if the sweetness, acidity, or some other property of two separate grape vines might differ in a statistically meaningful way.

Direct download: MINI_The_T-Test.mp3
Category:miniepisode -- posted at: 7:49pm PDT

A discussion about conducting US presidential election polls helps frame a converation about selection bias.

Direct download: MINI_Selection_Bias.mp3
Category:miniepisode -- posted at: 1:00am PDT

Commute times and BBQ invites help frame a discussion about the statistical concept of confidence intervals.

Direct download: MINI_Confidence_Intervals.mp3
Category:miniepisode -- posted at: 10:47pm PDT

A discussion about getting ready in the morning, negotiating a used car purchase, and selecting the best AirBnB place to stay at help frame a conversation about the decision theoretic principal known as the Value of Information equation.

Direct download: MINI_Value_of_Information.mp3
Category:miniepisode -- posted at: 12:29am PDT

Linhda and Kyle talk about Decision Tree Learning in this miniepisode.  Decision Tree Learning is the algorithmic process of trying to generate an optimal decision tree to properly classify or forecast some future unlabeled element based by following each step in the tree.

Direct download: MINI_Decision_Tree_Learning.mp3
Category:miniepisode -- posted at: 12:49am PDT

Our topic for this week is "noise" as in signal vs. noise.  This is not a signal processing discussions, but rather a brief introduction to how the work noise is used to describe how much information in a dataset is useless (as opposed to useful).

Also, Kyle announces having recently had the pleasure of appearing as a guest on The Conspiracy Skeptic Podcast to discussion The Bible Code.  Please check out this other fine program for this and it's many other great episodes.

Direct download: MINI_Noise.mp3
Category:miniepisode -- posted at: 6:00am PDT

In this week's mini episode, Linhda and Kyle discuss Ant Colony Optimization - a numerical / stochastic optimization technique which models its search after the process ants employ in using random walks to find a goal (food) and then leaving a pheremone trail in their walk back to the nest.  We even find some way of relating the city of San Francisco and running a restaurant into the discussion.

Direct download: MINI_Ant_Colony_Optimization.mp3
Category:miniepisode -- posted at: 6:00am PDT

This episode loosely explores the topic of Experimental Design including hypothesis testing, the importance of statistical tests, and an everyday and business example.

Direct download: MINI_Experimental_Design.mp3
Category:miniepisode -- posted at: 6:00am PDT

In this minisode, we discuss Bayesian Updating - the process by which one can calculate the most likely hypothesis might be true given one's older / prior belief and all new evidence.

Direct download: MINI_Bayesian_Updating.mp3
Category:miniepisode -- posted at: 6:00am PDT

In this mini, we discuss p-values and their use in hypothesis testing, in the context of an hypothetical experiment on plant flowering, and end with a reference to the Particle Fever documentary and how statistical significance played a role.

Direct download: MINI_p-values_.mp3
Category:miniepisode -- posted at: 6:00am PDT

In this first mini-episode of the Data Skeptic Podcast, we define and discuss type i and type ii errors (a.k.a. false positives and false negatives).

Direct download: type_i_type_ii.mp3
Category:miniepisode -- posted at: 6:00am PDT

1