Data Skeptic (general)

Today's episode is a reading of Isaac Asimov's The Machine that Won the War. I can't think of a story that's more appropriate for Data Skeptic.

Direct download: 2015_Holiday_Special.mp3
Category:general -- posted at: 12:00am PDT

In this interview with Aaron Halfaker of the Wikimedia Foundation, we discuss his research and career related to the study of Wikipedia. In his paper The Rise and Decline of an open Collaboration Community, he highlights a trend in the declining rate of active editors on Wikipedia which began in 2007. I asked Aaron about a variety of possible hypotheses for the phenomenon, in particular, how automated quality control tools that revert edits automatically could play a role. This lead Aaron and his collaborators to develop Snuggle, an optimized interface to help Wikipedians better welcome new comers to the community.

We discuss the details of these topics as well as ORES, which provides revision scoring as a service to any software developer that wants to consume the output of their machine learning based scoring.

You can find Aaron on Twitter as @halfak.

Direct download: wikipedia-revision-scoring-as-a-service.mp3
Category:general -- posted at: 6:30am PDT

Today's topic is term frequency inverse document frequency, which is a statistic for estimating the importance of words and phrases in a set of documents.

Direct download: tf_-_idf.mp3
Category:general -- posted at: 8:45am PDT

Early astronomers could see several of the planets with the naked eye. The invention of the telescope allowed for further understanding of our solar system. The work of Isaac Newton allowed later scientists to accurately predict Neptune, which was later observationally confirmed exactly where predicted. It seemed only natural that a similar unknown body might explain anomalies in the orbit of Mercury, and thus began the search for the hypothesized planet Vulcan.

Thomas Levenson's book "The Hunt for Vulcan" is a narrative of the key scientific minds involved in the search and eventual refutation of an unobserved planet between Mercury and the sun. Thomas joins me in this episode to discuss his book and the fascinating story of the quest to find this planet.

During the discussion, we mention one of the contributions made by Urbain-Jean-Joseph Le Verrier which involved some complex calculations which enabled him to predict where to find the planet that would eventually be called Neptune. The calculus behind this work is difficult, and some of that work is demonstrated in a Jupyter notebook I recently discovered from Paulo Marques titled The-Body Problem.

Thomas Levenson is a professor at MIT and head of its science writing program. He is the author of several books, including Einstein in Berlin and Newton and the Counterfeiter: The Unknown Detective Career of the World’s Greatest Scientist. He has also made ten feature-length documentaries (including a two-hour Nova program on Einstein) for which he has won numerous awards. In his most recent book "The Hunt for Vulcan", explores the century spanning quest to explain the movement of the cosmos via theory and the role the hypothesized planet Vulcan played in the story.

Follow Thomas on twitter @tomlevenson and check out his blog athttps://inversesquare.wordpress.com/.

Pick up your copy of The Hunt for Vulcan at your local bookstore, preferred book buying place, or at the Penguin Random House site.

Direct download: the-hunt-for-vulcan.mp3
Category:general -- posted at: 12:00am PDT

Today's episode discusses the accuracy paradox. There are cases when one might prefer a less accurate model because it yields more predictive power or better captures the underlying causal factors describing the outcome variable you are interested in. This is especially relevant in machine learning when trying to predict rare events. We discuss how the accuracy paradox might apply if you were trying to predict the likelihood a person was a bird owner.

Direct download: the-accuracy-paradox.mp3
Category:general -- posted at: 12:00am PDT

... or should this have been called data science from a neuroscientist's perspective? Either way, I'm sure you'll enjoy this discussion with Laurie Skelly. Laurie earned a PhD in Integrative Neuroscience from the Department of Psychology at the University of Chicago. In her life as a social neuroscientist, using fMRI to study the neural processes behind empathy and psychopathy, she learned the ropes of zooming in and out between the macroscopic and the microscopic -- how millions of data points come together to tell us something meaningful about human nature. She's currently at Metis Data Science, an organization that helps people learn the skills of data science to transition in industry.

In this episode, we discuss fMRI technology, Laurie's research studying empathy and psychopathy, as well as the skills and tools used in common between neuroscientists and data scientists. For listeners interested in more on this subject, Laurie recommended the blogs Neuroskeptic, Neurocritic, and Neuroecology.

We conclude the episode with a mention of the upcoming Metis Data Science San Francisco cohort which Laurie will be teaching. If anyone is interested in applying to participate, they can do so here.

Direct download: neuroscience.mp3
Category:general -- posted at: 12:00am PDT

A discussion of the expected number of cars at a stoplight frames today's discussion of the bias variance tradeoff. The central ideal of this concept relates to model complexity. A very simple model will likely generalize well from training to testing data, but will have a very high variance since it's simplicity can prevent it from capturing the relationship between the covariates and the output. As a model grows more and more complex, it may capture more of the underlying data but the risk that it overfits the training data and therefore does not generalize (is biased) increases. The tradeoff between minimizing variance and minimizing bias is an ongoing challenge for data scientists, and an important discussion for skeptics around how much we should trust models.

Direct download: bias-variance-tradeoff.mp3
Category:general -- posted at: 12:00am PDT

The recent opinion piece Big Data Doesn't Exist on Tech Crunch by Slater Victoroff is an interesting discussion about the usefulness of data both big and small. Slater joins me this episode to discuss and expand on this discussion.

Slater Victoroff is CEO of indico Data Solutions, a company whose services turn raw text and image data into human insight. He, and his co-founders, studied at Olin College of Engineering where indico was born. indico was then accepted into the "Techstars Accelarator Program" in the Fall of 2014 and went on to raise $3M in seed funding. His recent essay "Big Data Doesn't Exist" received a lot of traction on TechCrunch, and I have invited Slater to join me today to discuss his perspective and touch on a few topics in the machine learning space as well.

Direct download: big-data-doesnt-exist.mp3
Category:general -- posted at: 12:00am PDT

The degree to which two variables change together can be calculated in the form of their covariance. This value can be normalized to the correlation coefficient, which has the advantage of transforming it to a unitless measure strictly bounded between -1 and 1. This episode discusses how we arrive at these values and why they are important.

Direct download: covariance_and_correlation.mp3
Category:general -- posted at: 12:00am PDT

Today's guest is Cameron Davidson-Pilon. Cameron has a masters degree in quantitative finance from the University of Waterloo. Think of it as statistics on stock markets. For the last two years he's been the team lead of data science at Shopify. He's the founder of dataoragami.net which produces screencasts teaching methods and techniques of applied data science. He's also the author of the just released in print book Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference, which you can also get in a digital form.

This episode focuses on the topic of Bayesian A/B Testing which spans just one chapter of the book. Related to today's discussion is the Data Origami post The class imbalance problem in A/B testing.

Lastly, Data Skeptic will be giving away a copy of the print version of the book to one lucky listener who has a US based delivery address. To participate, you'll need to write a review of any site, book, course, or podcast of your choice on datasciguide.com. After it goes live, tweet a link to it with the hashtag #WinDSBook to be given an entry in the contest. This contest will end November 20th, 2015, at which time I'll draw a single randomized winner and contact them for delivery details via direct message on Twitter.

Direct download: bayesian-methods-for-hackers.mp3
Category:general -- posted at: 12:00am PDT

The central limit theorem is an important statistical result which states that typically, the mean of a large enough set of independent trials is approximately normally distributed.  This episode explores how this might be used to determine if an amazon parrot like Yoshi produces or or less waste than an African Grey, under the assumption that the individual distributions are not normal.

Direct download: Central_Limit_Theorem.mp3
Category:general -- posted at: 12:00am PDT

Today's guest is Chris Hofstader (@gonz_blinko), an accessibility researcher and advocate, as well as an activist for causes such as improving access to information for blind and vision impaired people. His background in computer programming enabled him to be the leader of JAWS, a Windows program that allowed people with a visual impairment to read their screen either through text-to-speech or a refreshable braille display. He's the Managing Member of 3 Mouse Technology. He's also a frequent blogger primarily at chrishofstader.com.

For web developers and site owners, Chris recommends two tools to help test for accessibility issues: tenon.io and dqtech.co.

A guest post from Chris appeared on the Skepchick blogged titled Skepticism and Disability which lead to the formation of the sister site Skeptibility.

In a discussion of skepticism and favorite podcasts, Chris mentioned a number of great shows, most notably The Pod Delusion to which he was a contributor. Additionally, Chris has also appeared on The Atheist Nomads.

Lastly, a shout out from Chris to musician Shelley Segal whom he hosted just before the date of recording of this episode. Her music can be found on her site or via bandcamp.

Direct download: accessible-technology.mp3
Category:general -- posted at: 12:00am PDT

Our episode this week begins with a correction. Back in episode 28 (Monkeys on Typewriters), Kyle made some bold claims about the probability that monkeys banging on typewriters might produce the entire works of Shakespeare by chance. The proof shown in the show notes turned out to be a bit dubious and Dave Spiegel joins us in this episode to set the record straight.

In addition to that, our discussion explores a number of interesting topics in astronomy and astrophysics. This includes a paper Dave wrote with Ed Turner titled "Bayesian analysis of the astrobiological implications of life's early emergence on Earth" as well as exoplanet discovery.

Direct download: Shakespeare-abiogenesis-exoplanets.mp3
Category:general -- posted at: 12:30am PDT

There are several factors that are important to selecting an appropriate sample size and dealing with small samples. The most important questions are around representativeness - how well does your sample represent the total population and capture all it's variance?

Linhda and Kyle talk through a few examples including elections, picking an Airbnb, produce selection, and home shopping as examples of cases in which the amount of observations one has are more or less important depending on how complex the underlying system one is observing is.

Direct download: sample_sizes.mp3
Category:general -- posted at: 11:47pm PDT

There's an old adage which says you cannot fit a model which has more parameters than you have data. While this is often the case, it's not a universal truth. Today's guest Jake VanderPlas explains this topic in detail and provides some excellent examples of when it holds and doesn't. Some excellent visuals articulating the points can be found on Jake's blog Pythonic Perambulations, specifically on his post The Model Complexity Myth.

We also touch on Jake's work as an astronomer, his noteworthy open source contributions, and forthcoming book (currently available in an Early Edition) Python Data Science Handbook.

Direct download: model_complexity_myth.mp3
Category:general -- posted at: 12:00am PDT

There are many occasions in which one might want to know the distance or similarity between two things, for which the means of calculating that distance is not necessarily clear. The distance between two points in Euclidean space is generally straightforward, but what about the distance between the top of Mount Everest to the bottom of the ocean? What about the distance between two sentences?

This mini-episode summarizes some of the considerations and a few of the means of calculating distance. We touch on Jaccard Similarity, Manhattan Distance, and a few others.

Direct download: distance_measures.mp3
Category:general -- posted at: 12:00am PDT

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. The program's founder Peter Murray-Rust joins us this week to discuss ContentMine. Our discussion covers the project, the scientific publication process, copywrite, and several other interesting topics.

Direct download: contentmine.mp3
Category:general -- posted at: 12:30am PDT

Today's mini-episode explains the distinction between structured and unstructured data, and debates which of these categories best describe recipes.

Direct download: structured_unstructured.mp3
Category:general -- posted at: 11:22pm PDT

Yusan Lin shares her research on using data science to explore the fashion industry in this episode. She has applied techniques from data mining, natural language processing, and social network analysis to explore who are the innovators in the fashion world and how their influence effects other designers.

If you found this episode interesting and would like to read more, Yusan's papers Text-Generated Fashion Influence Model: An Empirical Study on Style.com and The Hidden Influence Network in the Fashion Industry are worth reading.

Direct download: yusan_lin.mp3
Category:general -- posted at: 12:01am PDT

[MINI] PageRank

PageRank is the algorithm most famous for being one of the original innovations that made Google stand out as a search engine. It was defined in the classic paper The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Page. While this algorithm clearly impacted web searching, it has also been useful in a variety of other applications. This episode presents a high level description of this algorithm and how it might apply when trying to establish who writes the most influencial academic papers.

Direct download: pagerank.mp3
Category:general -- posted at: 12:01am PDT

Crypto

How do people think rationally about small probability events?

What is the optimal statistical process by which one can update their beliefs in light of new evidence?

This episode of Data Skeptic explores questions like this as Kyle consults a cast of previous guests and experts to try and answer the question "What is the probability, however small, that Bigfoot is real?"

Direct download: Data_Skeptic_-_Crypto.mp3
Category:general -- posted at: 2:34am PDT

A recent episode of the Skeptics Guide to the Universe included a slight rant by Dr. Novella and the rouges about a shortcoming in operating systems.  This episode explores why such a (seemingly obvious) flaw might make sense from an engineering perspective, and how data science might be the solution.

In this solo episode, Kyle proposes the concept of "annoyance mining" - the idea that with proper logging and enough feedback, data scientists could be provided the right dataset from which they can detect flaws and annoyances in software and other systems and automatically detect potential bugs, flaws, and improvements which could make those systems better.

As system complexity grows, it seems that an abstraction like this might be required in order to keep maintaining an effective development cycle.  This episode is a bit of a soap box for Kyle as he explores why and how we might track an appropriate amount of data to be able to make better software and systems more suited for the users.

Direct download: annoyance_mining.mp3
Category:general -- posted at: 11:18pm PDT

This episode contains converage of the 2015 Data Fest hosted at UCLA.  Data Fest is an analysis competition that gives teams of students 48 hours to explore a new dataset and present novel findings.  This year, data from Edmunds.com was provided, and students competed in three categories: best recommendation, best use of external data, and best visualization.

Direct download: Data_Fest_2015.mp3
Category:general -- posted at: 11:55pm PDT

Nicole Goebel joins us this week to share her experiences in oceanography studying phytoplankton and other aspects of the ocean and how data plays a role in that science.

 

We also discuss Thinkful where Nicole and I are both mentors for the Introduction to Data Science course.

Last but not least, check out Nicole's blog Data Science Girl and the videos Kyle mentioned on her Youtube channel featuring one on the diversity of phytoplankton and how that changes in time and space.

Direct download: Oceanography_and_Data_Science.mp3
Category:general -- posted at: 12:06am PDT

I'm joined this week by Alex Boklin to explore the topic of magical thinking especially in the context of Rhonda Byrne's "The Secret", and the similarities it bears to The Global Consciousness Project (GCP). The GCP puts forward the hypothesis that random number generators elicit statistically significant changes as a result of major world events.

Direct download: The_Secret_and_the_Global_Consciousness_Project.mp3
Category:general -- posted at: 12:05am PDT

This miniepisode discusses the technique called Cross Validation - a process by which one randomly divides up a dataset into numerous small partitions. Next, (typically) one is held out, and the rest are used to train some model. The hold out set can then be used to validate how good the model does at describing/predicting new data.

Direct download: MINI_Cross_Validation.mp3
Category:general -- posted at: 7:51am PDT

This episode features a discussion with statistics PhD student Zach Seeskin about a project he was involved in as part of the Eric and Wendy Schmidt Data Science for Social Good Summer Fellowship.  The project involved exploring the relationship (if any) between streetlight outages and crime in the City of Chicago.  We discuss how the data was accessed via the City of Chicago data portal, how the analysis was done, and what correlations were discovered in the data.  Won't you listen and hear what was found? 

Direct download: Streetlight_Outage_and_Crime_Rate_Analysis_with_Zach_Seeskin.mp3
Category:general -- posted at: 6:00am PDT

In this week's episode, we discuss applied solutions to big data problem with big data engineer Jay Shankar.  The episode explores approaches and design philosophy to solving real world big data business problems, and the exploration of the wide array of tools available.

 

Direct download: Data_Skeptic_Podcast_-_Big_Data_Tools.mp3
Category:general -- posted at: 6:00am PDT