Data Skeptic (general)
When working with time series data, there are a number of important diagnostics one should consider to help understand more about the data. The auto-correlative function, plotted as a correlogram, helps explain how a given observations relates to recent preceding observations. A very random process (like lottery numbers) would show very low values, while temperature (our topic in this episode) does correlate highly with recent days.
 
See the show notes with details about Chapel Hill, NC weather data by visiting:
 
 
Direct download: acf.mp3
Category:general -- posted at: 8:00am PDT

This week I spoke with Elham Shaabani and Paulo Shakarian (@PauloShakASU) about their recent paper Early Identification of Violent Criminal Gang Members (also available onarXiv). In this paper, they use social network analysis techniques and machine learning to provide early detection of known criminal offenders who are in a high risk group for committing violent crimes in the future. Their techniques outperform existing techniques used by the police. Elham and Paulo are part of the Cyber-Socio Intelligent Systems (CySIS) Lab.

Direct download: predicting-violent-offenders.mp3
Category:general -- posted at: 8:00am PDT

A dinner party at Data Skeptic HQ helps teach the uses of fractional factorial design for studying 2-way interactions.

Direct download: Fractional_factorial_design.mp3
Category:general -- posted at: 8:00am PDT

Cheng-tao Chu (@chengtao_chu) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning, and possibly some good reminders for experts as well. Our discussion parallels his recent blog postMachine Learning Done Wrong.

Cheng-tao Chu is an entrepreneur who has worked at many well known silicon valley companies. His paper Map-Reduce for Machine Learning on Multicore is the basis for Apache Mahout. His most recent endeavor has just emerged from steath, so please check out OneInterview.io.

Direct download: machine_learning_done_wrong.mp3
Category:general -- posted at: 8:00am PDT

Co-host Linh Da was in a biking accident after hitting a pothole. She sustained an injury that required stitches. This is the story of our quest to file a 311 complaint and track it through the City of Los Angeles's open data portal.

My guests this episode are Chelsea Ursaner (LA City Open Data Team), Ben Berkowitz (CEO and founder of SeeClickFix), and Russ Klettke (Editor of pothole.info)

Direct download: potholes.mp3
Category:general -- posted at: 8:24am PDT

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one should select to solve their problem?

This mini-episode explores the appropriate value of k to use when trying to estimate the cost of a house in Los Angeles based on the closests sales in it's area.

Direct download: the-elbow-method.mp3
Category:general -- posted at: 8:00am PDT

Today on Data Skeptic, Lachlan Gunn joins us to discuss his recent paper Too Good to be True. This paper highlights a somewhat paradoxical / counterintuitive fact about how unanimity is unexpected in cases where perfect measurements cannot be taken. With large enough data, some amount of error is expected.

The "Too Good to be True" paper highlights three interesting examples which we discuss in the podcast. You can also watch a lecture from Lachlan on this topic via youtube here.

Direct download: too_good_to_be_true.mp3
Category:general -- posted at: 8:00am PDT

How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why different houses have different prices. There's some amount of variance that can be explained by a model, and some amount that cannot be directly measured. R-squared is the ratio of the explained variance to the total variance. It's not a measure of accuracy, it's a measure of the power of one's model.

Direct download: r-squared.mp3
Category:general -- posted at: 8:00am PDT

 
Direct download: think_again.mp3
Category:general -- posted at: 7:30am PDT

[MINI] Multiple Regression

This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price.

Unlike a typical episode of Data Skeptic, these show notes are not just supporting material, but are actually featured in the episode.

The site Redfin gratiously allows users to download a CSV of results they are viewing. Unfortunately, they limit this extract to 500 listings, but you can still use it to try the same approach on your own using the download link shown in the figure below.

Direct download: multiple_regressions.mp3
Category:general -- posted at: 7:30am PDT