Data Skeptic

New York State approved the use of automated speed cameras within a specific range of schools. Tim Schmeier did an analysis of publically available data related to these cameras as part of a project at the NYC Data Science Academy. Tim's work leverages several open data sets to ask the questions: are the speed cameras succeeding in their intended purpose of increasing public safety near schools? What he found using open data may surprise you.

You can read Tim's write up titled Speed Cameras: Revenue or Public Safety? on the NYC Data Science Academy blog. His original write up, reproducible analysis, and figures are a great compliment to this episode.

For his benevolent recommendation, Tim suggests listeners visit Maddie's Fund - a data driven charity devoted to helping achieve and sustain a no-kill pet nation. And for his self-serving recommendation, Tim Schmeier will very shortly be on the job market. If you, your employeer, or someone you know is looking for data science talent, you can reach time at his gmail account which is timothy.schmeier at gmail dot com.

Direct download: NYC_Speed_Camera_Analysis.mp3
Category:open data -- posted at: 10:16pm PDT

The k-means clustering algorithm is an algorithm that computes a deterministic label for a given "k" number of clusters from an n-dimensional datset.  This mini-episode explores how Yoshi, our lilac crowned amazon's biological processes might be a useful way of measuring where she sits when there are no humans around.  Listen to find out how!

Direct download: MINI_k-means_clustering.mp3
Category:miniepisode -- posted at: 11:38pm PDT

Emre Sarigol joins me this week to discuss his paper Online Privacy as a Collective Phenomenon. This paper studies data collected from social networks and how the sharing behaviors of individuals can unintentionally reveal private information about other people, including those that have not even joined the social network! For the specific test discussed, the researchers were able to accurately predict the sexual orientation of individuals, even when this information was withheld during the training of their algorithm.

The research produces a surprisingly accurate predictor of this private piece of information, and was constructed only with publically available data from myspace.com found on archive.org. As Emre points out, this is a small shadow of the potential information available to modern social networks. For example, users that install the Facebook app on their mobile phones are (perhaps unknowningly) sharing all their phone contacts. Should a social network like Facebook choose to do so, this information could be aggregated to assemble "shadow profiles" containing rich data on users who may not even have an account.

Direct download: Shadow_Profiles_on_Social_Networks.mp3
Category:privacy -- posted at: 6:26pm PDT

The Chi-Squared test is a methodology for hypothesis testing. When one has categorical data, in the form of frequency counts or observations (e.g. Vegetarian, Pescetarian, and Omnivore), split into two or more categories (e.g. Male, Female), a question may arise such as "Are women more likely than men to be vegetarian?" or put more accurately, "Is any observed difference in the frequency with which women report being vegetarian differ in a statistically significant way from the frequency men report that?"

Direct download: MINI_Chi-Squared_Test.mp3
Category:miniepisode -- posted at: 9:58pm PDT

1