Data Skeptic (general)

Distributed computing cannot guarantee consistency, accuracy, and partition tolerance. Most system architects need to think carefully about how they should appropriately balance the needs of their application across these competing objectives. Linh Da and Kyle discuss the CAP Theorem using the analogy of a phone tree for alerting people about a school snow day.

Direct download: cap-theorem.mp3
Category:general -- posted at: 8:00am PDT

A startup is claiming that they can detect terrorists purely through facial recognition. In this solo episode, Kyle explores the plausibility of these claims.

Direct download: detecting-terrorists.mp3
Category:general -- posted at: 8:00am PDT

Goodhart's law states that "When a measure becomes a target, it ceases to be a good measure". In this mini-episode we discuss how this affects SEO, call centers, and Scrum.

Direct download: goodharts-law.mp3
Category:general -- posted at: 8:00am PDT

I'm joined this week by Jon Morra, director of data science at eHarmony to discuss a variety of ways in which machine learning and data science are being applied to help connect people for successful long term relationships.

Interesting open source projects mentioned in the interview include Face-parts, a web service for detecting faces and extracting a robust set of fiducial markers (features) from the image, and Aloha, a Scala based machine learning library. You can learn more about these and other interesting projects at the eHarmony github page.

In the wrap up, Jon mentioned the LA Machine Learning meetup which he runs. This is a great resource for LA residents separate and complementary to datascience.la groups, so consider signing up for all of the above and I hope to see you there in the future.

Direct download: data-science-at-eharmony.mp3
Category:general -- posted at: 8:00am PDT

Mystery shoppers and fruit cultivation help us discuss stationarity - a property of some time serieses that are invariant to time in several ways. Differencing is one approach that can often convert a non-stationary process into a stationary one. If you have a stationary process, you get the benefits of many known statistical properties that can enable you to do a significant amount of inferencing and prediction.

Direct download: stationarity.mp3
Category:general -- posted at: 8:00am PDT

I'm joined by Wes McKinney (@wesmckinn) and Hadley Wickham (@hadleywickham) on this episode to discuss their joint project Feather. Feather is a file format for storing data frames along with some metadata, to help with interoperability between languages. At the time of recording, libraries are available for R and Python, making it easy for data scientists working in these languages to quickly and effectively share datasets and collaborate.

Direct download: feather.mp3
Category:general -- posted at: 8:00am PDT

Bargaining is the process of two (or more) parties attempting to agree on the price for a transaction.  Game theoretic approaches attempt to find two strategies from which neither party is motivated to deviate.  These strategies are said to be in equilibrium with one another.  The equilibriums available in bargaining depend on the the transaction mechanism and the information of the parties.  Discounting (how long parties are willing to wait) has a significant effect in this process.  This episode discusses some of the choices Kyle and Linh Da made in deciding what offer to make on a house.

Direct download: bargaining.mp3
Category:general -- posted at: 8:00am PDT

Deepjazz is a project from Ji-Sung Kim, a computer science student at Princeton University. It is built using Theano, Keras, music21, and Evan Chow's project jazzml. Deepjazz is a computational music project that creates original jazz compositions using recurrent neural networks trained on Pat Metheny's "And Then I Knew". You can hear some of deepjazz's original compositions on soundcloud.

Direct download: deepjazz.mp3
Category:general -- posted at: 8:00am PDT

When working with time series data, there are a number of important diagnostics one should consider to help understand more about the data. The auto-correlative function, plotted as a correlogram, helps explain how a given observations relates to recent preceding observations. A very random process (like lottery numbers) would show very low values, while temperature (our topic in this episode) does correlate highly with recent days.
 
See the show notes with details about Chapel Hill, NC weather data by visiting:
 
 
Direct download: acf.mp3
Category:general -- posted at: 8:00am PDT

This week I spoke with Elham Shaabani and Paulo Shakarian (@PauloShakASU) about their recent paper Early Identification of Violent Criminal Gang Members (also available onarXiv). In this paper, they use social network analysis techniques and machine learning to provide early detection of known criminal offenders who are in a high risk group for committing violent crimes in the future. Their techniques outperform existing techniques used by the police. Elham and Paulo are part of the Cyber-Socio Intelligent Systems (CySIS) Lab.

Direct download: predicting-violent-offenders.mp3
Category:general -- posted at: 8:00am PDT

A dinner party at Data Skeptic HQ helps teach the uses of fractional factorial design for studying 2-way interactions.

Direct download: Fractional_factorial_design.mp3
Category:general -- posted at: 8:00am PDT

Cheng-tao Chu (@chengtao_chu) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning, and possibly some good reminders for experts as well. Our discussion parallels his recent blog postMachine Learning Done Wrong.

Cheng-tao Chu is an entrepreneur who has worked at many well known silicon valley companies. His paper Map-Reduce for Machine Learning on Multicore is the basis for Apache Mahout. His most recent endeavor has just emerged from steath, so please check out OneInterview.io.

Direct download: machine_learning_done_wrong.mp3
Category:general -- posted at: 8:00am PDT

Co-host Linh Da was in a biking accident after hitting a pothole. She sustained an injury that required stitches. This is the story of our quest to file a 311 complaint and track it through the City of Los Angeles's open data portal.

My guests this episode are Chelsea Ursaner (LA City Open Data Team), Ben Berkowitz (CEO and founder of SeeClickFix), and Russ Klettke (Editor of pothole.info)

Direct download: potholes.mp3
Category:general -- posted at: 8:24am PDT

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one should select to solve their problem?

This mini-episode explores the appropriate value of k to use when trying to estimate the cost of a house in Los Angeles based on the closests sales in it's area.

Direct download: the-elbow-method.mp3
Category:general -- posted at: 8:00am PDT

Today on Data Skeptic, Lachlan Gunn joins us to discuss his recent paper Too Good to be True. This paper highlights a somewhat paradoxical / counterintuitive fact about how unanimity is unexpected in cases where perfect measurements cannot be taken. With large enough data, some amount of error is expected.

The "Too Good to be True" paper highlights three interesting examples which we discuss in the podcast. You can also watch a lecture from Lachlan on this topic via youtube here.

Direct download: too_good_to_be_true.mp3
Category:general -- posted at: 8:00am PDT

How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why different houses have different prices. There's some amount of variance that can be explained by a model, and some amount that cannot be directly measured. R-squared is the ratio of the explained variance to the total variance. It's not a measure of accuracy, it's a measure of the power of one's model.

Direct download: r-squared.mp3
Category:general -- posted at: 8:00am PDT

 
Direct download: think_again.mp3
Category:general -- posted at: 7:30am PDT

[MINI] Multiple Regression

This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and square footage can predict the sale price.

Unlike a typical episode of Data Skeptic, these show notes are not just supporting material, but are actually featured in the episode.

The site Redfin gratiously allows users to download a CSV of results they are viewing. Unfortunately, they limit this extract to 500 listings, but you can still use it to try the same approach on your own using the download link shown in the figure below.

Direct download: multiple_regressions.mp3
Category:general -- posted at: 7:30am PDT

Samuel Mehr joins us this week to share his perspective on why people are musical, where music comes from, and why it works the way it does. We discuss a number of empirical studies related to music and musical cognition, and dispense a few myths about music along the way.

Some of Sam's work discussed in this episode include Music in the Home: New Evidence for an Intergenerational Link,Two randomized trials provide no consistent evidence for nonmusical cognitive benefits of brief preschool music enrichment, and Miscommunication of science: music cognition research in the popular press. Additional topics we discussed are also covered in a Harvard Gazette article featuring Sam titled Muting the Mozart effect.

You can follow Sam on twitter via @samuelmehr.

Direct download: samuel.mp3
Category:general -- posted at: 7:30am PDT

This episode reviews the concept of k-d trees: an efficient data structure for holding multidimensional objects. Kyle gives Linhda a dictionary and asks her to look up words as a way of introducing the concept of binary search. We actually spend most of the episode talking about binary search before getting into k-d trees, but this is a necessary prerequisite.

Direct download: kd_trees.mp3
Category:general -- posted at: 7:13am PDT

Algorithms are pervasive in our society and make thousands of automated decisions on our behalf every day. The possibility of digital discrimination is a very real threat, and it is very plausible for discrimination to occur accidentally (i.e. outside the intent of the system designers and programmers). Christian Sandvig joins us in this episode to talk about his work and the concept of auditing algorithms.

Christian Sandvig (@niftyc) has a PhD in communications from Stanford and is currently an Associate Professor of Communication Studies and Information at the University of Michigan. His research studies the predictable and unpredictable effects that algorithms have on culture. His work exploring the topic of auditing algorithms has framed the conversation of how and why we might want to have oversight on the way algorithms effect our lives. His writing appears in numerous publications including The Social Media Collective, The Huffington Post, and Wired.

One of his papers we discussed in depth on this episode was Auditing Algorithms: Research Methods for Detecting Discrimination on Internet Platforms, which is well worth a read.

Direct download: auditing_algorithms.mp3
Category:general -- posted at: 7:00am PDT

Today's episode begins by asking how many left handed employees we should expect to be at a company before anyone should claim left handedness discrimination. If not lefties, let's consider eye color, hair color, favorite ska band, most recent grocery store used, and any number of characteristics could be studied to look for deviations from the norm in a company.

When multiple comparisons are to be made simultaneous, one must account for this, and a common method for doing so is with the Bonferroni Correction. It is not, however, a sure fire procedure, and this episode wraps up with a bit of skepticism about it.

Direct download: bonferroni-correction2.mp3
Category:general -- posted at: 7:00am PDT

A recent paper in the journal of Judgment and Decision Making titled On the reception and detection of pseudo-profound bullshit explores empirical questions around a reader's ability to detect statements which may sound profound but are actually a collection of buzzwords that fail to contain adequate meaning or truth. These statements are definitively different from lies and nonesense, as we discuss in the episode.

This paper proposes the Bullshit Receptivity scale (BSR) and empirically demonstrates that it correlates with existing metrics like the Cognitive Reflection Test, building confidence that this can be a useful, repeatable, empirical measure of a person's ability to detect pseudo-profound statements as being different from genuinely profound statements. Additionally, the correlative results provide some insight into possible root causes for why individuals might find great profundity in these statements based on other beliefs or cognitive measures.

The paper's lead author Gordon Pennycook joins me to discuss this study's results.

If you'd like some examples of pseudo-profound bullshit, you can randomly generate some based on Deepak Chopra's twitter feed.

To read other work from Gordon, check out his Google Scholar page and find him on twitter via @GordonPennycook.

And just for fun, if you think you've dreamed up a Data Skeptic related pseudo-profound bullshit statement, tweet it with hashtag #pseudoprofound. If I see an especially clever or humorous one, I might want to send you a free Data Skeptic sticker.

 
Direct download: pseudo-profound-episode.mp3
Category:general -- posted at: 7:00am PDT

Today's mini episode discusses the widely known optimization algorithm gradient descent in the context of hiking in a foggy hillside.

Direct download: gradient_descent.mp3
Category:general -- posted at: 12:00am PDT

This episode is a discussion of data visualization and a proposed New Year's resolution for Data Skeptic listeners. Let's kill the word cloud.

Direct download: lets_kill_the_word_cloud.mp3
Category:general -- posted at: 12:00am PDT

Today's episode is a reading of Isaac Asimov's The Machine that Won the War. I can't think of a story that's more appropriate for Data Skeptic.

Direct download: 2015_Holiday_Special.mp3
Category:general -- posted at: 12:00am PDT

In this interview with Aaron Halfaker of the Wikimedia Foundation, we discuss his research and career related to the study of Wikipedia. In his paper The Rise and Decline of an open Collaboration Community, he highlights a trend in the declining rate of active editors on Wikipedia which began in 2007. I asked Aaron about a variety of possible hypotheses for the phenomenon, in particular, how automated quality control tools that revert edits automatically could play a role. This lead Aaron and his collaborators to develop Snuggle, an optimized interface to help Wikipedians better welcome new comers to the community.

We discuss the details of these topics as well as ORES, which provides revision scoring as a service to any software developer that wants to consume the output of their machine learning based scoring.

You can find Aaron on Twitter as @halfak.

Direct download: wikipedia-revision-scoring-as-a-service.mp3
Category:general -- posted at: 6:30am PDT

Today's topic is term frequency inverse document frequency, which is a statistic for estimating the importance of words and phrases in a set of documents.

Direct download: tf_-_idf.mp3
Category:general -- posted at: 8:45am PDT

Early astronomers could see several of the planets with the naked eye. The invention of the telescope allowed for further understanding of our solar system. The work of Isaac Newton allowed later scientists to accurately predict Neptune, which was later observationally confirmed exactly where predicted. It seemed only natural that a similar unknown body might explain anomalies in the orbit of Mercury, and thus began the search for the hypothesized planet Vulcan.

Thomas Levenson's book "The Hunt for Vulcan" is a narrative of the key scientific minds involved in the search and eventual refutation of an unobserved planet between Mercury and the sun. Thomas joins me in this episode to discuss his book and the fascinating story of the quest to find this planet.

During the discussion, we mention one of the contributions made by Urbain-Jean-Joseph Le Verrier which involved some complex calculations which enabled him to predict where to find the planet that would eventually be called Neptune. The calculus behind this work is difficult, and some of that work is demonstrated in a Jupyter notebook I recently discovered from Paulo Marques titled The-Body Problem.

Thomas Levenson is a professor at MIT and head of its science writing program. He is the author of several books, including Einstein in Berlin and Newton and the Counterfeiter: The Unknown Detective Career of the World’s Greatest Scientist. He has also made ten feature-length documentaries (including a two-hour Nova program on Einstein) for which he has won numerous awards. In his most recent book "The Hunt for Vulcan", explores the century spanning quest to explain the movement of the cosmos via theory and the role the hypothesized planet Vulcan played in the story.

Follow Thomas on twitter @tomlevenson and check out his blog athttps://inversesquare.wordpress.com/.

Pick up your copy of The Hunt for Vulcan at your local bookstore, preferred book buying place, or at the Penguin Random House site.

Direct download: the-hunt-for-vulcan.mp3
Category:general -- posted at: 12:00am PDT

Today's episode discusses the accuracy paradox. There are cases when one might prefer a less accurate model because it yields more predictive power or better captures the underlying causal factors describing the outcome variable you are interested in. This is especially relevant in machine learning when trying to predict rare events. We discuss how the accuracy paradox might apply if you were trying to predict the likelihood a person was a bird owner.

Direct download: the-accuracy-paradox.mp3
Category:general -- posted at: 12:00am PDT

... or should this have been called data science from a neuroscientist's perspective? Either way, I'm sure you'll enjoy this discussion with Laurie Skelly. Laurie earned a PhD in Integrative Neuroscience from the Department of Psychology at the University of Chicago. In her life as a social neuroscientist, using fMRI to study the neural processes behind empathy and psychopathy, she learned the ropes of zooming in and out between the macroscopic and the microscopic -- how millions of data points come together to tell us something meaningful about human nature. She's currently at Metis Data Science, an organization that helps people learn the skills of data science to transition in industry.

In this episode, we discuss fMRI technology, Laurie's research studying empathy and psychopathy, as well as the skills and tools used in common between neuroscientists and data scientists. For listeners interested in more on this subject, Laurie recommended the blogs Neuroskeptic, Neurocritic, and Neuroecology.

We conclude the episode with a mention of the upcoming Metis Data Science San Francisco cohort which Laurie will be teaching. If anyone is interested in applying to participate, they can do so here.

Direct download: neuroscience.mp3
Category:general -- posted at: 12:00am PDT

A discussion of the expected number of cars at a stoplight frames today's discussion of the bias variance tradeoff. The central ideal of this concept relates to model complexity. A very simple model will likely generalize well from training to testing data, but will have a very high variance since it's simplicity can prevent it from capturing the relationship between the covariates and the output. As a model grows more and more complex, it may capture more of the underlying data but the risk that it overfits the training data and therefore does not generalize (is biased) increases. The tradeoff between minimizing variance and minimizing bias is an ongoing challenge for data scientists, and an important discussion for skeptics around how much we should trust models.

Direct download: bias-variance-tradeoff.mp3
Category:general -- posted at: 12:00am PDT

The recent opinion piece Big Data Doesn't Exist on Tech Crunch by Slater Victoroff is an interesting discussion about the usefulness of data both big and small. Slater joins me this episode to discuss and expand on this discussion.

Slater Victoroff is CEO of indico Data Solutions, a company whose services turn raw text and image data into human insight. He, and his co-founders, studied at Olin College of Engineering where indico was born. indico was then accepted into the "Techstars Accelarator Program" in the Fall of 2014 and went on to raise $3M in seed funding. His recent essay "Big Data Doesn't Exist" received a lot of traction on TechCrunch, and I have invited Slater to join me today to discuss his perspective and touch on a few topics in the machine learning space as well.

Direct download: big-data-doesnt-exist.mp3
Category:general -- posted at: 12:00am PDT

The degree to which two variables change together can be calculated in the form of their covariance. This value can be normalized to the correlation coefficient, which has the advantage of transforming it to a unitless measure strictly bounded between -1 and 1. This episode discusses how we arrive at these values and why they are important.

Direct download: covariance_and_correlation.mp3
Category:general -- posted at: 12:00am PDT

Today's guest is Cameron Davidson-Pilon. Cameron has a masters degree in quantitative finance from the University of Waterloo. Think of it as statistics on stock markets. For the last two years he's been the team lead of data science at Shopify. He's the founder of dataoragami.net which produces screencasts teaching methods and techniques of applied data science. He's also the author of the just released in print book Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference, which you can also get in a digital form.

This episode focuses on the topic of Bayesian A/B Testing which spans just one chapter of the book. Related to today's discussion is the Data Origami post The class imbalance problem in A/B testing.

Lastly, Data Skeptic will be giving away a copy of the print version of the book to one lucky listener who has a US based delivery address. To participate, you'll need to write a review of any site, book, course, or podcast of your choice on datasciguide.com. After it goes live, tweet a link to it with the hashtag #WinDSBook to be given an entry in the contest. This contest will end November 20th, 2015, at which time I'll draw a single randomized winner and contact them for delivery details via direct message on Twitter.

Direct download: bayesian-methods-for-hackers.mp3
Category:general -- posted at: 12:00am PDT

The central limit theorem is an important statistical result which states that typically, the mean of a large enough set of independent trials is approximately normally distributed.  This episode explores how this might be used to determine if an amazon parrot like Yoshi produces or or less waste than an African Grey, under the assumption that the individual distributions are not normal.

Direct download: Central_Limit_Theorem.mp3
Category:general -- posted at: 12:00am PDT

Today's guest is Chris Hofstader (@gonz_blinko), an accessibility researcher and advocate, as well as an activist for causes such as improving access to information for blind and vision impaired people. His background in computer programming enabled him to be the leader of JAWS, a Windows program that allowed people with a visual impairment to read their screen either through text-to-speech or a refreshable braille display. He's the Managing Member of 3 Mouse Technology. He's also a frequent blogger primarily at chrishofstader.com.

For web developers and site owners, Chris recommends two tools to help test for accessibility issues: tenon.io and dqtech.co.

A guest post from Chris appeared on the Skepchick blogged titled Skepticism and Disability which lead to the formation of the sister site Skeptibility.

In a discussion of skepticism and favorite podcasts, Chris mentioned a number of great shows, most notably The Pod Delusion to which he was a contributor. Additionally, Chris has also appeared on The Atheist Nomads.

Lastly, a shout out from Chris to musician Shelley Segal whom he hosted just before the date of recording of this episode. Her music can be found on her site or via bandcamp.

Direct download: accessible-technology.mp3
Category:general -- posted at: 12:00am PDT

Our episode this week begins with a correction. Back in episode 28 (Monkeys on Typewriters), Kyle made some bold claims about the probability that monkeys banging on typewriters might produce the entire works of Shakespeare by chance. The proof shown in the show notes turned out to be a bit dubious and Dave Spiegel joins us in this episode to set the record straight.

In addition to that, our discussion explores a number of interesting topics in astronomy and astrophysics. This includes a paper Dave wrote with Ed Turner titled "Bayesian analysis of the astrobiological implications of life's early emergence on Earth" as well as exoplanet discovery.

Direct download: Shakespeare-abiogenesis-exoplanets.mp3
Category:general -- posted at: 12:30am PDT

There are several factors that are important to selecting an appropriate sample size and dealing with small samples. The most important questions are around representativeness - how well does your sample represent the total population and capture all it's variance?

Linhda and Kyle talk through a few examples including elections, picking an Airbnb, produce selection, and home shopping as examples of cases in which the amount of observations one has are more or less important depending on how complex the underlying system one is observing is.

Direct download: sample_sizes.mp3
Category:general -- posted at: 11:47pm PDT

There's an old adage which says you cannot fit a model which has more parameters than you have data. While this is often the case, it's not a universal truth. Today's guest Jake VanderPlas explains this topic in detail and provides some excellent examples of when it holds and doesn't. Some excellent visuals articulating the points can be found on Jake's blog Pythonic Perambulations, specifically on his post The Model Complexity Myth.

We also touch on Jake's work as an astronomer, his noteworthy open source contributions, and forthcoming book (currently available in an Early Edition) Python Data Science Handbook.

Direct download: model_complexity_myth.mp3
Category:general -- posted at: 12:00am PDT

There are many occasions in which one might want to know the distance or similarity between two things, for which the means of calculating that distance is not necessarily clear. The distance between two points in Euclidean space is generally straightforward, but what about the distance between the top of Mount Everest to the bottom of the ocean? What about the distance between two sentences?

This mini-episode summarizes some of the considerations and a few of the means of calculating distance. We touch on Jaccard Similarity, Manhattan Distance, and a few others.

Direct download: distance_measures.mp3
Category:general -- posted at: 12:00am PDT

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. The program's founder Peter Murray-Rust joins us this week to discuss ContentMine. Our discussion covers the project, the scientific publication process, copywrite, and several other interesting topics.

Direct download: contentmine.mp3
Category:general -- posted at: 12:30am PDT

Today's mini-episode explains the distinction between structured and unstructured data, and debates which of these categories best describe recipes.

Direct download: structured_unstructured.mp3
Category:general -- posted at: 11:22pm PDT

Yusan Lin shares her research on using data science to explore the fashion industry in this episode. She has applied techniques from data mining, natural language processing, and social network analysis to explore who are the innovators in the fashion world and how their influence effects other designers.

If you found this episode interesting and would like to read more, Yusan's papers Text-Generated Fashion Influence Model: An Empirical Study on Style.com and The Hidden Influence Network in the Fashion Industry are worth reading.

Direct download: yusan_lin.mp3
Category:general -- posted at: 12:01am PDT

[MINI] PageRank

PageRank is the algorithm most famous for being one of the original innovations that made Google stand out as a search engine. It was defined in the classic paper The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Page. While this algorithm clearly impacted web searching, it has also been useful in a variety of other applications. This episode presents a high level description of this algorithm and how it might apply when trying to establish who writes the most influencial academic papers.

Direct download: pagerank.mp3
Category:general -- posted at: 12:01am PDT

Crypto

How do people think rationally about small probability events?

What is the optimal statistical process by which one can update their beliefs in light of new evidence?

This episode of Data Skeptic explores questions like this as Kyle consults a cast of previous guests and experts to try and answer the question "What is the probability, however small, that Bigfoot is real?"

Direct download: Data_Skeptic_-_Crypto.mp3
Category:general -- posted at: 2:34am PDT

A recent episode of the Skeptics Guide to the Universe included a slight rant by Dr. Novella and the rouges about a shortcoming in operating systems.  This episode explores why such a (seemingly obvious) flaw might make sense from an engineering perspective, and how data science might be the solution.

In this solo episode, Kyle proposes the concept of "annoyance mining" - the idea that with proper logging and enough feedback, data scientists could be provided the right dataset from which they can detect flaws and annoyances in software and other systems and automatically detect potential bugs, flaws, and improvements which could make those systems better.

As system complexity grows, it seems that an abstraction like this might be required in order to keep maintaining an effective development cycle.  This episode is a bit of a soap box for Kyle as he explores why and how we might track an appropriate amount of data to be able to make better software and systems more suited for the users.

Direct download: annoyance_mining.mp3
Category:general -- posted at: 11:18pm PDT

This episode contains converage of the 2015 Data Fest hosted at UCLA.  Data Fest is an analysis competition that gives teams of students 48 hours to explore a new dataset and present novel findings.  This year, data from Edmunds.com was provided, and students competed in three categories: best recommendation, best use of external data, and best visualization.

Direct download: Data_Fest_2015.mp3
Category:general -- posted at: 11:55pm PDT

Nicole Goebel joins us this week to share her experiences in oceanography studying phytoplankton and other aspects of the ocean and how data plays a role in that science.

 

We also discuss Thinkful where Nicole and I are both mentors for the Introduction to Data Science course.

Last but not least, check out Nicole's blog Data Science Girl and the videos Kyle mentioned on her Youtube channel featuring one on the diversity of phytoplankton and how that changes in time and space.

Direct download: Oceanography_and_Data_Science.mp3
Category:general -- posted at: 12:06am PDT

I'm joined this week by Alex Boklin to explore the topic of magical thinking especially in the context of Rhonda Byrne's "The Secret", and the similarities it bears to The Global Consciousness Project (GCP). The GCP puts forward the hypothesis that random number generators elicit statistically significant changes as a result of major world events.

Direct download: The_Secret_and_the_Global_Consciousness_Project.mp3
Category:general -- posted at: 12:05am PDT

This miniepisode discusses the technique called Cross Validation - a process by which one randomly divides up a dataset into numerous small partitions. Next, (typically) one is held out, and the rest are used to train some model. The hold out set can then be used to validate how good the model does at describing/predicting new data.

Direct download: MINI_Cross_Validation.mp3
Category:general -- posted at: 7:51am PDT

This episode features a discussion with statistics PhD student Zach Seeskin about a project he was involved in as part of the Eric and Wendy Schmidt Data Science for Social Good Summer Fellowship.  The project involved exploring the relationship (if any) between streetlight outages and crime in the City of Chicago.  We discuss how the data was accessed via the City of Chicago data portal, how the analysis was done, and what correlations were discovered in the data.  Won't you listen and hear what was found? 

Direct download: Streetlight_Outage_and_Crime_Rate_Analysis_with_Zach_Seeskin.mp3
Category:general -- posted at: 6:00am PDT

In this week's episode, we discuss applied solutions to big data problem with big data engineer Jay Shankar.  The episode explores approaches and design philosophy to solving real world big data business problems, and the exploration of the wide array of tools available.

 

Direct download: Data_Skeptic_Podcast_-_Big_Data_Tools.mp3
Category:general -- posted at: 6:00am PDT