Data Skeptic

The scale and frequency with which information can be distributed on social media makes the problem of fake news a rapidly metastasizing issue. To do any content filtering or labeling demands an algorithmic solution.

In today's episode, Kyle interviews Kai Shu and Mike Tamir about their independent work exploring the use of machine learning to detect fake news.

Kai Shu and his co-authors published Fake News Detection on Social Media: A Data Mining Perspective, a research paper which both surveys the existing literature and organizes the structure of the problem in a robust way.

Mike Tamir led the development of fakerfact.org, a website and Chrome/Firefox plugin which leverages machine learning to try and predict the category of a previously unseen web page, with categories like opinion, wiki, and fake news.

Direct download: algorithmic-detection-of-fake-news.mp3
Category:general -- posted at: 8:06am PDT

If you prepared a list of creatures regarded as highly intelligent, it's unlikely ants would make the cut. This is expected, as on an individual level, ants do not generally display behavior that most humans would regard as intelligence. In fact, it might even be true that most species of ants are unable to learn. Despite this, ant colonies have evolved excellent survival mechanisms through the careful orchestration of ants.

Direct download: ant-intelligence.mp3
Category:general -- posted at: 8:00am PDT

With publications such as "Prior exposure increases perceived accuracy of fake news", "Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning", and "The science of fake news", Gordon Pennycook is asking and answering analytical questions about the nature of human intuition and fake news.

Gordon appeared on Data Skeptic in 2016 to discuss people's ability to recognize pseudo-profound bullshit.  This episode explores his work in fake news.

Direct download: human-detection-of-fake-news.mp3
Category:general -- posted at: 8:00am PDT

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email.

Whitelists, blacklists, traffic analysis, network analysis, and a variety of other tools are probably employed by most major players in this area. Naturally content analysis can be an especially powerful tool for detecting spam.

Given the binary nature of the problem (Spam or \neg Spam) its clear that this is a great problem to use machine learning to solve. In order to apply machine learning, you first need a labelled training set. Thankfully, many standard corpora of labelled spam data are readily available. Further, if you're working for a company with a spam filtering problem, often asking users to self-moderate or flag things as spam can be an effective way to generate a large amount of labels for "free".

With a labeled dataset in hand, a data scientist working on spam filtering must next do feature engineering. This should be done with consideration of the algorithm that will be used. The Naive Bayesian Classifer has been a popular choice for detecting spam because it tends to perform pretty well on high dimensional data, unlike a lot of other ML algorithms. It also is very efficient to compute, making it possible to train a per-user Classifier if one wished to. While we might do some basic NLP tricks, for the most part, we can turn each word in a document (or perhaps each bigram or n-gram in a document) into a feature.

The Naive part of the Naive Bayesian Classifier stems from the naive assumption that all features in one's analysis are considered to be independent. If x and y are known to be independent, then Pr(x \cap y) = Pr(x) \cdot Pr(y). In other words, you just multiply the probabilities together. Shh, don't tell anyone, but this assumption is actually wrong! Certainly, if a document contains the word algorithm, it's more likely to contain the word probability than some randomly selected document. Thus, Pr(\text{algorithm} \cap \text{probability}) > Pr(\text{algorithm}) \cdot Pr(\text{probability}), violating the assumption. Despite this "flaw", the Naive Bayesian Classifier works remarkably will on many problems. If one employs the common approach of converting a document into bigrams (pairs of words instead of single words), then you can capture a good deal of this correlation indirectly.

In the final leg of the discussion, we explore the question of whether or not a Naive Bayesian Classifier would be a good choice for detecting fake news.

 
Direct download: spam-filtering.mp3
Category:general -- posted at: 8:00am PDT

How does fake news get spread online? Its not just a matter of manipulating search algorithms. The social platforms for sharing play a major role in the distribution of fake news. But how significant of an impact can there be? How significantly can bots influence the spread of fake news?

In this episode, Kyle interviews Filippo Menczer, Professor of Computer Science and Informatics.

Fil is part of the Observatory on Social Media ([OSoMe][https://osome.iuni.iu.edu/tools/]). OSoMe are the creators of HoaxyBotometerFakey, and other tools for studying the spread of information on social media.

The interview explores these tools and the contributions Bots make to the spread of fake news.

Direct download: the-spread-of-fake-news.mp3
Category:general -- posted at: 8:00am PDT

This episode kicks off our new theme of "Fake News" with guests Robert Sheaffer and Brad Schwartz.

Fake news is a new label for an old idea. For our purposes, we will define fake news information created to deliberately mislead while masquerading as a legitimate, journalistic source of truth. It's become a modern topic of discussion as our cultures evolve to the fledgling mechanisms of communication introduced by online platforms.

What was the earliest incident of fake news? That's a question for which we may never find a satisfying answer. While not the earliest, we present a dramatization of an early example of fake news, which leads us into a discussion with UFO Skeptic Robert Sheaffer. Following that we get into our main interview with Brad Schwartz, author of Broadcast Hysteria: Orson Welles's War of the Worlds and the Art of Fake News.

Direct download: fake-news.mp3
Category:general -- posted at: 8:00am PDT

We revisit the 2018 Microsoft Build in this episode, focusing on the latest ideas in DevOps. Kyle interviews Cloud Developer Advocates Damien Brady, Paige Bailey, and Donovan Brown to talk about DevOps and data science and databases.

For a data scientist, what does it even mean to “build”? Packaging and deployment are things that a data scientist doesn't normally have to consider in their day-to-day work. The process of making an AI app is usually divided into two streams of work: data scientists building machine learning models and app developers building the application for end users to consume.

DevOps includes all the parties involved in getting the application deployed and maintained and thinking about all the phases that follow and precede their part of the end solution. So what does DevOps mean for data science? Why should you adopt DevOps best practices?

In the first half, Paige and Damian share their views on what DevOps for data science would look like and how it can be introduced to provide continuous integration, delivery, and deployment of data science models. In the second half, Donovan and Damian talk about the DevOps life cycle of putting a database under version control and carrying out deployments through a release pipeline.

Direct download: devops-for-data-science.mp3
Category:general -- posted at: 1:23pm PDT

Logic is a fundamental of mathematical systems. It's roots are the values true and false and it's power is in what it's rules allow you to prove. Prepositional logic provides it's user variables. This episode gets into First Order Logic, an extension to prepositional logic.

Direct download: first-order-logic.mp3
Category:general -- posted at: 8:00am PDT

An intelligent agent trained in a simulated environment may be prone to making mistakes in the real world due to discrepancies between the training and real-world conditions. The areas where an agent makes mistakes are hard to find, known as "blind spots," and can stem from various reasons. In this week’s episode, Kyle is joined by Ramya Ramakrishnan, a PhD candidate at MIT, to discuss the idea “blind spots” in reinforcement learning and approaches to discover them.

Direct download: blind-spots-in-reinforcement-learning.mp3
Category:data science -- posted at: 8:00am PDT

In this week’s episode, our host Kyle interviews Gokula Krishnan from ETH Zurich, about his recent contributions to defenses against adversarial attacks. The discussion centers around his latest paper, titled “Defending Against Adversarial Attacks by Leveraging an Entire GAN,” and his proposed algorithm, aptly named ‘Cowboy.’

Direct download: defending-against-adversarial-attacks.mp3
Category:general -- posted at: 8:00am PDT