Data Skeptic

Kyle interviews Julia Silge about her path into data science, her book Text Mining with R, and some of the ways in which she's used natural language processing in projects both personal and professional.

Related Links

Direct download: text-mining-in-r.mp3
Category:general -- posted at: 8:00am PDT

One of the most challenging NLP tasks is natural language understanding and reasoning. How can we construct algorithms that are able to achieve human level understanding of text and be able to answer general questions about it?

This is truly an open problem, and one with the bAbI dataset has been constructed to facilitate. bAbI presents a variety of different language understanding and reasoning tasks and exists as benchmark for comparing approaches.

In this episode, Kyle talks to Rasmus Berg Palm about his recent paper Recurrent Relational Networks

Direct download: recurrent-relational-networks.mp3
Category:general -- posted at: 7:47am PDT

In the first half of this episode, Kyle speaks with Marc-Alexandre Côté and Wendy Tay about Text World.  Text World is an engine that simulates text adventure games.  Developers are encouraged to try out their reinforcement learning skills building agents that can programmatically interact with the generated text adventure games.

 

In the second half of this episode, Kyle interviews Kevin Patel about his paper Towards Lower Bounds on Number of Dimensions for Word Embeddings.  In this research, the explore an important question of how many hidden nodes to use when creating a word embedding.

Direct download: text-world-and-word-embedding-lower-bounds.mp3
Category:general -- posted at: 8:00am PDT

Word2vec is an unsupervised machine learning model which is able to capture semantic information from the text it is trained on. The model is based on neural networks. Several large organizations like Google and Facebook have trained word embeddings (the result of word2vec) on large corpora and shared them for others to use.

The key algorithmic ideas involved in word2vec is the continuous bag of words model (CBOW). In this episode, Kyle uses excerpts from the 1983 cinematic masterpiece War Games, and challenges Linhda to guess a word Kyle leaves out of the transcript. This is similar to how word2vec is trained. It trains a neural network to predict a hidden word based on the words that appear before and after the missing location.

Direct download: word2vec.mp3
Category:general -- posted at: 8:00am PDT

1