Data Skeptic

If a CEO wants to know the state of their business, they ask their highest ranking executives. These executives, in turn, should know the state of the business through reports from their subordinates. This structure is roughly analogous to a process observed in deep learning, where each layer of the business reports up different types of observations, KPIs, and reports to be interpreted by the next layer of the business. In deep learning, this process can be thought of as automated feature engineering. DNNs built to recognize objects in images may learn structures that behave like edge detectors in the first hidden layer. Proceeding layers learn to compose more abstract features from lower level outputs. This episode explore that analogy in the context of automated feature engineering.

Linh Da and Kyle discuss a particular image in this episode. The image included below in the show notes is drawn from the work of Lee, Grosse, Ranganath, and Ng in their paper Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations.


Direct download: automated-feature-engineering.mp3
Category:general -- posted at: 8:00am PDT

In this episode, I speak with Raghu Ramakrishnan, CTO for Data at Microsoft.  We discuss services, tools, and developments in the big data sphere as well as the underlying needs that drove these innovations.

Direct download: big-data-tools-and-trends.mp3
Category:general -- posted at: 9:28am PDT

In this episode, we talk about a high-level description of deep learning.  Kyle presents a simple game (pictured below), which is more of a puzzle really, to try and give  Linh Da the basic concept.



Thanks to our sponsor for this week, the Data Science Association. Please check out their upcoming Dallas conference at

Direct download: deep-learning-primer.mp3
Category:general -- posted at: 8:00am PDT

Versioning isn't just for source code. Being able to track changes to data is critical for answering questions about data provenance, quality, and reproducibility. Daniel Whitenack joins me this week to talk about these concepts and share his work on Pachyderm. Pachyderm is an open source containerized data lake.

During the show, Daniel mentioned the Gopher Data Science github repo as a great resource for any data scientists interested in the Go language. Although we didn't mention it, Daniel also did an interesting analysis on the 2016 world chess championship that complements our recent episode on chess well. You can find that post here

Supplemental music is Lee Rosevere's Let's Start at the Beginning.


Thanks to Periscope Data for sponsoring this episode. More about them at

Periscope Data




Direct download: data-provenance-and-reproducibility-with-pachyderm.mp3
Category:general -- posted at: 9:09am PDT