How Did Google Build Duplex?

And where did they get their training data?

How Did Google Build Duplex?

And where did they get their training data?

It’s been several weeks since Sundar Pichai showed off Google Duplex, which will likely go down as a demo for the ages. Ars Technica wrote

Google CEO Sundar Pichai played back two phone conversations that he alleged were 100-percent legitimate, in which Google’s AI-driven voice service called real-world businesses and scheduled appointments based on a user’s data. In both cases, voices that sounded decidedly more human and realistic than the default female Google Assistant voice used seemingly natural speech patterns. Phrases like “um” and a decidedly West Coast question-like lilt could be heard as Google Duplex confirmed both a salon appointment and a dinner reservation.

Google calls this product Google Duplex. It’s a tool for automating specific types of phone conversations. Here it is in action:

For about a day the tech world was stunned by the demo. The cadence and tone of Google Duplex’s speech was incredibly realistic. After 24 hours awe matured into apprehension as people weighed the ethics of the application.

A central question emerged, “Is it wrong for an automated service to masquerade as a human?” 

Google gave us a few days to ponder the question before saying Duplex would identify itself as a robot. But the froth subsided slowly. Axios called a few salons and restaurants around Google’s HQ (despite the developers appearing to be based in Israel) and used that experience as evidence Pichai’s demo was rigged.

I think Google Duplex is fascinating. It is a perfect loci to discuss ethics, technology, culture, and our relationship with robots.

But I think we missed the real story behind Duplex. We were so struck by its ability to pass as human that we missed it’s smarts, its domain logic that enables it to successfully navigate a phone call with a salon or restaurant. Duplex’s domain logic is the real story, a feature only Google could build, and its worth further examination.

  1. How does Google Duplex work? Why can Google build it and not someone else?
  2. What data did Google use to train Duplex? If it used phone calls, where did they come from and how many do they have?
  3. Do we care where Google (or anyone else) gets training data? Why would we want to provide or withhold our own actions?
  4. When (if at all) is it okay for robots try to pass as human?

What is Google Duplex made of?

Google hasn’t said much about Duplex, aside from the initial demo and a high-level Google AI blog post. Based on this limited information and Google’s demonstrations, Duplex likely has the following functional tasks:

  1. Speech Recognition: A system for parsing what the human on the other end of the line is saying.
  2. Domain Logic: A system which parses and recognizes sentences related to appointment booking, then generates appropriate replies to successfully create an appointment.
  3. Text Generation: A system for turning the domain logic’s output into natural language.
  4. Text Humanization: A system for making the stilted, computer-generated sentences appear more human.
  5. Speech Generation: A system for turning text into human-sounding speech.

There’s probably a coordination system in there as well, negotiating between all those systems and handling interruptions as these five steps repeat in a cycle. Given the framework above, we can diagram an example call interaction like so:

Despite most of the hand-wringing focusing on steps 4 and 5, the meat of Google Duplex is step 2, the domain logic. Facebook, Amazon, Apple and others can recognize and generate speech sufficiently for this application. But all three are likely unable to build the domain logic Google is showing off with Duplex. Before we get into why they can’t, let’s look at what a domain logic needs to do and how you’d build one.

[let’s say you want to build a chat bot which books appointments OR…]

[alexa/siri example of playing a song]

The above explains how Duplex functions, but it doesn’t explain how it works. My diagram is probably too clean, too linear for an application built with deep learning. In their AI Blog, Google diagrams Duplex more accurately:

If you aren’t familiar with deep learning, this diagram will likely be a bit confusing. Let’s simplify it a touch:

Imagine deep learning as a box. Into the box we put the audio from the phone call, the text generated from the speech recognition software, and additional contextual data: the duration of the call, the previous sentences spoken, what time of day it is, etc. All that data bangs around in the box until a response falls out, which is translated into audio. In this diagram, our Deep Learning box replaces the Domain Logic and Text Generation from our original diagram. (It is unclear from Google’s blog post if our Text Humanizing step is in the Deep Learning box or in their Text to Speech (TTS) step)

[deep learning in a nutshell]

[this is what people mean when they say Deep Learning will replace some coding. They aren’t writing elaborate rules for each individual case, but throwing everything in a multi-layered model which handles the complexity as best it can.

[tricking gpus to make problems look like videogames]

[word embedding]

[borrowing word embedding (google has enough) vs purpose built embedding (need lots of training data)]

[we’re upset about speech generation and text humanization.]

[oddly we aren’t upset about the domain logic, more on that later]

Case Study 1: Grubhub & Fax Machines

[Google duplex probably will save restaurants money because they won’t have to share a cut with Grubhub or OpenTable.]

Why So Mundane?

[If you had a magical talking robot that could trick people with uncanny realism, it seems like a waste to use it to make a haircut appointment.]

[This is the limitation of machine learning: it is general. They could do this because they could acquire the data to do this. Find the phone numbers of a bunch of salons, record every time someone called them, sort through the calls to find only ones about appointments, then train off that.]

[System for breaking up with a boyfriend. These calls would be hard to find. So you can’t abuse this system for anything other than the most mundane calls possible.]

[Where did these calls come from? Did Google strike deals with tons of salons (probably not, because it would be easier to get them to stock ipads to automate orders). No they probably got them through android.

 — — 

Notes on it from a product manager

Does it follow my rules?

  • It starts with low stakes challenges (though not as low as photo categorization)
  • It has mechanisms to prime the pump (hired workers to hand train and handle failed calls and Android/GVoice footprint)
  • With usage it gets more data. The phone calls to people are it’s new data.


  • It can’t be used nationally, or even internationally, without issue due to call recording laws
  • It has terrible optics