<- all articles

Add value to your chatbots through phone calls

Daphné Vermeiren

Leverage Twilio’s Media Streams to enable voice interactions with your chatbots.


In this article I’m going to walk you through the steps of using voice to interact with an API using:

  • Twilio Media Streams to get the audio stream from the phone call
  • Google Cloud Speech-to-text for the transcription
  • ngrok to proxy a public url to locally running application
  • Javascript (NodeJS) is the programming language we’ll be using

Our chatbot platform Oswald is already equipped with voice interaction but we would like to provide a simple tutorial to roll your own.

In most chatbot platforms, we use the ‘traditional’ channels such as website widgets, Facebook Messenger, WhatsApp, … in short: you type a message, it’s sent to the chatbot which responds in text.

The next step is adding voice interaction to the capabilities of our assistants. For the purpose of this tutorial, I used NodeJS, but Twilio offers their SDK in a wide variety of flavours like C#, PHP, Java and Python.


  • Basic programming skills
  • Your Google Cloud account is set up ($)
  • Your Twilio account is set up ($)
  • ngrok is installed

Getting started with our chatbot

To keep the tutorial brief, we’re gonna create the most simple chatbot you have ever seen!

First, we’ll create the chatbot API. Create a basic Express application with a POST endpoint, /message , which holds our chatbot in the switch-statement below.

switch(message.toLowerCase()) {  case 'hello':    answer = { text: 'Hi, how are you?' };    break;  case 'good':    answer = { text: 'Nice, I\'m happy for you' };    break;  case 'bad':    answer = { text: 'Oh no, you are such a loser' };    break;  case 'goodbye':    answer = { text: 'Nice talking to you', action: 'hangup' };    break;  default:     break;}

You can see one thing that stands out, namely the action on ‘goodbye’ where we want to stop the conversation and hang up.

You can look at the full code for the chatbot API here:

Twilio transcription service

Now for the interesting part: we will create an application that asks for the audio stream of the phone call and dispatches it to the speech-to-text service. When we receive the transcription we’ll ask our chatbot what to respond.

This is what our main entry file index.js will look like.

Doesn’t seem very complicated, let’s break it down!

1) The part with the dispatcher

const http = require('http');const HttpDispatcher = require('httpdispatcher'); const dispatcher = new HttpDispatcher(); const wsserver = http.createServer((request, response) => {     dispatcher.dispatch(request, response); });  const HTTP_SERVER_PORT = process.env.PORT; wsserver.listen(HTTP_SERVER_PORT, () => {     console.log("Server listening on: http://localhost:%s", HTTP_SERVER_PORT); });

In our chatbot API we used Express to handle our routes, but here a simple dispatcher will suffice. We create our server with the instruction to direct every request to our dispatcher. The server is launched on the port defined in our environment.

2) The part with the web socket

const WebSocketServer = require('websocket').server;const MediaStreamHandler = require('./media-stream-handler');const mediaws = new WebSocketServer({       httpServer: wsserver,       autoAcceptConnections: true});   mediaws.on('connect', connection => {     new MediaStreamHandler(connection); });

We use our previously defined server to initialise our web socket server to accept the audio streams. Twilio will start sending messages on the socket stream. In a later section we’ll discuss how ‘MediaStreamHandler’ does its business.

3) The part with the responder

const RespondingService = require('./responding-service'); const respondingService = new RespondingService();queue.listeners.push(respondingService);

Here we initialise the responding service. Its job is to talk to our chatbot API. The service wants to be notified when a message comes in so we add it to the listeners.

4) The part that manages everything

const fs = require('fs'); const path = require('path');const queue = require('./queue'); const TwilioCall = require('./twilio-call'); dispatcher.onPost('/voice/stream', (req, res) => {      const params = new URLSearchParams(req.body);        const twilioCall = new TwilioCall(            params.get("AccountSid"),             params.get("CallSid"),             process.env["TWILIO_AUTH_TOKEN"]      );       respondingService.on('update', (response) => {            twilioCall.update(response)      });      const filePath = path.join(__dirname + '/templates', 'streams.xml');      const stat = fs.statSync(filePath);       res.writeHead(200, {'Content-Type': 'text/xml', 'Content-Length': stat.size   });     const readStream = fs.createReadStream(filePath);      readStream.pipe(res); });

The endpoint /voice/stream is the webhook we provide to Twilio as shown in the screenshot.

Twilio config panel

I’m using port 8080, so after you start ngrok, your terminal will look like this:


const params = new URLSearchParams(req.body);  const twilioCall = new TwilioCall(     params.get("AccountSid"),      params.get("CallSid"),      process.env["TWILIO_AUTH_TOKEN"]  ); respondingService.on('update', (response) => {       twilioCall.update(response)   });

Twilio also sends some information to the endpoint which we use to create the instance of the Twilio call on our side. The responding service updates the call instance with the response of our chatbot API.

These were the concepts we’re using for our voice interaction tutorial. We’ll briefly cover the remaining pieces of code in other files below.



When we start receiving the audio content, we channel it to Google through the transcription service. When we receive the transcription, we notify all the listeners via the shared queue.



We construct a response, leveraging the Twilio SDK. We then send out an update with the VoiceResponse XML string. If an action is included in the answer, we add it to the voiceresponse, in this case only ‘hangup’ is provided.



TwilioCall is used to send updates in the form of a TwiML. Twilio will then use its own text-to-speech service to send the audio back to the phone.

Initial XML

Replace d08ca5a2.ngrok.io with your ngrok proxy 😉




Tucked away is the setup of the stream from the Google Speech package. Nothing special but necessary.

And 💥BOOM💥 you brought a voice enabled conversational interface into existence.

You can find the code for the streaming service here:



We use Twilio Media Streams to add voice to the means of interacting with our chatbots, ngrok makes our locally running program publicly available. Google’s speech-to-text service converts the stream into text but can be replaced by any provider.

Do you just want a chatbot platform that incorporates this feature with a one-click integration? Contact us via hello@oswald.ai

Next steps could include

  • Respond with an audio stream so we can control the answering voice
  • Adapt the bridge to the speech-to-text service on every response

Written by

Daphné Vermeiren

Want to know more?

Related articles