address. Gaston Geenslaan 11 B4, 3001 Leuven
All Insights

Add value to your chatbots through phone calls

Leverage Twilio’s Media Streams to enable voice interactions with your chatbots.

Leverage Twilio’s Media Streams to enable voice interactions with your chatbots.


In this article I’m going to walk you through the steps of using voice to interact with an API using:

  • Twilio Media Streams to get the audio stream from the phone call
  • Google Cloud Speech-to-text for the transcription
  • ngrok to proxy a public url to locally running application
  • Javascript (NodeJS) is the programming language we’ll be using

Our chatbot platform Oswald is already equipped with voice interaction but we would like to provide a simple tutorial to roll your own.

In most chatbot platforms, we use the ‘traditional’ channels such as website widgets, Facebook Messenger, WhatsApp, … in short: you type a message, it’s sent to the chatbot which responds in text.

The next step is adding voice interaction to the capabilities of our assistants. For the purpose of this tutorial, I used NodeJS, but Twilio offers their SDK in a wide variety of flavours like C#, PHP, Java and Python.


  • Basic programming skills
  • Your Google Cloud account is set up ($)
  • Your Twilio account is set up ($)
  • ngrok is installed

Getting started with our chatbot

To keep the tutorial brief, we’re gonna create the most simple chatbot you have ever seen!

First, we’ll create the chatbot API. Create a basic Express application with a POST endpoint, /message , which holds our chatbot in the switch-statement below.

switch(message.toLowerCase()) {
case 'hello':
answer = { text: 'Hi, how are you?' };
case 'good':
answer = { text: 'Nice, I\'m happy for you' };
case 'bad':
answer = { text: 'Oh no, you are such a loser' };
case 'goodbye':
answer = { text: 'Nice talking to you', action: 'hangup' };

You can see one thing that stands out, namely the action on ‘goodbye’ where we want to stop the conversation and hang up.

You can look at the full code for the chatbot API here:

Twilio transcription service

Now for the interesting part: we will create an application that asks for the audio stream of the phone call and dispatches it to the speech-to-text service. When we receive the transcription we’ll ask our chatbot what to respond.

This is what our main entry file index.js will look like.

Doesn’t seem very complicated, let’s break it down!

1) The part with the dispatcher

const http = require('http');
const HttpDispatcher = require('httpdispatcher');
const dispatcher = new HttpDispatcher();

const wsserver = http.createServer((request, response) => {
dispatcher.dispatch(request, response);

const HTTP_SERVER_PORT = process.env.PORT;
wsserver.listen(HTTP_SERVER_PORT, () => {
console.log("Server listening on: http://localhost:%s", HTTP_SERVER_PORT);

In our chatbot API we used Express to handle our routes, but here a simple dispatcher will suffice. We create our server with the instruction to direct every request to our dispatcher. The server is launched on the port defined in our environment.

2) The part with the web socket

const WebSocketServer = require('websocket').server;
const MediaStreamHandler = require('./media-stream-handler');

const mediaws = new WebSocketServer({
httpServer: wsserver,
autoAcceptConnections: true

mediaws.on('connect', connection => {
new MediaStreamHandler(connection);

We use our previously defined server to initialise our web socket server to accept the audio streams. Twilio will start sending messages on the socket stream. In a later section we’ll discuss how ‘MediaStreamHandler’ does its business.

3) The part with the responder

const RespondingService = require('./responding-service'); 

const respondingService = new RespondingService();

Here we initialise the responding service. Its job is to talk to our chatbot API. The service wants to be notified when a message comes in so we add it to the listeners.

4) The part that manages everything

const fs = require('fs'); 
const path = require('path');
const queue = require('./queue');
const TwilioCall = require('./twilio-call');

dispatcher.onPost('/voice/stream', (req, res) => {
const params = new URLSearchParams(req.body);
const twilioCall = new TwilioCall(

respondingService.on('update', (response) => {

const filePath = path.join(__dirname + '/templates', 'streams.xml');
const stat = fs.statSync(filePath);
res.writeHead(200, {
'Content-Type': 'text/xml',
'Content-Length': stat.size

const readStream = fs.createReadStream(filePath);

The endpoint /voice/stream is the webhook we provide to Twilio as shown in the screenshot.

Twilio config panel

I’m using port 8080, so after you start ngrok, your terminal will look like this:

const params = new URLSearchParams(req.body);  
const twilioCall = new TwilioCall(

respondingService.on('update', (response) => {

Twilio also sends some information to the endpoint which we use to create the instance of the Twilio call on our side. The responding service updates the call instance with the response of our chatbot API.

These were the concepts we’re using for our voice interaction tutorial. We’ll briefly cover the remaining pieces of code in other files below.



When we start receiving the audio content, we channel it to Google through the transcription service. When we receive the transcription, we notify all the listeners via the shared queue.



We construct a response, leveraging the Twilio SDK. We then send out an update with the VoiceResponse XML string. If an action is included in the answer, we add it to the voiceresponse, in this case only ‘hangup’ is provided.



TwilioCall is used to send updates in the form of a TwiML. Twilio will then use its own text-to-speech service to send the audio back to the phone.

Initial XML

Replace with your ngrok proxy 😉




Tucked away is the setup of the stream from the Google Speech package. Nothing special but necessary.

And 💥BOOM💥 you brought a voice enabled conversational interface into existence.

You can find the code for the streaming service here:


We use Twilio Media Streams to add voice to the means of interacting with our chatbots, ngrok makes our locally running program publicly available. Google’s speech-to-text service converts the stream into text but can be replaced by any provider.

Do you just want a chatbot platform that incorporates this feature with a one-click integration? Contact us via

Next steps could include

  • Respond with an audio stream so we can control the answering voice
  • Adapt the bridge to the speech-to-text service on every response

Subscribe to our newsletter

Raccoons NV (Craftworkz NV, Oswald AI NV, Brainjar NV, Wheelhouse NV, TPO Agency NV and Edgise NV are all part of Raccoons) is committed to protecting and respecting your privacy, and we’ll only use your personal information to administer your account and to provide the products and services you requested from us. From time to time, we would like to contact you about our products and services, as well as other content that may be of interest to you. If you consent to us contacting you for this purpose, please tick below:

By submitting this form, you agree to our privacy policy.

In order to provide you the content requested, we need to store and process your personal data. If you consent to us storing your personal data for this purpose, please tick the checkbox below.

More blog articles that can inspire you


Why you probably shouldn’t use React.FC to type your React components

This post is nuanced, but read it and you might change your opinion about using React.FC. What value does React.FC (or F...

what we do


address. Gaston Geenslaan 11 B4, 3001 Leuven