I'm a creator, product manager and programmer based in London


Writings and musings about history, technology, development and anything else that seems interesting.

Voice Controlled Video with Wistia

As a favour for a friend who runs a video production agency, I recently developed a prototype for a "Choose Your Own Adventure" style video experience which users could watch and control with their voice. Unfortunately the project wasn't commissioned in the end – but it was particularly interesting to work on. Here’s how I went about building the proof of concept.

The initial brief was to create a web-based video experience whose narrator could guide the user through a series of videos about a car. The narrator would ask the user questions, with the user's aural responses then recorded and parsed by speech recognition to dictate the course of the video. There would be three possible routes – as this was a demo, the videos were pieced together from various YouTube clips. Here's an example of the story flow:

Voice Controlled Video Demo Flow

With a tight deadline and my fairly average web development skills I knew I would need to find some clever solutions and lean on other's work and libraries to create the experience, rather than building it all from scratch. I broke the project down into three parts:

  • Video: Where will the videos be stored? How will they be played? As a developer, how can I control the video player to change the video after a command is recognised?

  • Audio: Will the audio be embedded in the video or included as separate files? How will they be played and controlled if we are using the latter method?

  • Speech Recognition: How will the user's commands be processed? How will these commands trigger changes?

Here's how I tackled them...


My initial thought was to embed the videos in my web application – but after some experimentation, it was apparent that buffering times made for a disjointed experience without spending a significant amount of work on pre-loading videos.

My next idea was to embed a YouTube video player into the web page and manipulate the videos it plays using the YouTube Javascript API. The API has the ability to seek to a particular point in a video, pause it and perform various other essential functions. In the end I decided to use the Wistia Player API – Wistia is essentially a white-labelled YouTube and provides a service for businesses to host, manage and deliver their videos to customers. It also allows you to add interactive components to videos easily. Their player API is similar to that of YouTube, but has a larger number of commands and events I thought could be useful. The only downside with choosing Wistia was the pricey $99 a month fee to host more than three videos, whether they are being viewed by one or a thousand people.

Wistia's documentation was easy to understand with plenty of samples so I had built the code I'd need very quickly:

This is the only HTML code required. To load a video on start I inserted the video ID in the marked location above.

The Javascript code required to manipulate the player includes a function to replace a video in the player, and an object we push to the initialization queue whose functions will be called when _all videos are Ready (loaded) or Embedded (first appear on the page).

Adding functions for a particular video is achieved by changing the id from _all and adding it to the _wq array.

To be completely honest I'm really not sure about what this _wq array all about or exactly how it functions – but I'm guessing it is specific to Wistia, as my searches for more details have come up dry.

With this code I had everything I needed to trigger events for different video states, swap a video in and out, and control playback. The only difficulty I had with the Player API involved getting some videos to not autoplay; the easiest way I found to prevent this was to edit the video settings on the Wistia site.

Editing video settings within Wistia to prevent autoplaying.

Editing video settings within Wistia to prevent autoplaying.


In order to prompt the user for a command and to add some explanatory narration, an audio track was required. I experimented with several options, and while playing separate audio tracks at certain points in the video worked best (as it allows a prompt to be played multiple times whilst the video is paused) we decided to incorporate the tracks into the video to simplify things.

Having had previous difficulties playing audio formats across all browsers working on Foundbite, I decided to use Howler.js to play my audio tracks:

If a track is already playing I pause it before creating a new one (otherwise Howler will play them both on top of each other), then play and subscribe to an end event where I have added the option of calling a function to begin listing to the user, if required (more below).

Google Chrome justly blocks sound from playing when you load a web page until the user has interacted with a page. In this case my Wistia settings to prevent the introduction video autoplaying means the user has to click play anyway, thereby allowing Howler to begin working.

We decided to add a subtle backing track playing on repeat throughout the experience and pausing this when the user is being prompted for a command, this made it more obvious when it was their turn to speak.


In order to avoid getting dragged down a complicated speech recognition rabbit hole I again elected to use a prebuilt solution, Annyang.js; a wrapper around the Web Speech API.

Without using the Web Speech API I would have needed to write code to use the microphone, record the user's speech and then upload this or an audio stream to a third party Speech to Text service and then parse the response to look for my desired terms - definitely not an easy undertaking in the time I had available. The biggest downside is the current lack of widespread compatibility with the API - it is basically only available in Google Chrome. For a proof of concept this is fine, but for a wider audience this solution probably wouldn't be acceptable.

The Web Speech API does most of the heavy lifting but Annyang adds a simpler API on top with the addition of commands - pre-configured words that, when recognised, call a function. For example:

// annyang will capture anything after a splat (*) and pass it to the function.
// e.g. saying "Show me Batman and Robin" will call showFlickr('Batman and Robin');
`show me *tag': showFlickr,

// A named variable is a one word variable, that can fit anywhere in your command.
// e.g. saying "calculate October stats" will call calculateStats('October');
'calculate :month stats': calculateStats,

// By defining a part of the following command as optional, annyang will respond
// to both: "say hello to my little friend" as well as "say hello friend"
`say hello (to my little) friend': greeting

Using this, the code I added my solution is as follows:

Here I first create the commands needed - two which use optional words - and the functions they should call when recognised. Annyang is then started in paused mode, to prevent it recognising words from the video. When awaiting a command the background music is paused and Annyang resumes listening. Afterwards Annyang is paused, the background music resumes and the relevant video replaces the one currently playing.

Creating this together was remarkably easy with one important discovery - Annyang will not work unless you are viewing an HTTPS site!

The demo playing in Google Chrome, opening the developer console and running Annyang in developer mode will show you exactly what it is hearing.

The demo playing in Google Chrome, opening the developer console and running Annyang in developer mode will show you exactly what it is hearing.

Putting It All Together

The final bits of code to complete the solution are shown below: functions that will play sound or begin listening at certain points in a video based on a script we created:

Here is a demo of how the experience worked. Unfortunately due to Wistia's free plan restrictions I could only use 3 videos for the experience. When you arrive on the page open up the developer console so you can see what speech is being recognised and full-screen the video to avoid it changing size:


The available options are as follows:

As I mentioned, this was a very quick proof of concept to put together and if it were to become a production piece later on there are quite a few things that would need to change:

  • Browser compatibility: Chrome has a huge audience (~67%) but that still excludes a lot of people from viewing what would most likely be marketing material which a client would want to be seen by as many people as possible. Clearly a more sophisticated speech recognition solution would have to be found or built to support as many browsers as possible. Mobile would present another challenge as on some platforms you have to press the screen before microphone recording is enabled to prevent abuse.

  • Clarity: As we all know voice recognition is far from perfect, even in perfect conditions. In experimentation some laptop microphones really had difficulty picking up commands.

  • Timings: with a more detailed script and planning it would be easier to ensure we began listening for commands at exactly the right point. Sometimes in this demo Annyang picks up the end of the narrator’s questions or doesn't start listening quite soon enough.

  • Video Loading: Videos generally seem to load quickly when switching between them on a decent connection, but preloading videos before the transition would present a more seamless experience. Perhaps this might have to be without Wistia in the future.

Since developing this Netflix have released Bandersnatch, an interactive episode of Black Mirror with similar "Choose Your Own Adventure" mechanics using the trusty mouse, a much less ambitious option but a real first for television! (Interestingly they are also being sued over it)

If you've found this and are looking to do something similar then do let me know if this has helped at all or if you have any ideas how it could have been done differently.