Introduction

If you’ve read my previous blog posts, you know that I’ve been working on adding speech-to-text functionality in Jitsi Meet, an open-source WebRTC javascript application that allows anyone to host a conferencing solution in a browser, as part of GSoC 2017. The summer is drawing to an end, autumn (and winter ;) ) is coming, and I want to talk about what has been happening the last few months!

What have I accomplished?

It is now possible to use Jigasi as a transcriber in a Jitsi Meet room! You can normally dial Jigasi into a room by using the “Call SIP” button. This would create a bridge between the Jitsi Meet conference and a SIP endpoint. Instead of using a normal SIP URI, you can use jitsi_meet_transcribe. This will make Jigasi join as a transcriber instead of forwarding the audio to the SIP number. Note that using the “Call SIP” button to dial the transcriber into the room is only a temporary solution and will be replaced with a proper UI element.

While Jigasi is in a room as a transcriber, it will provide nearly real-time transcription. Currently Jigasi can send speech-to-text results to the Jitsi Meet room as either plain text or JSON. If it’s send in JSON, Jitsi Meet will provide subtitles in the left corner of the video, while plain text will just be posted in the chat. It will also provide a complete transcript of the call. It will do this by providing a link to where the final, complete transcript can be found after the conference when it enters the room. You can see Jigasi in action as a transcriber in the video below!

How does it work?

When Jigasi acts as a gateway between a SIP endpoint and a Jitsi Meet conference, it joins the room with an XMPP account. It then mixes the audio off all other participants into one single audio stream before sending it off to the SIP endpoint. When it joins as a transcriber, it will just forward the audio from all participants to a speech-to-text service instead of doing the audiomixing.

Currently only the Google Cloud speech-to-text API is implemented. Google’s API supports streaming speech recognition for up to a minute long stream. This means we can simply forward the audio streams we receive from the participants to their API. The 1 minute restriction is circumvented by using single_utterenace=true in the configuration and opening a new session as soon as an is_final=true result is received.

However, the design allows for implementing another speech-to-text service. This can be a commercial one like IBM’s, or an open-source, self-hosted solution like DeepSpeech.

Whenever a new result comes in from the speech-to-text service, the transcribed text is send back to the Jitsi Meet room as a chat message. Jigasi can be configured to sent the transcription as plain text as well as send it as a JSON object. A JSON object has a lot more information, which was useful for implementing the subtitle feature. An example of a JSON object which is send from Jigasi:

{
   "jitsi-meet-muc-msg-topic":"transcription-result",
   "payload":{
      "transcript":[
         {
            "confidence":0.0,
            "text":"this is an example Json message"
         }
      ],
      "is_interim":false,
      "language":"en-US",
      "message_id":"14fcde1c-26f8-4c03-ab06-106abccb510b",
      "event":"SPEECH",
      "participant":{
         "name":"Nik",
         "id":"d62f8c36"
      },
      "stability":0.0,
      "timestamp":"2017-08-24T11:04:05.637Z"
   }
}

All incoming results are also stored to be able to create a final transcript. When Jigasi joins the room, it will provide the link towards where the final transcript will be served from. Jigasi serves these files by also running a Jetty file server instance. The final transcript can be stored in either plain text or a JSON Object. The JSON object of a final transcript will hold an array of the JSON objects which are sent as a chat message. Saving the transcript as a JSON object was implemented as a way to more easily create a nice HTML page, for example. This is however future work. An example of a transcript in plain text:

Transcript of conference held at 24-Aug-2017 in room testjigasi@conference.meet.jit.si
Initial people present at 11:10:34:
	Nik

Transcript, started at 11:10:34:
________________________________________________________________________________
<11:10:40> Nik: this is an example transcripts 
<11:10:50> Nik: in this place what everyone in the conference has set 
<11:11:04> Nik: it also displays when someone enters or leaves the conference 
<11:11:11> Nik: I hope you like it goodbye 
________________________________________________________________________________

End of transcript at 24-Aug-2017 11:11:14

I would like to point out that that I said

It displays what everyone in the conference has said

and the service incorrectly transcribed it displays as in this place and set as said. Both statements have similar pronunciations, which make them hard to distinguish. Also keep in mind that I’m not a native american english speaker, which might lead to me pronouncing the words slightly different. When reading this transcript, however, the main context is still clear or can be figured out.

The programming bits :)

In this section I will briefly go over the Jitsi projects which needed modification and the code I wrote in order to make everything work. I will also try to shortly introduce the important classes which can be used to extend my work.

Jigasi

You might have noticed me mentioning Jigasi a few times already. It is the bread and butter of my project! In its core Jigasi is a java application entering the Jitsi Meet room as a participant. That way it gets access to all the audio streams. I implemented the TranscriptionGateway, which creates a TranscriptionGatewaySession every time Jigasi is dialed into a room as a transcriber. The TranscriptionGatewaySession handles the main logic of starting and stopping the transcriber, accessing the audio, sending the results and storing the transcript.

The TranscriptionService interface is used to implement the GoogleCloudTranscriptionService. In order to implement another speech-to-text API, you can implement the same interface and use it as an argument when creating a TranscriptionGatewaySession.

If you want to change the way the results are send to Jitsi Meet, you can implement the TranscriptionResultPublisher interface. The same is true for publishing the transcript and the TranscriptPublisher interface. The implemented interfaces should be added to the TranscriptHandler.

Here is the list of all the merged pull requests I’ve made to Jigasi:

libjitsi

A small modification to libjitsi was necessary to circumvent the audio mixing and directly receive the audio streams in Jigasi. This was done by adding the ReceiveStreamBufferListener interface. The pull request can be found here.

lib-jitsi-meet

lib-jitsi-meet is the javascript library connecting the Jitsi Meet project with the videobridge backend. For example, it handles the events related to receiving data. Normally, it receives JSON messages from the videobridge. One such messages can be a “stats” object, which has information about the quality of the connection, such as bit rate, packet loss, etc. This is called an ENDPOINT_MESSAGE. The library fires a JitsiConferenceEvents.ENDPOINT_MESSAGE_RECEIVED event when that happens.

My pull request allows the same to happen when the chat room receives a JSON object. This way Jigasi can send the transcription results as JSON objects to Jitsi Meet.

Jitsi Meet

Jitsi meet is the front-end code of the Jitsi Meet project. In order to display the transcription results as subtitles, I created a React feature. The component uses the JSON objects received from Jigasi to render the subtitles.

Future work

I’m very happy with what I’ve been able to contribute to the Jitsi projects this summer! Unfortunately, as with every (software) project, there is always room for improvement and additional features. Here are some of these in an arbitrary order:

  • Implement another speech-to-text API. This is interesting in order to see if other APIs have better or worse accuracy, and to be able to offer a choice. An example is IBM’s API.
  • Add a button to stop the transcriber. Currently this can only be done by kicking it or when everyone leaves the room.
  • Move away from dialing jitsi_meet_transcribe with the SIP button to invite the transcriber and replace it with some other UI button/element.
  • Add a heuristic to detect silent audio, and do not send it. This might be way to save money, as the Google API is unfortunately not free ($0.006 per 15 seconds, rounded up).
  • Add a UI element to select which language will be used in the conference.
  • Look into free alternatives for the speech-to-text, such as CMUSphinx or the upcoming Common Voice project combined with DeepSpeech from Mozilla.
  • Save the transcript in other formats, publish them in other ways then just serving the plain text and/or JSON file.
  • Move the plain text results away from the chat, such that it not spammed, and display it in another UI element instead (but which is persistent, unlike the subtitles).
  • This work has not been put into production yet and thus has not been tested sufficiently. There will be (small) bugs which need to be ironed out.