A month has passed and a lot has happened. I now have a working prototype of a transcription service in Jitsi Meet!

Quick reminder

My GSoC project has 2 main goals: providing a live, as close to real-time as possible transcription of every participant and delivering a final, complete transcript at the end of a conference. The plan was to modify Jigasi to accomplish this.

The prototype

A picture (or 60 pictures per second in this case ;) ) shows more than 1000 words, right? So I’ve made a demo, which you can watch below.

You can see that Jigasi can be dialed into a conference with a special uri jitsi_meet_transcribe. This will later be hidden behind an UI. After it is dailed in Jigasi will notify that it has started transcribing, and will post results to the chat as soon as it has finished transcribing a particular piece of audio from a participant. As it will currently spam chat with these transcripts, this will also be moved to another UI element. Jigasi does also internally store a final transcript, but we still need to come up with a way to deliver these to the end user.

What does the demo conveniently not show?

There are a few hiccups which need to be ironed out. On a few occasions Jigasi will fail to start up correctly, or it might join a room without starting to transcribe. You can also see in the demo that the accuracy is not perfect (dog vs talk), and that I talk slower than you would normally talk in a meeting, which also improves the accuracy. I also did not use more difficult words, like jargon, which it might fail to recognize. All these issues are not really under my control, but depend on what the speech-to-text backend can provide. There might of course be some optimisation possible, like sending the audio in a different file type, or providing the service with a vocabulary of words which might’ve been said. There is currently no good way to stop the transcriber without kicking it or using a command in the browser console. There is also a major limitation set by the backend we use,which I will talk about in the next paragraph.

The speech-to-text API behind Jigasi

We decided on using the Google Cloud speech-to-text API as our backend for the transcription. Google provides a REST and a gRCP API as well as a client library. Although the client library is currently in beta, we decided on using it as it was very easy to import into Jigasi and supports StreamingRecognize

Google’s speech-to-text API provides 3 different services:

  1. Recognize: This is a synchronous API in which you provide a single audio file and your application will block until you receive the transcription. Currently the audio can not be longer than 1 minute. Supported by REST, gRCP and the client library.
  2. LongRunningRecognize: This is an asynchronous API in which you provide a single audio file and your application will retrieve (interim) results via a provided interface. Currently the audio can not be longer than 80 minutes. Supported by REST, gRCP and the client library
  3. StreamingRecognize: This is an asynchronous API in which you open a “session” and provide a continuous stream of audio files, and will receive results as soon as they are ready via a provided interface. Currently you can keep sending audio for 1 minute. Only supported by gRCP and the client library.

For this project the StreamingRecognize service seems like the best choice, as Jitsi Meet sends audio packets in 20 ms intervals, which we can then forward to the API (after locally buffering for ~100/500 ms to save bandwith, best frame time to be determined :) ). This solves the issue of determining when someone is actively speaking and when someone finished speaking, which would’ve been necessary when using the (LongRunning)Recognize. It is also better than the recognize API’s as StreamingRecognize can almost immediately provide a result when someone finishes speaking, whereas for the other two you can only send the audio for processing when someone has finished their sentence, which can lead to heavy delays the longer the audio is.

Dealing with the 1 minute restriction

The only problem with StreamingRecognize is that Google decided to allow a stream of up to a single minute, after which it will unmercifully throw an exception at you(r application). This is very unfortunate for our use-case as users often talk for a longer amount of time, especially in settings like business meetings or lectures. It would be a lot easier if we could just open a session for every participant at the start of a conference and keep it open for the entire duration, but alas, we have to work around it. My quick and dirty solution was to open an initial session when someone starts and keep it open for 55 seconds. However, after 50 seconds we open an additional one which will will overlap for 5 seconds with the “old” session as to catch words if the user is in the middle of a sentence. Then the new session becomes the old one, which will stay open for 55 seconds in total, and the cycle continues. This kinda works, but might provide some double transcripts. It also sucks when you have to stop a session when someone is actively speaking, which will heavily impact the accuracy. There might be a solution if you listen for final_result=true in your results, which is a flag set by Google. The problem with this approach is that if someone is actively speaking Google might not detect end-of-sentence and thus not send final_result=true before your 1 minute is up. This will be a major task to solve in the upcoming weeks.

Using StreamingRecognition with the Java Client library on an AudioStream

To end this post, I want to quickly explain how to use StreamingRecognition with the Java client library, as it is currently missing from their documentation.

This is a code sample using the streaming API.

import com.google.api.gax.grpc.ApiStreamObserver;
import com.google.api.gax.grpc.StreamingCallable;
import com.google.cloud.speech.spi.v1.SpeechClient;
import com.google.cloud.speech.v1.*;
import com.google.common.util.concurrent.SettableFuture;
import com.google.protobuf.ByteString;

import java.io.IOException;
import java.util.List;

public class StreamingSession
{
    /**
     * This listens for responses from Google
     */
    private ResponseApiStreamingObserver<StreamingRecognizeResponse> responseObserver;

    /**
     * This sends requests to Google
     */
    private ApiStreamObserver<StreamingRecognizeRequest> requestObserver;

    /**
     * The client managing all connections
     */
    private SpeechClient speechClient;

    /**
     * Create a session and send config
     *
     * @throws IOException when failing to connect, can be due to missing credentials
     */
    public StreamingSession()
            throws IOException
    {
        // Instantiates a client with GOOGLE_APPLICATION_CREDENTIALS
        speechClient = SpeechClient.create();


        // Configure request with raw PCM audio
        RecognitionConfig recConfig = RecognitionConfig.newBuilder()
                .setEncoding(RecognitionConfig.AudioEncoding.LINEAR16)
                .setLanguageCode("en-US")
                .setSampleRateHertz(16000)
                .build();
        StreamingRecognitionConfig config = StreamingRecognitionConfig.newBuilder()
                .setConfig(recConfig)
                .build();

        responseObserver = new ResponseApiStreamingObserver<StreamingRecognizeResponse>();

        StreamingCallable<StreamingRecognizeRequest, StreamingRecognizeResponse> callable =
                speechClient.streamingRecognizeCallable();

        requestObserver = callable.bidiStreamingCall(responseObserver);

        // The first request must **only** contain the audio configuration:
        requestObserver.onNext(StreamingRecognizeRequest.newBuilder()
                .setStreamingConfig(config)
                .build());
    }

    /**
     * Give a frame of continuous audio to Google
     *
     * @param audio the audio as an array of bytes
     */
    void giveNextAudioFrame(byte[] audio)
    {
        // Subsequent requests must **only** contain the audio data.
        requestObserver.onNext(StreamingRecognizeRequest.newBuilder()
                .setAudioContent(ByteString.copyFrom(audio))
                .build());
    }

    /**
     * Close the session and print results to standard output
     *
     * @throws IOException when failing to close the session
     */
    void endSession()
            throws Exception
    {
        // Mark transmission as completed after sending the data.
        requestObserver.onCompleted();

        List<StreamingRecognizeResponse> responses = responseObserver.future().get();

        for(StreamingRecognizeResponse response : responses)
        {
            for(StreamingRecognitionResult result : response.getResultsList())
            {
                for(SpeechRecognitionAlternative alternative : result.getAlternativesList())
                {
                    System.out.println(alternative.getTranscript());
                }
            }
        }
        speechClient.close();
    }

    /**
     * This class receives the text results once they come in with the #onNext message
     */
    class ResponseApiStreamingObserver<T> implements ApiStreamObserver<T> {
        private final SettableFuture<List<T>> future = SettableFuture.create();
        private final List<T> messages = new java.util.ArrayList<T>();

        @Override
        public void onNext(T message) {
            messages.add(message);
        }

        @Override
        public void onError(Throwable t) {
            future.setException(t);
        }

        @Override
        public void onCompleted() {
            future.set(messages);
        }

        // Returns the SettableFuture object to get received messages / exceptions.
        public SettableFuture<List<T>> future() {
            return future;
        }
    }
}

The import bits are the ResponseApiStreamingObserver<StreamingRecognizeResponse> responseObserver and ApiStreamObserver<StreamingRecognizeRequest> requestObserver objects. As their names suggest, they offer the ability to send audio (as a StreamingRecognizeRequest object) and receive transcripts (as a StreamingRecognizeResponse object). The giveNextAudioFrame() method can be used to give a chunk of your audio stream, and when the stream is empty you can call endSession(), which will stop the session and print the final results.