It has been 5 weeks since the start of GSoC 2017 in May! I would like to share with you what I have already accomplished and what should soon be on that list as well. Let’s dive in, shall we? :)
What have I been working on?
In my previous blog post I described the idea of using Jigasi to join a conference room of Jitsi Meet. Jigasi allows people to call someone over SIP into a conference. It does this by joining the room itself, and acting as a bridge between the sip call and the jitsi meet conference. The audio of the conference will be received by Jigasi, mixed into a single audio stream and then forwarded to the SIP call. The other way around is also facilitated. The audio from the person behind the SIP call will be sent to Jigasi and forwarded to the conference room.
The part where Jigasi can be brought into a room and receives all audio is very useful for our transcription use-case. I have modified Jigasi such that instead of mixing and forwarding the audio to SIP, the audio can be sent to a transcription service. A transcription service is an interface to a speech-to-text API or library. This service will be given the audio and the library behind the interface is expected to transcribe the audio. This allows for multiple speech-to-text API’s to be implemented and compared for performance and accuracy. When the speech-to-text service gives a result this transcription should be broadcasted to the conference room in some way. It should also be stored and merged into a single file to publish the complete transcription after the conference is over.
What currently works?
The previous weeks I have been working on a transcription module in Jigasi. There are some general classes and interfaces which should facilitate a transcription session of a conference room. Currently it is able to join a conference room and receive all audio packets. It then uses the GoogleCloudTranscriptionService to send the audio packets to Google’s speech-to-text API. The results will just be printed to the console of the java environment.
What problems have arisen?
The current set-up has a few flaws. One major issue is the problem of silent audio. Because we are currently using a paid service every second matters. In a conference, however, most of the time only 1 person is actively speaking while all others are listening. This means Jigasi would be sending and paying for the transcription of silent audio streams. Another problem is that currently the audio is raw 48000 KHz PCM. The google cloud API prefers 16000 Khz and flac encoding. This might decrease the accuracy of the transcription. It also is more bandwidth intensive. Both these problems will need to be looked into and addressed.
The jigasi project itself also poses a problem. The goal of Jigasi is to be a bridge between Jitsi Meet conferences and SIP calls. The transcription use-case does not completely fit into what Jigasi is used for. We need to come up with a way to integrate the transcription module into the Jigasi project without breaking its current functionality. Thus currently my work cannot be merged into Jigasi.
What will the upcoming weeks have to offer?
Before the issues described above can be addressed I want to focus on having an actual working transcription service for Jitsi Meet. This involves continuing my work on the transcription module in Jigasi. My first goal is to filter the audio packets received based on the SSRC of the participant. This way the audio of every participant is split and can be transcribed separately. After this has been accomplished I can work on sending the results of the transcriptions to the chat of Jitsi Meet such that participants can see the transcriptions results live. The third goal is to merge every result into a single file and publish that file to a static location such as a Github gist or Amazon Web Services.
I hope to see you in a few weeks to talk about the progress!