With Love, A.I: Transcription (2 of 2)

Can a developer enhance Google’s Speech to Text system (GS2T)? Short answer is “Yes”. Let’s take a look how to go about it

Google has what’s called language “hints”. By adding hints to the vocabulary, google speech-to-text is able to “understand” the audio and transcribe better.

As I had shared in part 1 of this blog, I tried transcribing the following

The lines in English and Telugu –

Nee Pada Sevaku Velaye Swami  
Aapada Baapava Aananda Nilaya 
Daari Tennu Teliyaga Leni  
Daasula Brovaga Vegame Raava

నీ పద సేవకు వేళాయె స్వామి  
ఆపద బాపవా ఆనంద నిలయ 
దారి తెన్నూ తెలియగ లేని  
దాసుల బ్రోవగ వేగమే రావా

Google transcribed that to –

Pada Sevaku Velaye Swami, Nee Pada Seva Vela 
Teliyadu 
Teliyagaane Leni Naa Manasulo

పద సేవకు వేళాయె స్వామి, నీ పాద సేవ వేళ 
తెలియదు 
తెలియగానే లేని నా మనసులో

I’d like to help Google’s S2T to transcribe better by providing the words (Aapada, Baapava, Aananda, Nilaya, Daari, Tennu) as “phrase hints”. Once I do that, I will transcribe again and hope that something better comes out the other end this time.

In order to get my phrase hints across to GS2T I need to enroll as a Google Cloud Platform developer, create an account and enable Cloud Speech-To-Text APIs and write a bit of json code to get going. There are good examples in Google cloud documentation

Here is how I fed GS2T my phrase hints

curl -X POST \
-H "Authorization: Bearer "$(gcloud auth application-default print-access-token) \
-H "Content-Type: application/json; charset=utf-8" --data \
"{ 'config': { \
'language_code': 'te-IN', \
'speechContexts': { \
"phrases":['దారి','తెన్నూ', 'తెలియక', 'లేని', 'ఆపద', 'బాపవా'] \
} \
}, \
'audio' : { \
'uri':'gs://ammas2t/12465.flac' \
} \
}" "https://speech.googleapis.com/v1/speech:longrunningrecognize"
$ ./resultcurlcmd
{
"name": "4980738099736676025",
"metadata": {
"@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeMetadata",
"progressPercent": 100,
"startTime": "2019-04-22T20:05:40.740461Z",
"lastUpdateTime": "2019-04-22T20:06:41.100358Z"
},
"done": true,
"response": {
"@type": "type.googleapis.com/google.cloud.speech.v1.LongRunningRecognizeResponse",
"results": [
{
"alternatives": [
{
"transcript": "పద సేవకు వేళాయె నీ పదసేవ వేళాయె",
"confidence": 0.6822279
}
]
},
{
"alternatives": [
{
"transcript": "తెలియక లేని నా",
"confidence": 0.6823187
}
]
},
{
"alternatives": [
{
"transcript": " మమ్మేలుకో చరణములే నమ్మితి నీ పదసేవ వేళలో స్వామి",
"confidence": 0.5497838
}
]
},
{
"alternatives": [
{
"transcript": " ఆనంద నిలయం",
"confidence": 0.63640434
}
]
},
{
"alternatives": [
{
"transcript": " ఆశల జీవితం",
"confidence": 0.3930311
}
]
},
{
"alternatives": [
{
"transcript": " లేని జీవితం",
"confidence": 0.613313
}
]
},
{
"alternatives": [
{
"transcript": " నేను",
"confidence": 0.41449854
}
]
},
{
"alternatives": [
{
"transcript": " హాయ్ బ్రదర్",
"confidence": 0.59204257
}
]
}
]
}
}

The transcription seems to have gone south for some reason. I need to investigate further why my phrase hints not only didn’t help make the result better but they made it worse.

If you want to follow along how I setup Google Cloud Speech-To-Text API here are the screenshots, mostly self-evident.

Then download and install GCloud SDK and tools

in AI | 368 Words

With Love, A.I: Transcription (1 of 2)

tran·scrip·tion/ˌ tran(t)ˈskripSH(ə)n/
a written or printed representation of something.

Written representation of audio is normally considered as transcription. How does one go from audio to written word? We could listen to the audio word by word and note down the written representation of each word. This is a manual process and sometimes, we may need to pause the audio to catch up. Some words might be transcribed incorrectly.

Can AI help speed up the process and reduce errors in transcription? That’s a rhetorical question because AI already does this to some extent, we have seen it in products from Apple, Amazon, Google and others.

What would it take for a machine to listen and convert that listening into written word? In its simplest sense, assuming that the machine knows the entire vocabulary of that language in which the audio is in e.g. English, it can compare the spoken word with its vast library of phonemes to figure out what word the audio maps to and “type’ that word in that language into a text editor. Repeating this process for every uttered word recursively will produce a text document that is (hopefully) an exact representation of the audio.

For example, the spoken word “Potato” could be recognized as such by the software that processes each phoneme in the word with the library of phonemes and deconstruct the word to its basic phonemes, then match the possible word with a library of words, take context into consideration and figure out if it the textual representation of the spoken audio is really “Pohtahtoh” or “Pahtayto” or something else.

Apparently, most speech recognition systems use something called Hidden Markov Models.

Specific example of a Markov model for the word POTATO
The more general representation of Markov model. source: wikipedia

Can you implement a speech recognition and transcription system for Telugu language, using off the shelf libraries? This is a question I don’t know the answer to but let’s find out.

I set out looking for speech recognition libraries already available I can leverage and found a few. I don’t know which one is best suited for my purpose. I ‘ll with Google Cloud Speech to Text API as it claims to support 120 languages and Telugu is one of them.

I uploaded a Telugu song clip and Google STT produced the following –

The lines go –

Nee Pada Sevaku Velaye Swami 
Aapada Baapava Aananda Nilaya
Daari Tennu Teliyaga Leni 
Daasula Brovaga Vegame Raava

Google transcribed that to –

Pada Sevaku Velaye Swami, Nee Pada Seva Vela
Teliyadu
Teliyagaane Leni Naa Manasulo

What just happened. Why did Google transcription not work? In fact, it is so far off, the transcribed text reads like gibberish.

It’s possible the audio was not of great quality. It’s also possible that the Telugu vocabulary universe of Google Speech-to-Text System (GSTT) is limited. Perhaps the words Aapada, Baapava, Aananda, Nilaya, Daari, Tennu and others are not transcribed properly because related phonemes are missing from the GSTT.

Can one add phonemes and new words to GSTT to improve its accuracy? Funny thing is, it’s possible to add vocabulary to GSTT, it’s simple but not easy. It requires you to know programming and using Google’s STT Application Programming Interface (API). We will look at how to improve Google’s Speech to Text system by adding to its vocabulary in Part 2!

in AI | 563 Words