On rethinking how audio is captured, represented and retrieved in this new world of AI
Ishwarya Ananthabotla completed her BS and MS in Electrical Engineering and Computer Science from MIT. She is pursuing a PhD in the MIT Media Lab’s Responsive Environments group, exploring ways to capitalize on our knowledge of human perception, cognition, memory, and attention, to re-think traditional paradigms for audio capture, representation, and retrieval.
“If the whole truth is told, oral tradition stands out as the single most dominant communicative technology of our species as both a historical fact and, in many areas still, a contemporary reality.” __John Foley, Signs of Orality
Johannes Gutenberg ushered in the era of printing press and movable type around the year 1439. Thanks to him I’m able to write this blog but what intrigues me is why did we go from aural tradition to written tradition in the first place. What were the problems with aural methods that were addressed by printed word?
How did people transfer knowledge, news, gossip before printing was a thing? Handwritten manuscript was the prevalent method for writing and sharing ideas. However, the most popular method of transferring, sharing ideas with one another was through oral communication.
“If the whole truth is told, oral tradition stands out as the single most dominant communicative technology of our species as both a historical fact and, in many areas still, a contemporary reality.”
__John Foley, Signs of Orality
In ancient India, scriptures, folklore, stories were mainly transmitted orally. It is widely believed that srutis of Hinduism (Vedas) were never written down but have been transferred from generation to generation solely orally. Signs of that can be seen even today in the way music is taught in North India, Hindustani music, which is the main focus of one of the four vedas, Samaveda. The notations, structure of the composition (Raaga) and the Chalan (movements and interconnections between various notes) are some of the aspects that are still transferred between the Guru and Sishya in the so called Guru-Sishya-Parampara in oral methods.
“The Vedic texts were orally composed and transmitted, without the use of script, in an unbroken line of transmission from teacher to student that was formalized early on. This ensured an impeccable textual transmission superior to the classical texts of other cultures; it is, in fact, something like a tape-recording… Not just the actual words, but even the long-lost musical (tonal) accent (as in old Greek or in Japanese) has been preserved up to the present.”
— Michael Witzel
In Greece, it’s believed that Homer’s epic poetry (Lliad and Odyssey) was primarily composed, performed and transmitted orally.
Are we getting back to oral/aural mode of transmitting ideas more so than written script? This brings me to the point I’m trying to make with this writeup, if we can speak and the machines can understand and converse with us just like humans, if not better than humans, would oral communication become the predominant way we transmit ideas, commands, conversations?
If we can talk to the TV, talk to the garage door, talk to the thermostat and listen to books, listen to magazines, make annotations using audio markers rather than visual markers, you get the point, If I can talk to my phone and it can talk back to me (as Turing dreamed of), would we still want to type or read? or would we rather talk and listen?. What do you think?
“Alexa and Google Duplex are not perfect but so are humans, only difference is, Alexa and Duplex are making great strides forward”
During the years 2008-11 I had worked at a healthcare IT company that used Automated Speech Recognition (ASR), Speech-to-Text (STT) and Text-to-Speech (TTS) software to automate collecting insurance information from members on behalf of health insurance companies in the US e.g. Blue Cross, United.
Speaking to technology, as if you are chatting with a human, has been the holy-grail of digital user interface design and I believe we are in the golden age of speech recognition.
Speech recognition will become so ubiquitous that we wouldn’t have to type to chat with friends and family, in any language, or speak to an automated voice assistant that doesn’t understand what you are saying even after repeating 5 times. We would never have to fumble through many similar looking buttons on your TV remote to watch the show you want to watch, from any streaming app. It will be accomplished by a simple voice command, as if summoning a human assistant to find and play the show from Netflix for you.
Alexa And Duplex Are Not Perfect But So Are Humans
In 2017, when I spoke to my Alexa at home in Telugu (my mother tongue), it was missing what I said 9 out of 10 times but now in 2019 it’s already faring a lot better, magic of NLP in the cloud. I’d say it’s still far from perfect but not 1 out of 10, may be a 5!
Typing in non-english language into your iMessage or WhatsApp requires installing a language keyboard, finding several key combinations to type a single word, it’s not easy. Imagine simply speaking to the chat bot and it types the text in a language of your choice effortlessly. Imagine being able to communicate with a tourist in english while she sees text on her phone in her language instantly. It’s already possible to a large extent, with assistants like Google Duplex we are well on our way to this “utopian” world where language is not a barrier to communication any more.
Here is the sample google duplex making a call to reserve a spot at a hair salon, not bad ha?