How do voice assistants work?

13 min readJan 8, 2022

People have desired to communicate with computers virtually since the invention of the first computer. There are a lot of computers in science fiction that can hold conversation, HAL-9000 and the Starship Enterprise’s computer to J.A.R.V.I.S. and the legendary duo of C3-PO and R2-D2.

It was just a few decades ago that the thought of having a meaningful conversation with a computer seemed far-fetched at the time, but the technology to make voice interfaces effective and widely available is now available and is already on its way. Several consumer-level goods have emerged in recent years. Voice assistants have become more affordable in recent years, and more people are using them. Every day, new features and platforms are introduced. Users have complete control over their actions. From basic informative inquiry to music and phone calls, Voice control may be used to operate their phone or turn lights on and off. The basic workings and typical features of today’s voice assistants will be explored in this column. It will also go through some of the privacy and security concerns that come with voice assistants, as well as possible future applications for these gadgets

What Are Voice Assistants?

Simply put, voice assistants are the realization of the science fiction dream of interacting with our computers by talking to them. Apple’s Siri, Microsoft’s Cortana, Amazon’s Alexa, and Google’s Assistant are all software agents that run on purpose-built speaker devices or smartphones. The software constantly listens for a keyword to wake it up. Once it hears that keyword, it records the user’s voice and sends it to a specialized server, which processes and interprets it as a command. Depending on the command, the server will supply the voice assistant with appropriate information to be read back to the user, play the media requested by the user, or complete tasks with various connected services and devices. The number of services that support voice commands is growing rapidly, and Internet-of-Things device manufacturers are also building voice control into their products.

Apple’s Siri assistant has been around the longest, released as a standalone app in 2010 and bundled into iOS in 2011. Microsoft followed shortly thereafter with Cortana in 2013. Amazon launched Alexa with its Echo-connected home speaker in 2014, and Google’s Assistant was announced in 2016 along with its Home speaker and is also embedded in the Google app for Android based smartphones.

Earlier voice-activated devices relied on a smaller set of “built-in” commands and responses. Recent advances in natural language processing, also known as computational linguistics, has allowed voice assistants to create meaningful responses quickly. Hirschberg and Manning credit these recent improvements in natural language processing to four things:

1. A vast increase in computing power
2. The availability of very large amounts of linguistic data
3. The development of highly successful machine learning (ML) methods, and
4. A much richer understanding of the structure of human language and its deployment in social contexts.

As personal computers have grown cheaper and more powerful, and people have created more and more online text to be analyzed, scientists have used that text to train voice assistants to listen and respond to our requests in more natural and meaningful ways. Voice assistants can parse requests phrased in a number of different ways and interpret what the user is most likely to want. For example, to ask Google’s Assistant to remember where one parked his or her car, a user can say any of a number of phrases: “Remember where I parked,” “I parked here,” “I left the car on 6th street,” or “the car is in the south lot” will all get a similar result. Google will remember where the user parked the car and, when asked later, will be able to respond accordingly. The user can ask questions in a similarly natural way; asking “where did I park,” “where did I leave the car,” or “do you remember where I parked” all trigger the expected response. Natural language processing avoids the user frustration of earlier voice recognition systems, which required specific phrases and patterns in order to work properly.

Evolution of Voice Assistants

The voice assistant journey can be broken down into two phases::

Phase 1 was all about getting consumers introduced to the idea of using voice to perform tasks.

Phase 2 is about voice becoming a pervasive interaction mode with more capabilities which is used more frequently on more devices, in apps, and in different contexts.

What happens when you say “Okay, Google”?

You utter a sound, and there is a machine learning algorithm [or maybe a statistical algorithm like Markov Chains or a more general statistical algorithm] attempting to match your utterances to a syllable, letter, or word with maximum probability. This of course depends greatly on the corpora [or the raw data] that you feed it. Now let’s assume that we have that feature in “OK Google”, which is ostensibly a voice search for Google.

The crux of the magic here is that Google’s assistant has words already it expects to see with high probability, and so the recognition will be better. Locally speaking, it is going to crawl your “appointments” and your “contacts” and attempt to correlate your human sounds with those and others that are low hanging fruit relative to an assistant position. If this correlation fails below a certain threshold [e.g. show me the “supercalifragilistic” restaurant near me] it is going to fall back to the generalized google search, probably.

But, while it plays in the sandbox of assistant-ship, the recognition will be tighter, only when there is an anomaly will it default to the catch all.

Now, the catchall is trained with many people, accents included, searching out things on the internet that may have very unique names, and so that AI must be open-minded. Probably it doesn’t match words, but instead syllables or letters. This is a much more challenging problem, and you would expect to see a performance decrease.

In other words, the assistant knows you’re probably trying to say “Simon” since you just texted him. Whereas the global algorithm will have to fight with “sigh a man”, “Sigh, Mon.” etc.

Siri V Cortana V Alexa V Assistant

Voice assistants can add other features, often called “skills,” that expand their abilities by interfacing with other programs via voice commands. Amazon’s Alexa has skills for playing Jeopardy, ordering your usual drink from your local Starbucks, and summoning an Uber or Lyft using connected account data. Google’s Assistant has similar skills but lags behind Amazon in the sheer number of available skills, largely due to being released later. Google Assistant also integrates with several tools that allow users to create their own skills. Using web services like Tasker and IFTTT (If This Then That), users can craft skills that will allow them to automate social media posts, turn devices on and off, and hundreds of other possibilities. For example, telling Assistant “Good morning” could launch a number actions designed to speed up the user’s morning routine.

Voice assistants are available on a wide variety of hardware platforms. Amazon and Google both market dedicated home speaker devices for their voice assistants. Apple has entered the home speaker market, with the line up of its Siri-enabled HomePod devices. Microsoft has focused on building Cortana into Windows 10 PCs and phones and recently partnered with Harmon Kardon to develop a Cortana enabled home speaker. As the voice assistant market stabilizes, it seems likely that there will be additional integration and that feature sets across the main voice assistants will become similar. For the moment, Amazon is the dominant player in the field, due to launching a home product first with a large media library available out of the box. Google is building capacity, and the addition of a home-based speaker and integration into other Google products will drive their market share up. Apple may also becoming more of a contender with the release of HomePods. Microsoft is not likely to gain much traction, as their share of the smartphone market is negligible and they lack a compelling home-based product.

What Can Voice Assistants Do?

Although each currently available voice assistant has unique features, they share some similarities and are able to perform the following basic tasks:

1.send and read text messages, make phone calls, and send and read email messages;
2. answer basic informational queries (“What time is it? What’s the weather forecast? How many ounces are in a cup?”);
3. set reminders, make lists, and do basic math calculations;
4. control media playback from connected services such as Amazon, Google Play, Netflix, and Spotify;
5. control Internet-of-Things-enabled devices such as thermostats, lights, alarms, and locks;
6. tell jokes and stories

How Does Voice Assistant Work?

Voice assistant first records our speech. Because interpreting sounds takes up a lot of computational power, the recording of our speech is sent to computational servers to be analyzed more efficiently.
Algorithms break down what we say into individual sounds. It then consults a database containing various words’ pronunciations to find which words most closely correspond to the combination of individual sounds.
It then identifies key words to make sense of the tasks and carry out corresponding functions. For example, if Google Assistant notices words like “weather” or “temperature”, it would open the weather app.
Google’s servers send the information back to your device and Google Assistant may speak. If Google Assistant needs to say anything back to us, it would go through the same process described above, but in reverse order.

Challenges Faced by Voice Assistants

Yes, voice technology has problems. Call them challenges or call them opportunities that one can tap into it, pun intended!

Permission to access a lot of data:

“If you want automation, you have to give up control!”

The biggest disadvantage of the Assistant is that it requires users to give access to a lot of data. Want the Assistant to remind you of calendar events? You have to give access to your calendar. Want the Assistant to remind you when to leave for your event? You have to give access to your location. Want the Assistant to send texts for you? You have to give it SMS control. Want the Assistant to turn on your lights? Gotta give access for that.
The problem is, this is sort of human fault. People have become so lazy and so obsessed with having technology do things for them, that they keep asking for more out of tech. However, these things should be done without giving up any private data, which is just not possible.Further there are other device issues like it heats up your mobile,maximum battery use,high data use or sometimes the device gets hung, etc.

Security and Privacy

One of the main issues with these voice-activated devices is security. Anyone with access to a voice-activated device can ask it questions, gather information about the accounts and services associated with the device, and ask it to perform tasks. This poses a major security risk because these devices will read out calendar contents, emails, and other highly personal information.
In one reported case, a man discovered that the iPad in his living room would unlock the front door for anyone who stood outside and asked Siri to let them in.
Google has recently upgraded its Assistant software to include voice printing, which uniquely identifies each user by voice 84 M. B. HOY and prevents the device from reading out personal information. Apple is also teaching Siri to recognize a user’s voice, but that feature had not yet been released at the time of this writing.
Amazon’s Alexa is just as prone to these security issues, and Amazon is working to deploy a similar voice printing system. Alexa has the added issue of being built into Amazon’s store interface. By default, anyone with voice access to the device can order items using the owner’s Amazon account. There are options to set a voice passcode to confirm purchases, and all goods will ship to the owner’s address on file, but there is still potential for malicious users to purchase goods on the owner’s account. Household members could make unauthorized purchases as well, like the six-year-old who ordered herself a dollhouse and four pounds of sugar cookies via Alexa.
Voice assistants are also vulnerable to several other attacks. Researchers have recently proven that voice assistants will respond to inaudible commands delivered at ultrasonic frequencies. This would allow an attacker to approach a victim, play the ultrasonic command, and the victim’s device would respond.
Privacy is another major concern for voice assistant users. By their very nature, these devices must be listening at all times so that they can respond to users. Amazon, Apple, Google, and Microsoft all insist that their devices are not recording unless users speak the command to wake the assistant, but there has been at least one case where a malfunctioning device was recording at all times and sending those recordings back to Google’s servers.
Even if the companies developing these voice assistants are being careful and scrupulous, there is a potential for data to be stolen, leaked, or used to incriminate people.

Accuracy

Voice Assistants don’t always understand what’s spoken. There could be many reasons behind these- sometimes it could be because of how we say, our accent can cause that. Sometimes, it could be because the voice assistant simply doesn’t know what to do with your question. After all, it doesn’t have any instructions related to your query.

Lack of vernacular Support

Speech recognition, perhaps the most critical component of a Voice Assistant, is not available for a lot of languages spoken around the world. The problem is not only limited to speech recognition but also extends to other critical functional areas of Voice Assistants.

Countries like India, with a massive Indic speaking population and lack of quality ASR model for vernacular languages, are often a limiting factor in providing a good voice experience. In India, voice assistants will not be a source of convenience but are a necessity.

Most of the Natural Language Processing is being done after translating spoken utterances from vernacular languages to English. In this process, a lot of contextual nuances are lost or are changed.

What is the Future?

Millennial consumers are fueling the shift towards voice assistants powered by artificial intelligence. Significant AI adoption is driving the move to voice applications. Additionally, IoT devices such as thermostats, speakers, and smart appliances are making voice assistants ever more useful in the lives of everyday users.

1. Streamlined conversations

Google and Amazon recently announced that their voice assistants will stop requiring the user to say ‘wake’ words such as ‘Alexa’ or ‘Google’ to start a conversation. Such devices are also expected to get better at understanding contextual factors that make conversations more efficient.

2. Change in search behaviors

The market value of Voice-based shopping will reach 40 billion by 2022, according to industry projections.
· Consumer spending via voice assistants is also projected to reach 18% by the year 2022.
· Not surprisingly by 2022, voice-based ad revenues are expected to reach $19 billion.

3. Personalized experiences

voice-enabled devices saw a 39% increase year-over-year in online sales. Pretty soon, voice assistants will start providing even more personalized experiences as they become better at distinguishing different voices and tailoring results according to each individual user’s information.

4. Compatibility and integration

Samsung has already started this with its release of its Family Hub refrigerator. Google also recently rolled out a new product called Google Assistant Connect, which allows manufacturers to build custom devices integrated with this technology.

5. Focus on security

Amazon and Google introduced a number of security measures (including speaker ID and verification) to their voice assistant technologies. New solutions are also in the pipeline to make it more secure for customers to buy things using voice.

Conclusion:

The complexity and accuracy of voice recognition technology and voice assistant software have grown exponentially in the last few years. Currently available voice assistant products from Apple, Amazon, Google, and 86 M. B. HOY Microsoft allows users to ask questions and issue commands to computers in natural language. There are many possible future uses of this technology, from home automation to translation to companionship and support for the elderly. However, there are also several problems with the currently available voice assistant products. Privacy and security controls will need to be improved before voice assistants should be used for anything that requires confidentiality. Librarians should monitor these products and be ready to provide assistance to their patrons with these devices. They should also explore the possibilities for providing library materials via voice assistants as the technology matures.

Written by: Atharva Dongare, Yash Halgoankar, Mansi Jadhav, Aditya Giradkar and Rushikesh Lenekar