A virtual assistant is a software that interprets natural language (Natural Language Processing) and, if properly trained, can dialogue with human interlocutors in order to provide information or perform certain operations.
Dictation machines are the first device that we can trace back to what today are the voice assistants. In 1877, Thomas Edison invented the phonograph. Basically, the phonograph consisted of a stylus that, in response to the pressure produced by the sound vibrations emitted by the “user”, engraved grooves on a rotating cylinder covered with a layer of aluminum paper. Once engraved, the rotating cylinder could be heard, by means of a stylus, making it rotate in the opposite way, basically like today’s turntables. The recording quality was very low, but this first device laid the foundations of what would become the vocal assistants: it had, in fact, the ability to record speech and play it back.
In 1886 Alexander Graham Bell improved Edison’s technology by replacing the aluminium foil with wax. This allowed longer recordings and higher quality speech reproduction.
To really get closer to today’s vocal assistants, we have to wait until 1952, the year in which Audrey, produced by Bell Laboratories, made its debut.
Audrey stands for Automatic Digit Recognizer: a machine about 1.80 meters high that had the ability to recognize numbers from 0 to 9. In order to do so, it was necessary for the user to pause between one number and the other, and run “speech tests” to adapt Audrey to the user’s voice. Theoretically, Audrey could be applied for the composition, or typing, of numerical scales without hands, but its size, price, and maintenance costs did not allow Audrey to be massively applied. In fact, it was faster and more efficient to still type numbers by hand. However, this invention was the founding basis of what is now called “voice recognition”: the technology at the basis of voice assistants.
Taking a step forward, IBM, in 1962, introduced for the first time, at the World Fair in Seattle, Shoebox: the first voice-activated computer. It was capable of understanding 10 digits and 6 words (plus, minus, total, subtotal, false and off). Shoebox, connected to a calculator, was able to solve simple mathematical operations. Like Audrey, this machine recognized and acted on the voice frequencies emitted by the user. The real innovation, however, was the ability to perform actions on the basis of the input collected: another fundamental step on the road to the invention of voice assistants.
Almost 10 years later, in 1971, the Defense Advanced Research Projects Agency (DARPA) funded a five-year word recognition project carried out by Carnegie Mellon University. This study led to the launch of Harpy (1976). Harpy had a vocabulary of 1,011 words and was able to understand entire sentences, distinguishing the different words that make up a sentence. Harpy could elaborate a speech that followed grammatical structures, vocabulary and pre-programmed pronunciations. One of the most fascinating aspects of Harpy, but especially the one that brought him closer to the vocal assistants we all know today, was that Harpy returned the message “I don’t know what you said, please repeat” when it couldn’t understand the user. Here you can see a video of how Harpy works.
In 1986, IBM introduced an updated version of Shoebox, Tangora. The name pays tribute to Albert Tangora, who set the world record as the fastest typist in the 1980s. Tangora, like Shoebox, was connected to a device, but this time, instead of being connected to a calculator, it was connected to a typewriter. Tangora could recognize about 20,000 words, but the real peculiarity was that the machine could predict the most reliable result, based on what it had interpreted until then.
Finally, for a real commercialization of voice recognition systems, we have to wait for the 90s and the arrival of Dragon’s NaturallySpeaking software. Launched in its first version in 1997, it was able to recognize and transcribe natural human speech (users did not have to pause between words) in a digital document, at the rate of 100 words per minute. This software cost just under 700 USD, which made it ”accessible” compared to previous speech recognition devices. Today, the software launched by Dragon is still on sale in its updated versions.
So far, all the fundamental bases of what would become voice assistants had been laid. But the final realization was only possible with the development of artificial intelligence and machine learning.
So let’s summarize the functioning of a modern vocal assistant in all its steps:
- the assistant is activated when it receives a specific voice command: the so-called “hot word“, in the case of Amazon “Alexa”;
- once the audio input has been received, words are identified through Automatic Speech Recognition technology;
- once the words have been identified, Natural Language Understanding is applied to attribute a meaning to the input, (i.e. to understand what the “assisted person” wants);
- at this point, the software makes use of the different applications, or rather “skills“, available in the cloud to perform the assigned task (i.e. use the weather application to give information about the weather);
- finally, the voice assistant must give “voice” to the identified response. To do so, it uses a “Text to Speech” speech synthesis engine that returns the search result with a natural voice from a written content.
Now, you are probably wondering at what stage Artificial Intelligence and Machine Learning intervene. The answer is multiple, in fact, these are technologies that allow voice assistants to: react to a precise command by choosing from a set of pre-configured answers/solutions, to learn and improve their skills over time and, in more refined systems, to base their answers on the habits of the users that the voice assistant has started to recognize.
The last decade can be called the “modern age” of voice assistants, starting with the birth of Siri: Apple’s voice assistant. In 2011, Apple, for the first time, decided to add a voice assistant to its smartphones. The iPhone 4S, offered to its users a fully vocal interaction mode: schedule appointments, play music, search for information, perform other basic tasks became possible simply by talking with the cell phone. Thanks to this integration into Apple’s mobile devices, Siri was the first voice assistant to reach a mass audience. Other assistants such as Google Now and Cortana from Microsoft soon followed.
In the same year, Google introduced online voice search: it became possible to use the computer microphone to search for content by voice on Chrome, taking advantage of the voice synthesis features within Google Search. Later on, Google presented the famous Google Assistant, integrated in all Android devices, which have Google services by default.
2014 marks another great revolution in the world of voice assistants: Amazon introduces Alexa, and with it, also Echo is presented. It is the birth of the smart speaker: a stand-alone intelligent speaker. That day a real competitive market of voice assistants was born, with Google and Apple responding to Amazon’s challenge by presenting Google Home and Apple HomePod respectively. These smart speakers, installed in homes, are also able to interact with external devices allowing people to control with their voice home automation devices: appliances, lights, thermostat, but also security systems. Any device of the latest generation with a WiFi connection can be controlled by the voice assistant in the speaker, giving the ability to manage homes literally by voice.
Since then, the smart speaker market has been growing exponentially: according to the results of Strategy Analytics‘ recent research, global smart speaker sales in the first quarter of 2020 reached 28.2 million units, an increase of 8.2% over the first quarter of 2019.
Amazon, according to Statista, is confirmed as the leading brand in this sector, with a market share of 23.5% worldwide. According to Canalysis analysts, by the end of 2020 there will be 320 million active smart speakers in the world, a figure that will double by 2024: exorbitant numbers that underline the importance that these small voice assistants are assuming in people’s daily lives.
But the story of voice assistants is just beginning: according to Hej!, the digital innovation agency that applies artificial intelligence to conversation, users will begin to change the way they interact and the voice will become the first activator of digital systems. When technologies will be more advanced, they will be able to provide truly complete experiences: users will no longer ask for the help of these systems but they will be the assistants who will start to suggest, in predictive mode, to users what to do in certain circumstances.
The practical applications of voice assistants are, therefore, almost infinite and in the years to come, thanks to them, we will see a real revolution in every economic and social field, while research will focus more and more on the understanding of language to make the dialogues between man and machine more and more natural. They will never replace human conversations, that’s for sure, but they will surely be more and more useful to help us in our daily life.
To not miss the next journey to discover the technologies that are revolutionizing our daily life, subscribe to our newsletter here.