Neural Networks for Spoken Language Understanding


Yesterday, our CTO, Andrey Ryabov shared our learnings in developing a Neural Network for Spoken Language Understanding with students at Stanford University. It is part of the AI course on Neural Networks and Deep Learning. Here is a brief on the topics covered.

Spoken language is quite different from the written language in the following ways

  1. When speaking, people don’t always follow grammar, use punctuation, and often split their sentences.
  2. Automatic Speech Recognition (ASR) introduces errors.
  3. Users tend to use more anaphoras.
  4. When writing, a person can go back and edit sentences, but for a speaker, it’s not possible, corrections are appended to the sentence.

These specifics of spoken language have to be considered when developing Natural Language systems. Many classical NLP models trained on datasets of written language don’t perform well on spoken language.

How does one develop a Voice AI service that converts speech to the meaning and offers human-like conversations? Here are some tricks we shared on Spoken Language Understanding.

  1. Develop Word Vectors for sentences using classical NLP training set.
  2. Augment the LSTM to use word positions and context information.
  3. Use Attention to make only the important words contribute more.
  4. Use Augmented Dataset and Transfer Learning to better train the Neural Network.

The above are some of the Neural Network enhancements we shared and will soon be released in our Voice AI service. If the spoken language understanding problem appeals to you and you are an engineer, email to learn more. If you are an enterprise that wants to leverage the Voice AI service, contact us here.

2018 will be the year of Voice in Enterprise

article-pic-022017 was a year of Cambrian explosion for voice-enabled devices. Amazon consolidated their leadership by launching a whole family of Echo devices. Echo Dot was Amazon’s best-seller during the holiday period. Google was not too far behind. More than one Google Home device was sold every secondsince Google Home Mini started shipping. 2018 CES offered a glimpse into the future of conversational voice: a world where consumers can use voice to control everything — thermostats, lamps, sprinklers, security systems, smart watch, fitness bracelet, smart TV, or their vehicles.

2018 will be the year for Voice Assistants to enter the Enterprise and transform how work gets done. Here are a few important factors for Voice Assistants to make a seminal impact in the enterprise:

  • Conversational Experience: Enterprise Voice Assistant should provide a human-like conversational experience. Unlike request-response, we should be able to interrupt the Assistant anytime. While you are getting your daily briefing by saying “how does my day looks like” and listening to the summary of your meetings, urgent emails, direct messages and important tasks; you should be able to interrupt anytime and ask to “Text Laura that I am running 10 min late, and move my next meeting by half an hour”.
  • Contextually Intelligent: Today, we use a whole array of applications to get things done at work. Enterprise Voice Assistant need to preserve the context of the conversation so the user can have a deep meaningful dialogue to complete the business workflow drawing information from different data sources — without switching the applications. For example, a technician may ask “Where is my next appointment” followed by “Which cable modem should I bring”. Enterprise Voice Assistant will need to preserve the customer context from the first question to pull up the right modem information.
  • Proactive Predictions: Voice Assistant for the enterprise should not just respond to the user commands, it should also be able to leverage Artificial Intelligence (AI) to proactively make recommendations and predictions to accelerate the task. For example, when a Delivery Service guy asks “how do I get to the next delivery”, Voice Assistant should provide directions but also proactively tell “By the way, the door code is 1290 to get access to the building.”
  • Domain Specific: Every enterprise and every industry is unique, and has it’s own lingua franca. Enterprise Voice Assistant should be able to apply Machine Learning (ML) to learn the unique vocabulary of every domain to improve the speech-to-meaning. The assistant should cumulatively learn from each user interaction such that it can make recommendations to other colleagues about what to say to complete the task.

Voice AI will not just be used for simple work tasks like control the meeting or book a conference room. It will be used to complete the complex business workflows because as we think about it, the more complex the scenario, the more it benefits from voice AI to help users get what they need quickly without switching applications.

2018 will be the groundbreaking year business leaders in every industry will wake up to the importance of voice AI. The notion of opening an app will start to seem old-school, when you can just talk to get work done.

Taking your Voice Assistant to the Office


Voice assistants have been finding their way into our daily lives. Among at home voice-enabled speaker devices, Amazon has taken a 70% market foothold and sold over 25 million units, while Google trails behind at 23.8% market share and selling 5 million units. These devices, along with their mobile counterparts Siri, Cortana, and Google Assistant, have pushed voice into the mainstream, making consumers more comfortable using them as a hands free way of getting things done.

Still, only 46% of U.S. adults use voice assistants, with the most frequent use cases being playing music, setting alarms, and checking the weather. The remaining 64% of U.S. adults say they don’t use voice assistants simply because they are not interested. And it’s no wonder, as several of the most common use cases for voice can easily be achieved with a few taps on the phone — not to mention Siri and Google’s command and response can be easily bungled, taking more time than just tapping through to the desired app.

Each tech giant is using their own strengths to approach the advantages voice enables: Amazon’s Echo is used for shopping, Apple is launching their own Home Pod early 2018, which doubles down on music, and Google is optimizing their own Home device for search. So far, Amazon seems to be the only one to deliver on the advantages of voice, letting consumers say something as simple as “Amazon order X”, which automatically completes the order and has the items shipped.

Traditional User Interfaces are complex, and completing tasks or performing a search is time consuming. With voice, tasks can be completed in only a few words. Today for example, if you wanted to find the recent changes to a document, you would first have to open your docs app, search for the document, open it, then click on a separate button to view the recent changes. With voice, you could simply say “Are there any recent changes to the document?” and the result will be spoken back to you.

At work, we often use business applications that require training and practice to use effectively within our organization. These products help us solve complex problems, but shouldn’t require us to jump through hoops. As Steve Jobs said, “the computer is the bicycle for the mind”. Our software at work should be intelligent to help us go faster, not slow us down. With voice, we’ll be able to go faster with less effort. This is the future of voice we’re building at Synqq.

Entering the Voice Information Era

SynqqVoiceInformationEraWe are entering an era where the voice is being transformed from audio to information. Contributing factors are behavioral changes in smartphones and voice-controlled speakers, advancements in infrastructure technology, and reduction in infrastructure pricing. Today we use products like Amazon Alexa, Apple Siri and Google Voice with voice commands, and get a response back with each request. Behind the scenes, these products convert voice into text via Automatic Speech Recognition (ASR) and use Natural Language Processing (NLP) to interpret and return the results either visually or as voice using Text to Speech (TTS). The rapid growth and ease of use of these products have instilled behavioral changes in consumers — making us more comfortable and more likely to use voice instead of traditional user interfaces.

Every major part of the technology infrastructure required to convert voice into information is available from cloud vendors: Amazon, Google, and Microsoft. The three major services they provide around voice are ASR, TTS, and NLP. For example, the ASR services from Amazon, Google and Microsoft are priced around 2.4 cents/minute. Going forward, all the cloud vendors are embarking on Deep Learning to reduce the training workload, to improve the accuracy of the transcribed text, and to scale the complexity of ASR. In parallel, the evolution of dropping hardware prices combined with the Nvidia GPU cloud infrastructure (for both training and inference) will dramatically reduce the prices of ASR services.

What can we do as the speech infrastructure services ASR, NLP, and TTS improve and the prices come down?

To make a prediction, let’s take a look at the evolution of voice communication services over the last decade. The technology infrastructure for voice communications like high-bandwidth codecs, Acoustic Echo Cancellation (AEC), and the broadband built the stage for complete solutions like Skype, WeChat, Line, WebEx, WhatsApp, and many others. These complete solutions offered a better experience and had a network effect to become the dominant players for voice communication.

Along the similar lines, we expect complete solutions to emerge by leveraging the speech infrastructure services to become the dominant players. At Synqq, we are excited to leverage the infrastructure available to develop the world’s first Voice AI service for Enterprises. Our focus is to provide the voice interface to access all your information in your Enterprise. Just talk to Synqq!

Tapping into Voice at Work

ai image.png

At work, we’re always trying to be more productive: to get more out of doing less, whether it’s our time, our tools, or our method of communication. Digital tools like email, collaboration apps, and messengers have helped us do that, but there’s the one method of communication everyone uses at work that’s never been optimized: voice. Voice, which remains the most used, most effective, and fastest form of communication at work, is also the most difficult to store and consume.  What if we could “see” what was said in our conversations instead? We’re best at exchanging information with each other with voice, but consume information best with visuals.

The advancements in Automatic Speech Recognition (ASR) from Google, Microsoft, and Baidu have reached almost 4.9% word error rate. However the Speech-to-Text transcription for conversations at work require another level of innovation. Firstly the ASR’s are not aware of the context of our conversations and such cannot interpret the keywords, names and the entities we refer. As such they have poor accuracy of the words recognized. Secondly ASRs have no concept of who said what in the conversation. This is required to preserve the structure of the conversations. Typically after most conversations, we would like to know what one person about a given topic. And even short conversations contain paragraphs of text when transcribed, which are a pain to read through to get a few sentences of useful information.

history of asr.png

speech recognition wer.png




Now that we’re in the era of Infinite Computing and Artificial Intelligence, it’s finally possible to address this by developing the voice NLP infrastructure. We need something that not only takes everything down in our meetings, but a voice NLP that knows our specific context, and knows what’s important to us. Something that will lets us be engaged in our meetings, while keeping everyone on the same page after. And that will let us get more out of doing less. This is the future of voice in enterprise we’re building at Synqq.

Future of Voice in Enterprise


We talk with our co-workers, customers, partners, and other stakeholders in business every day. In every meeting, whether in-person, web conference or over the phone, we use voice as the primary medium of communication. Be it project meetings, sales calls, support calls or interviews, precious information is contained in these conversations. Today, most of this information is lost, as it is neither feasible nor practical to capture voice conversations with existing technology. And even if we are able to capture the voice conversations, we can’t search what is inside them, we can’t see who said what, and we can’t get to the key moments. Have you ever tried to find a single sentence from an hour-long recording? Voice conversations are like dark matter.

Why? First, it isn’t easy to handle voice in all situations, and the devices we use to capture voice determines the quality. For the right quality, voice needs to be handled differently in all kinds of conversations. It’s easy to capture voice in a Web Conference, but how can it be done for in-person meetings or phone calls? Once voice is captured, how can we uniquely identify each speaker? How can we make sure each speaker’s voice is heard equally well?

Second, transforming voice into a visual, searchable stream of information is hard. The words we use in our conversations depend on the context. They also depend on our business domain. For example, the word “checkin” in software context means putting software in a repository, but in airline context, it is two words “check-in”. The intent and entities are different based on the context, user, and the domain. And each of us pause differently between our sentences, that makes it hard to segment speech into sentences.

At Synqq, we have pioneered new technology to handle voice in all situations and have made seminal advancements in Natural Language Processing — that is going to change the way we handle voice conversations in the enterprise. The current era of infinite computing makes it affordable. Voice conversations will no longer be the dark matter of the past. Voice conversations can become the searchable record for every enterprise.

Era of Infinite Computing


We’re living in the era of Infinite Computing with the creation of Cloud Platforms like AmazonGoogle, and Microsoft. Our daily lives are transformed by services like GoogleFacebook, and Uber. The phone revolution leverages the cloud services to talk to anyone on the planet, get to any place, order anything, and get entertained without taking up all your storage. These services are at the forefront of the future services to come in the era of Infinite Computing.

Why is it so important? As we enter into the era of Infinite Computing, technology companies will be able to build sophisticated AI models to organize, classify, and predict things that are limited only by our imagination. For example, the 1.3 trillion pictures we took on our phones in the last one year are automatically organized and classified by the social networks.  The transportation industry has been transformed by services like UberLyft, and Didi, and in the future, we will see fleets of self-driving vehicles feasible only in the era of Infinite Computing. Every industry will be transformed in the era of Infinite Computing.

What can the era of Infinite Computing do for our daily work? Can it save us from the daily chores of capturing all the information we need and organize it for us? And enable us to recall the snippet that matters at any time with a tap or by using voice? Can we gain superhuman memory for ourselves and our teams? We believe all this is possible. This era of Infinite Computing enables us to develop personalized machine learning models to do all the heavy lifting so we and our teams can achieve superhuman things.