It worked for me with "what's the capital of the country where the Brandenburg Gate is", and "what's the population of the country with the Eiffel Tower", then "what's the current time there?"
Similarly, the restaurant and hotel demos from the other promo video worked fine. Also with follow-up questions like "and what about ones with free wifi?"
Does it hold context for subsequent questions? For example can you ask "what city is the Brandenburg Gate in?", "which country is that in?", "what is its population?"
They advertise it as "speech-to-meaning". I've been thinking about meaning in AI and how important it is for understanding and interacting with the world to be able to answer "What does this mean?" and "Does this make sense?"
Does anyone hove any insight or references to recent research about how to model, train for and represent meaning?
* Semantic parsing maps sentences into something actionable (a query is answered, the robot moves, etc). For a more abstract, non-task specific version, relevant keywords are "semantic role labelling", and recently "the AMR corpus". Systems stuff falls under "entailment" and "paraphrase".
* The theory driving this stuff is increasingly combinatory categorial grammar (CCG). Read Mark Steedman's book "The Syntactic Process", for an explanation of how to parse NL sentences into lambda terms. The premise of the theory is that we can make an efficient compiler that outputs the lambda terms directly, and that the syntactic structure is just the derivation structure --- a trace of the algorithm.
At the end of the day though, we just want the output. There are lots of ways to compute the relevant reepresentations.
* Much relevant work on CCG-based semantic parsing is happening at University of Washington, e.g. under Luke Zettlemoyer. With the new Allen Institute investing serious money, Seattle looks insanely good for NLP now.
word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method.
Yoav Goldberg and Omer Levy. arXiv 2014. [pdf]
The word2vec software of Tomas Mikolov and colleagues has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations.
Why does this produce good word representations? Good question. We don’t really know. The objective above clearly tries to increase the quantity vw·vc for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each other (note also that contexts sharing many words will also be similar to each other). This is, however, very hand-wavy.
The voice understanding is impressive, but obviously a video can show the best-case.
The factual questions themselves aren't hard. I've written a toy QA system, and it could handle the base case of those - they are just straight querying on Freebase/DBpedia.
The longer question ("the population, land area and capitals of Japan, India and China") was good.
If anyone has a working copy, I'd like to know which of these it can handle:
"Who is Bill Clinton?"
"Who is Bill Clinton's daughter?"
"Who did Bill Clinton's daughter marry?"
If it can get the third level then it's pretty good. From memory something like OpenEphyra can sometimes get that 3rd level, but usually fails.
I thought the contextual querying (where one query led to another and it had to remember the previous details) was pretty good.
So that indicates that the natural language parsing isn't super advanced.
"Traditional"-style dependency parsing should be able to solve that, and some of the Facebook AI group's recent work indicates that neural-network based parsing should be able to handle it too.
My toy QA system can handle the second case, without even doing dependency parsing.
It's not clear to me if this is a problem or not. If their voice recognition is as good as it seems that's pretty significant technology.
Soundhound programmer here: We just haven't developed that domain of knowledge yet. The natural language part of things can absolutely understand that, but the underlying base of understanding hasn't been developed.
For example, we haven't done movie show times yet either. Doesn't mean the natural language parsing isn't advanced ;)
It seems like you've turned some things off. One if the demo videos had a question like "What is the population if the capitol of the main that has the Space Needle?" When I tried that question tonight, it returned a Wikipedia link for Washington DC and second answer was strangely Pyongyang. Asking directly about the population of Washington DC, it gave that answer. Who is the President of the United States, returned a Wikipedia article about the (office of the) POTUS. I intended to follow that question with "How tall is he?" -- a question Google demonstrated at Google I/O 3 years ago.
"How is the weather tomorrow?" "How about Saturday?" "And Sunday?" That line of questioning worked surprisingly well.
I can tell you what's so impressive: 1.) It does not just answer isolated questions but it has a reprentation of the discourse. You see that when guy in the video follows up on a previous query related to mortgages. 2.) It is incremental, that means it starts processing the input while you are still talking. That's what allows it to be so fast. In principle, that allows it also to interrupt your sentence and ask for clarification. 3.) It has good understanding of complex topics and that allows it to ask clarification questions and additional details needed to answer a question.
Google Now does surprisingly well on all but the third question, though if you ask "Who is Chelsey Clinton's husband's father" it gets the right answer.
Siri passes every single question, verbatim, to WolframAlpha and it gets none of them right.
@nl noob here. I'd like to try building a toy QA systems (but in limited domain. For example, wild life). Can you please provide some open source pointers ? Is OpenEphyra the one I should look into ? Thanks!
The GitHub version of OpenEphyra[1] is a good place to start. From memory you need to much around some to get the Bing-based knowledge-miner to work, but it is pretty good once you do.
[2] is a good architectural overview.
For limited domain knowledge base building, DeepDive[3] is state-of-the-art. They don't have a QA interface though.
ok so I have an invite and it fails all questions except "Who is Bill Clinton"
Having tried it somewhat extensively I'm quite disappointed - I don't mean to say it's bad but that video was definitely curated in terms of questions.
Hmm... When your tech works this well, do you want to get acquired? This reminds me of Pied Piper in the way that it's just so much better than anything else out there. Applications for something like this stretch beyond a personal assistant; combining it with something like Watson or WolframAlpha could be very useful. I feel like I could actually use this to control my computer with confidence, for example.
They seem to have (plans for) a way[1] to integrate this into other applications. Looks like it would reduce a lot of the friction that exists with current-gen voice recognition systems.
Hope its real. Reminds me those cool expert system demos we say in the eighties, they had all the answers ... for selected narrow group of questions with very specific semantics.
I am very impressed. The voice recognition seems significantly faster and more accurate than Google's. The interactive back and forth with the mortgage calculations was the coolest part, I think. How are you able to access population and location data that quickly? It feels like it must be stored locally.
Google is in the business of making things go very fast. I don't think the primary bottleneck in Google's speech recognition is the server load, surely they can add more processing power if that were the issue.
Well, it is actually very demanding. ASR systems usually work with the speed of 1 RT (RT= real time factor, meaning recognizing 1 second of speech in 1 second). Approximately %60-70 of these processing goes to acoustic scoring. Rest is search in a large sub-phonetic+words graph and feature extraction (feature extraction takes a tiny percentage actually).
Nowadays acoustic scoring is done by large deep neural networks. And they are quite computation intensive. One can use GPU for that and indeed it works really fast if you have all the speech beforehand (off-line or batch mode). But for live recognition, GPUs lose their advantage quite a bit. Probably that is why Google worked on quantized vectorization and other tricks to make the DNNs fast in CPU [1].
I am quite sure this creates an immense pressure on their servers when tens of thousands of concurrent speech streams are queued for recognition. Perhaps todays GPUs are better in that aspect and more work can be delegated to decrease the pressure. There were other interesting work which utilizes almost all processing in GPU [2].
In short, ASR systems are very very processing hungry and a challenge for everyone, probably even for Google.
This isn't entirely accurate. Or rather, it is accurate as far as it goes, but doesn't tell the whole story.
Training a neural network uses a lot of computational power. From memory I think training the Android voice recognition was weeks of training on Google's GPU cluster ([1] talks about 95 hours for partial training, but I don't think that's the production system).
However, once the network is trained it doesn't use much power at all. The trained network can run a mobile phone, and it doesn't even drain the batteries much.
I was mentioning about run time operations, not training. Yes training DNNs are much more time consuming, but my point is, using them is also not cheap. As mentioned, processing 1 second of speech, lets say in 0,5 seconds is expensive. Considering a web search is done in sub millisecond time. of course I assume speech recognition is done in server side.
Very interesting links, thanks for sharing. I'm not sure if you're familiar with Android's speech recognition, but it seems to work offline as well. I wonder if they offload the computation to their servers when you're online and compute it locally when you're not. However the latency seems to be on the same order of magnitude.
Yes it works off line and it is a work of marvel IMO. Seems like all work is done in the phone when you are offline. And it performs close to the server counterpart. Latency is probably because of the nature of the live ASR processing. System cannot recognize word sequences immediately, just as humans.
The transscript of the speech shows up almost instantly and if you look closely you can see it corrects initial mistakes. This suggests that the speech-recognition is running on the phone. Might drain the battery.
apkmirror link - http://www.apkmirror.com/apk/soundhound-inc/hound/soundhound...
Edit: Currently, only US devices are supported. You can sideload using the apk link.
In either case, you need an invite code for which you can register from in app or on the website - www.soundhound.com/