Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Hound voice search NLP demo [video] (youtube.com)
142 points by sandGorgon on June 3, 2015 | hide | past | favorite | 55 comments


Google play link - https://play.google.com/store/apps/details?id=com.hound.andr...

apkmirror link - http://www.apkmirror.com/apk/soundhound-inc/hound/soundhound...

Edit: Currently, only US devices are supported. You can sideload using the apk link.

In either case, you need an invite code for which you can register from in app or on the website - www.soundhound.com/


Google Play won't let me install it, saying it's incompatible with my Nexus 4 (on Lollypop)


That's why the GP also posted a direct APK link.

In any case, you need an invite code in order to actually use the app.


I just installed it yesterday on my Nexus 4 (on 5.1.1).


"What the fuck was that ??" were the exact words I spoke out loud after I saw this video.

Is this for real, no editing, no time compression ?

Does it really understand those questions or is it preprogrammed ?

Looks kind of incredible to be true. If it is, though, I'm in awe.


It worked for me with "what's the capital of the country where the Brandenburg Gate is", and "what's the population of the country with the Eiffel Tower", then "what's the current time there?"

Similarly, the restaurant and hotel demos from the other promo video worked fine. Also with follow-up questions like "and what about ones with free wifi?"


Does it hold context for subsequent questions? For example can you ask "what city is the Brandenburg Gate in?", "which country is that in?", "what is its population?"


Yes it does. That's one of the main features.


I am impressed with the speed. But it says internal demo for good reasons I am assuming :)


They advertise it as "speech-to-meaning". I've been thinking about meaning in AI and how important it is for understanding and interacting with the world to be able to answer "What does this mean?" and "Does this make sense?"

Does anyone hove any insight or references to recent research about how to model, train for and represent meaning?


On limited network connection, so can't give you links, but:

* Academic keyword is "semantic parsing", main publications open access at http://aclweb.org/anthology

* Semantic parsing maps sentences into something actionable (a query is answered, the robot moves, etc). For a more abstract, non-task specific version, relevant keywords are "semantic role labelling", and recently "the AMR corpus". Systems stuff falls under "entailment" and "paraphrase".

* The theory driving this stuff is increasingly combinatory categorial grammar (CCG). Read Mark Steedman's book "The Syntactic Process", for an explanation of how to parse NL sentences into lambda terms. The premise of the theory is that we can make an efficient compiler that outputs the lambda terms directly, and that the syntactic structure is just the derivation structure --- a trace of the algorithm.

At the end of the day though, we just want the output. There are lots of ways to compute the relevant reepresentations.

* Much relevant work on CCG-based semantic parsing is happening at University of Washington, e.g. under Luke Zettlemoyer. With the new Allen Institute investing serious money, Seattle looks insanely good for NLP now.


Here's a chronological reading list of techniques

Traditional NLP -> 2012, IBM Watson papers[1]

Watson is probably the pinnacle of "traditional" style NLP (ie, Tokenization/Lemmatization/Part of Speech Tagging/Framing/Knowledge Engineering/etc)

Word2Vec -> 2013, Google paper[2]

Word2Vec kind of exploded everyone's mind for a while.

Subgraph Embedding -> 2015, Facebook AI Group [3][4]

[1] http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=617771...

[2] http://arxiv.org/pdf/1301.3781.pdf

[3] https://research.facebook.com/publications/1473550739586509/...

[4] http://arxiv.org/abs/1502.05698


You might be interested to read this: http://arxiv.org/pdf/1402.3722v1.pdf

word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method. Yoav Goldberg and Omer Levy. arXiv 2014. [pdf]

The word2vec software of Tomas Mikolov and colleagues has gained a lot of traction lately, and provides state-of-the-art word embeddings. The learning models behind the software are described in two research papers. We found the description of the models in these papers to be somewhat cryptic and hard to follow. While the motivations and presentation may be obvious to the neural-networks language-modeling crowd, we had to struggle quite a bit to figure out the rationale behind the equations.


Yeah. Pulling a quote:

Why does this produce good word representations? Good question. We don’t really know. The objective above clearly tries to increase the quantity vw·vc for good word-context pairs, and decrease it for bad ones. Intuitively, this means that words that share many contexts will be similar to each other (note also that contexts sharing many words will also be similar to each other). This is, however, very hand-wavy.

I love an honest paper.


Check this[1] out: it describes some techniques used to convert natural language to machine-readable descriptions.

[1] https://github.com/wiseman/energid_nlp


What is it that impresses people here?

The voice understanding is impressive, but obviously a video can show the best-case.

The factual questions themselves aren't hard. I've written a toy QA system, and it could handle the base case of those - they are just straight querying on Freebase/DBpedia.

The longer question ("the population, land area and capitals of Japan, India and China") was good.

If anyone has a working copy, I'd like to know which of these it can handle:

"Who is Bill Clinton?"

"Who is Bill Clinton's daughter?"

"Who did Bill Clinton's daughter marry?"

If it can get the third level then it's pretty good. From memory something like OpenEphyra can sometimes get that 3rd level, but usually fails.

I thought the contextual querying (where one query led to another and it had to remember the previous details) was pretty good.


The first question returns Wikipedia info, the second two just say "Showing search results for: ...".


Thanks.

So that indicates that the natural language parsing isn't super advanced.

"Traditional"-style dependency parsing should be able to solve that, and some of the Facebook AI group's recent work indicates that neural-network based parsing should be able to handle it too.

My toy QA system can handle the second case, without even doing dependency parsing.

It's not clear to me if this is a problem or not. If their voice recognition is as good as it seems that's pretty significant technology.


Soundhound programmer here: We just haven't developed that domain of knowledge yet. The natural language part of things can absolutely understand that, but the underlying base of understanding hasn't been developed.

For example, we haven't done movie show times yet either. Doesn't mean the natural language parsing isn't advanced ;)


It seems like you've turned some things off. One if the demo videos had a question like "What is the population if the capitol of the main that has the Space Needle?" When I tried that question tonight, it returned a Wikipedia link for Washington DC and second answer was strangely Pyongyang. Asking directly about the population of Washington DC, it gave that answer. Who is the President of the United States, returned a Wikipedia article about the (office of the) POTUS. I intended to follow that question with "How tall is he?" -- a question Google demonstrated at Google I/O 3 years ago.

"How is the weather tomorrow?" "How about Saturday?" "And Sunday?" That line of questioning worked surprisingly well.


I can tell you what's so impressive: 1.) It does not just answer isolated questions but it has a reprentation of the discourse. You see that when guy in the video follows up on a previous query related to mortgages. 2.) It is incremental, that means it starts processing the input while you are still talking. That's what allows it to be so fast. In principle, that allows it also to interrupt your sentence and ask for clarification. 3.) It has good understanding of complex topics and that allows it to ask clarification questions and additional details needed to answer a question.


Who is Bill Clinton's daughter's husband?

Who is Bill Clinton's daughter's husband's daughter?

Who is Bill Clintron's daughter's husband's daughter's grandfather?

Who is Bill Clinton's house?


Google Now does surprisingly well on all but the third question, though if you ask "Who is Chelsey Clinton's husband's father" it gets the right answer.

Siri passes every single question, verbatim, to WolframAlpha and it gets none of them right.


Just tested it. It fails all of the above and just returns a websearch.


@nl noob here. I'd like to try building a toy QA systems (but in limited domain. For example, wild life). Can you please provide some open source pointers ? Is OpenEphyra the one I should look into ? Thanks!


The GitHub version of OpenEphyra[1] is a good place to start. From memory you need to much around some to get the Bing-based knowledge-miner to work, but it is pretty good once you do.

[2] is a good architectural overview.

For limited domain knowledge base building, DeepDive[3] is state-of-the-art. They don't have a QA interface though.

[1] https://github.com/TScottJ/OpenEphyra

[2] https://mu.lti.cs.cmu.edu/trac/Ephyra/wiki/Docs/Architecture...

[3] http://deepdive.stanford.edu/


Maybe a good idea to study this then, could be useful to help drive your understanding on the problem domain.

http://sirius.clarity-lab.org


OK. Thanks. Aside, looked into the sirius video and I kept craving for a demo :)


I would like to know if you can ask the following group of questions : - "What is Bill Clinton's age ?" - "Who did he marry ?"


ok so I have an invite and it fails all questions except "Who is Bill Clinton"

Having tried it somewhat extensively I'm quite disappointed - I don't mean to say it's bad but that video was definitely curated in terms of questions.


Google Now gets one and two and flunks the third. Interesting test.


Regular Google search get's them all correct.


Hmm... When your tech works this well, do you want to get acquired? This reminds me of Pied Piper in the way that it's just so much better than anything else out there. Applications for something like this stretch beyond a personal assistant; combining it with something like Watson or WolframAlpha could be very useful. I feel like I could actually use this to control my computer with confidence, for example.


They seem to have (plans for) a way[1] to integrate this into other applications. Looks like it would reduce a lot of the friction that exists with current-gen voice recognition systems.

[1] https://www.houndify.com/


this already exists --> api.ai


If this is indeed a non scripted and uncompressed demo this technology is pretty outstanding.

SoundHound[1] are behind this product.

[1] http://www.soundhound.com/houndify


Hope its real. Reminds me those cool expert system demos we say in the eighties, they had all the answers ... for selected narrow group of questions with very specific semantics.


Incoming acquisition in 3...2.....1


Exactly the words I was going to post... This looks like a great addition to the google app. Just the speech-to-text is impressive enough...


The integration is the hard part. Do you just throw out the whole existing OK Google engine?


I am very impressed. The voice recognition seems significantly faster and more accurate than Google's. The interactive back and forth with the mortgage calculations was the coolest part, I think. How are you able to access population and location data that quickly? It feels like it must be stored locally.


"significantly faster and more accurate than Google's" Yes it's impressive, but it could be because Google has a lot more users.


Google is in the business of making things go very fast. I don't think the primary bottleneck in Google's speech recognition is the server load, surely they can add more processing power if that were the issue.


Well, it is actually very demanding. ASR systems usually work with the speed of 1 RT (RT= real time factor, meaning recognizing 1 second of speech in 1 second). Approximately %60-70 of these processing goes to acoustic scoring. Rest is search in a large sub-phonetic+words graph and feature extraction (feature extraction takes a tiny percentage actually).

Nowadays acoustic scoring is done by large deep neural networks. And they are quite computation intensive. One can use GPU for that and indeed it works really fast if you have all the speech beforehand (off-line or batch mode). But for live recognition, GPUs lose their advantage quite a bit. Probably that is why Google worked on quantized vectorization and other tricks to make the DNNs fast in CPU [1].

I am quite sure this creates an immense pressure on their servers when tens of thousands of concurrent speech streams are queued for recognition. Perhaps todays GPUs are better in that aspect and more work can be delegated to decrease the pressure. There were other interesting work which utilizes almost all processing in GPU [2].

In short, ASR systems are very very processing hungry and a challenge for everyone, probably even for Google.

[1] http://static.googleusercontent.com/media/research.google.co...

[2] http://www.cs.cmu.edu/~ianlane/hydra/#&panel1-1


This isn't entirely accurate. Or rather, it is accurate as far as it goes, but doesn't tell the whole story.

Training a neural network uses a lot of computational power. From memory I think training the Android voice recognition was weeks of training on Google's GPU cluster ([1] talks about 95 hours for partial training, but I don't think that's the production system).

However, once the network is trained it doesn't use much power at all. The trained network can run a mobile phone, and it doesn't even drain the batteries much.

[1] http://static.googleusercontent.com/media/research.google.co...


I was mentioning about run time operations, not training. Yes training DNNs are much more time consuming, but my point is, using them is also not cheap. As mentioned, processing 1 second of speech, lets say in 0,5 seconds is expensive. Considering a web search is done in sub millisecond time. of course I assume speech recognition is done in server side.


Very interesting links, thanks for sharing. I'm not sure if you're familiar with Android's speech recognition, but it seems to work offline as well. I wonder if they offload the computation to their servers when you're online and compute it locally when you're not. However the latency seems to be on the same order of magnitude.


Yes it works off line and it is a work of marvel IMO. Seems like all work is done in the phone when you are offline. And it performs close to the server counterpart. Latency is probably because of the nature of the live ASR processing. System cannot recognize word sequences immediately, just as humans.

There is a paper from Google on the issue:

http://static.googleusercontent.com/media/research.google.co...


The transscript of the speech shows up almost instantly and if you look closely you can see it corrects initial mistakes. This suggests that the speech-recognition is running on the phone. Might drain the battery.


Looks good. Bit frustrating that it's not available in Canada though.


Or the UK though. I wish Google Play had a "Download Anyway" button. Region blocking is retarded.


I don't believe it will be exactly this good in everyday usage, if for no other reason than speed.

Even if the parsing is done locally, broad data queries will have to hit the cloud.


How can I cut in line and get an invite? I find it a bit unfair that those who supported SoundHound didn't get any special treatment.


The voice recognition and speech parsing is incredible. Combining it with some calculations and interactivity is over the top!


Very impressive!




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: