Deep convolutional networks, by design, are unable to integrate contextual and a...

MAXPOOL · on May 9, 2019

Learning multilayer convolutional representations of statistical features is roughly equal to taking few first few layers in visual cortex and stacking them. Creating higher and higher stacks is not going to solve vision.

We are essentially building a frog with better and better visual perception in the hope that it could become a taxi driver. It will become a totally amazing super-frog with super-vision, but it's still just a frog with frog-like visual perception and limits. Using pre-attentive feature recognition stage equivalent for complex object recognition can fake human like object recognition when we force it, but it's wrong approach. We get these catastrophic failures because we hit the limits.

Features seem to exist independently from one another in the early processing stages of human perception. They are not associated with a specific object either. Human perception is not gradually turning features into objects like we do in deep learning. Properly distinguishing feature integration from detection and how to do it is a open question.

jacobush · on May 9, 2019

And people will amaze at the totally super-froggy things these super-frogs can do, and understand even less why the super frogs aren't taxiing already. :-)

AstralStorm · on May 9, 2019

They actually are, but the self driving cars use that subsystem as only one component of the whole and most of it is not a super-frog.

pakl · on May 10, 2019

You are making a lot of incorrect statements about brains and vision. I would advise you to study some visual neuroscience.

> Learning multilayer convolutional representations of statistical features is roughly equal to taking few first few layers in visual cortex and stacking them.

No, it isn't roughly equal the first few layers of visual cortex. The first few layers of visual cortex have substantial feedback connectivity from higher areas which affects the responses of even the most peripheral parts. (Citations in our arxiv preprint linked above.). Most of the brain has more feedback connectivity from elsewhere than feedforward ascending connectivity. This qualitatively affects activations.

>We are essentially building a frog...

I suspect frog vision is far more robust than anything we are "essentially building".

> Features seem to exist independently...

Please have a close look at some modern visual neuroscience. Or speak to an good honest electrophysiologist.

sgt101 · on May 10, 2019

Which citations are you referring to? I would be grateful if you could please be specific.

ben_w · on May 9, 2019

What do you mean by “ambient“? If you hadn’t finished your comment with the words “our prototype” I would’ve assumed you meant things such as pictures of wolves having snow in them, and that snow being a clue that they are wolves, but I know that you can’t mean that.

gliop · on May 9, 2019

When you walk into a grocery store, you assume the fruit isn't plastic. When you walk into a furniture store, you do.

Why? Ambient context.

dec0dedab0de · on May 9, 2019

When you walk into a grocery store, you assume the fruit isn't plastic. When you walk into a furniture store, you do.

Why? Ambient context.

That was a really great way to get the point across, Especially because I still sometimes think it's real, even when I know the context.

ben_w · on May 9, 2019

That’s an example, not an explanation. From only that example, I cannot differentiate “ambient context” from “common sense”, which is a phrase that means totally different things to everyone who I’ve seen use it.

heyitsguay · on May 9, 2019

Very agreed with all this. I've been learning the same lessons working on more robust computer vision for biomedical imaging. I bet unsupervised predictive pretraining could be adapted to (static) 3d image volumes. The z axis replaces the t axis, and you predict the next 2d slice from previous ones. Hmm...

As an aside - from the paper it looks like you worked at Brain Corp a few years back. Any thoughts on them and what they're doing these days? I'll be looking for a job again soon and i see a lot of ads for them.

pjc50 · on May 9, 2019

> classifying pixel patterns in isolation isn't sufficient for robust visual perception

This seems to be only a very small step forward from Minsky's negative result about "perceptrons".

AstralStorm · on May 9, 2019

That's because DNN are only a small step removed from multilayer perceptrons as well. (Few more layers, a tiny bit of internal structure, more advanced nonlinear activation function, better training schedule. Much more training data.)

They're not even close to structural or training algorithm complexity of natural neutral networks yet.

_0ffh · on May 9, 2019

That result was not about multilayer perceptrons, but perceptrons. But, whatever.

AstralStorm · on May 9, 2019

Multilayer perceptrons share many of the same problems single layer perceptrons have, such as trouble with high level structure and generating weird nonrobust features. They are much more nonlinear through and thus somewhat more powerful. (I'm imprecise here but it is easy to find papers on this ancient tech from before AI winter.)

DNN is essentially one of these with more layers than just typical 4 for MLP, because we figured out a way to propagate error and training gradients. (Plus a few important but interesting details.) They are not really qualitatively different according to math they use... The main difference is use of gated or not differentiable activation functions with various ways to compute approximate gradients when faced with this feature. Especially convolutional nets are similar to MLP.

marcosdumay · on May 9, 2019

It seems that we are finally at the point where throwing more hardware/data at a dumb algorithm won't give you much better results. This means that there will be space for smarts at AI again. And this is happening at the same time that throwing more money on general purpose hardware is stopping generating good results too, with great opportunities for synergy.

> The field will advance when/if practitioners recognize that classifying pixel patterns in isolation isn't sufficient for robust visual perception

But this, well, it is very clearly sufficient, and we have well accepted results showing this. It just won't work on practice. That probably means the change will be full of fighting while the old ways still work, and lots of failures and unexpected successes.

p1esk · on May 10, 2019

we are finally at the point where throwing more hardware/data at a dumb algorithm won't give you much better results

Recent success of gpt-2 indicates otherwise.