Deep convolutional networks, by design, are unable to integrate contextual and ambient information present in an image (or in preceding images) to inform how to interpret local features they use. So it's no surprise they struggle with unconstrained images. Images where ambient context varies.
It's intriguing how much focus there is on adversarial examples. You don't need adversarial examples to make a deep network fail - in a sense that's overkill. Just point the poor deep network at a sequence of images from the real world -- images from a self driving car, security camera, or webcam. You'll see it make spontaneous errors. No matter how much training data you gave it.
The field will advance when/if practitioners recognize that classifying pixel patterns in isolation isn't sufficient for robust visual perception, and adopt alternative neural network designs that can interpret what they perceive in light of (no pun intended) context and physical expectations.
Learning multilayer convolutional representations of statistical features is roughly equal to taking few first few layers in visual cortex and stacking them. Creating higher and higher stacks is not going to solve vision.
We are essentially building a frog with better and better visual perception in the hope that it could become a taxi driver. It will become a totally amazing super-frog with super-vision, but it's still just a frog with frog-like visual perception and limits. Using pre-attentive feature recognition stage equivalent for complex object recognition can fake human like object recognition when we force it, but it's wrong approach. We get these catastrophic failures because we hit the limits.
Features seem to exist independently from one another in the early processing stages of human perception. They are not associated with a specific object either. Human perception is not gradually turning features into objects like we do in deep learning. Properly distinguishing feature integration from detection and how to do it is a open question.
And people will amaze at the totally super-froggy things these super-frogs can do, and understand even less why the super frogs aren't taxiing already. :-)
You are making a lot of incorrect statements about brains and vision. I would advise you to study some visual neuroscience.
> Learning multilayer convolutional representations of statistical features is roughly equal to taking few first few layers in visual cortex and stacking them.
No, it isn't roughly equal the first few layers of visual cortex. The first few layers of visual cortex have substantial feedback connectivity from higher areas which affects the responses of even the most peripheral parts. (Citations in our arxiv preprint linked above.). Most of the brain has more feedback connectivity from elsewhere than feedforward ascending connectivity. This qualitatively affects activations.
>We are essentially building a frog...
I suspect frog vision is far more robust than anything we are "essentially building".
> Features seem to exist independently...
Please have a close look at some modern visual neuroscience. Or speak to an good honest electrophysiologist.
What do you mean by “ambient“? If you hadn’t finished your comment with the words “our prototype” I would’ve assumed you meant things such as pictures of wolves having snow in them, and that snow being a clue that they are wolves, but I know that you can’t mean that.
That’s an example, not an explanation. From only that example, I cannot differentiate “ambient context” from “common sense”, which is a phrase that means totally different things to everyone who I’ve seen use it.
Very agreed with all this. I've been learning the same lessons working on more robust computer vision for biomedical imaging. I bet unsupervised predictive pretraining could be adapted to (static) 3d image volumes. The z axis replaces the t axis, and you predict the next 2d slice from previous ones. Hmm...
As an aside - from the paper it looks like you worked at Brain Corp a few years back. Any thoughts on them and what they're doing these days? I'll be looking for a job again soon and i see a lot of ads for them.
That's because DNN are only a small step removed from multilayer perceptrons as well. (Few more layers, a tiny bit of internal structure, more advanced nonlinear activation function, better training schedule. Much more training data.)
They're not even close to structural or training algorithm complexity of natural neutral networks yet.
Multilayer perceptrons share many of the same problems single layer perceptrons have, such as trouble with high level structure and generating weird nonrobust features. They are much more nonlinear through and thus somewhat more powerful. (I'm imprecise here but it is easy to find papers on this ancient tech from before AI winter.)
DNN is essentially one of these with more layers than just typical 4 for MLP, because we figured out a way to propagate error and training gradients. (Plus a few important but interesting details.)
They are not really qualitatively different according to math they use... The main difference is use of gated or not differentiable activation functions with various ways to compute approximate gradients when faced with this feature. Especially convolutional nets are similar to MLP.
It seems that we are finally at the point where throwing more hardware/data at a dumb algorithm won't give you much better results. This means that there will be space for smarts at AI again. And this is happening at the same time that throwing more money on general purpose hardware is stopping generating good results too, with great opportunities for synergy.
> The field will advance when/if practitioners recognize that classifying pixel patterns in isolation isn't sufficient for robust visual perception
But this, well, it is very clearly sufficient, and we have well accepted results showing this. It just won't work on practice. That probably means the change will be full of fighting while the old ways still work, and lots of failures and unexpected successes.
It's intriguing how much focus there is on adversarial examples. You don't need adversarial examples to make a deep network fail - in a sense that's overkill. Just point the poor deep network at a sequence of images from the real world -- images from a self driving car, security camera, or webcam. You'll see it make spontaneous errors. No matter how much training data you gave it.
The field will advance when/if practitioners recognize that classifying pixel patterns in isolation isn't sufficient for robust visual perception, and adopt alternative neural network designs that can interpret what they perceive in light of (no pun intended) context and physical expectations.
It worked for our prototype.[0]
[0] https://arxiv.org/abs/1607.06854