Since you're an expert: what about these videos is hard? The things that jumped ...

rsp1984 · on Dec 10, 2015

Well the second video is a mock-up. In the first video notice that a) the observed things are floating in space and b) the camera motion is very smooth. This is how they sidestep the "layering problem" in the video. The desk leg occluding the robot is probably done using a depth sensor.

_ntka · on Dec 10, 2015

These two things are non-trivial, but not particularly hard in themselves. However, doing them at ultra-low latency becomes quite a challenge. Doing anything at ultra-low latency is already a challenge, but especially so when what you're trying to do is running a deep neural net for entity recognition or gesture recognition.

woodman · on Dec 10, 2015

Training an ANN is computationally intensive, using a trained ANN is not. No context switching for system calls, no memory management, just matrix math.

ansgri · on Dec 10, 2015

well, first you need to know what image regions feed to ANN, and that can involve some segmentation and pre-recognition, otherwise you're going to evaluate the net at all feasible subwindows — and that's a LOT of matrix math for you. Very big GPU can help, but they have latency in themselves, and FPGA at such performance levels are inordinately expensive. Done at scale though ASICs seem to be the sure-to-work way.

woodman · on Dec 10, 2015

I'd be very surprised if a modern cpu couldn't handle the task, especially if you were clever about detecting regions of interest, predicting head movement and cache maintenance. But I'd also be surprised if they go to market with an x86 under the hood.

I remember reading a while ago about how smart tvs were using ANNs for upscaling, so it has been done at scale. rimshot

ansgri · on Dec 10, 2015

(1) TVs don't have strict latency requirement. I've hard latencies of 100 ms are common.

(2) Upscaling ANNs process rather small image neighborhood radius, and required processing power is on the order of O(r² * log r), and if a minimally recognizable cat is 50x50 px and for upscale you use a very large window of 16x16, that's 14 times already.

woodman · on Dec 10, 2015

Latencies of 100 ms may be common because TVs don't have strict latency requirements.

16x16 is a very small window, I have no idea what they're using for TVs, but 128 isn't uncommon in post production ANN upscaling. Also consider the fact that ANNs have not received anywhere close to the level of attention in optimization that compilers have, so there is also a lot of potential slack to be taken up if real time processing demands it.

rasz_pl · on Dec 10, 2015

1 or have premade 3d environment model and do accurate position tracking. Position tracking is a LOT easier to do realtime.

2 bullshit CGI "this is how we hope it would look like if it was real" demo

Few months ago their apparatus was one color only, stationary and the size of a desk. Now all of a sudden can be strapped to a camera and does colors? color me sceptical :(

ant6n · on Dec 10, 2015

The weapon is physically there tho