Since you're an expert: what about these videos is hard? The things that jumped out at me are:
1 - the robot moving behind the table leg (ie you have to do depth recognition of objects in the scene)
2 - the user's hand interacting with the artificial elements in the scene. Some code had to recognize a hand and figure out which element it was touching.
What strikes you as the hard parts of those videos besides the real-time requirement?
Well the second video is a mock-up. In the first video notice that a) the observed things are floating in space and b) the camera motion is very smooth. This is how they sidestep the "layering problem" in the video. The desk leg occluding the robot is probably done using a depth sensor.
These two things are non-trivial, but not particularly hard in themselves. However, doing them at ultra-low latency becomes quite a challenge. Doing anything at ultra-low latency is already a challenge, but especially so when what you're trying to do is running a deep neural net for entity recognition or gesture recognition.
Training an ANN is computationally intensive, using a trained ANN is not. No context switching for system calls, no memory management, just matrix math.
well, first you need to know what image regions feed to ANN, and that can involve some segmentation and pre-recognition, otherwise you're going to evaluate the net at all feasible subwindows — and that's a LOT of matrix math for you. Very big GPU can help, but they have latency in themselves, and FPGA at such performance levels are inordinately expensive.
Done at scale though ASICs seem to be the sure-to-work way.
I'd be very surprised if a modern cpu couldn't handle the task, especially if you were clever about detecting regions of interest, predicting head movement and cache maintenance. But I'd also be surprised if they go to market with an x86 under the hood.
I remember reading a while ago about how smart tvs were using ANNs for upscaling, so it has been done at scale. rimshot
(1) TVs don't have strict latency requirement. I've hard latencies of 100 ms are common.
(2) Upscaling ANNs process rather small image neighborhood radius, and required processing power is on the order of O(r² * log r), and if a minimally recognizable cat is 50x50 px and for upscale you use a very large window of 16x16, that's 14 times already.
Latencies of 100 ms may be common because TVs don't have strict latency requirements.
16x16 is a very small window, I have no idea what they're using for TVs, but 128 isn't uncommon in post production ANN upscaling. Also consider the fact that ANNs have not received anywhere close to the level of attention in optimization that compilers have, so there is also a lot of potential slack to be taken up if real time processing demands it.
1 or have premade 3d environment model and do accurate position tracking. Position tracking is a LOT easier to do realtime.
2 bullshit CGI "this is how we hope it would look like if it was real" demo
Few months ago their apparatus was one color only, stationary and the size of a desk. Now all of a sudden can be strapped to a camera and does colors? color me sceptical :(
1 - the robot moving behind the table leg (ie you have to do depth recognition of objects in the scene)
2 - the user's hand interacting with the artificial elements in the scene. Some code had to recognize a hand and figure out which element it was touching.
What strikes you as the hard parts of those videos besides the real-time requirement?