Have Microsoft broken all the boundaries? As I was perusing the net the other day, I came across a video on the BBC website, which was showing an application of project Natal, that Microsoft had been demonstrating at E3. The introduction to the video was claiming that this was something pretty special and I’ll have to be honest at first glance, it certainly did seem a little too good. I thought it would be interesting to take a look at the video and analyse it a little. For those of you that haven’t seen it, or indeed can’t, I’ll give a short text description here.
The video starts with a woman walking up to a screen and greeting a small child who was playing on a swing. He walks over and greets her back. They then enter a discussion where the woman, Claire, questions the boy, Milo, as to whether he has done his homework or not. The boy then changes emotion, putting his head down and starts walking with shoulders hunched, not looking at Claire at all. The narrator points this out and describes a technology where Milo can recognise Claires emotions and visa versa. Interesting. As we continue, Claire offers to help Milo with his homework. He throws her a pair of goggles, which obviously can’t permeate through the screen into the real world, but Claire stoops to pick up the virtual goggles. He tells her to put the glasses on, and she uses her hands to make goggles like shapes in front of her eyes. Milo acknowledges this, and the camera then shifts to look into a pool of water, where Claire is now able to interact, by waving her hands in front of the screen, to make small waves in the water. After this she decides to help Milo and draws him an orange fish on a piece of paper. She shows this to a device above the screen and Milo reaches up and grabs what appears to be a copy of the drawing from above the screen. We hear him exclaim that it is orange shortly before the video finishes.
Clever stuff I hear you cry. Well yes and no, I feel that in some sense the video may be misrepresenting what is actually going on in front of our eyes. Now don’t get me wrong, the Natal framework certainly looks impressive, but I wanted to take a look at current technologies and see whether there is actually anything new in this at all. First of all we have facial recognition, Milo clearly recognises Claire and responds to her by name. Though facial recognition hasn’t been perfected, many machines are able to tell the difference between several faces. Head tracking and face tracking is something that even digital cameras can do nowadays and so this doesn’t surprise me. To be honest, let’s look at the market for this framework. It’s largely going to be of home entertainment use. Owing to that fact, the number of faces it has to differentiate between is likely to be small, often consisting of two adults of differing gender along with two children separated by age with a few years. I’ll admit I’m stereotyping a little here, but it’s nothing to be concerned about, any family is going to have similar differentiations between the various occupants.
Moving on from this we have the voice recognition. Voice recognition hasn’t received a huge boost to it’s technology of late, but it’s still good enough for recognising a few keywords. Extending this to the Natal framework and it’s hard to see whether the conversation is free form or scripted. Listening to the narrator speak about the project, and watching a few things on the screen it concerns me that the video is little more than a glorified script. What makes me say this? The fact that the narrator explains that everytime the pair of goggles is thrown at the interactee, they stoop down to pick them up. This seems to me to indicate that events are not at all free flowing and still have to utilise a large amount of pre-scripted effort. This is further confirmed by the feint but still visible symbol on screen of how to make the goggles symbol and this is repeated at the beginning of the demonstration where it appears Claire has been prompted to wave to Milo. It seems the NATAL system is driven by gestures and symbols. What did intrigue me is as Milo skips off to the pond, he mentions in conversation “I don’t know until I try do I?” This seemed to be a rather out of the blue sentence and could indicate more realism to the whole system, or a string of random phrases that Milo may utter, after discussing homework.
The emotional state of Milo is something which is touted by the narrator quite heavily in this video. He claims that Milo is able to recognise emotions in the interactee and is also able to exhibit emotions back. The second claim is a little easier to stomach. It’s entirely possible to put modifiers on the motion sequences to make them look happy or sad. Dropping the head, slouching forward is nothing special. The former of the two claims is more difficult to stomach. Just how can Milo recognise emotions from the interactee. In the video, we do not actually see any evidence of this, but it could possibly be achieved by monitoring the persons own stance and features of voice. Milo’s voice does indeed seem to change with his emotion, lending his voice to vary considerably depending on his “emotion”. This could be achieved quite easily with having a number of responses, dependent on the input of the interactee. Some could be happy, sad, surprised and based on keywords from the voice recognition and emotion analysis from stance and possibly face.
The next subject is one which unless the system is really limited I can’t fully explain. The synthesis of speech is actually really good. Along with speech recognition this appears to be an area which has been lacking in technological development in recent years. It could be that the demonstration has pre-scripted lines which Milo can speak, or it could be that the words can be generated on the fly. The NATAL sensor is apparently equipped with a multi array microphone which enables it to do acoustic source localisation and noise suppression which could aid the speech recognition, but the speech synthesis would probably be handled by the software on the console.
Next comes the interaction with water. Now in my mind, this is the easiest portion of the demonstration. There are a few nice touches, but again there is nothing ground breaking here. The sensor in NATAL is apparently capable of doing 3D full body motion capture of up to 4 people. Taking the movements of the Claire and making her ripple the water really is child’s play. It was, however, refreshing to see her reflection in the water. Presumably the RGB camera in the sensor is used to map video onto a plane which is then “rippled.” To be honest though not technically impressive this was one of my favourite parts of the demonstration video. The camera is also used to take a quick photo when Claire draws a picture of a fish for Milo. Though we hear Milo exclaim that it’s orange, the video ends before we can see whether he recognises it as a fish or not. Assuming that Milo is expecting to see a certain set of shapes, it isn’t beyond the realms of possibility for the software to be able to pick out rudimentary shapes from the drawing and convert those for Milo to process.
Some of you reading this, who have watched the video may be thinking that I’m being a little harsh and that the video was pretty amazing. I’m not denying the fact that the video was impressive at all. However after my first initial watch I decided that I wanted to dig a little deeper, and not take everything on face value. I wanted to see whether Microsoft were bringing anything ground breaking to the market. In my personal opinion the technologies behind this are nothing new at all. What NATAL does appear to bring, is a way to amalgamate all of these new technologies together into a single package. If the API behind this is as good the demonstration video, then it will be very interesting to see what the XBox360 has to offer, once NATAL is released. To be honest it is all going to hinge on what Microsoft do with the technology. Having a great technical demo is one thing, but being able to turn that into an immersive gaming experience is a completely different thing altogether. After all, we all have virtual reality now don’t we? Oh….yeh….what did happen to that?