Answered by:
Gesture Recognition
Question

Hi,
I want to implement some basic gesture like Lean Left/Right etc. For this, I used the ycoordinate values of Head and Spine to keep track of the vertical distance between them, along with difference in depth value constraint.
Lets say, if the Vertical distance is less than 100, then body leans (either left or right, which is further checked based on Xcoordinates). This works ok when I stand at approx 1m away from kinect. As I move further one to two steps back, the skeleton gets smaller in size and the vertical distance between head and spine is reduced to less than 100, which shows a gesture of leaning of body.
Any ideas. how can i refine my gesture recognition?
Thank you
Answers

(Sorry about the english mistakes in the following)
Well, dunno if that can help, but here is what I did.
First of all, I extract the positions of the hands, the elbows and the shoulders of the skeleton everytime a new skeleton frame is ready. Then I recenter those coordinates right between the shoulders, and finally I normalise these by using the distance between the shoulders as the unit.
I thus get a 12dimensions vector. This is what I will use in position and gesture recognition.
I created a "Position recognizer". It has a list of various 12dimensions vectors, each associated with a class (not a c# class). I simply use a nearestneighbour algorithm to tell whether the current position belongs to a known class or not : I compute the euclidian Distance between the current position and the known positions, and keep the closest know position. If the distance between this position and the current position is less than a fixed threshold, I win. Otherwise, the current position is considered "unknown".
Gesture recognition is a bit more complicated : whenever a new 12dimensions vector is ready, I store it in a buffer that keeps the last N postions in memory. Then, the positions sequence contained in that buffer is passed to a "Gesture Recognizer". It works pretty much the same way : it has a set of positions sequences, each one associated to a gesture name. To compute the distance between the given sequence and a known sequence, I use a basic Dynamic Time Warping algorithm (DTW).
Actually, I use a slighty morethanbasic version. If SEQ[1...m] is the sequence to analyse and EX_SEQ[1...n] the known sequence :
 Reverse the two sequences to compare => SEQ_R and EX_SEQ_R
 Compute the DTW matrix using SEQ_R and EX_SEQ_R => DTWmatrix[1..m,1..n]
 Return the value : min(DTWmatrix[x,n]) for x between 1 and m.
If the smallest DTW distance between the analysed sequence and a known sequence if below a given threshold, the corresponding gesture is recognized.
So far, I got pretty good results. The main difficulty is thresholding. For know, I just set thresholds "empiricaly", but there are smarter ways to do it...
Was that clear enough ? _"
 Proposed as answer by Eddy EscardoRaffo [MSFT] Thursday, June 30, 2011 1:02 AM
 Marked as answer by Eddy EscardoRaffo [MSFT] Wednesday, July 13, 2011 6:04 AM
All replies





(Sorry about the english mistakes in the following)
Well, dunno if that can help, but here is what I did.
First of all, I extract the positions of the hands, the elbows and the shoulders of the skeleton everytime a new skeleton frame is ready. Then I recenter those coordinates right between the shoulders, and finally I normalise these by using the distance between the shoulders as the unit.
I thus get a 12dimensions vector. This is what I will use in position and gesture recognition.
I created a "Position recognizer". It has a list of various 12dimensions vectors, each associated with a class (not a c# class). I simply use a nearestneighbour algorithm to tell whether the current position belongs to a known class or not : I compute the euclidian Distance between the current position and the known positions, and keep the closest know position. If the distance between this position and the current position is less than a fixed threshold, I win. Otherwise, the current position is considered "unknown".
Gesture recognition is a bit more complicated : whenever a new 12dimensions vector is ready, I store it in a buffer that keeps the last N postions in memory. Then, the positions sequence contained in that buffer is passed to a "Gesture Recognizer". It works pretty much the same way : it has a set of positions sequences, each one associated to a gesture name. To compute the distance between the given sequence and a known sequence, I use a basic Dynamic Time Warping algorithm (DTW).
Actually, I use a slighty morethanbasic version. If SEQ[1...m] is the sequence to analyse and EX_SEQ[1...n] the known sequence :
 Reverse the two sequences to compare => SEQ_R and EX_SEQ_R
 Compute the DTW matrix using SEQ_R and EX_SEQ_R => DTWmatrix[1..m,1..n]
 Return the value : min(DTWmatrix[x,n]) for x between 1 and m.
If the smallest DTW distance between the analysed sequence and a known sequence if below a given threshold, the corresponding gesture is recognized.
So far, I got pretty good results. The main difficulty is thresholding. For know, I just set thresholds "empiricaly", but there are smarter ways to do it...
Was that clear enough ? _"
 Proposed as answer by Eddy EscardoRaffo [MSFT] Thursday, June 30, 2011 1:02 AM
 Marked as answer by Eddy EscardoRaffo [MSFT] Wednesday, July 13, 2011 6:04 AM




A vector representing a position is just an array with 12 values : six X values and six Y values.
double[] position = new double[12] ;
But, as long as you can define a distance between two positions (I use a simple Euclidian distance, but any other usual or unusual distance would work as well), positions can be represented in any imaginable way, and you'll be able to use a DTW algorithm.
I'll try to post a little video, if I can...

Understand,
now for the gesture recognition the two sequence to compare have the same number of elements,right?
I assume that the number of samples are the same for the two sequence to compare. Or you take a defined amount of time for each testing sequence that could have different number of samples?
In any case....thanks a lot
Ugo 
Nope ! That is precisely what DTW was designed for : comparing sequences with different number of elements. One sequences may have been executed faster, or there could be deletions/insertion of elements.
In the most basic form of the algorithm (you can find it on Wikipedia), there is a constraint : the first elements of the each sequences are matched together, as well as the last elements.
You can can remove that "matched last elements" things by returning the min of the last row (or column, depends on the application) of the DTW matrix.
For My gesture recognition, I reverse the two sequences before I start the DTW alg. That is to say, I always match the last positions of the two sequences together. Then I compute the DTW matrix, and look for the subsequence of the sequence I am trying to classify that best matches the example Sequence.
(Not sure it is clear... Ask for precisions if you need).





I've posted a simple code snippet to identify a hand wave gesture here.
http://aswathkrishnan.tumblr.com/post/7175233975/geekalertkinecting2positionandgesture