# Gesture Recognition

• ### Question

• Hi,

I want to implement some basic gesture like Lean Left/Right etc. For this, I used the y-coordinate values of Head and Spine to keep track of the vertical distance between them, along with difference in depth value constraint.

Lets say, if the Vertical distance is less than 100, then body leans (either left or right, which is further checked based on X-coordinates). This works ok when I stand at approx 1m away from kinect. As I move further one to two steps back, the skeleton gets smaller in size and the vertical distance between head and spine is reduced to less than 100, which shows a gesture of leaning of body.

Any ideas. how can i refine my gesture recognition?

Thank you

Friday, June 24, 2011 1:41 PM

• (Sorry about the english mistakes in the following)

Well, dunno if that can help, but here is what I did.

First of all, I extract the positions of the hands, the elbows and the shoulders of the skeleton everytime a new skeleton frame is ready. Then I re-center those coordinates right between the shoulders, and finally I normalise these by using the distance between the shoulders as the unit.

I thus get a 12-dimensions vector. This is what I will use in position and gesture recognition.

I created a "Position recognizer". It has a list of various 12-dimensions vectors, each associated with a class (not a c# class). I simply use a nearest-neighbour algorithm to tell whether the current position belongs to a known class or not : I compute the euclidian Distance between the current position and the known positions, and keep the closest know position. If the distance between this position and the current position is less than a fixed threshold, I win. Otherwise, the current position is considered "unknown".

Gesture recognition is a bit more complicated : whenever a new 12-dimensions vector is ready, I store it in a buffer that keeps the last N postions in memory. Then, the positions sequence contained in that buffer is passed to a "Gesture Recognizer". It works pretty much the same way : it has a set of positions sequences, each one associated to a gesture name. To compute the distance between the given sequence and a known sequence, I use a basic Dynamic Time Warping algorithm (DTW).

Actually, I use a slighty more-than-basic version. If SEQ[1...m] is the sequence to analyse and EX_SEQ[1...n] the known sequence :

- Reverse the two sequences to compare => SEQ_R and EX_SEQ_R

- Compute the DTW matrix using SEQ_R and EX_SEQ_R => DTWmatrix[1..m,1..n]

- Return the value : min(DTWmatrix[x,n]) for x between 1 and m.

If the smallest DTW distance between the analysed sequence and a known sequence if below a given threshold, the corresponding gesture is recognized.

So far, I got pretty good results. The main difficulty is thresholding. For know, I just set thresholds "empiricaly", but there are smarter ways to do it...

Was that clear enough ? -_-"

Tuesday, June 28, 2011 9:18 AM

### All replies

• Helloy,

Did you use scaled values?

Teemu
There is no place without shadow
Friday, June 24, 2011 4:12 PM
• Hi Teemu,

I didn't get you about what kind of scaled values.

I'm using the pixel coordinates of the joints in the skelton frame.

Thank You

Monday, June 27, 2011 8:04 AM
• What about using vectors instead of pixels? For my gestures, I compare the user's arm direction to the up vector. Similarly, you can do the same by comparing the angle between the up vector and the direction the spine is pointed towards.
Monday, June 27, 2011 8:16 PM
• This is a good idea to work with. Thank you :)
Tuesday, June 28, 2011 8:42 AM
• (Sorry about the english mistakes in the following)

Well, dunno if that can help, but here is what I did.

First of all, I extract the positions of the hands, the elbows and the shoulders of the skeleton everytime a new skeleton frame is ready. Then I re-center those coordinates right between the shoulders, and finally I normalise these by using the distance between the shoulders as the unit.

I thus get a 12-dimensions vector. This is what I will use in position and gesture recognition.

I created a "Position recognizer". It has a list of various 12-dimensions vectors, each associated with a class (not a c# class). I simply use a nearest-neighbour algorithm to tell whether the current position belongs to a known class or not : I compute the euclidian Distance between the current position and the known positions, and keep the closest know position. If the distance between this position and the current position is less than a fixed threshold, I win. Otherwise, the current position is considered "unknown".

Gesture recognition is a bit more complicated : whenever a new 12-dimensions vector is ready, I store it in a buffer that keeps the last N postions in memory. Then, the positions sequence contained in that buffer is passed to a "Gesture Recognizer". It works pretty much the same way : it has a set of positions sequences, each one associated to a gesture name. To compute the distance between the given sequence and a known sequence, I use a basic Dynamic Time Warping algorithm (DTW).

Actually, I use a slighty more-than-basic version. If SEQ[1...m] is the sequence to analyse and EX_SEQ[1...n] the known sequence :

- Reverse the two sequences to compare => SEQ_R and EX_SEQ_R

- Compute the DTW matrix using SEQ_R and EX_SEQ_R => DTWmatrix[1..m,1..n]

- Return the value : min(DTWmatrix[x,n]) for x between 1 and m.

If the smallest DTW distance between the analysed sequence and a known sequence if below a given threshold, the corresponding gesture is recognized.

So far, I got pretty good results. The main difficulty is thresholding. For know, I just set thresholds "empiricaly", but there are smarter ways to do it...

Was that clear enough ? -_-"

Tuesday, June 28, 2011 9:18 AM
• Thank you for helping me out. But I couldn't get it exactly what does  "I re-center those coordinates right between the shoulders" mean?
Wednesday, June 29, 2011 10:10 AM
• It just means I perform a coordinates system change, Using a point located between the two shoulders as the new origin.
Wednesday, June 29, 2011 11:11 AM
• Rhemyst,

nice approach.

just a question, the vector is formed of 2 coordinates for each traked part of the skeleton? I mean you are traslating the coordinates in 2D? so the vector is 6 element traked * 2 coordinates?

Ugo
Wednesday, June 29, 2011 12:37 PM
• A vector representing a position is just an array with 12 values : six X values and six Y values.

double[] position = new double[12] ;

But, as long as you can define a distance between two positions (I use a simple Euclidian distance, but any other usual or unusual distance would work as well), positions can be represented in any imaginable way, and you'll be able to use a DTW algorithm.

I'll try to post a little video, if I can...

Wednesday, June 29, 2011 12:47 PM
• Understand,

now for the gesture recognition the two sequence to compare have the same number of elements,right?

I assume that the number of samples are the same for the two sequence to compare. Or you take a defined amount of time for each testing sequence that could have different number of samples?

In any case....thanks a lot

Ugo
Wednesday, June 29, 2011 1:11 PM
• Nope ! That is precisely what DTW was designed for : comparing sequences with different number of elements. One sequences may have been executed faster, or there could be deletions/insertion of elements.

In the most basic form of the algorithm (you can find it on Wikipedia), there is a constraint : the first elements of the each sequences are matched together, as well as the last elements.

You can can remove that "matched last elements" things by returning the min of the last row (or column, depends on the application) of the DTW matrix.

For My gesture recognition, I reverse the two sequences before I start the DTW alg. That is to say, I always match the last positions of the two sequences together. Then I compute the DTW matrix, and look for the subsequence of the sequence I am trying to classify that best matches the example Sequence.

(Not sure it is clear... Ask for precisions if you need).

Wednesday, June 29, 2011 1:25 PM
• Here is a video : http://www.youtube.com/watch?v=XsIoN96yF3E

There are some annotations.

Basically, I make the program learn 3 gestures named "Y", "C" an "T", then do some testing. As you can see, It works regardless of my movements velocity. :)

Wednesday, June 29, 2011 2:51 PM
• This is great! I understand your approach but I don't think my mathematics are up to making a version myself. Do you plan to open source your code?

Steve

Wednesday, June 29, 2011 4:59 PM
• There is no need for math... I just copied the algorithm from the wikipedia page, and applied it to sequences of 12-dimensions vector. Plus a few improvement I added after testin

I'll see what code I can post tomorow.

Wednesday, June 29, 2011 5:35 PM
• Thanks man, appreciate the help :)
Wednesday, June 29, 2011 5:37 PM
• I've posted a simple code snippet to identify a hand wave gesture here.