Isolating hands in depth image RRS feed

  • Question

  • Hi all,

    I've been playing with Kinect for a few days, with the goal of detecting a hand when it's open and closed. I have a solution right now, which seems to work, but it relies on the person standing in a specific z-location; I'll explain a bit more:

    I sample an area of the depth image that's around the hand I want to detect; then I threshold that image by a certain grey-scale value, so that only elements in the scene that are beyond a certain distance will be shown. From here, I test the pixels of the image until I've drawn a box around the hand. I then test this every frame against some constants, to decide whether the hand is open or closed.

    This works, but as I say, it requires the person to be standing in a specific spot. Even then, if they move their hand too far back, it breaks.

    I've been thinking about how to avoid this, and solve it so that the person can be standing anywhere, but I'm not really sure where to go with it. I tried posterizing the image, anywhere from 2 to 15 levels of grey, but unfortunately the grey-shades seem to be too close to each other, so the hand ends up being coloured the same as, for example, the body. 

    I also wondered if it'd be possible to map the depth position of the skeleton to the grey-shades that appear in the scene. Then I could threshold from a grey that was just darker than the body's grey, at any given time, and it'd reveal the hand only. 

    But I'm not sure how to go about these things, and not even sure if they would work. Does anyone have any suggestions? I should say, I'm using the AIR Kinect libraries, not the official Microsoft SDK, but I don't suppose this should make too much difference to any theoretical solutions.


    Thursday, July 12, 2012 7:10 PM

All replies

  • Hi. Your problem is easy to solve =)

    The first step (not sure if we are on the same page here) is to isolate that part of the depth map that contains only the hand. This you can do by using the z-value (from skeleton data), multiply by 1000 (to get from meters to millimeters), and then delete all the pixels in the depth map whose value deviates by more than a threshold from your hand depth (+/100 has worked well for me, which means, +/- 10 cm).

    Then, I believe what you're doing is to see how large the hand-pixel area is. A hand that is open consumes more area than a hand that is closed (fist). However, you need to take into consideration that the appearance of the hand in the depth map changes depending on the distance. The hand of a user who is further away will appear smaller.

    You can still solve the problem by manually measuring the number of pixels / area for a certain distance, the best choice is 1.0 meters. Then you can use the rule of proportion in case the user is at any different distance. E.g. if you determined that the height of the bounding box (for the 640x480 depth map) is 90 pixels for an open hand, if the user is 1.0m away, and 50 pixels if it is closed. Your threshold would be something inbetween, e.g. 70 pixelx. Then in your code, you would do something like:

    openHandThreshold = 70 / handPosition.z;

    and then you test the height of the bounding box against that threshold. Note that "handPosition.z" is the distance from the skeleton, in meters.

    Wednesday, July 18, 2012 8:11 AM
  • Hey there,

    Sorry about the delay, but thanks a lot for your answer! I think I follow your whole approach, I just have one question about implementation of it: you say that by getting the hand's z-position (in meters) then multiplying it by 1000 to get it into millimetres, I can then use that value to remove all the pixels behind the hand in the depth map. How would I map that value to a grey value in the depth image? As in, each pixel in the depth image is a shade of grey, but how do you know which shade corresponds to which distance from the sensor?

    Thanks again though, awesomely comprehensive answer!


    Monday, July 23, 2012 10:39 AM
  • Actually you need to understand the difference between data processing (using raw data) and data visualization (whose result is something else). What you want to do is data processing with the raw data, i.e. the raw depth map that contains the distances of points of your room to the Kinect in millimeters. You want to use the raw data (and not the visualization) because your goal is not to visualize something, but to detect whether the hand is closed or not, and also the raw data is a lot more precise than the visuailzation (the raw data ranges from 400 to ca 5000, i.e. 40cm to 5meters, while the visualization ranges only from 0-255).

    I have no possibility to know the exact answer to your question about the shade of gray, because I don't know what visualization technique you are using. Several alternatives exist, e.g. "black is 0cm/closest value (white is largest value)" or "black is largest value (and white is closest value)". Actually you'll have to inspect the code that does the conversion from the raw depth map to the bitmap that is displayed on your screen. ... But anyways, as I said before, it makes more sense to work on the depth map directly ;)

    Wednesday, July 25, 2012 9:27 AM