none
Can I tell the Kinect to only recognise speech if it meets a certain volume threshold? RRS feed

  • Question

  • I've been toying with the speech sample (red, green, blue) in the SDK and have it controlling Windows Media Center quite well. The problem I have is that the Kinect sometimes considers audio from my speakers as commands.

    I have been outputting all matched audio to .wav files ro listen to, and the false positives are very quiet in comparison to my actual voice commands (the Kinect beam filtering and noise suppression seem to be doing a decent job of ignoring the TV's speakers).

    I am using Result.Confidence (I know it's not fully working, but it is useful). I think that if I can place a further condition on the audio to be a specific loudness then I could remove these false positives almost entirely.

    All the commands I use begin with the word 'xbox', so it's not that the voices from my TV speakers are actually saying the correct words.

    Friday, July 22, 2011 11:01 AM

Answers

  • Make sure autogaincontrol is turned OFF and also go into your Recording Devices control panel and reduce the input volume to about 50%.  (It is probably at 100).  If you are speaking reasonably loud, you'll still get good recognition, maybe even better results.

    You can try settinc the "NoiseSuppression" propterty on or off and see which works better.

    There is also the beam angle that could help reduce false positives.  You can either set the mic to focus on the center beam only (MicArrayFixedBeam) or you can check the beam angle on recognition and check that it's within an acceptable range.

    I know these things won't completely eliminate the problem but they may help.

    You could also try creating a slightly longer and "vowely" prefix for your commands than xbox.  Rustling, tapping, clicking sounds, will be more likely to trigger consonants like x b k t p.  Try something like "Julia".  If you do use "xbox", try spelling it "ex box".  You can also create "honey pot" prefixes that sound a little bit like your command (e.g. jewels, hulio).  Then in your code you can ignore commands that use these prefixes.   The problem with the most speech recognition is that if it thinks you are speaking, it will really try to match to something in your grammars and will tend to be optimistic about the match.  The Kinect is particularly bad in this respect (IMO).

    good luck

     


    awesome voice control for your htpc at www.voxcommando.com
    Monday, August 29, 2011 4:16 PM
  • I wanted to add here, in reference to the OP, that I have seen Kinect give 99%+ confidence to a 5 word command, when there was no speech at all, just the sound of typing on a keyboard. I have seen this at least twice. And it seems between 1-3 times a night the Kinect will accept a command as over 85% when it's not even close.

    To limit false positives, I'm not only checking the confidence on the overall command, but also on the first word of that command (which is a command in and of itself), because all my commands start with the same word. So in order for a speech command to be accepted by my software, the overall command has to have 88% or greater confidence, and the first word has to have 97.5% or greater confidence. So what I referenced in the first paragraph for false positives all have happened with this very finely tuned confidence threshold.

    I've attributed this behavior to the non-fully-functional confidence model of the Kinect framework, but I'd be interested to hear from Eddy or someone else just how inaccurate it is, and what we can expect for changes in the near term. Is the inaccuracy general, that is every confidence value, no matter the level, is not what it should be, or is it mostly accurate, but every once in a while it generates a faulty value, and some of these faulty values are spiking high above my threshold and causing false positives?

    Thursday, September 1, 2011 7:16 PM

All replies

  • pumpkinszwan,

    The Kinect SDK does not provide the support you're looking for, but your question seems to be more directed at speech recognition functionality than kinect-specific functionality, so you might have better luck finding what you need in the Microsoft Speech Technologies Developer Center (http://msdn.microsoft.com/en-us/speech/default.aspx).

    That being said, and out of curiosity, what is the confidence reported by system for the false positives you see?

    Sorry for the limited help,
    Eddy


    I'm here to help
    Friday, July 22, 2011 8:56 PM
  • Eddy, thanks for the help.

     

    I have set my confidence threshold at 0.8 (I assume the range is 0-1), so these false positives are getting a confidence of at least 0.8. When listening back to the recordings there is no way they should be misinterpreted - they are often little more than quiet noise. I know the confidence functionality is not fully working in the beta, so perhaps in later versions of the SDK the false positives will go away.

     

    I only get the occasional false positive, but it can be quite annoying if the audio mutes unexpectedly!

    Saturday, July 23, 2011 12:58 AM
  • ive seen this as well. random noise and speech that is obviously not a command causes a command to be recognized, and on occasion the confidence spikes > .8 or .9. ive attributed this to the not fully functional confidence value. 
    Monday, August 29, 2011 3:22 PM
  • Make sure autogaincontrol is turned OFF and also go into your Recording Devices control panel and reduce the input volume to about 50%.  (It is probably at 100).  If you are speaking reasonably loud, you'll still get good recognition, maybe even better results.

    You can try settinc the "NoiseSuppression" propterty on or off and see which works better.

    There is also the beam angle that could help reduce false positives.  You can either set the mic to focus on the center beam only (MicArrayFixedBeam) or you can check the beam angle on recognition and check that it's within an acceptable range.

    I know these things won't completely eliminate the problem but they may help.

    You could also try creating a slightly longer and "vowely" prefix for your commands than xbox.  Rustling, tapping, clicking sounds, will be more likely to trigger consonants like x b k t p.  Try something like "Julia".  If you do use "xbox", try spelling it "ex box".  You can also create "honey pot" prefixes that sound a little bit like your command (e.g. jewels, hulio).  Then in your code you can ignore commands that use these prefixes.   The problem with the most speech recognition is that if it thinks you are speaking, it will really try to match to something in your grammars and will tend to be optimistic about the match.  The Kinect is particularly bad in this respect (IMO).

    good luck

     


    awesome voice control for your htpc at www.voxcommando.com
    Monday, August 29, 2011 4:16 PM
  • Make sure autogaincontrol is turned OFF and also go into your Recording Devices control panel and reduce the input volume to about 50%.  (It is probably at 100).  If you are speaking reasonably loud, you'll still get good recognition, maybe even better results.


    awesome voice control for your htpc at www.voxcommando.com
    I'm going to give this a try. It makes sense to me, I've noticed it's more accurate from 5-10 feet away then it is from 1-2 ft away.
    Monday, August 29, 2011 7:50 PM
  • This makes a significant difference. On two computers I use for testing, one was at 14%, and the other was at 100%. In testing it seems 50% - 75% works best. 
    Thursday, September 1, 2011 5:43 PM
  • Thanks for posting these findings. I too am getting lots of false positives. I was wondering if there might be another way to solve this problem. Could one modify the grammar to accept a Wildcard that would catch all the junk noise? It seems like restricting the grammar to just a few choices will force the recognizer to choose amongst those choices, even though none of them are any good.
    Thursday, September 1, 2011 6:37 PM
  • I experimented with the source.AppendWildcard() method. You would append this in a grammar between two sub grammars as necessary. It does work, with limitations.

    It seems it works best with big, clear words, not simple ones like "the" or "of the". My grammar has multiple commands in each grammar: there is a command to trigger the command processor, then a command to do something, then a command that describes where to do that something, and then sometimes a command which dictates how much of that something to do. I have some more advanced grammars too, that along with what I just described split the "do something command" to "do this to something by this much", modify a larger command (to/by, on/off), and other things like that. 

    I initially tried just the .AppendWildcard() method to fill in the blanks between these commands, but it didn't work 100% reliably. It often mistook simple words and phrases like "of the" to be another word from another command. But if I added a few extra commands that pre-include the most common variations (i.e. natural speech), it has cut down on false negatives considerably. 

    Be forewarned, there are limits here. You can't list out every command explicitly (I found a practical limit on an average computer to be about 7,000 commands), and you can't make it too simple (i.e. a form of "normalization"), but there is a balance to be found in the middle, close to simple. 

    One thing I've done, which sounds like it would work for you, is added to the end of all my command grammars the AppendWildcard method. This way if it hears anything else after it has already heard enough to recognize a command it will process that command and not get caught up in the extra speech. One example is if someone said "please" after the command. My goal is to simulate natural speech as much as possible. It isn't possible with an infinite set of commands, but with a limited set there are also limited variations with normal speech. 


    Thursday, September 1, 2011 6:48 PM
  • I wanted to add here, in reference to the OP, that I have seen Kinect give 99%+ confidence to a 5 word command, when there was no speech at all, just the sound of typing on a keyboard. I have seen this at least twice. And it seems between 1-3 times a night the Kinect will accept a command as over 85% when it's not even close.

    To limit false positives, I'm not only checking the confidence on the overall command, but also on the first word of that command (which is a command in and of itself), because all my commands start with the same word. So in order for a speech command to be accepted by my software, the overall command has to have 88% or greater confidence, and the first word has to have 97.5% or greater confidence. So what I referenced in the first paragraph for false positives all have happened with this very finely tuned confidence threshold.

    I've attributed this behavior to the non-fully-functional confidence model of the Kinect framework, but I'd be interested to hear from Eddy or someone else just how inaccurate it is, and what we can expect for changes in the near term. Is the inaccuracy general, that is every confidence value, no matter the level, is not what it should be, or is it mostly accurate, but every once in a while it generates a faulty value, and some of these faulty values are spiking high above my threshold and causing false positives?

    Thursday, September 1, 2011 7:16 PM
  • Thanks for all those great suggestions! I've just added the confidence check to my code and turned down the volume on the mic to around 65%. These two things have drastically reduced the false positives I was getting. I'm also setting the senser a little further from where I am sitting which also seems to help. Before it was about a foot away, now it's about two feet. I'll take a look at appending wildcards to the ends of Grammars to see if that helps even more.

    Thursday, September 1, 2011 8:15 PM
  • I don't suppose you know if I can get the text reprenetation of what was said that matched the wildcard, do you? Right now it comes back with an ellipsis in that position.

    Friday, September 2, 2011 12:26 AM
  • I don't suppose you know if I can get the text reprenetation of what was said that matched the wildcard, do you? Right now it comes back with an ellipsis in that position.

    Unfortunately I'm 99% sure that the answer to that is NO.  To get the text back you would need to use "appendDictation" instead of "appendWildcard".  This is possible with system.speech, but not with windows.speech.  Unfortunately if you want to use the Kinect with it's array steering and specially tweaked language/sound model you need to use windows.speech, which means you are limited in a number of ways.  AFAIK you also have no access to learning, or languages other than English.  You can use Kinect as a regular microphone with a program that uses system.speech (e.g. VoxCommando) and then you will have access to learning, dictation, other languages, and possibly other benefits, but if you are going to do that, then the Kinect is actually a rather poor choice for a microphone.

    I wonder if you might be able to get the whole sentence that was spoken though?  result.text ?  or does that return ellipses as well?


    awesome voice control for your htpc at www.voxcommando.com
    • Proposed as answer by Scripticus Friday, September 2, 2011 3:20 PM
    Friday, September 2, 2011 1:02 AM
  • Thanks for posting these findings. I too am getting lots of false positives. I was wondering if there might be another way to solve this problem. Could one modify the grammar to accept a Wildcard that would catch all the junk noise? It seems like restricting the grammar to just a few choices will force the recognizer to choose amongst those choices, even though none of them are any good.

    this is why I suggested using a "honey pot" prefix, that would be rejected if heard.  Also if you are going to use a prefix it should be marked as optional and only commands that contain the prefix should be acted on.  In other words, no prefix is another type of "honey pot".  These "honey pot" commands should be recognized by your grammar, but then rejected by your code, just as you could reject something with too low a confidence.

    There has been some talk of wildcards.  To be clear, if you are going to use a wildcard, it is only going to make your commands MORE FLEXIBLE (i.e. allow for more natural language - as suggested).  It will not help to reduce false positives, but in fact will probably make them occur more frequently.


    awesome voice control for your htpc at www.voxcommando.com
    Friday, September 2, 2011 1:08 AM
  • You are right, the confidence value currently output by API loosely correlates with quality of recognition, so using some confidence threshold instead of none will result in slightly better results, but algorithms have not yet been trained properly to guarantee accuracy of this confidence value, so you should treat this as a general inaccuracy in that most confidence values output will be incorrect by an unpredictable amount.
    I'm here to help
    Friday, September 2, 2011 3:00 AM
  • I've developed a little project named S.A.R.A.H. (http://encausse.net/s-a-r-a-h/) with Kinect and have the same issue.

    One of the command is recognized every 10 minutes with a high confidence level (> 0.85)

    The sentence is : "Sarah, what time is it".

    It is really really strange because sometimes it is triggered in a room without sound, only white noise !

    I have set a "weight" on word "Sarah" but it still strange ... may be there is a clever way to fix it ?


    Wednesday, October 10, 2012 11:22 AM
  • I've developed a little project named S.A.R.A.H. (http://encausse.net/s-a-r-a-h/) with Kinect and have the same issue.

    One of the command is recognized every 10 minutes with a high confidence level (> 0.85)

    The sentence is : "Sarah, what time is it".

    It is really really strange because sometimes it is triggered in a room without sound, only white noise !

    I have set a "weight" on word "Sarah" but it still strange ... may be there is a clever way to fix it ?


    Are you using the latest Kinect SDK? What you are describing was common with the beta's, but not the commercial releases. 
    Wednesday, October 10, 2012 6:14 PM
  • Yes I use the latest SDK 1.5 (not the one release 2 days ago)

    There is nothing bind to a given release:
    - I use common SAPI
    - And Skeleton sensor

    Thursday, October 11, 2012 7:39 AM
  • Yes I use the latest SDK 1.5 (not the one release 2 days ago)

    There is nothing bind to a given release:
    - I use common SAPI
    - And Skeleton sensor

    That's your problem, the Kinect SDK is built for the new speech server, not SAPI. You need to use the built in SDK methods...
    Thursday, October 11, 2012 7:42 AM
  • Thanks for you answers, witch methods shoudl I use ? 

    I have a little code for kinect using KinectAudioSource (l. 76):
    https://github.com/JpEncausse/WSRMacro/blob/master/WSRMacro/WSRKinectMacro.cs

    Then I use a standard SpeechRecognitionEngine (l. 237)
    https://github.com/JpEncausse/WSRMacro/blob/master/WSRMacro/WSRMacro.cs

    I didn't see any SpeechRecognitionEngine in Kinect SDK ?


    Friday, October 12, 2012 12:07 PM