none
Speech Recognition - Responses? RRS feed

  • Question

  • Wondering if someone could help me with a few questions I have about Speech Recognition.  I'm using the kinect as my input and had some code that I wanted to build on.  
     [code] namespace WpfApplication1
    {
        class Program
        {
            private const string RecognizerId = "SR_MS_en-US_Kinect_10.0";
            static void MainWindow (string[] args)
         
            {
                using (var source = new KinectAudioSource())
                {
                    source.FeatureMode = true;
                    source.AutomaticGainControl = false; //Important to turn this off for speech recognition
                    source.SystemMode = SystemMode.OptibeamArrayOnly; //No AEC for this sample
                    RecognizerInfo ri = SpeechRecognitionEngine.InstalledRecognizers().Where(r => r.Id == RecognizerId).FirstOrDefault();
                    if (ri == null)
                    {
                        Console.WriteLine("Could not find speech recognizer: {0}. Please refer to the sample requirements.", RecognizerId);
                        return;
                    }
                    Console.WriteLine("Using: {0}", ri.Name);
                    using (var sre = new SpeechRecognitionEngine(ri.Id))
                    {
                        
                        var colors = new Choices();
                        colors.Add("red");
                        colors.Add("green");
                        colors.Add("blue");
                        colors.Add("yellow");
                        colors.Add("orange");
                        colors.Add("brown");
                        colors.Add("black");
                        colors.Add("white");
                        colors.Add("pink");
                        colors.Add("go home");
                        colors.Add("Moose");
                        colors.Add("computer");
                        
                      
                         var gb = new GrammarBuilder();
                        //Specify the culture to match the recognizer in case we are running in a different culture.                                 
                        gb.Culture = ri.Culture;
                        gb.Append(colors);
                        // Create the actual Grammar instance, and then load it into the speech recognizer.
                        var g = new Grammar(gb);
                        sre.LoadGrammar(g);
                        sre.SpeechRecognized += SreSpeechRecognized;
                        sre.SpeechHypothesized += SreSpeechHypothesized;
                        sre.SpeechRecognitionRejected += SreSpeechRecognitionRejected;
                        using (Stream s = source.Start())
                        {
                            sre.SetInputToAudioStream(s,
                                                      new SpeechAudioFormatInfo(
                                                          EncodingFormat.Pcm, 16000, 16, 1,
                                                          32000, 2, null));
                            Console.WriteLine("Recognizing. Say: 'red', 'green' or 'blue'. Press ENTER to stop");
                            sre.RecognizeAsync(RecognizeMode.Multiple);
                            Console.ReadLine();
                            Console.WriteLine("Stopping recognizer ...");
                            sre.RecognizeAsyncStop();
                        }
                    }
                }
            }
            static void SreSpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e)
            {
                Console.WriteLine("\nSpeech Rejected");
                if (e.Result != null)
                    DumpRecordedAudio(e.Result.Audio);
            }
            static void SreSpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
            {
                Console.Write("\rSpeech Hypothesized: \t{0}", e.Result.Text);
            }
            static void SreSpeechRecognized(object sender, SpeechRecognizedEventArgs e)
            {
                //This first release of the Kinect language pack doesn't have a reliable confidence model, so 
                //we don't use e.Result.Confidence here.
                Console.WriteLine("\nSpeech Recognized: \t{0}", e.Result.Text);
            }
            private static void DumpRecordedAudio(RecognizedAudio audio)
            {
                if (audio == null) return;
                int fileId = 0;
                string filename;
                while (File.Exists((filename = "RetainedAudio_" + fileId + ".wav")))
                    fileId++;
                Console.WriteLine("\nWriting file: {0}", filename);
                using (var file = new FileStream(filename, System.IO.FileMode.CreateNew))
                    audio.WriteToWaveStream(file);
            }
        }
    }
     [/code] 
    1. How could the the voice recognition grammar phrases (i.e green, blue, go home, moose, computer) produce confirmation when recognized and cause a response, whether verbal response or actions.  
     
    i.e
    [code] var colors = new Choices();
                        colors.Add("red");
                        colors.Add("pink");
                        colors.Add("go home");
                        colors.Add("Moose");
                        colors.Add("computer"); [/code]
     [code] Console.WriteLine("Recognizing. Say: 'red', 'green' or 'blue'. Press ENTER to stop");
                            sre.RecognizeAsync(RecognizeMode.Multiple);
                            Console.ReadLine();
                            Console.WriteLine("Stopping recognizer ...");
                            sre.RecognizeAsyncStop(); [/code]
                        
    when recognized the program prints the recognized phrase and a confidence model.  How would I use this data to offer a verbal response rather than just printing? Any help would be greatly appreciated.
    i.e 
        [code]     static void SreSpeechRecognitionRejected(object sender, SpeechRecognitionRejectedEventArgs e)
            {
                Console.WriteLine("\nSpeech Rejected");
                if (e.Result != null)
                    DumpRecordedAudio(e.Result.Audio);
            }
            static void SreSpeechHypothesized(object sender, SpeechHypothesizedEventArgs e)
            {
                Console.Write("\rSpeech Hypothesized: \t{0}", e.Result.Text);
            }
            static void SreSpeechRecognized(object sender, SpeechRecognizedEventArgs e)
            {
                //This first release of the Kinect language pack doesn't have a reliable confidence model, so 
                //we don't use e.Result.Confidence here.
                Console.WriteLine("\nSpeech Recognized: \t{0}", e.Result.Text); [/code] 
    Would it be possible to add more grammar variables that were not included in colors?  Would this be building a new grammar instance?
     [code] // Create the actual Grammar instance, and then load it into the speech recognizer.
                        var g = new Grammar(gb);
                        sre.LoadGrammar(g);
                        sre.SpeechRecognized += SreSpeechRecognized;
                        sre.SpeechHypothesized += SreSpeechHypothesized;
                        sre.SpeechRecognitionRejected += SreSpeechRecognitionRejected;
                        using (Stream s = source.Start())
                        {
                            sre.SetInputToAudioStream(s,
                                                      new SpeechAudioFormatInfo(
                                                          EncodingFormat.Pcm, 16000, 16, 1,
                                                          32000, 2, null));
                            Console.WriteLine("Recognizing. Say: 'red', 'green' or 'blue'. Press ENTER to stop");
                            sre.RecognizeAsync(RecognizeMode.Multiple);
                            Console.ReadLine();
                            Console.WriteLine("Stopping recognizer ...");
                            sre.RecognizeAsyncStop(); [/code]
    Sunday, June 19, 2011 6:02 PM

Answers

  • As you may have noticed 'SreSpeechRecognized' is called when something was recognized. You cand put some other stuff in there. If you want to make audio responses you could either try to use some computer-speech mechanism (if you have many different answers), make it up from pre-recorded audio files (like you car navigation system) or play full pre-recorded sentences..

    There is a complex grammar mechanism wich I don't understand (yet). You can find an example in the 'Sample Shape Game' (located in users\public\documents..). From my understanding you can define grammar either in code or in a XML file.

    Please correct my if I'm wrong (it's just a guess): The SpeechRecognitionEngine recognizes only one word from a Grammar instance at a time. If you have one colors-grammar instance and one with numbers and you say "red green twelve" it would recognize sth. like "red" "green twelve" ??

    Sunday, June 19, 2011 6:47 PM

All replies

  • As you may have noticed 'SreSpeechRecognized' is called when something was recognized. You cand put some other stuff in there. If you want to make audio responses you could either try to use some computer-speech mechanism (if you have many different answers), make it up from pre-recorded audio files (like you car navigation system) or play full pre-recorded sentences..

    There is a complex grammar mechanism wich I don't understand (yet). You can find an example in the 'Sample Shape Game' (located in users\public\documents..). From my understanding you can define grammar either in code or in a XML file.

    Please correct my if I'm wrong (it's just a guess): The SpeechRecognitionEngine recognizes only one word from a Grammar instance at a time. If you have one colors-grammar instance and one with numbers and you say "red green twelve" it would recognize sth. like "red" "green twelve" ??

    Sunday, June 19, 2011 6:47 PM
  • The SpeechRecognitionEngine would recognize one word at a time if your grammars are composed entirely of single words, but if your grammar includes phrases such as "make button yellow", then that would be recognized as a whole unit rather than as each of the individual words.

    The help topic for Grammar is this: http://msdn.microsoft.com/en-us/library/microsoft.speech.recognition.grammar(v=office.13).aspx

    And the developer center for speech technologies is this: http://msdn.microsoft.com/en-us/speech/dd393287, if you need deeper support regarding speech APIs.

    Eddy

     


    I'm here to help
    Wednesday, July 13, 2011 1:44 AM