none
How good is the SpeechRecognitionEngine? RRS feed

  • Question

  • I have a client who wants to have speech to text enabled in their application. What they want to do is to dictate observations into a Surface Pro's microphone and have the speech turned into text. Microsoft supplies a SpeechRecognitionEngine which I tried but, honestly, the results are embarrassingly unusable. Either the poor thing has been massively mishandled or it just isn't up to handling a set of words in the 10,000 range. Does anyone have commercial experience with this speech to text engine? Alternatively, are there other STT engines that are available at a reasonable price that anyone has experience with?

    Richard Lewis Haggard

    Wednesday, September 25, 2019 1:23 AM

Answers

  • After spending more time than is reasonable on this problem I can now answer that the SpeechRecognitionEngine is really good at handing a small number of words and utterly useless as a general speech to text engine. If you need something other than an automated phone menu system then you need to go elsewhere. I'm trying Dragon Naturally Speaking next.

    Richard Lewis Haggard

    Sunday, October 13, 2019 8:41 PM

All replies

  • Hi Richard,

    Thank you for posting here.

    >>Microsoft supplies a SpeechRecognitionEngine which I tried but, honestly, the results are embarrassingly unusable.

    I want to what happened to you when you used SpeechRecognitionEngine.

    >>Does anyone have commercial experience with this speech to text engine?

    Do you want to search third-party product to do it? If so, you could ask for help here.

    Note:This response contains a reference to a third party World Wide Web site. Microsoft is providing this information as a convenience to you. Microsoft does not control these sites and has not tested any software or information found on these sites; Therefore, Microsoft cannot make any representations regarding the quality, safety, or suitability of any software or information found there. There are inherent dangers in the use of any software found on the Internet, and Microsoft cautions you to make sure that you completely understand the risk before retrieving any software from the Internet.

    Best Regards,

    Jack



    MSDN Community Support
    Please remember to click "Mark as Answer" the responses that resolved your issue, and to click "Unmark as Answer" if not. This can be beneficial to other community members reading this thread. If you have any compliments or complaints to MSDN Support, feel free to contact MSDNFSF@microsoft.com.

    Wednesday, September 25, 2019 7:39 AM
    Moderator
  • How are other people using this class? How many words can it work with reasonably?

    In my case the idea is to provide a user with the means whereby a user can be out in the field with a touchpad class computer. He enters dictation mode and begins speaking freeform with the intent of describing what he sees.

    Here's a somewhat shortened version of the code that I've implemented in order accomplished the above desired functionality. The app is a WPF/MVVM/Prism/Unity architecture.

    When the page is activated Recognizer, a SpeechRecognizerEngine object, is created and initialized. The initialization process reads a text file of 10,000 most common English words and populates a GrammerBuilder with the file's contents. The GrammerBuilder is associated with Regognizer. Input is set to the default audio device. The AudioStateChanged and SpeechRecognized events are subscribed to. AudioStateChanged is used to give the user some visual feedback as to what the Recognizer is doing. SpeechRecognized is used to append the newly recognized word with the previous words.

    protected void SpeechToTextInitialize()
    {
        try
        {
        string[] words = File.ReadAllLines(@"google-10000-english-usa.txt");
    
        Choices valueChoices = new Choices();
        foreach (string word in words)
        {
            if (word.IndexOf('"') < 0)
                valueChoices.Add(word.Trim());
        }
    
        GrammarBuilder grammarBuilder = new GrammarBuilder();
        grammarBuilder.Append(valueChoices);
    
        Recognizer.LoadGrammar(new Grammar(grammarBuilder));
        Recognizer.SetInputToDefaultAudioDevice();
    
        Recognizer.AudioStateChanged  += new EventHandler<AudioStateChangedEventArgs> (Recognizer_AudioStateChanged);
        Recognizer.SpeechRecognized   += new EventHandler<SpeechRecognizedEventArgs>  (Recognizer_SpeechRecognized);
        }
        catch(Exception ex)
        {
            Console.WriteLine(ex);
        }
    }
    

    A button on the UI is pressed. Its Command handler calls Recognizer.RecognizeAsync.

    When the SpeechRecognized event is called the recognized word is appended to the current collection of recognized words.

    Another button stops the recognition by calling Recognizer.RecognizeAsyncStop.

    My suspicion at the moment is that the recognizer simply is limited as to how many words can exist in its grammer because when I attempt to use the recognizer the results are unusable. One word out of twenty results in a recognized response and that word is almost invariably incorrect. Is this something that other people have run in to? Am I simply asking too much of the poor beast or is there some magic that I'm ignoring?


    Richard Lewis Haggard

    Friday, September 27, 2019 4:09 PM
  • I tried reducing the size of the grammar down to 1000 most common English words. Now the success rate has gone up to maybe one in ten which is consistent with the engine only being able to work with a small vocabulary. 

    Richard Lewis Haggard

    Friday, September 27, 2019 4:55 PM
  • The last entry into that forum on STT was 2015. Not very useful but thanks anyway.

    Richard Lewis Haggard

    Sunday, September 29, 2019 8:48 PM
  • After spending more time than is reasonable on this problem I can now answer that the SpeechRecognitionEngine is really good at handing a small number of words and utterly useless as a general speech to text engine. If you need something other than an automated phone menu system then you need to go elsewhere. I'm trying Dragon Naturally Speaking next.

    Richard Lewis Haggard

    Sunday, October 13, 2019 8:41 PM