none
UCMA, Speech Server, SDK, SAPI, TellMe - What to use to allow web applications to perform interactive TTS and ASR

    Question

  • Hello All,

     

    We have spent a fair bit of time searching the web and this website but the requirement we have at the moment is the integration of the Speech engines, mainly TTS (and ASR in the future) to be programmable/useable from a web app written in C# or similar to allow users of the web GUI frontend to perform interactive TTS conversions of the text they type in a form field/text box on the web-based GUI. Now, since many of the TTS engines we've played with so far incl. Nuance, Loquendo, Cepstral, Festival etc. have X number of voice fonts available. However, if a user is typing something in a non-english language using the Roman script e.g. typing a Chinese name like Zheng or an Indian name like Rakesh or a Russian name like Vitaly, then a regular english langauge voice in TTS engines mispronounces them severly that it's hard to even tell what it just said. So the only way to do it is type Phonetically. For e.g. vitaly needs to be written as Vi-Taa-Lee for it to pronounce it correctly. So, we want the non-english users to be able to try all these combinations out and when they like the result, they will click "save" and the app will save that generated audio-stream (depends on what the Speech Server produces like a wave or a mp3 or something) into the SQL Server DB or something for later access. I'm assuming for that we need an API and/or an SDK etc. to interact with the Speech Engine.

    But there is so many products and versions that are in the Microsoft portfolio as if they can't figure out how to really put this n place as So many terms as mentioned in the subject are floating around and we can't find one place that explains the difference between all of them, what exists, what doesn't and which one we're supposed to be using. All we understand is that UCMA 3.0 includes the entire framework along with the Speech API and allows applications to interact with the new Lync Server etc. We also understand that Speech Server 2007 is now dead.

     

    What we don't know is the deal with the SDK, SAPI 5.4 and TellMe. Can this be used for our purpose. It's also likely, that soon we'll introduce a service with ASR in it over the phone, hence we'll need the Speech platform to talk to a Border controller or some other SIP Based platform.

     

    We didn't know what forum to post this in, so we're posting it in this one. Would anyone be able to help?

     

    Thanks so much,

    TWG Support

    Saturday, February 12, 2011 1:26 AM

Answers

  • Sorry to let your question languish for so long! I don't frequent this site very often.

    For your first question, the UCMA 3.0 SDK provides the Recorder and Player classes that can be used in your scenario. The UCMA 3.0 SDK has a number of Quickstart samples. One of these samples features the Recorder class, and another features the Player class.

    For your second question, there is a set of technical articles that have been recently published on MSDN that might be of help to you, Using Speech Recognition in UCMA 3.0 and Lync 2010. Here's a link to the first article in the five-part series: http://msdn.microsoft.com/en-us/library/gg986848.aspx. I'm not sure if this is an exact fit for what you're asking, but maybe some of the ideas discussed will be helpful.

    The articles discuss two applications - a UCMA 3.0 application that handles the speech recognition, and a Lync 2010 Silverlight application that presents a user interface. The Lync user speaks a phrase, and this audio goes to the UCMA application, into a SpeechRecognitionEngine that has loaded a grammar. If the user's speech matches the grammar, the UCMA application sends the appropriate data back to the Lync user.

    Like I say, I'm not sure how closely this matches what you're asking. If I'm off base here, post another question and I'll see if I can find a better answer (and sooner than 3 months!)

    Mark

    Tuesday, May 17, 2011 7:01 PM

All replies

  • First, a little about the various speech-related SDKs:

    ·         SAPI 5.4 (and other versions) – Speech API – intended for use in C++ COM applications.

    ·         Speech Server 2007 – A product that was discontinued a while back. Much of the functionality is present in two other namespaces: Microsoft.Speech and System.Speech.

    ·         Microsoft.Speech – geared toward .NET Framework applications running on server operating systems, such as Windows Server 2008 and Windows Server 2008 R2.

    ·         System.Speech – geared toward .NET Framework applications running on client platforms, such as Windows Vista and Windows 7.

    ·         Tellme – a Microsoft subsidiary that hosts IVR applications using VoiceXML.

     

    Microsoft Unified Communications Managed API 3.0 (UCMA 3.0) SDK includes two classes that provide the necessary plumbing so that speech recognition and speech synthesis can be used in UCMA applications. These classes are SpeechRecognitionConnector and SpeechSynthesisConnector. Both are in the Microsoft.Rtc.Collaboration.AudioVideo namespace. UCMA 3.0 SDK has sample Quickstart applications that show how these classes can be used. With the UCMA 3.0 SDK it’s relatively easy to write applications that run against Microsoft Lync Server 2010, that perform speech recognition and/or TTS. Note that UCMA and Microsoft.Speech are technically part of the .NET Framework but do not ship with it.

     

    Since it seems that you are primarily interested in speech synthesis (TTS), installing the appropriate TTS engines might be what you’re looking for. Microsoft provides TTS engines for these languages and locales:

    ·         Catalan (Spain)

    ·         Chinese (separate versions for China, Hong Kong, and Taiwan)

    ·         Danish

    ·         Dutch

    ·         English (separate versions for Australia, Canada, India, and US)

    ·         German

    ·         Finnish

    ·         French (separate versions for Canada and France)

    ·         Italian

    ·         Japanese

    ·         Korean

    ·         Norwegian (Bokmål)

    ·         Polish

    ·         Portuguese (separate versions for Brazil and Portugal)

    ·         Russian

    ·         Spanish (separate versions for Spain and Mexico)

    ·         Swedish

     

    Here’s the link to the download site for recognition engines and TTS engines: http://www.microsoft.com/downloads/en/details.aspx?FamilyID=47ffd4e5-e682-4228-8058-dd895252a3c3&displaylang=en.

     

    Here’s a link to the download site for UCMA 3.0 SDK and Speech Platform Server SDK: http://www.microsoft.com/en-us/Tellme/developers/default.aspx#tab=server.

    Monday, February 21, 2011 8:11 PM
  • Thank you so much Mark. That was very helpful. 

    Just two quick questions to follow up

    a) With Microsoft.Speech, can we deploy the TTS engines on a Windows Server 2008 R2 machine and have our .NET application directly use it to provide synthesis such as described above i.e. where a user has an interactive form and they can type their text, select a language, rate, pitch, volume etc. and click on "listen" to hear what it sounds like, and when they are happy, click on "save" and the synthesized text audio is saved in the backend SQL DB as a BLOB or some other format?

    b) When we go to deploying ASR, How would we get the border controller or softswitch (any SIP based platform) to communicate with the Microsoft speech system to send/transmit the speech from the caller's end to the Microsoft ASR engine to be recognized and then send the result to another app which will then make a decision on the result? e.g read a file, play a sound, place a call etc? Somewhere something needs to talk SIP/MRCPv2 for us to forward the call to that platform to hook into the audio/RTP stream of the caller. Do we then need to install a Lync server and have a .NET application get the result from it?

     

    Thanks again for helping. This question has been sitting there for over a week before you responded, hope it won't be that long to have this one responded to ;)

    Cheers

    Monday, February 21, 2011 10:14 PM
  • Sorry to let your question languish for so long! I don't frequent this site very often.

    For your first question, the UCMA 3.0 SDK provides the Recorder and Player classes that can be used in your scenario. The UCMA 3.0 SDK has a number of Quickstart samples. One of these samples features the Recorder class, and another features the Player class.

    For your second question, there is a set of technical articles that have been recently published on MSDN that might be of help to you, Using Speech Recognition in UCMA 3.0 and Lync 2010. Here's a link to the first article in the five-part series: http://msdn.microsoft.com/en-us/library/gg986848.aspx. I'm not sure if this is an exact fit for what you're asking, but maybe some of the ideas discussed will be helpful.

    The articles discuss two applications - a UCMA 3.0 application that handles the speech recognition, and a Lync 2010 Silverlight application that presents a user interface. The Lync user speaks a phrase, and this audio goes to the UCMA application, into a SpeechRecognitionEngine that has loaded a grammar. If the user's speech matches the grammar, the UCMA application sends the appropriate data back to the Lync user.

    Like I say, I'm not sure how closely this matches what you're asking. If I'm off base here, post another question and I'll see if I can find a better answer (and sooner than 3 months!)

    Mark

    Tuesday, May 17, 2011 7:01 PM