locked
Question regarding Windows Speech API! RRS feed

  • Question

  • Hello,

    I have a several hours long high quality speech (in English) of a single person. I also have its transcripts. Now what I want to do is that there are some sentences in the speech that I want to search and extract from the audio and create a separate audio file for each sentence.

    A simple algorithm that I have in my mind is that I will train the speech recognizer using my audio and it transcripts. Then I will search my desired sentences from the audio using the trained speech recognizer. I will only search sentences that are present in the speech. Therefore the input will be a sentence and output will be the start time and end time of that sentence in the audio file. I will then use the start and end time to extract the audio segment from the audio file. Please note I am not talking about Text to Speech conversion, I want to extract real human voice from the audio file.

    I am an experienced C++ programmer. But I don't have any speech processing or mp3 file manipulation experience.

    Questions:
    1) Are Windows SAPI capable of doing this type of work?
    2) If yes which SAPI do I need to study or focus on?
    3) Would I need any MP3 manipulation library?
    4) If yes please suggest which one should I use?

    regards

    • Edited by RS. Haider Tuesday, August 24, 2010 6:34 AM
    Sunday, August 22, 2010 4:42 PM

Answers

  • You may want to post the question on a SAPI forum.  Your application is complex.  Not only do you want to to a speech-to-text recognition, you want to track the start and end times of each word so you can correlate that with the known transcript.  I'm not sure these timings generate events for your application to track.  Also, you want to do this with a pre-recording and not a live recording from the microphone, which is not the default of any speech sample that I am aware of.  Have you started playing with any of the SAPI samples at all?
     
    You post this under the VC forum, but I really recommend you write your app in .NET because the SAPI interfaces are much easier to use from .NET.
     
    Good luck,
    David
     

    Efficiently access this forum with newsreaders like WLM, Thunderbird, and Forte Agent: http://communitybridge.codeplex.com
    • Proposed as answer by Jesse Jiang Wednesday, August 25, 2010 8:34 AM
    • Marked as answer by Jesse Jiang Monday, August 30, 2010 12:52 AM
    Monday, August 23, 2010 2:24 PM

All replies

  • Someone please reply, I am waiting for help...
    Monday, August 23, 2010 11:16 AM
  • You may want to post the question on a SAPI forum.  Your application is complex.  Not only do you want to to a speech-to-text recognition, you want to track the start and end times of each word so you can correlate that with the known transcript.  I'm not sure these timings generate events for your application to track.  Also, you want to do this with a pre-recording and not a live recording from the microphone, which is not the default of any speech sample that I am aware of.  Have you started playing with any of the SAPI samples at all?
     
    You post this under the VC forum, but I really recommend you write your app in .NET because the SAPI interfaces are much easier to use from .NET.
     
    Good luck,
    David
     

    Efficiently access this forum with newsreaders like WLM, Thunderbird, and Forte Agent: http://communitybridge.codeplex.com
    • Proposed as answer by Jesse Jiang Wednesday, August 25, 2010 8:34 AM
    • Marked as answer by Jesse Jiang Monday, August 30, 2010 12:52 AM
    Monday, August 23, 2010 2:24 PM
  • Thanks for replying Mr. David Ching. I am posting my question on a SAPI forum as you suggested. I haven't yet started playing with SAPI samples but I will start it as soon as I get proper direction. And thanks for your second suggestion, I would like to use .NET!

    regards

    Tuesday, August 24, 2010 6:16 AM