What does the documentation mean by "endpoint your audio"? RRS feed

  • Question

  • In the voice recognition API document, it says this regarding the format of uploaded audio to Bing SR:

    "Your application must endpoint the audio to determine start and end of speech. The endpoints specify to the service the start and end of the request."

    What exactly does this mean? I say this because I used to have a recognition interface that worked for a few months and then suddenly broke, and I can't figure out why. This is what my code does:

    1. Get an access token and build an HttpWebRequest according to the sample code here
    2. Set Transfer-Encoding to chunked
    3. To the output stream, write a 44-byte RIFF file header, specifying a fileSize of 44 and a single data block with dataSize = 0 (I've also tried -1 or a very large arbitrary value)
    4. As the program is recording, take the 16-bit samples and dump them directly to the HTTP request stream
    5. When finished, call Close() on the request stream, and then call GetResponse() on the HttpWebRequest

    When I do this, however, the response never comes back and I get a timeout error. I believe it is because the service is expecting me to send more audio and never got the "finished" signal. So how do I tell the service when I'm done sending data? Do I need to add a "cue " marker in the RIFF somewhere that says when the end happens? Do I need to use send a "wavl" chunk and have each intermediate request use a separate "data" chunk? Am I just closing the request improperly?

    Friday, March 18, 2016 8:08 PM